Refactor Entropy Code 5-29-13
2015-01-13azim58 - Refactor Entropy Code 5-29-13
I would like to refactor my entropy code so that it is a little cleaner
and can handle a wide variety of situations. I want to be able to prepare
the list of numbers from multiple sources. Then I want to take the lists
and be able to perform various types of calculations.
I want to be able to prepare the data from
- gpr file
- tab delimited text file (raw data)
- tab delimited text file (normalized data (I could use median and Combat
From this data I want to produce a file containing
- a raw number list
- a normalized number list and a normalized number list converted to
I want to take these number lists and calculate
- entropy (raw)
- entropy (normalized (from Combat and median normalized data))
- CV (raw)
- CV (normalized (from Combat and median normalized data))
I want to test my code to make sure that it is working with a few small
lists of numbers: one kind of random, one extreme one populated with all
of the highest numbers, one extreme one populated with all of the lowest
numbers, one with no duplicates so that there would be no "bin" with more
than one item.
mini gpr file here:
"F:\kurt\storage\CIM Research Folder\DR\2013\5-29-13\entropy\Mini_gpr.gpr"
mini tab delimited text file raw data here:
"F:\kurt\storage\CIM Research Folder\DR\2013\5-29-13\entropy\tab
delimited raw.xlsx"
mini tab delimited text file normalized data here:
"F:\kurt\storage\CIM Research Folder\DR\2013\5-29-13\entropy\tab
delimited normalized.xlsx"
I also copied these files to the shared drive so that programs on other
computers can access them
"S:\Research\Cancer_Eradication\Discovering tumor specific
antigens\entropy\5-29-13\entropy"
one kind of random dataset
random
65535, 861, 65535, 861, 65535, 556, 65535, 956, 255, 1, 1, 1, 255
one with random with no high or low
random_nhl
235, 861, 235, 861, 235, 556, 80, 956, 255, 42000, 42000, 42000, 255
one extreme with highest numbers
all_high
65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535,
65535, 65535, 65535
one extreme with lowest numbers
all_low
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
one with no duplicates
all_different
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 65535
These entropy test cases with their entropy and cv values can be found
here
"F:\kurt\storage\CIM Research Folder\DR\2013\5-29-13\entropy\entropy test
cases 5-29-13.xlsx"
I'll also test a real list from a gpr with 10,000 values
The original name of the gpr file was
1009951_bot_N-19(152)_08132012.gpr
which came from the
2012 good gprs diseases 1-8
folder
I renamed the gpr to
"F:\kurt\storage\CIM Research Folder\DR\2013\5-29-13\entropy\test.gpr"
===========================================================================
6-11-13
Now that I have my test data ready to go, I can rewrite the code. I
should make sure I have the previous code copied to a safe place.
see also
how to calculate information entropy in excel
===========================================================================
6-13-13
Alright I've basically refactored the code. Code found here
"F:\kurt\storage\CIM Research Folder\DR\2013\6-13-13\entropy
program\entropy of array src 6-13-13.zip"
Now I can go through and look at my test cases and fix errors.
I copied the test case code to here
S:\Research\Cancer_Eradication\Discovering tumor specific
antigens\entropy\6-13-13
from
"S:\Research\Cancer_Eradication\Discovering tumor specific
antigens\entropy\5-29-13\entropy"
Now I can test out the code.
===========================================================================
I tested the code and fixed some mistakes. I also measured the time it
took on different systems. Here's a message I sent to Lu Wang to test on
his computer as well.
message to Lu Wang
Hi Lu, I'm sharing these two files with you. Maybe you could help me see
how fast my program runs on your system. So far I have run the program on
two systems.
time taken to run a test gpr with Pentium 4 CPU 3.4 GHz 2 GB RAM system
(your old computer now at the far north wall of our lab)
2013/06/15 19:06:22
2013/06/15 19:14:55
8m33s
time taken to run a test gpr with AMD Phenom II X6 1055T CPU 2.8 GHz 8 GB
RAM system (my personal computer at my apartment)
2013/06/15 19:49:00
2013/06/15 19:51:31
2m31s
Let's see how fast your system will take. This will just take a little
bit of your time. Here are the instructions.
-Start eclipse. File->New java project. Enter project name as
EntropyOfArray. Navigate to the src file for the project on your hard
drive and paste the src files there.
-Right click on the src file under EntropyOfArray in Java and click
Refresh and now all of the src files should show up.
-Place the test.gpr somewhere on your hard drive.
-open the Test_Immunosignature_Data_030413 class and change the String
directory line so that the proper directory with the test.gpr file on
your hard drive is listed. Make sure the filepath string is surrounded by
quotes "" and that every backslash \ is actually two backslashes \\.
-Click the green arrow at the top of the Eclipse IDE editor to run the
program. Program should run for several minutes.
-Copy the text output of the program in the console which states the time
the program took and send it to me. If you could send me the name of your
processor, the number of GHz, and the amount of RAM that would be great
too.
-Go to the folder titled entropy inside of the folder that you put
test.gpr into. Open test_details.txt and send me the number that is
listed after Entropy of Distribution:
Thanks a lot for helping me out! I hope this doesn't take too much of
your time. Let me know if you have any questions at all.
Best,
Kurt
Here's the specs of Lu's system
Hi Kurt,
2013/06/15 20:18:29
2013/06/15 20:19:22
53s;
Entropy of Distribution: 6.2099199905835425
the processor of my computer is i7-3770, with 3.9GHz, the RAM the program
took is about 2.5G.
My computer has 32G of RAM and usually there are 16GB RAM free
Best,
Lu