Work 102012

2015-01-13

azim58 - Work 102012

Some intervals for adding to the large table ran out of memory. It is
hard to find these intervals because the text file is so large. I would
like to make a Java class for handling large text files. This class could
find, sort, etc.

I created the LargeTextFileHandler class and I had it try to search for
lines in a text file that was greater than 1 GB. I can now see how long
this will take the program.

Code used here (102112)

It takes about 1-5 min for the program to go through the whole 1 GB file
1 time.

I will readd items to the table for items 6011-8000.

===========================================================================
I think I am starting to get a clearer idea for how I would like to
analyze the data in my 1 GB table of results file. For each protein
matched by the 1st motif group (e.g.
0_3i_b_1_266i, 0_3i_b_1_716i, etc.), I would like to see how many times
it matched with another protein in the motif group matches. I would then
like to know what percentile of matches that this number of matches falls
into (e.g. is this number of matches greater than the 90% of the other
numbers of matches). Then I would like to get the average and median e
scores for all of the matches by that protein. Once I have these
percentile numbers, I would like to sort the data so that the proteins
with the greatest number of matches and the greatest median or average
scores (proteins closest to the 100th percentile number of matches and
100th percentile e score corner in a graph) are ranked towards the top. I
would then want to see if any of these proteins look interesting.