Work 092712
2015-01-13azim58 - Work 092712
continued working on compareAllSequencesInFile1WithFile2 method
Made 2 artificial comparison files
"S:\Research\Cancer_Eradication\Users\kwhittem\DR\2012\9-26-12\test_compari
son1.txt"
"S:\Research\Cancer_Eradication\Users\kwhittem\DR\2012\9-26-12\test_compari
son2.txt"
Added a line from the 1st one to the 2nd one.
gb|EGH42662.1| RNA methyltransferase TrmH, group 2 [Pseudomonas ... 18.0
19006
Alright the compareAllSequencesInFile1WithFile2 method seems to be
working now.
Now I just need to make sure that the blast result from each motif group
gets compared with the others.
Then I can see how many matches each input has.
===========================================================================
I would also like to get my HEE sequence to match something from the
database
called this command
blastp -db
S:\Research\Cancer_Eradication\Users\kwhittem\DR\2012\9-26-12_database\nr
- query x -word_size 2 -seg no -evalue 200000000000000 -comp_based_stats
- out test_for_hee.txt
x contains the sequence HEEX
Searching this way did yield a list of matches and some of them did
contain SMC related matches.
searching this way also yielded a list of matches with some SMC related
matches for PMRE as well.
These test search results can be found here
C:\kurt\storage\CIM Research Folder\DR\2012\9-27-12\test_blast
===========================================================================
When I search for matches lower than the evalue I don't think numbers
like 3e-2 are getting counted so I'll need to fix this in the
need_to_determine_number_of_matches
section
The match counting code appears to work. Now I can just add this match
information to the table as well.
Now I would like to rank the items in a manner so that the items with the
highest matches and lowest e-values are ranked the highest. Actually, I
can basically do this simply by sorting the numbers in excel.
The only other features that I wanted to add to my program involve better
logging, and bepipred functionality. I don't think either of these things
will be terribly difficult to implement.
Finished writing logging functions.
Started writing bepipred_handler class.
When I tried to use java to ssh I go the following message
Pseudo-terminal will not be allocated because stdin is not a terminal.
ssh for java
http://stackoverflow.com/questions/2514439/how-to-run-ssh-commands-on-remot
e-system-through-java-program
I'm getting the following error
cannot make a static reference to the non-static method exec(String) from
the type Runtime
I'll forget the bepipred stuff for now.
Now I'll try to run the program from scratch from here
S:\Research\Cancer_Eradication\Users\kwhittem\DR\2012\9-28-12\mpa
I should modify the log file for the comparison of the blast results so
that it states which two are being compared out of how many.
I added this feature.
For some reason the FSA files are not being created properly for some
sequences. I suspect this is a sequence and regex problem.
What is the command to see how many files are in a directory?
ls -1 | wc -l
Now I'll look into the fsa file issue.
Here's a file that was not found
blast_res_motif_group_0_3i_b_blast_res_motif_group_1_15980i.txt
blast_res_motif_group_0_3i.fsa was created
The file for
blast_res_motif_group_1_15980i was not made
instead this file was made
console_blast_res_motif_group_1_15980i.fsa.txt
The text in this file shows that the blast did not work (the blast help
commands are listed and everything). There is also this message
Error: Too many positional arguments (1), the offending value: Chain
Now I need to find out what this line was in the original blast result
document.
I'm not sure which line in the blast result document that 15980 refers
to. I would think it would refer to either line 15980 or 15980+31(header
part of document) = 16011 + or - 1 number for each of these
possibilities. None of these entries contain the word "Chain" though.
Line 16021 does contain the word chain. Why would it be 10 off?
I see why it is off. The regex I used expects input that has a space
after the "|...|" but the lines with the word chain don't have a space
after the | and so they were not added. I'll need to fix this.
I think changing the regex from this
(.+?)\|(.+?)\|\s\s+?(.+?)\s\s+(.+?)\s\s+(.+)
to this
(.+?)\|(.+?)\|.*?\s\s+?(.+?)\s\s+(.+?)\s\s+(.+)
should work
How long approximately will my program take to compare to 20,000 line
blast result files?
On Saturday at about 11:39am there were 78484 comparisons performed.
The program started on Friday at 11:44 am so let's say that 24 hours
passed
20000*20000 = 400,000,000 comparisons need to be made. How many hours
will this take?
24/78484=y/400000000
This will take about 122,318 hours. This will take 5097 days or 14 years.
A little long haha
Okay now I can take a look at getting ssh and bepipred working
ssh with java
===========================================================================
Actually spent time trying to get the blast to work. Wanted to blast an
accession against 20,000 accessions, but this doesn't seem to be working.
It looks like I may need to retrieve the sequences.
When I blasted one retrieved sequences against the other approximately
20,000 retrieved sequences all in one fast file this gave me the result I
wanted. It looks like the program will take approximately 5 days to
compare all of the sequences one at a time in file 1 with all of the
sequences at once in file 2. I think this is fairly reasonable. Much
better than the 11-16 year time period it was going to take before.
I would like to start cleaning up my code a little bit. I have 3 versions
of the comparison file method, but I think I can get rid of all but 1 and
store the others somewhere else. I would also like to make it so that
certain files are not created unless they need to be.I think I'll start
on this another time.