note from Mark 7-6-11

2013-12-01

azim58 - note from Mark 7-6-11

This is an e-mail conversation about the GEMODA program.

===========================================================================
Hi Mark Styczynski,

I am currently experimenting with the Gemoda program which I
believe you worked on. I am using it to see how similar different
peptide sequences are to each other. Occasionally, with this program
I will obtain a best score of exactly 1.0. I obtain this score when I
don't think I should. For example, when I compare just two peptides
with extremely similar sequences I would think the significance would
be a very small number, and it often is. However, sometimes it is
1.0.

Here's a concrete example:

QRQHSP and QRQHSPV have a best significance for their best motif of
1.0 when using the following command:

gemoda-s -l 4 -g 2 -m BLOSUM62 -i motif_file_for_gemoda.txt

I don't completely understand in what situations I obtain 1.0, and I
also don't understand whether or not this is the answer I am supposed
to get. What do you think about this? I'm basically just looking for
a tool to give me a score indicating how similar two short sequences
are to each other. I know this is kind of a detailed technical
question about something you probably have not looked at for a very
long time so I completely understand if you cannot help me much.
However, any response would be greatly appreciated!

Best regards,
Kurt Whittemore

Graduate Student
Arizona State University
Biodesign Institute
727 E. Tyler Street
Tempe, AZ 85287

===========================================================================
Mark

===========================================================================
Kurt,

You are right, it has been easily four years since I have swam through
Gemoda code, and probably substantially more than that.

My guess would be that for degenerate, over-simplified cases, you are
finding these bad significance values.

I'll first refer you to our supplementary information:

http://web.mit.edu/bamel/gemoda/jensen2004supp.pdf

There are details on the significance calculations in there that you
should read. Once you read that, you'll see that the significance is
strictly based on your dataset, and whether the similarity "signal" you
are detecting is substantially different from the background noise. It is
*not* telling you the likelihood of two proteins having some level of
similarity given what is known about nature. This was done in order to
continue with the "data agnostic" approach --- the significance is only
analyzed on a problem-specific level. This means that the same run of
similarity, in different backgrounds, will have different significance.

What you have, then, is likely just "this is the only long similarity,
there isn't much to compare it to". If you were able to put in longer
runs of non-similarity on either side, or additional non-similar
sequences, your significance would likely become more like what you are
expecting.

Does that help?

===========================================================================
Me

===========================================================================

Yes! That actually helps a lot. Thanks for the information

azim58wiki: