Some Notes comparing regular expressions and PWM
2015-01-13++ Some Notes comparing regular expressions and PWM
Position Weight Matrix (PWM and regular expression)
<s:Position weight matrix and regular expression>
http://www.cs.cmu.edu/~epxing/Class/10810-05/Lecture6.pdf
{
Regular expressions can be limiting
•
The regular expression syntax is still
too rigid
to represent
many
highly divergent
protein motifs.
•
Also,
short
patterns are sometimes insufficient with today
’
s
large databases. Even requiring perfect matches you might
find many
false positives
. On the other hand some real
sites might not be perfect matches.
•
We need to go beyond apparently equally likely alternatives,
and ranges for gaps. We deal with the former first, having a
distribution at each position
.
ftp.cse.buffalo.edu/users/azhang/disc/Seq_pattern_I.ppt
Regular expression for:
Pattern matching (sequence motifs),
Pattern discovery (promoter elements).
Position Weight Matrix (PWM) for:
Pattern matching (TransFac, TESS, etc),
Pattern discovery (MEME, Gibbs sampling).
Hidden Markov Models (HMMs) for protein domain analysis (next lecture).
http://en.wikipedia.org/wiki/Sequence_motif
A matrix of numbers containing scores for each residue or nucleotide at each position of a fixed-length motif. There are two types of weight matrices.
A position frequency matrix (PFM) records the position-dependent frequency of each residue or nucleotide. PFMs can be experimentally determined from SELEX experiments or computationally discovered by tools such as MEME using hidden Markov models.
A position weight matrix (PWM) contains log odds weights for computing a match score. A cutoff is needed to specify whether an input sequence matches the motif or not. PWMs are calculated from PFMs.
http://biochem218.stanford.edu/Projects%202012/Lin.pdf
^I like the easy to understand wording of this document
Most motif finding algorithms fall into t
wo major groups based on the combinatorial approach used:
(1)
word
-
based (string
-
based) method, represented by regular expressions (RE), or (2)
probabilistic
sequence models based on position weight matrices (PWM)
[
3
]
.
The two methods have their own
strengths and weaknesses.
----------------------
-regular expressions
--This method
is a good choice for
finding motifs where all instances are identical. However, for typical transcription factor motifs that
often have several weakly constrained positions, the word based method can suffer
-PWM
--This model assumes that
each position in the motif is statistically independent of the others.
--The
advantage for
probabilistic approaches is that, compared with word
based methods,
can have each letter
match
a particular motif position to varying de
grees, rather than just match or no match.
--On the other hand, PWMs allow for a more flexible description of motifs because each letter
can match a particular motif
position to varying degree rather than simply matching or not matching
--The main disadvantage of PWMs for motif discovery is that they are far more difficult for computer
algorithms to search for
-There are
myriads
of algorithms available for motif finding,
each with their ad
vantage and
disadvantages. Diverse approaches, including combinatorial enumeration, probabilistic modeling,
mathematical programming, neural networks, and genetic algorithms
, have been used
http://www.jbc.org/content/280/22/21491.full.pdf+html
The ability to
perform both regular expression searches and weight matrix
searches on a large set of putative promoters is unique to our
tool
www.stat115.org/lectures/MotifFinding.ppt
SeqLogo consists of stacks of symbols, one stack for each position in the sequence
The overall height of the stack indicates the sequence conservation at that position
The height of symbols within the stack indicates the relative frequency of nucleic acid at that position
}
</s:Position weight matrix and regular expression>