Some Notes comparing regular expressions and PWM

2015-01-13

++ Some Notes comparing regular expressions and PWM

Position Weight Matrix (PWM and regular expression)

<s:Position weight matrix and regular expression>



http://www.cs.cmu.edu/~epxing/Class/10810-05/Lecture6.pdf


{


Regular expressions can be limiting


•


The regular expression syntax is still


too rigid


to represent


many


highly divergent


protein motifs.


•


Also,


short


patterns are sometimes insufficient with today


’


s


large databases. Even requiring perfect matches you might


find many


false positives


. On the other hand some real


sites might not be perfect matches.


•


We need to go beyond apparently equally likely alternatives,


and ranges for gaps. We deal with the former first, having a


distribution at each position


.

ftp.cse.buffalo.edu/users/azhang/disc/Seq_pattern_I.ppt



Regular expression for:


Pattern matching (sequence motifs),


Pattern discovery (promoter elements).





Position Weight Matrix (PWM) for:


Pattern matching (TransFac, TESS, etc),


Pattern discovery (MEME, Gibbs sampling).





Hidden Markov Models (HMMs) for protein domain analysis (next lecture).

http://en.wikipedia.org/wiki/Sequence_motif



A matrix of numbers containing scores for each residue or nucleotide at each position of a fixed-length motif. There are two types of weight matrices.


A position frequency matrix (PFM) records the position-dependent frequency of each residue or nucleotide. PFMs can be experimentally determined from SELEX experiments or computationally discovered by tools such as MEME using hidden Markov models.


A position weight matrix (PWM) contains log odds weights for computing a match score. A cutoff is needed to specify whether an input sequence matches the motif or not. PWMs are calculated from PFMs.

http://biochem218.stanford.edu/Projects%202012/Lin.pdf
^I like the easy to understand wording of this document



Most motif finding algorithms fall into t


wo major groups based on the combinatorial approach used:


(1)


word


-


based (string


-


based) method, represented by regular expressions (RE), or (2)


probabilistic


sequence models based on position weight matrices (PWM)


[


3


]


.


The two methods have their own


strengths and weaknesses.








----------------------


-regular expressions


--This method


is a good choice for


finding motifs where all instances are identical. However, for typical transcription factor motifs that


often have several weakly constrained positions, the word based method can suffer





-PWM


--This model assumes that


each position in the motif is statistically independent of the others.


--The


advantage for


probabilistic approaches is that, compared with word


based methods,


can have each letter


match


a particular motif position to varying de


grees, rather than just match or no match.


--On the other hand, PWMs allow for a more flexible description of motifs because each letter


can match a particular motif


position to varying degree rather than simply matching or not matching


--The main disadvantage of PWMs for motif discovery is that they are far more difficult for computer


algorithms to search for








-There are


myriads


of algorithms available for motif finding,


each with their ad


vantage and


disadvantages. Diverse approaches, including combinatorial enumeration, probabilistic modeling,


mathematical programming, neural networks, and genetic algorithms


, have been used

http://www.jbc.org/content/280/22/21491.full.pdf+html



The ability to


perform both regular expression searches and weight matrix


searches on a large set of putative promoters is unique to our


tool

www.stat115.org/lectures/MotifFinding.ppt‎



SeqLogo consists of stacks of symbols, one stack for each position in the sequence


The overall height of the stack indicates the sequence conservation at that position


The height of symbols within the stack indicates the relative frequency of nucleic acid at that position

}
</s:Position weight matrix and regular expression>