Home
News
Feed
search engine
by
freefind
advanced
Some Notes comparing regular expressions and PWM
2015-01-13
++ Some Notes comparing regular expressions and PWM Position Weight Matrix (PWM and regular expression) <s:Position weight matrix and regular expression> { http://www.cs.cmu.edu/~epxing/Class/10810-05/Lecture6.pdf { Regular expressions can be limiting • The regular expression syntax is still too rigid to represent many highly divergent protein motifs. • Also, short patterns are sometimes insufficient with today ’ s large databases. Even requiring perfect matches you might find many false positives . On the other hand some real sites might not be perfect matches. • We need to go beyond apparently equally likely alternatives, and ranges for gaps. We deal with the former first, having a distribution at each position . } ftp.cse.buffalo.edu/users/azhang/disc/Seq_pattern_I.ppt { Regular expression for: Pattern matching (sequence motifs), Pattern discovery (promoter elements). Position Weight Matrix (PWM) for: Pattern matching (TransFac, TESS, etc), Pattern discovery (MEME, Gibbs sampling). Hidden Markov Models (HMMs) for protein domain analysis (next lecture). } http://en.wikipedia.org/wiki/Sequence_motif { A matrix of numbers containing scores for each residue or nucleotide at each position of a fixed-length motif. There are two types of weight matrices. A position frequency matrix (PFM) records the position-dependent frequency of each residue or nucleotide. PFMs can be experimentally determined from SELEX experiments or computationally discovered by tools such as MEME using hidden Markov models. A position weight matrix (PWM) contains log odds weights for computing a match score. A cutoff is needed to specify whether an input sequence matches the motif or not. PWMs are calculated from PFMs. } http://biochem218.stanford.edu/Projects%202012/Lin.pdf ^I like the easy to understand wording of this document { Most motif finding algorithms fall into t wo major groups based on the combinatorial approach used: (1) word - based (string - based) method, represented by regular expressions (RE), or (2) probabilistic sequence models based on position weight matrices (PWM) [ 3 ] . The two methods have their own strengths and weaknesses. ---------------------- -regular expressions --This method is a good choice for finding motifs where all instances are identical. However, for typical transcription factor motifs that often have several weakly constrained positions, the word based method can suffer -PWM --This model assumes that each position in the motif is statistically independent of the others. --The advantage for probabilistic approaches is that, compared with word based methods, can have each letter match a particular motif position to varying de grees, rather than just match or no match. --On the other hand, PWMs allow for a more flexible description of motifs because each letter can match a particular motif position to varying degree rather than simply matching or not matching --The main disadvantage of PWMs for motif discovery is that they are far more difficult for computer algorithms to search for -There are myriads of algorithms available for motif finding, each with their ad vantage and disadvantages. Diverse approaches, including combinatorial enumeration, probabilistic modeling, mathematical programming, neural networks, and genetic algorithms , have been used } http://www.jbc.org/content/280/22/21491.full.pdf+html { The ability to perform both regular expression searches and weight matrix searches on a large set of putative promoters is unique to our tool } www.stat115.org/lectures/MotifFinding.ppt { SeqLogo consists of stacks of symbols, one stack for each position in the sequence The overall height of the stack indicates the sequence conservation at that position The height of symbols within the stack indicates the relative frequency of nucleic acid at that position } } </s:Position weight matrix and regular expression>
azim58wiki: