Bioinformatics c problem solving paradigms

2014-11-02

azim58 - Bioinformatics c problem solving paradigms

Bioinformatics: problems solving paradigms

"C:\Users\kurtw_000\Documents\kurt\storage\Documents\Books\bioinformatics problem solving paradigms\whole book\Bioinformatics Problem Solving Paradigms.pdf"

On page 134

as of 7-14-13
received book from Tempe Public Library through ill a few days ago
book due july 30th

some notes (newest top to oldest bottom)

It is interesting to see that often applications of dy- namic programming may be reinterpreted as Viterbi equations for a suitably designed Hidden Markov model. This shows, in a certain sense, a certain gen- erality of the Hidden Markov approach.

Knuth-Morris-Pratt
algorithm

The Knuth-Morris- Pratt algorithm requires p times a preprocessing of a pattern of length m,
followed by p times running through text T. This leads to a complexity of O(pm + pn). Using suffix trees this can be done better.

Suffix trees represent all suffixes of a string T as labels of positions in a maximally compressed manner.

Soft Pattern Matching = Alignment

Computing a multiple alignment is NP-hard

gene assembly
The task arises to select an ascending chain r of non-overlapping
candidate exons (black shaded areas) whose concatenation r* gives greater or equal optimal alignment score with T than all other selections of ascending chains (Fig. 2.14).

Thus, with some confidence, we find out which protein is probably synthesized by the expected gene within string G and moreover, which substrings of G are the putative exons within that gene. Of course, simply trying out all ascending chains one by one is no good idea for obtaining an efficient algorithm. We will later see that clever usage of dynamic programming leads to an efficient solution (Chap. 3).

To par- ticularly solve an optimization problem in a recursive manner depends on the validity of the so-called Bellman principle (see 8J, reprint [9): assembly of an optimal solution y of X requires optimal solutions YI," . ,Yk of instances
XI , 路路 . ,Xk.

dynamic programming
Stated.dlfferently, t~p-down re~urslve
computation is replaced by a'bottom-up iterative computatIOn of solutIOns of

Consider testing satisfiability of Boolean formulas CP(XI, . .. ,x n ) having n Boolean variables. Such a for- l1Iula is satisfiable if and only if at least one of the formulas cp('true', ... ,In), cp(,false', . .. , X n ) is satisfiable. This is a recursive formulation of satisfiability. Here, we end up with 2n different subcalls. Indeed, satisfiability is an NP-hard problem so that a polynomial time solution is not possible, unless P = NP.

Thus thefollowing general problem is associated with every sort of adaptive system with parameters that may be optimally fitted to training data. HIDDEN MARKOV MODEL T R A IN IN G Given the structure of a Hidden Markov model (Le. number of states and connection structure but without having fixed parameters) and observed data D l , ... , D k , compute model parameters such that the resulting model has maximum likelihood of generating the data.

Indeed, there are successful approaches to contact map prediction using support vector machines.

recursive procedures often leads to exponential running time (measured ~n terms of input length), with the exact running time estimation depend-
mg on the number and size of the generated sub-instances and the costs for assembling sub-solutions into a solution of the original call.

The most prominent and simplest application of dynamic programming
is the computation of Fibonacci numbers.

The main steps of any dynamic programming approach can be summarized as follows:

We illustrate these approaches by a couple of bioinformatics problems. The simplest example is alignment of strings. Here, dynamic programming in its basic and pure form can be studied

Knuth-Morris-Pratt
algorithm

The Knuth-Morris- Pratt algorithm requires p times a preprocessing of a pattern of length m,
followed by p times running through text T. This leads to a complexity of O(pm + pn). Using suffix trees this can be done better.

Suffix trees represent all suffixes of a string T as labels of positions in a maximally compressed manner.

Soft Pattern Matching = Alignment

Computing a multiple alignment is NP-hard

gene assembly
The task arises to select an ascending chain r of non-overlapping
candidate exons (black shaded areas) whose concatenation r* gives greater or equal optimal alignment score with T than all other selections of ascending chains (Fig. 2.14).

Thus, with some confidence, we find out which protein is probably synthesized by the expected gene within string G and moreover, which substrings of G are the putative exons within that gene. Of course, simply trying out all ascending chains one by one is no good idea for obtaining an efficient algorithm. We will later see that clever usage of dynamic programming leads to an efficient solution (Chap. 3).

To par- ticularly solve an optimization problem in a recursive manner depends on the validity of

Concluding this brief introduction of SUfflX trees we can state that they are a main driving force within bioinformatics tools: "no suffix trees, no bioinJormatics. "

There are at least three reasons why exact string matching applications as discussed in the section before are of secondary interest in bioinformatics.

By deleting spacing symbols we obtain a string with ma..ximum summed align- ment score with all strings occurring in the lines of the alignment. Such a most similar string is known in algorithm theory as Steiner string.

solutions which are at least half as good as the optimal solution. Such methods are called 2-approximation algorithms

Introns between exons exhibit the GT-AG motif at the beginning and end.

Exons show a considerably higher frequency of epG pairs than non-exon regions do

only a solid theo- retical competence offers a chance of obtaining valid estimations of problem complexity and, on this basis, optimal algorithms.

obtaining good protein structure prediction methods thus counts as one of the Holy Grails of bioinformatics

pq trees
The following concept of a tree with two sorts of nodes, P-nodes and Q-nodes, simplifies descriptions considerably.

As I~ I' well-known in complexity theory, this innocent looking problem of separatl~ a list into two sublists with identical sums is an NP-hard problem, known to the PARTITION problem.

Thus S contains all fragments as substrings. We say that S is a common superstring for the considered fragments.

SHORTEST COMMON SUPERSTRING Given a substring-free set of strings, find a shortest common superstring.

SHORTEST COl\IMON SU- PERSTRING is an NP-hard problem, thus there is no polynomial time algo- rithm solving it (unless P = NP). In Chap. 6, a polynomial time algorithm
is presented that returns a common superstring that is at most four times longer than a shortest common superstring. Such an algorithm is called a 4-approximation algorithm.

LON GEST C O M M O N SU B S TR IN G Given a set of strings, find a longest common substring.