Bioinformatics Genes Proteins and Computers

2015-06-28

Bioinformatics Genes Proteins and Computers

"C:\Users\kurtw_000\Documents\kurt\storage\Documents\Books\2014\08-25-2014d1251\Bioinformatics Genes Proteins and Computers.pdf"

I bought this book for $20.59 on 7/6/14
finished book pre 5-3-15

Bioinformatics Genes, Proteins, and computers

by C.A Orengo, D.T. Jones and J.M. Thornton
2003

bought this book 07-06-2014d0720
Heidi got this book. It mentions Markov models and support vector machines. It also looks like it has some good diagrams and pictures and is written well.

tried to sell on Amazon and gave it to Mengjiao to sell, but I closed the listing on 12-07-2014d1203. Here was the info on Amazon 12-07-2014d1203

Some Notes

q
The alignment of two biological sequences is the cornerstone of bioinformatics

Combinatorics **
pg. 87 of Bioinformatics Genes Proteins and Computers
permute n objects of k types 09-28-2014d1905
Suppose you have 10 colored balls, of which ﬁve are red, two green and three yellow.
The number of distinct ways you can order them in a line is 10!/(5!2!3!) = 252.0. More
generally, given N objects that fall into K types, the number of ways they can be per-
muted is given by the multinomial coefficient:

W = N!(n1!*n2!*...*Nk!)

where n, is the number of objects of type i, and n! is the factorial of n (e.g., for n= 5, n! = 5*4*3*2*1). Factorials produce big numbers and as the number of objects increases, W soon becomes awkward to calculate directly. Fortunately, we can find ln W with relative ease. Sterling's approximation states that ln N! approximately equals N*ln N - N. So substituting this into the multinomial coefficient above gives

ln W = -N*Sum(pi*ln pi)

where pi = ni/N, the fractional frequency of the ith type. Divide both sides by N*ln2 and you get Shannon's entropy:

S = - Sum(p*log2(p))

. . .
Thus, S is a convenient and intuitive measure of diversity of types among a set of symbols

Zvelebil's truth table of amino acid properties
properties: hydrophobic, polar, small, proline, tiny, aliphatic, aromatic, positive, negative, charged

The most common type of mutation data score is the sum ofpairs (SP) score.

Combinatorics

number of pairs from n objects 09-28-2014d1907

How many pairs can you make from n objects?
N*(N-1)/2
pg. 89 of Bioinformatics Genes Proteins and Computers

q
there is no single best way to measure
either amino acid similarity or sequence conservation.

q
Protein Domain Database

q
To allow the identification of homologous domains with low sequence identity (<30%), pattern and profile methods enhance sequence information

q
Motifs are typically about 10-20 aa

permissive regular expression (fuzzy regular expressions) (BLOCKS and PRINTS databases)
exact regular expression (PROSITE)

macromolecular motif representations

Fingerprints

modularity of proteins (pg. 139; 1.2.5)

rest of annotations from whole book

Extracted Annotations (5/2/2015, 11:53:51 PM)

"The alignment of two biological sequences is the cornerstone of bioinformatics. This" (Demaria et al 2014:795)

Extracted Annotations (5/2/2015, 11:58:03 PM)

"6.6 Conclusion Although the PDB is currently much smaller than the sequence databases by nearly three orders of magnitude, the international structural genomics initiatives will help to populate the database and provid" (Demaria et al 2014:748)

"3 Structural motifs and functional analogs" (Demaria et al 2014:810)

"ethods to detect recurring structural motifs" (Demaria et al 2014:812)

Extracted Annotations (5/3/2015, 12:00:30 AM)

"otein structure is much more conserved than sequence in evolution, so knowledge of the three-dimensional structure provides a powerful tool for determining even distant homoiosies between proteins. I" (Demaria et al 2014:723)

"SCOP domains wi" (Demaria et al 2014:725)

"t is striking that the distribution of the number of families containing different numbers of domains follows a power law in all genomes. This" (Demaria et al 2014:729)

"it is striking that the distribution of the number of families containing different numbers of domains follows a power law in all genomes. This mea" (Demaria et al 2014:729)

"ikel% that the similarity in the domain family-size distributions across gen( nnes is due to there being a few ubiquitously useful domain families and many small families that have specialized functions in all of the genomes." (Demaria et al 2014:729)

"prehensive information on small molecule Za8 hul F. odi is contained in the EcoCvc database." (Demaria et al 2014:734)

"Jensen model of evolution." (Demaria et al 2014:735)

"As mentioned above, recruitment of domain families across pathways is common, and these duplications involve either conservation of reaction chemistry or conservation of a cofactor - or minor substrate-binding site. This" (Demaria et al 2014:735)

"pathways are constructed by recruitment for the sake of catalytic mechanism, with few instances of duplication of enzymes within a pathway or serial recruitment across pathways." (Demaria et al 2014:738)

"determined that over 90% of the enzymes that are in stable complexes in F. cull metabolic pathways are adjacent on the E. c-oli some." (Demaria et al 2014:741)

"The veast-two-hvbrid system uses the transcription of a reporter gene driven by the Cia14 transcription factor to monitor whether or not two proteins are interacting. As shown in Fiore 12.2a, if the interaction between two proteins, A and B, is being tested, one of their 4enes would be fused to the DNA-binding domain of the (.ial4 transcription factor Kia14-1)BD) while the other would be fused to the activation domain ((gal-4-AD). The DNA-binding domain chimeric protein will hind upstream of the reporter gene. If the activation domain Lhimeric protein interacts with the DNA-binding domain chimeric protein, the reporter gene will be transcribed" (Demaria et al 2014:742)

"e 'bait' protein, protein A here, fused to the DNA-binding connti,n;of the yeast Gal4 transcription factors (Ga14-DBD), as indicated by the black line gene the two proteins. The Gal4-DBD will bind in the promoter region of a reporter acti .e. The set.ond chimeric protein consists of a 'prey' protein, protein B here, fused to the ,vation domain of the Gal4 transcription factor (Ga14-AD); the fusion is again indicated be d r:lack line connecting the two proteins. If proteins A and B interact, the Gal4-AD will Doi_cruited to the promoter of the reporter gene as well, and will activate the RNA bettIrrase and thus stimulate transcription of the reporter gene. Thus the interaction s(ale een Proteins A and B can be monitored by expression of the reporter (b). In the largethe 2urification of protein complexes, fusion proteins are created as well. These consist of rk„orf-e r_otei" (Demaria et al 2014:742)

I'm a little surprised this works (note on p.742)

"The result of the known interactions between members of structural protein families is graph of connections between families like that shown in Figure 12.3, where the nodes ar" (Demaria et al 2014:743)

"protein families and the edges represent an interaction between at least one of the domains from each of the two families, Most domain families o" (Demaria et al 2014:744)

"a method for pre dicting; protein interactions. T" (Demaria et al 2014:745)

"It can be described very succinctly. Given two biological molecules of known structure that are known to interact can we determine their three-dimensional structure when in a cornPlex" (Demaria et al 2014:750)

"otein-protein docking" (Demaria et al 2014:751)

"tein-ligand docking" (Demaria et al 2014:751)

"ltkiliris b en knOwn for some time that conservation of residues at the surface of a protein fainstirutarophobicity or tly,,_ al een related to function. This may be an eniyme active-site or binding site. Unlike electrostatic po" (Demaria et al 2014:754)

"rtain highly populated protein folds called the surco', there is conservation of binding-site location even in the absence of homology. thicn , have been termed Supersites (see Chapter 10). Th" (Demaria et al 2014:755)

"tein-ligand (small molecule) docking dlgonthms" (Demaria et al 2014:756)

"anstorm docking met hod pi" (Demaria et al 2014:757)

"otein-protein docking." (Demaria et al 2014:757)

"Virtual screening and structure-based drug design" (Demaria et al 2014:760)

". The aim is to produce a manageable subset that can be screened experimentally with high-throughput methods." (Demaria et al 2014:760)

"Stochastic processes use a random sampling procedure to search conformational space. This fldudes methods such as Monte Carlo simulation, simulated annealing, Tabu search, genetic algorithms and evolutionary programming. T" (Demaria et al 2014:762)

"b?"11 muck • fit qvever, the color scheme used is not flexible. The GRASP and VR1.M have Hien In( orp, Ated into the GRASS server. This allows a \\ eh-based interactive exploration of t: lecides in lilt' PIM allowing the molecular properties to be viewed on the molecular sur- _ 'nitre are visualization. also several other popular molecular graphics programs that allow a similar" (Demaria et al 2014:764)

"principles of the Minimum information Ahoul Experiment (M1AME" (Demaria et al 2014:766)

Extracted Annotations (5/2/2015, 11:49:55 PM)

"values f" (Demaria et al 2014:723)

"Normalizing genes" (Demaria et al 2014:724)

"An international consortium of array groups has defined a Minimum information About a Microarray Experiment (MEANIE) that provides a framework for definInv, the type of data that should be stored and dividing the large amount of array data types into defined groups for database implementation." (Demaria et al 2014:727)

"ayExpress. the proposed international !flicroarray databa" (Demaria et al 2014:727)

"Data +iiinii1g has been defined as the process of discovering knowledge or patterns hidden iri„,(()1en large/ (latasets." (Demaria et al 2014:729)

"y, the size of machine-readable data sets has increased and the problem of 'data explosion' has become apparent. Many an" (Demaria et al 2014:730)

"In this chapter we will present some at the most commonly used methods for gene eX inC5.. lion data exploration, including hierarchical clusterhN, K-means, and self-organizing nlaP (SOM). Also, support vector machines (SVM) have become popular for classifying egPressle'il data," (Demaria et al 2014:730)

"Hierarchical clustering" (Demaria et al 2014:732)

"concept of the hierarchical representation of a data set was pri ogy." (Demaria et al 2014:732)

"hi erarchical tree or timirogram representing a nested set of partitions. Sec" (Demaria et al 2014:733)

"The K-means algorithm is popular because it is easy to understand, easy to implement, and has a good time complexity." (Demaria et al 2014:739)

"nother problem is that it is sensitive to the initial partition - the selection of the initial patterns, and may converge to a local minimum of the criterion function value if the initial partition is not properly chosen. A possible remedy is to run the algorithm with a number of different initial partitions. If they all lead to the same final partition, this implies that the global minimum of the square error has been achieved. However, this can be time-consuming, and may not always work." (Demaria et al 2014:739)

"Self - organizing maps" (Demaria et al 2014:739)

"particularly for partitional clustering and visualization. It is cap" (Demaria et al 2014:739)

"of the most important propel° of SOM that similar input vectors are mapped to geometrically close winner nodes on th output map. This is called neighborhood preservations, which has turned out to be very usetto for clustering similar data patterns." (Demaria et al 2014:740)

"known and remains one of the most important ways of validating the results. Hierarchical clustering, K-means and SOM are probably the most commonly used clustering methods for gene expression analysis. By" (Demaria et al 2014:741)

"he global search and optimization methods such as genetic" (Demaria et al 2014:741)

"zy clustering may have much to offer as dlre" (Demaria et al 2014:741)

"mptoms, personal or financial information, or gene expression level at duref r 'g nine Points. Essentially we are interested in how to construct a classification procesify -Hn a set of cases whose classes are known so that such a procedure can be used to clasnew cases" (Demaria et al 2014:741)

"Support vector machines (SVM)" (Demaria et al 2014:742)

"k-fold cross-validatio" (Demaria et al 2014:742)

"Without going into full details the maximal margin of separation can be uniquely' constructed by solving a constrained quadratic optimization problem involving support vcc tors, a small subset of patterns that lie on the margin. The s" (Demaria et al 2014:742)

"um a SuPriort vector machines offer a novel approach to classification. SVM training always global minimum, while its neural network competitors may get stuck with a local minim . ation and it can be analyzed theoretically using" (Demaria et al 2014:747)

"e is no established theory that can ._t:intee that a given family of SVMs will have high accuracy on a given problem. SVM" (Demaria et al 2014:747)

"area of functional genomics, network modeli has begun more recently, a variety of approaches have already been proposed. These range" (Demaria et al 2014:748)

"cellular °file' The e" (Demaria et al 2014:749)

"major subdisciplines, which we term expression proteomics and cell-map proteomics. Th" (Demaria et al 2014:750)

"Cell-map proteomics is the large-scale characterization of protein interactions, and includes methods for studying protein—protein interactions, methods for studying the interaction between proteins and small molecules (ligands) and cellular localization studies. T" (Demaria et al 2014:750)

"s of thousands of proteins differing in abundance owt . orders of magnitude. Es" (Demaria et al 2014:750)

"protein .s•eparation technology) and the characteriza- 11 01 intik idnalproteins within such mixtures (p rOreiti ilittiOttitiOti teCht10/0,s„T1." (Demaria et al 2014:751)

"the first experiments using 2-DE it proved possible to resolve over 1000 proteins from the bacterium Escherichia coil. The resolution of the technique has improved steadily, and it is now possible to separate 10,000 proteins in one experiment, although such skill is not readily transferable. How" (Demaria et al 2014:752)

"embrane proteins remain a challenge requiring case-by-case tuning of the detergents used. A" (Demaria et al 2014:752)

"gon;" (Demaria et al 2014:752)

"chromatography separation methods are typically employed, using combinations °I.51. exclusion, ion exchange and reverse phase chromatography. These app" (Demaria et al 2014:752)

"Affinity chromatography techniques for cell-map proteomics" (Demaria et al 2014:753)

"use of specific antibodies directed towards particular proteins (immunoprecipitation)," (Demaria et al 2014:753)

"technique that has had the biggest impact on protein annotation in proteomics is mass spectrometry" (Demaria et al 2014:753)

". A mass spectrometer has three components. The ionizer converts the analyte into gas phase ions and accelerates them towards the mass analyzer, which separates the i 2s . according to their mass/charge ratio on their way to the ion detector, which records the i L"P , i onization of individual ions. Large molecules such as proteins and DNA are broken up by stan- . d procedures, b" (Demaria et al 2014:753)

"Figure 16.2" (Demaria et al 2014:754)

"Strategies for protein annotation by mass spectrometry" (Demaria et al 2014:754)

"A highly complex interaction map representing 1500 yeast proteins. Reproduced from Tucker CI. et al." (Demaria et al 2014:755)

"he molecular weight of a whole protein is insufficiently discriminatdagtato: its identification, which is why trypsin digestion is required. If a protein exists in the Publase' it can be identified from as few as two or three peptides. Excellent commercial . and" (Demaria et al 2014:755)

"Mass analyzers. The two basic types of mass analyzer used in proteomics are the quadrupole and time of flight (TOL') analyzers. A quadrupole analyzer comprises four metal rods, pairs of which are electrically connected and carry opposing voltages that can be controlled by the operator. Mass spectra are obtained by varying the potential difference applied across the ion stream, allowing ions of different mass/charge ratios to be directed towards the detector. A time of flight analyzer measures the time taken by ions to travel down a flight tube to the detector, a factor that depends on the mass/charge ratio." (Demaria et al 2014:756)

"SNP are more abundant in non-coding compared to coding DNA, and many" (Demaria et al 2014:757)

"tie novo peptide sequencing." (Demaria et al 2014:757)

"Protein chips are miniature devices on which proteins, or specific capture agents that interact with proteins, are arrayed. As such, protein chips can act both to separate proteins (on the basis of specific affinity) and characterize them (if the capture agent is highly specific, as in the case of antibodies) (" (Demaria et al 2014:758)

"Solution arrays. New technologies based on coded microspheres or barcoded nanopartides release the protein chip from its two-dimensional format and will probably emerge as the next generation of miniature devices used in proteomics." (Demaria et al 2014:759)

"Scientists have traditionally written programs in languages such as Pert, C or C4-1- to extract data relevant to their work from files, p" (Demaria et al 2014:764)

"(MySQL)," (Demaria et al 2014:766)

"relational model has been dominant, a" (Demaria et al 2014:766)

"stgreSQL) an" (Demaria et al 2014:766)

"(Oracle)" (Demaria et al 2014:766)

"An alternative data model, the object-oriented model, is supported by some DBMSs. T" (Demaria et al 2014:767)

"example, rather than having data about a protein domain spread across many tables as would be the case in a relational database, in an object-oriented database all the information about a protein domain, including what a domain is related to and the ways in which domain data may be Cleated and accessed, would be brought together in the definition of a protein domain object." (Demaria et al 2014:767)

"their continued wide U. Given this current dominance of relational DBMSs, object-oriente" (Demaria et al 2014:767)

"The standard language for accessing a relational database is SQL (Structured Query Language):" (Demaria et al 2014:768)

"SELECT clause" (Demaria et al 2014:769)

"FROM clause t" (Demaria et al 2014:769)

"WHERE clause" (Demaria et al 2014:769)

"veloping a full application will usually require programming in a language such as Perl, Java or C++ for all the non-database tasks of the application. Hence, mechanisms have been developed to ena" (Demaria et al 2014:769)

"data model presented by a DBMS - a series of maies111 the case of a relational database - may differ enormously from the storage structures maim In reality by the DBMS to hold the data." (Demaria et al 2014:769)

"Indexes in databases work in a similar fashion to indexes in hooks. A rea" (Demaria et al 2014:771)

Extracted Annotations (5/3/2015, 12:03:21 AM)

"A DBMS no longer just manages data in tables in a single local database; increasingly it provides access to data stored locally and remotely, Wide Web. the da being stored in conventional databases, or in files external to the database, or on the \\ odd 17.4 Challenges arising from biological data" (Demaria et al 2014:723)

"grid computing: turni ng Web into integrated data and computational resources which a scientist can plug i nto in tf" same way that the electricity grid can he plugged into when electrical power is required. (." (Demaria et al 2014:725)

"l automated way, markup that is not 'nlY computer readable, but also computer understandable." (Demaria et al 2014:726)