<p><span>This volume details several important databases and data mining tools. </span><span>Data Mining Techniques for the Life Sciences, Second Edition </span><span>guides readers through archives of macromolecular three-dimensional structures, databases of protein-protein interactions, thermodyna
Data Mining Techniques for the Life Sciences (Methods in Molecular Biology, 609)
β Scribed by Oliviero Carugo (editor), Frank Eisenhaber (editor)
- Publisher
- Humana
- Year
- 2009
- Tongue
- English
- Leaves
- 408
- Category
- Library
No coin nor oath required. For personal study only.
β¦ Synopsis
Most life science researchers will agree that biology is not a truly theoretical branch of science. The hype around computational biology and bioinformatics beginning in the nineties of the 20th century was to be short lived (1, 2). When almost no value of practical importance such as the optimal dose of a drug or the three-dimensional structure of an orphan protein can be computed from fundamental principles, it is still more straightforward to determine them experimentally. Thus, experiments and observationsdogeneratetheoverwhelmingpartofinsightsintobiologyandmedicine. The extrapolation depth and the prediction power of the theoretical argument in life sciences still have a long way to go. Yet, two trends have qualitatively changed the way how biological research is done today. The number of researchers has dramatically grown and they, armed with the same protocols, have produced lots of similarly structured data. Finally, high-throu- put technologies such as DNA sequencing or array-based expression profiling have been around for just a decade. Nevertheless, with their high level of uniform data generation, they reach the threshold of totally describing a living organism at the biomolecular level for the first time in human history. Whereas getting exact data about living systems and the sophistication of experimental procedures have primarily absorbed the minds of researchers previously, the weight increasingly shifts to the problem of interpreting accumulated data in terms of biological function and bio- lecular mechanisms.
β¦ Table of Contents
METHODS IN MOLECULAR BIOLOGYTM
Data Mining Techniques for the Life Sciences
Preface
References
Contents
Contributors
Part 1: Databases
Nucleic Acid Sequence and Structure Databases
1. Introduction
2. Sequence Databases
2.1. General Nucleotide Sequence Databases
2.2. DNA Databases
2.2.1. Transcript Structures and Alternative Splicing
2.2.2. Repeats and Mobile Elements
2.2.3. Promoters and Regulation
2.3. RNA Databases
2.3.1. Noncoding RNA Sequences
2.3.2. mRNA Elements
2.3.3. RNA Editing
2.3.4. Specific RNA Families
2.3.5. Artificial Selected/Designed RNAs
3. Secondary Structures
4. Tertiary Structures
References
Genomic Databases and Resources at the National Center for Biotechnology Information
1. Introduction
2. Data Flow and Processing
3. Text Search and Retrieval System: Entrez
3.1. Organizing Principles
3.1.1. Query Examples
3.1.2. Links Between the Nodes
3.1.3. Links Within the Nodes
3.1.4. Links Outside the Nodes
3.2. Tools for Advanced Users
4. Genomic Databases
4.1. Trace Repositories
4.1.1. Trace Archive
4.1.2. Short Read Archive (SRA)
4.1.3. GenBank - Primary Sequence Archive
4.2. Entrez Databases
4.2.1. Entrez Genome
4.2.2. Entrez Genome Project
4.2.3. Entrez Protein Clusters
5. Analysis of Prokaryotic Genome Data
5.1. gMap - Compare Genomes by Genomic Sequence Similarity
5.2. Genome ProtMap - Compare Genomes by Protein Sequence Similarity
5.3. Concise BLAST
6. Browsing Eukaryotic Genome Data
7. Searching Data by Sequence Similarity (BLAST)
7.1. Organism-Specific Genomic BLAST
7.2. Multi-organism Genomic BLAST
8. FTP Resources for Genome Data
9. Conclusion
References
Protein Sequence Databases
1. Introduction
2. The Foundation Is the Sequence
3. What Is Known About the Protein?
4. Lower Quality Sequence Databases
5. Navigating the Labyrinth of Resources
References
Protein Structure Databases
1. Introduction
2. Structures and Structural Data
2.1. Terminology
2.2. The PDB and the wwPDB
2.3. Structural Data and Analyses
3. Atlases
3.1. The RCSB PDB
3.1.1. Summary Page
3.1.2. Other Information
3.1.3. Quality Assessment
3.1.4. Molecule of the Month
3.1.5. Structural Genomics Portal
3.2. The MSD
3.2.1. Search Tools
3.2.2. The AstexViewertrade
3.3. JenaLib
3.4. OCA
3.5. PDBsum
3.5.1. Pfam Domain Diagrams
3.5.2. Quality Assessment
3.5.3. Enzyme Reactions
3.5.4. Figures from Key References
3.5.5. Secondary Structure and Topology Diagrams
3.5.6. Intermolecular Interactions
4. Homology Models and Obsolete Entries
4.1. Homology Modeling Servers
4.2. Threading Servers
4.3. Obsolete Entries
5. Fold Databases
5.1. Classification Schemes
5.2. Fold Comparison
6. Miscellaneous Databases
6.1. Selection of Data Sets
6.2. Uppsala Electron Density Server (EDS)
6.3. Curiosities
7. Summary
References
Protein Domain Architectures
1. Introduction
2. Protein Family and Domain Databases
2.1. CDD
2.2. InterPro
2.3. Everest
3. Tools for Analyzing Protein Domain Architectures
3.1. ProDom Domain Architecture Viewer
3.2. SMART
3.3. CDART
3.4. InterPro Domain Architectures
3.5. PRODOC
4. Discussion
References
Thermodynamic Database for Proteins: Features and Applications
1. Introduction
2. Thermodynamic Database for Proteins and Mutants, ProTherm
2.1. Contents of ProTherm
2.2. Search and Display Options in ProTherm
2.3. ProTherm Statistics
3. Factors Influencing the Stability of Proteins and Mutants
3.1. Relationship Between Amino Acid Properties and Protein Stability upon Mutations
3.2. Influence of Neighboring and Surrounding Residues to Protein Mutant Stability
3.3. Stabilizing Residues in Protein Structures
4. Prediction of Protein Mutant Stability
4.1. Average Assignment Method
4.2. Empirical Energy Functions
4.3. Torsion, Distance and Contact Potentials
4.4. Machine Learning Techniques
5. Conclusions
References
Enzyme Databases
1. Introduction
2. Classification and Nomenclature
3. Enzyme Information Resources
3.1. General Enzyme Databases
3.1.1. Enzyme Nomenclature Web Sites
3.1.1.1. IUBMB Nomenclature and ExplorEnz
3.1.1.2. SIB-ENZYME Nomenclature Database
3.1.1.3. IntEnz
3.1.2. Enzyme-Functional Databases
3.1.2.1. BRENDA
3.1.2.2. AMENDA/FRENDA
3.1.2.3. KEGG
3.2. Special Enzyme Databases
3.2.1. MEROPS
3.2.2. MetaCyc
3.2.3. REBASE
3.2.4. Carbohydrate-Active Enzymes (CAzy)
3.2.5. Databases Based on Sequence Homologies
References
Biomolecular Pathway Databases
1. Introduction
2. Current Development in Pathway Databases
2.1. Kyoto Encyclopedia of Genes and Genomes (KEGG)
2.2. Reactome
2.3. BioCyc
2.4. Pathway Interaction Database (PID)
3. Problems Associated with Pathway Analysis Based on Public Databases
4. Conclusions and Outlook
References
Databases of Protein-Protein Interactions and Complexes
1. Introduction
2. Recent Status of Protein-Protein Interaction and Complex Databases
2.1. Database of Interacting Proteins (DIP)
2.2. Molecular INTeraction database (MINT)
2.3. IntAct
2.4. BioGRID
2.5. Human Protein Reference Database (HPRD)
2.6. MPact
2.7. STRING
2.8. Unified Human Interactome (UniHI)
3. Comparison of Protein-Protein Interaction and Complex Databases
4. Conclusions and Future Developments
References
Part 2: Data Mining Techniques
Proximity Measures for Cluster Analysis
1. Introduction
2. Proximity Measures
2.1. Proximity Between Two Statistical Units
2.2. Proximity Between a Single Unit and a Group of Units
2.3. Proximity Between Two Groups
References
Clustering Criteria and Algorithms
1. Introduction
2. Hierarchical Agglomerative Clustering
3. Clustering Validation
3.1. Validation of an Individual Cluster
3.2. Validation of a Clustering Level
3.3. Comparison Between Alternative Clusterings
4. Clustering Tendency
5. Monotonicity and Crossover
References
Neural Networks
1. Introduction
2. Learning Theory
2.1. Parameterisation of a Neural Network
2.2. Learning Rules
2.2.1. Regression Analysis
2.2.2. Classification Analysis
2.3. Learning Algorithms
2.3.1. Regression
2.3.2. Classification
2.3.3. Procedure
3. Neural Networks in Action
3.1. Model Evaluation
3.2. Evaluation Statistics
3.2.1. Regression Analysis
3.2.2. Classification Analysis
3.3. Generalisation Issue
3.3.1. Over-Training
3.3.2. Over-Sized
3.4. Data Organisation for Model Evaluation
4. Applications to Bioinformatics
4.1. Bio-chemical Data Analysis
4.2. Gene Expression Data Analysis
4.3. Protein Structure Data Analysis
4.4. Bio-marker Identification
4.5. Sequence Data Analysis
5. Summary
References
A Userβs Guide to Support Vector Machines
1. Introduction
2. Preliminaries: Linear Classifiers
3. Kernels: from Linear to Nonlinear Classifiers
4. Large-Margin Classification
4.1. The Geometric Margin
4.2. Support Vector Machines
5. Understanding the Effects of SVM and Kernel Parameters
6. Model Selection
7. SVMs for Unbalanced Data
8. Normalization
9. SVM Training Algorithms and Software
10. Further Reading
References
Hidden Markov Models in Biology
1. Introduction
1.1. Definitions
2. A Markov Chain in Linkage Mapping
3. Hidden Markov Models
3.1. A Simple CpG Island Model
3.2. The Viterbi Algorithm
3.3. The Forward-Backward algorithm
3.4. The Baum-Welch or Expectation-Maximization algorithm
3.5. The Metropolis Sampler
4. Discussion
References
Part 3: Database Annotations and Predictions
Integrated Tools for Biomolecular Sequence-Based Function Prediction as Exemplified by the ANNOTATOR Software Environment
1. Introduction
2. Integrated Sequence Analytic Tools
2.1. Ad Hoc Scripting
2.2. Integrated Frameworks for Bioinformatic Workflows
3. The ANNOTATOR, an Example of an Integrated Sequence Analytic Tool
3.1. Algorithm Integration
3.2. Visualization
4. Conclusions
References
Computational Methods for Ab Initio and Comparative Gene Finding
1. Introduction
2. Methods
2.1. Ab Initio Gene Finding
2.2. Evidence-Based Gene Finding
2.3. Combining Ab Initio and Evidence-Based Methods
2.4. Measuring Gene Finding Accuracy
References
Sequence and Structure Analysis of Noncoding RNAs
1. Introduction
1.1. Typographical Conventions
2. Materials
2.1. Hardware
2.2. Software
2.3. Example Files
2.4. Additional Information
3. Methods
3.1. Software Installation
3.2. Prediction of RNA Secondary Structure
3.2.1. Structure Prediction for Single Sequences
3.2.2. Consensus Structure Prediction for Homologous Sequences
3.3. De Novo Prediction of Structural ncRNAs
3.3.1. Single Sequences
3.3.2. Aligned Sequences
3.3.3. Unaligned Sequences
3.4. Searching Databases for Homologous Sequences
3.4.1. Sequence-Based Algorithms
3.4.2. Sequence- and Structure-Based Algorithms
3.5. Annotation of Specific RNA Classes
3.5.1. tRNAs
3.5.2. rRNAs
3.5.3. snoRNAs
3.5.4. miRNAs
3.5.5. Other Classes
3.6. Interaction Partners
3.6.1. Hybridization Structure of Two RNAs
3.6.2. miRNA Target Prediction
4. Notes
4.1. Note 1
References
Conformational Disorder
1. Introduction
2. Materials
3. Methods
3.1. Running Individual Disorder Predictions
3.1.1. Predictors Trained on Data Sets of Disordered Proteins
3.1.1.1. PONDR
3.1.1.2. DisProt VSL2
3.1.1.3. DisProt (PONDR) VL3, VL3H, VL3E, and VL3P
3.1.1.4. Globplot 2
3.1.1.5. DisEMBL
3.1.1.6. Disopred2
3.1.1.7. RONN
3.1.1.8. DISpro
3.1.1.9. SPRITZ
3.1.1.10. PreLink
3.1.1.11. Ucon
3.1.1.12. OnD-CRF
3.1.1.13. POODLE-S
3.1.1.14. PrDOS
3.1.1.15. DisPSSMP
3.1.2. Predictors That Have Not Been Trained on Disordered Proteins
3.1.2.1. The Charge/Hydropathy Method and Its Derivative FoldIndex
3.1.2.2. NORSp
3.1.2.3. IUPred
3.1.2.4. FoldUnfold
3.1.2.5. DRIP-PRED
3.1.3. Hydrophobic Cluster Analysis (HCA): A Nonconventional Disorder Predictor
3.1.4. Predictor Specificities
3.2. Identifying Regions of Induced Folding
3.3. General Procedure for Disorder Prediction
3.4. Running the MeDor Metaserver for the Prediction of Disorder
References
Protein Secondary Structure Prediction
1. Introduction
1.1. How Do We Define a Secondary Structure?
1.2. Importance of Secondary Structure Prediction
1.3. Deciphering Prediction Rules from Residue Patterns
1.3.1. Information Retrieval
1.3.2. alpha-Helices
1.3.3. beta-Strands
1.3.4. Loops
1.4. Building Up Topology Models
2. Overview of Secondary Structure Prediction Methods
2.1. The Early Methods
2.2. A Note on Current Prediction Methods
2.3. State-of-the-Art Methods: The Power of Neural Networks
2.3.1. Principles of Neural Networks
2.3.2. PHD/PHDpsi
2.3.3. PSIPRED
2.3.4. PROF (King)
2.3.5. SSpro/Porter
2.3.6. PSSP/APSSP/APSSP2
2.3.7. SAM-T99/SAM-T02/SAM-T06
2.3.8. YASPIN
2.3.9. Jpred (Jnet)
3. Assessing Prediction Accuracy
3.1. How Do We Assess Predictions Using Experimentally Solved Protein Structures
3.2. CASP and CAFASP
3.3. The EVA Automatic Evaluation of Prediction Methods
4. Methods for Transmembrane Topology Prediction
5. Integrating Secondary Structure Information into Multiple
5.1. Sequence Alignment
6. Challenges for the Field and Practical Considerations
6.1. Is There an Upper Limit?
6.2. Prediction Accuracy and Multiple Alignment Quality
6.3. Practical Recommendations
References
Analysis and Prediction of Protein Quaternary Structure
1. Introduction
2. QS, Symmetry, and Crystal Structures
3. Mining the Literature
4. QS in Databases
4.1. QS in the Protein Data Bank
4.1.1. Entry 1mkb
4.1.2. Entry 1otp
4.1.3. Entry 1r56
4.2. QS in Databases Derived from the PDB
4.2.1. Biounit
4.2.2. ProtBuD
4.2.3. D Complex
4.2.4. PiQSi
4.3. Deriving the QS from the Atomic Coordinates: PQS, PITA, PISA
4.3.1. PQS
4.3.2. PITA
4.3.3. PISA
5. Predicting QS from the Amino Acid Sequence
5.1. InterPreTS
5.2. Threading
6. Protein-Protein Docking
6.1. Docking Algorithms
6.2. Scoring and Refinement
6.3. Generating Oligomers by Docking
6.4. Assessing Docking Predictions: CAPRI
7. Conclusion
References
Prediction of Posttranslational Modification of Proteins from Their Amino Acid Sequence
1. Introduction
2. General Consideration for PTM Predictor Construction and Evaluation
2.1. PTMs as a Result of an Enzymatic Process
2.2. General Considerations for PTM Predictors
2.3. General Issues with Prediction Rate Accuracy
3. Predictors for Specific PTMS
3.1. Recognition of Targeting Signals with and Without Proteolytic Cleavage Sites
3.2. Lipid PTMs
3.3. Phosphorylation Prediction Tools
3.4. Glycosylation
4. Conclusions
References
Protein Crystallizability
1. Introduction
1.1. Protein Crystallization
1.2. Structural Genomics
2. Methods
2.1. Crystallization Target Selection
2.1.1. Overall Structural Determination Success (from Cloning to Structure)
2.1.2. Probability of Protein Crystallization
2.2. Construct Optimization
2.3. Optimizing Initial Conditions
3. Notes
3.1. Data
3.2. Methods
References
Subject Index
π SIMILAR VOLUMES
<p><P>Whereas getting exact data about living systems and sophisticated experimental procedures have primarily absorbed the minds of researchers previously, the development of high-throughput technologies has caused the weight to increasingly shift to the problem of interpreting accumulated data in
<p><P>Whereas getting exact data about living systems and sophisticated experimental procedures have primarily absorbed the minds of researchers previously, the development of high-throughput technologies has caused the weight to increasingly shift to the problem of interpreting accumulated data in
<p><P>Whereas getting exact data about living systems and sophisticated experimental procedures have primarily absorbed the minds of researchers previously, the development of high-throughput technologies has caused the weight to increasingly shift to the problem of interpreting accumulated data in
This third edition details new and updated methods and protocols on important databases and data mining tools. Chapters guides readers through archives of macromolecular sequences and three-dimensional structures, databases of protein-protein interactions, methods for prediction conformational disor