Since the first edition, published in 2001, genomics research has taken great strides. In this updated second edition, a team of expert researchers share the most current information in a field that has recently switched emphasis from gene identification to functional genomics and the characterizati
Statistical Genomics: Methods and Protocols (Methods in Molecular Biology, 1418)
β Scribed by Ewy MathΓ© (editor), Sean Davis (editor)
- Publisher
- Humana
- Year
- 2016
- Tongue
- English
- Leaves
- 419
- Category
- Library
No coin nor oath required. For personal study only.
β¦ Synopsis
This volume expands on statistical analysis of genomic data by discussing cross-cutting groundwork material, public data repositories, common applications, and representative tools for operating on genomic data. Statistical Genomics: Methods and Protocols is divided into four sections. The first section discusses overview material and resources that can be applied across topics mentioned throughout the book. The second section covers prominent public repositories for genomic data. The third section presents several different biological applications of statistical genomics, and the fourth section highlights software tools that can be used to facilitate ad-hoc analysis and data integration. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, step-by-step, readily reproducible analysis protocols, and tips on troubleshooting and avoiding known pitfalls.
Through and practical, Statistical Genomics: Methods and Protocols, explores a range of both applications and tools and is ideal for anyone interested in the statistical analysis of genomic data.
β¦ Table of Contents
Preface
Contents
Contributors
Part I: Groundwork
Chapter 1: Overview of Sequence Data Formats
1 Introduction
2 Results
2.1 FASTQ: Format of Raw Short Read Files
2.2 FASTA: Format of Reference Genome Files
2.3 BAM/SAM: Format of Alignment (Mapping) Data Files
2.3.1 Header Section
@HD Line
@SQ Lines
@RG Lines
@PG Lines
@CO Line
2.3.2 Alignment Section
2.4 GFF/GTF and BED: Formats for Genomic Feature Annotation Files
2.5 VCF: A Special File Format for Genomic Variations
3 Conclusions
References
Chapter 2: Integrative Exploratory Analysis of Two or More Genomic Datasets
1 Introduction
2 Methods
2.1 Analysis of Two Datasets Using Co-inertia Analysis
2.1.1 Co-inertia Analysis
2.2 Case Study I: Integration of NCI-60 Cell Line Transcriptomic and Proteomic Data
2.2.1 PCA of Individual Datasets
2.2.2 CIA of Both Datasets
2.2.3 CIA: Exploring the Variables
2.3 Exploration of Three or More Datasets
2.3.1 Multiple Co-inertia Analysis
2.3.2 Case Study 2: Cross Comparison of Gene Expression Data Obtained on Four Different Microarray Platforms
2.3.3 Example2: Integrative Analysis of Transcriptome, Proteome, and Phosphoproteome of Stem Cell Lines
3 Session Info
References
Chapter 3: Study Design for Sequencing Studies
1 Introduction
2 Statistical Considerations
2.1 Statistical Background
2.1.1 Randomization
2.1.2 Sampling Distributions
2.1.3 Statistical Testing
2.1.4 Statistical Power
2.1.5 Confounding
2.2 Statistical Principles of Design
2.2.1 Pilot Experiments
2.2.2 Protocols
2.2.3 Blocking
2.2.4 Replication
2.3 Design of Sequencing Studies
2.3.1 Treatment Design
2.3.2 Sample Design
2.3.3 Sample Preparation Design
3 Sequencing Design
3.1 Library Construction
3.2 Library Validation
3.3 Read Length
3.4 Multiplexing
3.5 Sequencing Depth
3.6 Targeted Sequencing
4 Conclusion
References
Chapter 4: Genomic Annotation Resources in R/Bioconductor
1 Introduction
2 Materials
3 Methods
3.1 Using the AnnotationHub
3.2 OrgDb Objects
3.3 TxDb Objects
3.4 OrganismDb Objects
3.5 BSgenome Objects
3.6 biomaRt
3.7 Creating Annotation Objects
4 Notes
References
Part II: Public Genomic Data
Chapter 5: The Gene Expression Omnibus Database
1 Introduction
2 Methods
2.1 Retrieve a Specific GEO Record
2.2 Quick Search Using Keywords
2.3 Advanced Search Using Structured Queries and Filters
2.4 Search Programmatically
2.5 Query with a Nucleotide Sequence
2.6 DataSet Analysis Tools
2.7 GEO Profiles Analysis Tools
2.8 Analyze with GEO2R
2.9 Visualize Data as a Genome Track
2.10 Download GEO Data
3 Methods/Conclusion
4 Notes
References
Chapter 6: A Practical Guide to The Cancer Genome Atlas (TCGA)
1 Introduction to TCGA
1.1 Data Use Restrictions
1.2 Future Directions in Cancer Genomics
1.2.1 Exceptional Responders Initiative
1.2.2 ALCHEMIST Clinical Trial
1.2.3 Cancer Driver Discovery Project
2 Using the TCGA Data Coordinating Center (DCC) Portal
2.1 Getting Help
2.2 TCGA Data Access and Use
2.3 Data Types and Data Levels
2.4 TCGA Samples and Identifiers
2.5 Data Archives and Files
2.6 Querying and Downloading TCGA Data
2.6.1 Data Matrix
Data Matrix Application/Tutorial
2.6.2 File Search
File Search Application/Tutorial
2.6.3 Bulk Download
Bulk Download Application/Tutorial
2.6.4 HTTP Directories
Open Access Directories/Controlled Access Directories/Tutorial
2.7 Other Key Features and Resources
2.7.1 MAF Search Facility
2.7.2 Annotations Database
2.7.3 Biospecimen Metadata Browser
2.7.4 Clinical Data Dictionary
2.7.5 Publication Pages
2.7.6 Web Services
3 Accessing TCGA Sequence Data
3.1 Cancer Genomics Hub (CGHub) Overview
3.2 Query Metadata
3.2.1 CGHub Metadata Browser
3.2.2 cgquery
3.3 Steps to Download BAM/FASTQ Files
3.3.1 Get a dbGAP Authorization
3.3.2 Get a CGhub Keyfile
3.3.3 Get and Install GeneTorrent
3.3.4 Retrieve BAM/FASTQ Files
Scenario One
Scenario Two
3.4 Safeguard Raw Sequence Files
3.4.1 Regarding Redistribution of Raw Sequence Files
3.4.2 Publication Moratorium
References
Part III: Applications
Chapter 7: Working with Oligonucleotide Arrays
1 Introduction
1.1 Array Types and Applications
1.2 The Need for Flexible Software
1.3 Handling Oligonucleotide Microarrays Data
1.3.1 Data Import and Manipulation of Probe-Level Data
1.3.2 Preprocessing
Background Correction
Normalization
Summarization
1.3.3 Preprocessing Different Array Types
1.3.4 Using Preprocessed Data for Additional Analyses
2 Materials
3 Methods
3.1 Preprocessing Affymetrix Expression Arrays
3.2 Preprocessing Affymetrix Gene ST Arrays
3.3 Preprocessing Affymetrix SNP Arrays
4 Notes
References
Chapter 8: Meta-Analysis in Gene Expression Studies
1 Introduction
2 Materials
2.1 Microarray Dataset Identification
2.1.1 The Gene Expression Omnibus
2.2 Dataset Preparation
2.2.1 Curation
2.2.2 Preprocessing
2.2.3 Batch Effects
2.2.4 Gene Collapsing
2.2.5 Pathway or Gene Set Collapsing
2.2.6 Duplicate Checking
2.2.7 Gene Pre-filtering
3 Methods
3.1 Fixed Effects Meta-analysis
3.2 Random Effects Meta-analysis
3.3 Fixed vs. Random Effects Meta-analysis
3.4 Rank-Based Meta-analysis
3.5 Other Approaches to Synthesis
3.6 Case Study
3.7 Extensions to Predictive Modeling
4 Notes
References
Chapter 9: Practical Analysis of Genome Contact Interaction Experiments
1 Introduction
2 Materials
3 Methods
3.1 Getting Started with Genome Contact Interaction Data Analysis
3.2 Identifying Statistical Significant Hi-C Contacts Using a Model-Based Approach
3.3 Detecting Genome Contact Interaction Differences between Two Conditions
4 Notes/Conclusion
References
Chapter 10: Quantitative Comparison of Large-Scale DNA Enrichment Sequencing Data
1 Introduction
2 Materials
2.1 Installation of MEDIPS
2.2 Preprocessing of the Sequencing Data
3 Methods
3.1 Parameter Settings and Quality Control
3.1.1 Stacked Reads
3.1.2 Extend Reads
3.1.3 Window Size
3.1.4 Coverage Saturation Analysis
3.2 Importing Alignment Data
3.3 Differential Coverage Analysis
3.4 Merging Neighboring DERs and Annotation
3.5 Notes
3.6 R Script to Download the H3K4me2 ChIP-seq Data from SRA
3.7 Shell Script for the Alignment of the H3K4me2 ChIP-seq Data (Fastq Files)
3.8 R Script for Quality Control and Differential Enrichment Analysis of the Aligned H3K4me2 ChIP-seq Data Comparing TH2 and N...
References
Chapter 11: Variant Calling From Next Generation Sequence Data
1 Introduction
1.1 Statistical Methods in Variant Analysis
1.2 Choice of Analysis Software
1.3 Variant Filtering and Accuracy
1.4 Software Pipelines and Reproducibility
2 Materials
3 Methods
3.1 Alignment of Reads
3.1.1 FASTQ Files
3.1.2 Running Novoalign
3.2 Removal of PCR Duplicates
3.2.1 Duplicate Removal Option 1
3.2.2 Duplicate Removal Option 2
3.3 Variant and Genotype Calling
3.3.1 Variant Calling with Freebayes
3.3.2 Variant Calling with bam2mpg
3.3.3 Compression and Random Access with VCF Files
3.4 Annotation of VCF Files
3.5 Viewing and Filtering Variants
3.6 Supplementary Protocols
3.6.1 Obtaining Sequence Data from NCBI
3.6.2 Downloading a GenomeΒ΄s Reference Sequence
3.6.3 Removing Alternate Haplotypes and Masking Pseudo-Autosomal Regions in the Reference
3.6.4 Formatting a Novoalign Database
4 Notes
References
Chapter 12: Genome-Scale Analysis of Cell-Specific Regulatory Codes Using Nuclear Enzymes
1 Introduction
2 Analysis of Chromatin Accessibility
2.1 Assay Protocols, Biases, and Data Reproducibility
2.2 Building a Profile for Data Visualization in a Genome Browser
2.3 Region Detection Algorithm
2.4 Region Annotation and Integrative Analyses
2.5 Motif Analysis on Hotspots
3 Analysis of TF Footprints
3.1 Data Requirement
3.2 Cut Count Profiles
3.3 Artifacts from the Enzyme Bias on Sequence Patterns at Cut Sites
3.4 Cut Count Analysis When Matching ChIP-seq Data Are Available for TFs of Interest
3.5 Detection of Putative Footprints and Limitations in Inference of TF Occupancy In Vivo
3.6 Sequence Motif Analysis
References
Part IV: Tools
Chapter 13: NGS-QC Generator: A Quality Control System for ChIP-Seq and Related Deep Sequencing-Generated Datasets
1 Introduction
2 Materials
2.1 Data Description: Preliminary Datasets Quality Evaluation (FastQC)
2.2 Data Description: Datasets Format for NGS-QC Generator Analysis
3 Methods
3.1 Preliminary Sequencing Quality Assessment
3.1.1 Sequence Quality
3.1.2 Clonal Reads
3.1.3 Adapter Contamination
3.2 Exploring the NGS-QC Generator Portal
3.3 Accessing to the NGS-QC Generator Tool via the Dedicated Galaxy Instance
3.3.1 An Overview of the Galaxy Interface
3.3.2 Performing a Global Quality Control Analysis
3.3.3 Visualize Local Enrichment Patterns in the Context of Their Quality
3.4 Exploring the NGS-QC Generator Database
3.4.1 A General Overview
3.4.2 The NGS-QC Database as a Source of Information for Providing Potential Hints to Interpret ChIP-Sequencing Assays with Lo...
3.4.3 Identifying ChIP-seq Grade Antibodies from the NGS-QC Database Content
4 Conclusions and Perspectives
5 Notes
References
Chapter 14: Operating on Genomic Ranges Using BEDOPS
1 Introduction
2 Methods
2.1 Genomic Data Formats
2.2 Pipelining with Streams
3 Core Functionality
3.1 Set-Based Operations
3.2 Proximity
3.3 Statistics and Annotations
4 Working Efficiently with Big Data
4.1 Parallelization
4.2 Sorting
4.3 Compression
5 Further Reading
References
Chapter 15: GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality
1 Introduction
1.1 Brief History
2 Computational Features
2.1 Using Global Information
2.2 Known Splicing Information
2.3 Tolerance to SNPs, Bisulfite Sequencing, RNA-Editing, and PAR-CLIP Sequencing
2.4 Standardized Indel Positions
2.5 Trimming Ends
2.6 Ambiguous Splicing
2.7 Chimeric Alignments and Distant Splicing
2.8 Overlapping Alignments
2.9 Suboptimal Alignments
2.10 Ranking Alignments and Eliminating Duplicates
2.11 Circular Chromosomes
2.12 Alternate Scaffolds
3 Usage
3.1 Downloading and Compilation
3.2 Pre-processing a Genome
3.3 Auxiliary SNP and Splicing Information
3.4 Alignment
3.5 Parallel Computation
4 Data Structures and Algorithms
4.1 Linear Genome: Vertical Storage Format
4.2 Genomic Hash Tables: Increased Specificity Through Compression
4.3 Handling Large Genomes
4.4 Segmental Hash Tables
4.5 Diagonalization in GMAP
4.6 Resolving Gaps in GMAP: SIMD Dynamic Programming
4.7 ESAs: Greedy Match-and-Extend
4.8 Integration of GMAP Algorithm into GSNAP
4.9 Segment Chaining
5 User Features and Downstream Analysis
5.1 Output Formats
5.2 Alignment Types and Split Output
5.3 Utility Programs
5.4 Integration with R/Bioconductor
5.5 Analyzing Datasets
5.6 Biological Inference
6 Discussion
References
Chapter 16: Visualizing Genomic Data Using Gviz and Bioconductor
1 Introduction
2 Materials
2.1 How to Install
2.2 Data Set Description
3 Methods
3.1 Track Objects
3.2 Display Parameters
3.3 A Real World Example
4 Notes
References
Chapter 17: Introducing Machine Learning Concepts with WEKA
1 Data Mining and Machine Learning
1.1 Structured vs. Unstructured Data
1.2 Supervised vs. Unsupervised Learning
1.3 Prediction: Classification and Regression
1.4 Clustering and Associations
1.5 Nominal and Numeric Data
2 WEKA: Getting Started
2.1 Loading Data
2.2 Preprocessing Data
2.3 Choosing a Classifier
2.4 Training and Testing
2.5 Running the Experiment
2.6 Understanding the Results
3 Approaching a Bioinformatics Problem
4 Other Kinds of Learning
5 Concluding Remarks
References
Chapter 18: Experimental Design and Power Calculation for RNA-seq Experiments
1 Introduction
2 Materials
3 Method
3.1 Obtaining PROPER
3.2 Setting Up Simulation Scenario
3.2.1 Using PROPER Provided Simulation Parameters
3.2.2 Using User Identified Data Source to Set Up Simulation Basis
3.3 Running Simulation and DE Detection
3.4 Power Assessment and Comparison
3.5 More Samples or Deeper Sequencing
3.6 To Filter or Not to Filter
4 Notes
References
Chapter 19: ItΒ΄s DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Meth
1 Introduction
2 Basic Theory of the Differential Expression Analysis
2.1 Overview
2.2 From Reads to Genewise Counts
2.3 Modelling Variability in a Quasi-Negative Binomial Framework
2.4 Introducing the Generalized Linear Model
2.5 Normalizing for Composition Biases
3 Tutorial with Real Data
3.1 Description of the Data Set
3.2 Read Alignment and Processing
3.3 Count Loading and Annotation
3.4 Filtering to Remove Low Counts
3.5 Normalization for Composition Bias
3.6 Exploring Differences Between Libraries
3.7 Dispersion Estimation
3.8 Testing for Differential Expression
4 Advanced Usage
4.1 Analysis of Variance
4.2 Complicated Contrasts
4.3 Gene Ontology Enrichment Analysis
4.4 Rotation Gene Set Tests
4.5 Session Information
References
Index
π SIMILAR VOLUMES
<span>Book by</span>
<span>This detailed volume provides an overview of recent advances in the application of genomic technologies in several domains of marine biology, raising awareness of various DNA- and RNA-based technologies. Genomic methods are essential in identifying previously undetected taxonomic (e.g. DNA bar
<p><span>This volume presents the latest protocols for both laboratory and bioinformatics based analyses in the field of marine genomics. The chapters presented in the book cover a wide range of topics, including the sampling and genomics of bacterial communities, DNA extraction in marine organisms,
<span>Genomic imprinting is the process by which gene activity is regulated according to parent of origin. Usually, this means that either the maternally inherited or the paternally inherited allele of a gene is expressed while the opposite allele is repressed. The phenomenon is largely restricted t
<span>This detailed book provides a comprehensive series of innovative research techniques and methodologies applied to the parasite genomics research area, all applying different approaches to analyzing parasite genomes and furthering the study of genetic complexity and the mechanisms of regulation