Computational Genomics with R

✍ Scribed by Altuna Akalin

Publisher: CRC Press
Year: 2020
Tongue: English
Leaves: 463
Category: Library

No coin nor oath required. For personal study only.

✦ Table of Contents

Cover
Half Title
Series Page
Title Page
Copyright Page
Dedication
Contents
Preface
About the Authors
1 Introduction to Genomics
1.1 Genes, DNA and central dogma
1.1.1 What is a genome?
1.1.2 What is a gene?
1.1.3 How are genes controlled? Transcriptional and post-transcriptional regulation
1.1.4 What does a gene look like?
1.2 Elements of gene regulation
1.2.1 Transcriptional regulation
1.2.2 Post-transcriptional regulation
1.3 Shaping the genome: DNA mutation
1.4 High-throughput experimental methods in genomics
1.4.1 The general idea behind high-throughput techniques
1.4.2 High-throughput sequencing
1.5 Visualization and data repositories for genomics
2 Introduction to R for Genomic Data Analysis
2.1 Steps of (genomic) data analysis
2.1.1 Data collection
2.1.2 Data quality check and cleaning
2.1.3 Data processing
2.1.4 Exploratory data analysis and modeling
2.1.5 Visualization and reporting
2.1.6 Why use R for genomics ?
2.2 Getting started with R
2.2.1 Installing packages
2.2.2 Installing packages in custom locations
2.2.3 Getting help on functions and packages
2.3 Computations in R
2.4 Data structures
2.4.1 Vectors
2.4.2 Matrices
2.4.3 Data frames
2.4.4 Lists
2.4.5 Factors
2.5 Data types
2.6 Reading and writing data
2.6.1 Reading large files
2.7 Plotting in R with base graphics
2.7.1 Combining multiple plots
2.7.2 Saving plots
2.8 Plotting in R with ggplot2
2.8.1 Combining multiple plots
2.8.2 ggplot2 and tidyverse
2.9 Functions and control structures (for, if/else, etc.)
2.9.1 User-defined functions
2.9.2 Loops and looping structures in R
2.10 Exercises
2.10.1 Computations in R
2.10.2 Data structures in R
2.10.3 Reading in and writing data out in R
2.10.4 Plotting in R
2.10.5 Functions and control structures (for, if/else, etc.)
3 Statistics for Genomics
3.1 How to summarize collection of data points: The idea behind statistical distributions
3.1.1 Describing the central tendency: Mean and median
3.1.2 Describing the spread: Measurements of variation
3.1.3 Precision of estimates: Confidence intervals
3.2 How to test for differences between samples
3.2.1 Randomization-based testing for difference of the means
3.2.2 Using t-test for difference of the means between two samples
3.2.3 Multiple testing correction
3.2.4 Moderated t-tests: Using information from multiple comparisons
3.3 Relationship between variables: Linear models and correlation
3.3.1 How to fit a line
3.3.2 How to estimate the error of the coefficients
3.3.3 Accuracy of the model
3.3.4 Regression with categorical variables
3.3.5 Regression pitfalls
3.4 Exercises
3.4.1 How to summarize collection of data points: The idea behind statistical distributions
3.4.2 How to test for differences in samples
3.4.3 Relationship between variables: Linear models and correlation
4 Exploratory Data Analysis with Unsupervised Machine Learning
4.1 Clustering: Grouping samples based on their similarity
4.1.1 Distance metrics
4.1.2 Hiearchical clustering
4.1.3 K-means clustering
4.1.4 How to choose “k”, the number of clusters
4.2 Dimensionality reduction techniques: Visualizing complex data sets in 2D
4.2.1 Principal component analysis
4.2.2 Other matrix factorization methods for dimensionality reduction
4.2.3 Multi-dimensional scaling
4.2.4 t-Distributed Stochastic Neighbor Embedding (t-SNE)
4.3 Exercises
4.3.1 Clustering
4.3.2 Dimension reduction
5 Predictive Modeling with Supervised Machine Learning
5.1 How are machine learning models fit?
5.1.1 Machine learning vs. statistics
5.2 Steps in supervised machine learning
5.3 Use case: Disease subtype from genomics data
5.4 Data preprocessing
5.4.1 Data transformation
5.4.2 Filtering data and scaling
5.4.3 Dealing with missing values
5.5 Splitting the data
5.5.1 Holdout test dataset
5.5.2 Cross-validation
5.5.3 Bootstrap resampling
5.6 Predicting the subtype with k-nearest neighbors
5.7 Assessing the performance of our model
5.7.1 Receiver Operating Characteristic (ROC) curves
5.8 Model tuning and avoiding overfitting
5.8.1 Model complexity and bias variance trade-off
5.8.2 Data split strategies for model tuning and testing
5.9 Variable importance
5.10 How to deal with class imbalance
5.10.1 Sampling for class balance
5.10.2 Altering case weights
5.10.3 Selecting different classification score cutoffs
5.11 Dealing with correlated predictors
5.12 Trees and forests: Random forests in action
5.12.1 Decision trees
5.12.2 Trees to forests
5.12.3 Variable importance
5.13 Logistic regression and regularization
5.13.1 Regularization in order to avoid overfitting
5.13.2 Variable importance
5.14 Other supervised algorithms
5.14.1 Gradient boosting
5.14.2 Support Vector Machines (SVM)
5.14.3 Neural networks and deep versions of it
5.14.4 Ensemble learning
5.15 Predicting continuous variables: Regression with machine learning
5.15.1 Use case: Predicting age from DNA methylation
5.15.2 Reading and processing the data
5.15.3 Running random forest regression
5.16 Exercises
5.16.1 Classification
5.16.2 Regression
6 Operations on Genomic Intervals and Genome Arithmetic
6.1 Operations on genomic intervals with
6.1.1 How to create and manipulate a GRanges object
6.1.2 Getting genomic regions into R as GRanges objects
6.1.3 Finding regions that do/do not overlap with another set of regions
6.2 Dealing with mapped high-throughput sequencing reads
6.2.1 Counting mapped reads for a set of regions
6.3 Dealing with continuous scores over the genome
6.3.1 Extracting subsections of Rle and RleList objects
6.4 Genomic intervals with more information: SummarizedExperiment class
6.4.1 Create a SummarizedExperiment object
6.4.2 Subset and manipulate the SummarizedExperiment object
6.5 Visualizing and summarizing genomic intervals
6.5.1 Visualizing intervals on a locus of interest
6.5.2 Summaries of genomic intervals on multiple loci
6.5.3 Making karyograms and circos plots
6.6 Exercises
6.6.1 Operations on genomic intervals with the
6.6.2 Dealing with mapped high-throughput sequencing reads
6.6.3 Dealing with contiguous scores over the genome
6.6.4 Visualizing and summarizing genomic intervals
7 Quality Check, Processing and Alignment of High-throughput Sequencing
Reads
7.1 FASTA and FASTQ formats
7.2 Quality check on sequencing reads
7.2.1 Sequence quality per base/cycle
7.2.2 Sequence content per base/cycle
7.2.3 Read frequency plot
7.2.4 Other quality metrics and QC tools
7.3 Filtering and trimming reads
7.4 Mapping/aligning reads to the genome
7.5 Further processing of aligned reads
7.6 Exercises
8 RNA-seq Analysis
8.1 What is gene expression?
8.2 Methods to detect gene expression
8.3 Gene expression analysis using high-throughput sequencing technologies
8.3.1 Processing raw data
8.3.2 Alignment
8.3.3 Quantification
8.3.4 Within sample normalization of the read counts
8.3.5 Computing different normalization schemes in R
8.3.6 Exploratory analysis of the read count table
8.3.7 Differential expression analysis
8.3.8 Functional enrichment analysis
8.3.9 Accounting for additional sources of variation
8.4 Other applications of RNA-seq
8.5 Exercises
8.5.1 Exploring the count tables
8.5.2 Differential expression analysis
8.5.3 Functional enrichment analysis
8.5.4 Removing unwanted variation from the expression data
9 ChIP-seq analysis
9.1 Regulatory protein-DNA interactions
9.2 Measuring protein-DNA interactions with ChIP-seq
9.3 Factors that affect ChIP-seq experiment and analysis quality
9.3.1 Antibody specificity
9.3.2 Sequencing depth
9.3.3 PCR duplication
9.3.4 Biological replicates
9.3.5 Control experiments
9.3.6 Using tagged proteins
9.4 Pre-processing ChIP data
9.4.1 Mapping of ChIP-seq data
9.5 ChIP quality control
9.5.1 The data
9.5.2 Sample clustering
9.5.3 Visualization in the genome browser
9.5.4 Plus and minus strand cross-correlation
9.5.5 GC bias quantification
9.5.6 Sequence read genomic distribution
9.6 Peak calling
9.6.1 Types of ChIP-seq experiments
9.6.2 Peak calling: Sharp peaks
9.6.3 Peak calling: Broad regions
9.6.4 Peak quality control
9.6.5 Peak annotation
9.7 Motif discovery
9.7.1 Motif comparison
9.8 What to do next?
9.9 Exercises
9.9.1 Quality control
10 DNA methylation analysis using bisulfite sequencing data
10.1 What is DNA methylation?
10.1.1 How DNA methylation is set ?
10.1.2 How to measure DNA methylation with bisulfite sequencing
10.2 Analyzing DNA methylation data
10.3 Processing raw data and getting data into R
10.4 Data filtering and exploratory analysis
10.4.1 Reading methylation call files
10.4.2 Further quality check
10.4.3 Merging samples into a single table
10.4.4 Filtering CpGs
10.4.5 Clustering samples
10.4.6 Principal component analysis
10.5 Extracting interesting regions: Differential methylation and segmentation
10.5.1 Differential methylation
10.5.2 Methylation segmentation
10.5.3 Working with large files
10.6 Annotation of DMRs/DMCs and segments
10.6.1 Further annotation with genes or gene sets
10.7 Other R packages that can be used for methylation analysis
10.8 Exercises
10.8.1 Differential methylation
10.8.2 Methylome segmentation
11 Multi-omics Analysis
11.1 Use case: Multi-omics data from colorectal cancer
11.2 Latent variable models for multi-omics integration
11.3 Matrix factorization methods for unsupervised multi-omics data integration
11.3.1 Multiple factor analysis
11.3.2 Joint non-negative matrix factorization
11.3.3 iCluster
11.4 Clustering using latent factors
11.4.1 One-hot clustering
11.4.2 K-means clustering
11.5 Biological interpretation of latent factors
11.5.1 Inspection of feature weights in loading vectors
11.5.2 Making sense of factors using enrichment analysis
11.5.3 Interpretation using additional covariates
11.6 Exercises
11.6.1 Matrix factorization methods
11.6.2 Clustering using latent factors
11.6.3 Biological interpretation of latent factors
Bibliography
Index

📜 SIMILAR VOLUMES

Computational Genomics with R

📁 Computational Genomics with R

✍ Altuna Akalin 📂 Library 📅 2020 🏛 CRC Press/Chapman & Hall 🌐 English

Computational Genomics with R provides a starting point for beginners in genomic data analysis and also guides more advanced practitioners to sophisticated data analysis techniques in genomics. The book covers topics from R programming, to machine learning and statistics, to the latest geno

Population Genomics with R

📁 Population Genomics with R

✍ Emmanuel Paradis 📂 Library 📅 2020 🏛 CRC Press 🌐 English

Population Genomics with R

📁 Population Genomics with R

✍ Emmanuel Paradis 📂 Library 📅 2020 🏛 Chapman and Hall/CRC 🌐 English

Population Genomics With R presents a multidisciplinary approach to the analysis of population genomics. The methods treated cover a large number of topics from traditional population genetics to large-scale genomics with high-throughput sequencing data. Several dozen R packages

Population Genomics with R

📁 Population Genomics with R

✍ Paradis, Emmanuel; 📂 Library 📅 2020 🏛 CRC Press LLC 🌐 English

Computational Statistics with R

📁 Computational Statistics with R

✍ C. R. Rao 📂 Library 📅 2014 🏛 Elsevier 🌐 English

Computational Statistics with R

📁 Computational Statistics with R

✍ Marepalli B. Rao and C.R. Rao (Eds.) 📂 Library 📅 2014 🏛 Elsevier 🌐 English

R is open source statistical computing software. Since the R core group was formed in 1997, R has been extended by a very large number of packages with extensive documentation along with examples freely available on the internet. It offers a large number of statistical and numerical methods and g