Principles and Methods for Data Science (Volume 43) (Handbook of Statistics, Volume 43)

✍ Scribed by Arni S.R. Srinivasa Rao (editor), C.R. Rao (editor)

Publisher: North Holland
Year: 2020
Tongue: English
Leaves: 498
Series: Handbook of Statistics
Edition: 1
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Principles and Methods for Data Science, Volume 43 in the Handbook of Statistics series, highlights new advances in the field, with this updated volume presenting interesting and timely topics, including Competing risks, aims and methods, Data analysis and mining of microbial community dynamics, Support Vector Machines, a robust prediction method with applications in bioinformatics, Bayesian Model Selection for Data with High Dimension, High dimensional statistical inference: theoretical development to data analytics, Big data challenges in genomics, Analysis of microarray gene expression data using information theory and stochastic algorithm, Hybrid Models, Markov Chain Monte Carlo Methods: Theory and Practice, and more.

✦ Table of Contents

Front Cover
Principles and Methods for Data Science
Copyright
Contents
Contributors
Preface
Chapter 1: Markov chain Monte Carlo methods: Theory and practice
1. Introduction
2. Introduction to Bayesian statistical analysis
2.1. Noninformative prior distributions
2.2. Informative prior distributions
2.2.1. Conjugate prior distributions
2.2.2. Nonconjugate prior distributions
2.3. Bayesian estimation
3. Markov chain Monte Carlo background
3.1. Discrete-state Markov chains
3.2. General state space Markov chain theory
4. Common MCMC algorithms
4.1. The Metropolis-Hastings algorithm
4.2. Multivariate Metropolis-Hastings
4.3. The Gibbs sampler
4.3.1. Sampling from intractable full conditional distributions
4.3.2. Rejection sampling
4.3.3. Adaptive rejection sampling
4.3.4. The tangent method
4.4. Slice sampling
4.5. Reversible jump MCMC
5. Markov chain Monte Carlo in practice
5.1. MCMC in regression models
5.2. Random effects models
5.3. Bayesian generalized linear models
5.4. Hierarchical models
6. Assessing Markov chain behavior
6.1. Using the theory to bound the mixing time
6.2. Output-based convergence diagnostics
6.2.1. Trace plots
6.2.2. Heidelberger and Welch (1983) Diagnostic
6.2.3. Geweke (1992) Spectral density diagnostic
6.2.4. Gelman and Rubin (1992) diagnostic
6.2.5. Yu and Mykland (1994) CUSUM plot diagnostic
6.3. Using auxiliary simulations to bound mixing time
6.3.1. Cowles and Rosenthal (1998) auxiliary simulation approach
6.3.2. An auxiliary simulation approach for random-scan random-walk Metropolis samplers
6.3.3. Auxiliary simulation approach for full-updating Metropolis samplers
6.4. Examining sampling frequency
7. Conclusion
References
Further reading
Chapter 2: An information and statistical analysis pipeline for microbial metagenomic sequencing data
1. Introduction
2. A brief overview of shotgun metagenomic sequencing analysis
2.1. Sequence assembly and contig binning
2.2. Annotation of taxonomy, protein, metabolic, and biological functions
2.3. Statistical analysis and machine learning
2.4. Reconstruction of pseudo-dynamics and mathematical modeling
2.5. Construction of analysis pipeline with reproducibility and portability
3. Computational tools and resources
3.1. Tools and software
3.1.1. BLAST
3.1.2. BWA
3.1.3. SAMtools
3.1.4. CD-HIT
3.1.5. MEGAHIT
3.1.6. MaxBin
3.1.7. Prodigal
3.1.8. VirFinder
3.1.9. DIAMOND
3.1.10. MEGAN
3.1.11. TSCAN
3.2. Public resources and databases
3.2.1. NCBI reference database (RefSeq)
3.2.2. Integrated Microbial Genomes (IMG) and Genome OnLine Database (GOLD)
3.2.3. UniProt
3.2.4. InterPro
3.2.5. KEGG
3.2.6. SEED
3.2.7. EggNOG
3.3. Do-It-Yourself information analysis pipeline for metagenomic sequences
4. Notes
Acknowledgments
References
Chapter 3: Machine learning algorithms, applications, and practices in data science
Abbreviations
1. Introduction
2. Supervised methods
2.1. Data sets
2.2. Linear regression
2.2.1. Polynomial fitting
2.2.2. Thresholding and linear regression
2.3. Logistic regression
2.4. Support vector machine-Linear kernel
2.5. Decision tree
Outline of decision tree
2.6. Ensemble methods
2.6.1. Boosting algorithms
2.6.2. Gradient boosting algorithm
2.7. Bias-variance trade off
2.7.1. Bias variance experiments
2.8. Cross validation and model selection
2.8.1. Model selection process
2.8.2. Learning curves
2.9. Multiclass and multivariate scenarios
2.9.1. Multivariate linear regression
2.9.2. Multiclass classification
2.9.2.1. Multiclass SVM
2.9.2.2. Multiclass logistic regression
2.10. Regularization
2.10.1. Regularization in gradient methods
2.10.2. Regularization in other methods
2.11. Metrics in machine learning
2.11.1. Confusion matrix
2.11.2. Precision-recall curve
2.11.3. ROC curve
2.11.4. Metrics for the multiclass classification
3. Practical considerations in model building
3.1. Noise in the data
3.2. Missing values
3.3. Class imbalance
3.4. Model maintenance
4. Unsupervised methods
4.1. Clustering
4.1.1. K-means
4.1.2. Hierarchical clustering
4.1.3. Density-based clustering
4.2. Comparison of clustering algorithms over data sets
4.3. Matrix factorization
4.4. Principal component analysis
4.5. Understanding the SVD algorithm
4.5.1. LU decomposition
4.5.2. QR decomposition
4.6. Data distributions and visualization
4.6.1. Multidimensional scaling
4.6.2. tSNE
4.6.3. PCA-based visualization
4.6.4. Research directions
5. Graphical methods
5.1. Naive Bayes algorithm
5.2. Expectation maximization
Example of email spam and nonspam problem-Posing as graphical model
5.2.1. E and M steps
5.2.2. Sampling error minimization
5.3. Markovian networks
5.3.1. Hidden Markov model
5.3.2. Latent Dirichlet analysis
Topic modeling of audio data
Topic modeling of image data
6. Deep learning
6.1. Neural network
6.1.1. Gradient magnitude issues
6.1.2. Relation to ensemble learning
6.2. Encoder
6.2.1. Vectorization of text
6.2.2. Autoencoder
6.2.3. Restricted Boltzmann machine
6.3. Convolutional neural network
6.3.1. Filter learning
6.3.2. Convolution layer
6.3.3. Max pooling
6.3.4. Fully connected layer
6.3.5. Popular CNN architectures
6.4. Recurrent neural network
6.4.1. Anatomy of simple RNN
6.4.2. Training a simple RNN
6.4.3. LSTM
6.4.4. Examples of sequence learning problem statements
6.4.5. Sequence to sequence mapping
6.5. Generative adversarial network
6.5.1. Training GAN
6.5.2. Applications of GANs
7. Optimization
8. Artificial intelligence
8.1. Notion of state space and search
8.2. State space-Search algorithms
8.2.1. Enumerative search methods
8.2.2. Heuristic search methods-Example A* algorithm
8.3. Planning algorithms
8.3.1. Example of a state
8.4. Formal logic
8.4.1. Predicate or propositional logic
8.4.2. First-order logic
8.4.3. Automated theorem proof
8.4.3.1. Forward chaining
8.4.3.2. Incompleteness of the forward chaining
8.4.3.3. Backward chaining
8.5. Resolution by refutation method
8.6. AI framework adaptability issues
9. Applications and laboratory exercises
9.1. Automatic differentiation
9.2. Machine learning exercises
9.3. Clustering exercises
9.4. Graphical model exercises
9.4.1. Exercise-Topics in text data
9.4.2. Exercise-Topics in image data
9.4.3. Exercise-Topics in audio data
9.5. Data visualization exercises
9.6. Deep learning exercises
References
Chapter 4: Bayesian model selection for high-dimensional data
1. Introduction
2. Classical variable selection methods
2.1. Best subset selection
2.2. Stepwise selection methods
2.3. Criterion functions
3. The penalization framework
3.1. LASSO and generalizations
3.1.1. Strong irrepresentable condition
3.1.2. Adaptive LASSO
3.1.3. Elastic net
3.2. Nonconvex penalization
3.3. Variable screening
4. The Bayesian framework for model selection
5. Spike and slab priors
5.1. Point mass spike prior
5.1.1. g-priors
5.1.2. Nonlocal priors
5.2. Continuous spike priors
5.3. Spike and slab LASSO
6. Continuous shrinkage priors
6.1. Bayesian LASSO
6.2. Horseshoe prior
6.3. Global-local shrinkage priors
6.4. Regularization of Bayesian priors
6.5. Prior elicitation-Hyperparameter selection
6.5.1. Empirical Bayes
6.5.2. Criterion-based tuning
7. Computation
7.1. Direct exploration of the model space
7.1.1. Shotgun stochastic search
7.2. Gibbs sampling
7.3. EM algorithm
7.4. Approximate algorithms
7.4.1. Laplace approximation
7.4.2. Variational approximation
8. Theoretical properties
8.1. Consistency properties of the posterior mode
8.2. Posterior concentration
8.3. Pairwise model comparison consistency
8.4. Strong model selection consistency
9. Implementation
10. An example
11. Discussion
Acknowledgments
References
Chapter 5: Competing risks: Aims and methods
1. Introduction
2. Research aim: Explanation vs prediction
2.1. In-hospital infection and discharge
2.1.1. Marginal distribution: Discharge as censoring event
2.1.2. Cause-specific distribution: Discharge as competing event
2.2. Causes of death after HIV infection
2.3. AIDS and pre-AIDS death
3. Basic quantities and their estimators
3.1. Definitions and notation
3.1.1. Competing risks
3.1.2. Multistate approach
3.2. Data setup
3.3. Nonparametric estimation
3.3.1. Complete data
3.3.2. Cause-specific hazard: Aalen-Johansen estimator
3.3.3. Incomplete data: Weights
3.3.4. Subdistribution hazard: Weighted PL estimator
3.3.5. Weighted ECDF estimator
3.3.6. Equivalence
3.4. Standard errors and confidence intervals
3.5. Regression models
3.6. Software
4. Time-varying covariables and the subdistribution hazard
4.1. Overall survival
4.2. Spectrum in causes of death
4.2.1. Internal approach
4.2.2. Pseudo-individual approach
4.3. Summary
5. Confusion
5.1. What is the appropriate analysis?
5.2. Is a marginal analysis feasible in practice?
5.3. If we fit a Cox model, do we need to assume that the competing risks are independent?
5.4. Is a regression model for the subdistribution hazard (such as a Fine and Gray model) the only truly competing risks ...
5.5. Is the subdistribution hazard a quantity that can be given an interpretation?
Acknowledgment
References
Chapter 6: High-dimensional statistical inference: Theoretical development to data analytics
1. Introduction
2. Mean vector testing
2.1. Independent observations
2.2. Projection-based tests
2.3. Random projections
2.4. Other approaches
2.5. Dependent observations
3. Covariance matrix
3.1. Estimation
3.2. Hypothesis testing
4. Discrete multivariate models
4.1. Multinomial distribution
4.2. Compound multinomial models
4.3. Other distributions
4.3.1. Bernoulli distribution
4.3.2. Binomial distribution
4.3.3. Poisson distribution
5. Conclusion
References
Chapter 7: Big data challenges in genomics
1. Introduction
2. Next-generation sequencing
3. Data integration
4. High dimensionality
5. Computing infrastructure
6. Dimension reduction
7. Data smoothing
8. Data security
9. Example
10. Conclusion
References
Chapter 8: Analysis of microarray gene expression data using information theory and stochastic algorithm
1. Introduction
1.1. Gene clustering algorithms
2. Methodology
2.1. Discretization
2.2. Genetic algorithm
2.3. The evolutionary clustering algorithm
2.3.1. Creation of individuals
2.3.2. Mutation operations
2.3.3. Fitness function
2.3.4. Selection
2.3.5. Termination
3. Results
3.1. Synthetic data
3.2. Real data
4. Section A: Studies on gastric cancer dataset (GDS1210)
4.1. Comparison of the algorithms based on the classification accuracy of samples
4.2. Analysis of classificatory genes
4.3. Comparison of algorithms based on the representative genes
4.4. Study of gene distribution in clusters
5. Section B: A brief study on colon cancer dataset
5.1. Comparison of the algorithms based on classification accuracy
6. Section C: A brief study on brain cancer (medulloblastoma metastasis) dataset (GDS232)
6.1. Comparison of the algorithms based on the classification accuracy
7. Conclusion
Appendices
Appendix A: A brief overview of the OCDD algorithm
Appendix B: Smoothing and Chi-square test method
References
Further reading
Chapter 9: Human life expectancy is computed from an incomplete sets of data: Modeling and analysis
1. Introduction
2. Life expectancy of newly born babies
3. Numerical examples
3. Numerical examples
Appendix. Analysis of the life expectancy function
References
Chapter 10: Support vector machines: A robust prediction method with applications in bioinformatics
1. Introduction
2. Mathematical prerequisites
2.1. Topology
2.2. Probability and measure theory
2.3. Functional and convex analysis
2.4. Derivatives in normed spaces
2.5. () Convex programs, Lagrange multipliers and duality
3. An introduction to support vector machines
3.1. () The generalized portrait algorithm
3.2. () The hard margin SVM
3.3. () The soft margin SVM
3.4. () Empirical risk minimization and support vector machines
3.5. Kernels and the reproducing kernel Hilbert space
3.6. () Loss functions
3.7. Bouligand-derivatives of loss functions
3.8. Shifting the loss function
4. () An introduction to robustness and influence functions
5. Properties of SVMs
5.1. Existence, uniqueness and consistency of SVMs
5.2. Robustness of SVMs
6. () Applications
6.1. Predicting blood pressure through BMI in the presence of outliers
6.2. Breast cancer distant metastasis through gene expression
6.3. Splice site detection
References
Index
Back Cover

📜 SIMILAR VOLUMES

Data Science: Theory and Applications (V

📁 Data Science: Theory and Applications (Volume 44) (Handbook of Statistics, Volume 44)

✍ Arni S.R. Srinivasa Rao (editor), C.R. Rao (editor) 📂 Library 📅 2021 🏛 North Holland 🌐 English

Data Science: Theory and Applications, Volume 44 in the Handbook of Statistics series, highlights new advances in the field, with this new volume presenting interesting chapters on a variety of interesting topics, including Modeling extreme climatic ev

Handbook of Computational Social Science

📁 Handbook of Computational Social Science, Volume 2: Data Science, Statistical Modelling, and Machine Learning Methods

✍ Uwe Engel, Anabel Quan-Haase, Sunny Liu, Lars Lyberg 📂 Library 📅 2021 🏛 Routledge 🌐 English

The Handbook of Computational Social Science is a comprehensive reference source for scholars across multiple disciplines. It outlines key debates in the field, showcasing novel statistical modeling and machine learning methods, and draws from specific case studies to demonstrate the opport

Handbook of Statistics, Volume 24: Data

📁 Handbook of Statistics, Volume 24: Data Mining and Data Visualization

✍ C.R. Rao, E. J. Wegman, J. L. Solka 📂 Library 📅 2005 🏛 North Holland 🌐 English

Handbook of Nanophysics, Volume I: Princ

📁 Handbook of Nanophysics, Volume I: Principles and Methods

✍ Klaus D. Sattler 📂 Library 📅 2010 🏛 CRC Press 🌐 English

Covering the key theories, tools, and techniques of this dynamic field, Handbook of Nanophysics: Principles and Methods elucidates the general theoretical principles and measurements of nanoscale systems. Each peer-reviewed chapter contains a broad-based introduction and enhances understanding of th

Statistical Models and Methods for Data

📁 Statistical Models and Methods for Data Science

✍ Leonardo Grilli; Monia Lupparelli; Carla Rampichini; Emilia Rocco; Maurizio Vich 📂 Library 📅 2023 🏛 Springer International Publishing 🌐 English

This book focuses on methods and models in classification and data analysis and presents real-world applications at the interface with data science. Numerous topics are covered, ranging from statistical inference and modelling to clustering and factorial methods, and from directional data analysis t

Harmonic Analysis (PMS-43), Volume 43: R

📁 Harmonic Analysis (PMS-43), Volume 43: Real-Variable Methods, Orthogonality, and Oscillatory Integrals. (PMS-43)

✍ Elias M. Stein 📂 Library 📅 2016 🏛 Princeton University Press 🌐 English

This book contains an exposition of some of the main developments of the last twenty years in the following areas of harmonic analysis: singular integral and pseudo-differential operators, the theory of Hardy spaces, L\sup\ estimates involving oscillatory integrals and Fourier integral operators,