Probabilistic Machine Learning: An Introduction

✍ Scribed by Kevin P. Murphy

Publisher: The MIT Press
Year: 2022
Tongue: English
Leaves: 854
Series: Adaptive Computation and Machine Learning
Edition: 1
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

A detailed and up-to-date introduction to machine learning, presented through the unifying lens of probabilistic modeling and Bayesian decision theory.

This book offers a detailed and up-to-date introduction to machine learning (including deep learning) through the unifying lens of probabilistic modeling and Bayesian decision theory. The book covers mathematical background (including linear algebra and optimization), basic supervised learning (including linear and logistic regression and deep neural networks), as well as more advanced topics (including transfer learning and unsupervised learning). End-of-chapter exercises allow students to apply what they have learned, and an appendix covers notation.

Probabilistic Machine Learning grew out of the author’s 2012 book, Machine Learning: A Probabilistic Perspective. More than just a simple update, this is a completely new book that reflects the dramatic developments in the field since 2012, most notably deep learning. In addition, the new book is accompanied by online Python code, using libraries such as scikit-learn, JAX, PyTorch, and Tensorflow, which can be used to reproduce nearly all the figures; this code can be run inside a web browser using cloud-based notebooks, and provides a practical complement to the theoretical topics discussed in the book. This introductory text will be followed by a sequel that covers more advanced topics, taking the same probabilistic approach.

✦ Table of Contents

Preface
Introduction
What is machine learning?
Supervised learning
Classification
Regression
Overfitting and generalization
No free lunch theorem
Unsupervised learning
Clustering
Discovering latent factors of variation'' Self-supervised learning Evaluating unsupervised learning Reinforcement learning Data Some common image datasets Some common text datasets Preprocessing discrete input data Preprocessing text data Handling missing data Discussion The relationship between ML and other fields Structure of the book Caveats I Foundations Probability: Univariate Models Introduction What is probability? Types of uncertainty Probability as an extension of logic Random variables Discrete random variables Continuous random variables Sets of related random variables Independence and conditional independence Moments of a distribution Limitations of summary statistics * Bayes' rule Example: Testing for COVID-19 Example: The Monty Hall problem Inverse problems * Bernoulli and binomial distributions Definition Sigmoid (logistic) function Binary logistic regression Categorical and multinomial distributions Definition Softmax function Multiclass logistic regression Log-sum-exp trick Univariate Gaussian (normal) distribution Cumulative distribution function Probability density function Regression Why is the Gaussian distribution so widely used? Dirac delta function as a limiting case Some other common univariate distributions * Student t distribution Cauchy distribution Laplace distribution Beta distribution Gamma distribution Empirical distribution Transformations of random variables * Discrete case Continuous case Invertible transformations (bijections) Moments of a linear transformation The convolution theorem Central limit theorem Monte Carlo approximation Exercises Probability: Multivariate Models Joint distributions for multiple random variables Covariance Correlation Uncorrelated does not imply independent Correlation does not imply causation Simpson's paradox The multivariate Gaussian (normal) distribution Definition Mahalanobis distance Marginals and conditionals of an MVN * Example: conditioning a 2d Gaussian Example: Imputing missing values * Linear Gaussian systems * Bayes rule for Gaussians Derivation * Example: Inferring an unknown scalar Example: inferring an unknown vector Example: sensor fusion The exponential family * Definition Example Log partition function is cumulant generating function Maximum entropy derivation of the exponential family Mixture models Gaussian mixture models Bernoulli mixture models Probabilistic graphical models * Representation Inference Learning Exercises Statistics Introduction Maximum likelihood estimation (MLE) Definition Justification for MLE Example: MLE for the Bernoulli distribution Example: MLE for the categorical distribution Example: MLE for the univariate Gaussian Example: MLE for the multivariate Gaussian Example: MLE for linear regression Empirical risk minimization (ERM) Example: minimizing the misclassification rate Surrogate loss Other estimation methods * The method of moments Online (recursive) estimation Regularization Example: MAP estimation for the Bernoulli distribution Example: MAP estimation for the multivariate Gaussian * Example: weight decay Picking the regularizer using a validation set Cross-validation Early stopping Using more data Bayesian statistics * Conjugate priors The beta-binomial model The Dirichlet-multinomial model The Gaussian-Gaussian model Beyond conjugate priors Credible intervals Bayesian machine learning Computational issues Frequentist statistics * Sampling distributions Gaussian approximation of the sampling distribution of the MLE Bootstrap approximation of the sampling distribution of any estimator Confidence intervals Caution: Confidence intervals are not credible The bias-variance tradeoff Exercises Decision Theory Bayesian decision theory Basics Classification problems ROC curves Precision-recall curves Regression problems Probabilistic prediction problems Choosing theright'' model
Bayesian hypothesis testing
Bayesian model selection
Occam's razor
Connection between cross validation and marginal likelihood
Information criteria
Hypothesis testing without Bayes factors
Frequentist decision theory
Computing the risk of an estimator
Consistent estimators
Admissible estimators
Empirical risk minimization
Empirical risk
Structural risk
Cross-validation
Statistical learning theory *
Frequentist hypothesis testing *
Likelihood ratio test
Null hypothesis significance testing (NHST)
p-values
p-values considered harmful
Why isn't everyone a Bayesian?
Exercises
Information Theory
Entropy
Entropy for discrete random variables
Cross entropy
Joint entropy
Conditional entropy
Perplexity
Differential entropy for continuous random variables *
Relative entropy (KL divergence) *
Definition
Interpretation
Example: KL divergence between two Gaussians
Non-negativity of KL
KL divergence and MLE
Forward vs reverse KL
Mutual information *
Definition
Interpretation
Example
Conditional mutual information
MI as a generalized correlation coefficient'' Normalized mutual information Maximal information coefficient Data processing inequality Sufficient Statistics Fano's inequality * Exercises Linear Algebra Introduction Notation Vector spaces Norms of a vector and matrix Properties of a matrix Special types of matrices Matrix multiplication Vector–vector products Matrix–vector products Matrix–matrix products Application: manipulating data matrices Kronecker products * Einstein summation * Matrix inversion The inverse of a square matrix Schur complements * The matrix inversion lemma * Matrix determinant lemma * Application: deriving the conditionals of an MVN * Eigenvalue decomposition (EVD) Basics Diagonalization Eigenvalues and eigenvectors of symmetric matrices Geometry of quadratic forms Standardizing and whitening data Power method Deflation Eigenvectors optimize quadratic forms Singular value decomposition (SVD) Basics Connection between SVD and EVD Pseudo inverse SVD and the range and null space of a matrix * Truncated SVD Other matrix decompositions * LU factorization QR decomposition Cholesky decomposition Solving systems of linear equations * Solving square systems Solving underconstrained systems (least norm estimation) Solving overconstrained systems (least squares estimation) Matrix calculus Derivatives Gradients Directional derivative Total derivative * Jacobian Hessian Gradients of commonly used functions Exercises Optimization Introduction Local vs global optimization Constrained vs unconstrained optimization Convex vs nonconvex optimization Smooth vs nonsmooth optimization First-order methods Descent direction Step size (learning rate) Convergence rates Momentum methods Second-order methods Newton's method BFGS and other quasi-Newton methods Trust region methods Stochastic gradient descent Application to finite sum problems Example: SGD for fitting linear regression Choosing the step size (learning rate) Iterate averaging Variance reduction * Preconditioned SGD Constrained optimization Lagrange multipliers The KKT conditions Linear programming Quadratic programming Mixed integer linear programming * Proximal gradient method * Projected gradient descent Proximal operator for 1-norm regularizer Proximal operator for quantization Incremental (online) proximal methods Bound optimization * The general algorithm The EM algorithm Example: EM for a GMM Blackbox and derivative free optimization Exercises II Linear Models Linear Discriminant Analysis Introduction Gaussian discriminant analysis Quadratic decision boundaries Linear decision boundaries The connection between LDA and logistic regression Model fitting Nearest centroid classifier Fisher's linear discriminant analysis * Naive Bayes classifiers Example models Model fitting Bayesian naive Bayes The connection between naive Bayes and logistic regression Generative vs discriminative classifiers Advantages of discriminative classifiers Advantages of generative classifiers Handling missing features Exercises Logistic Regression Introduction Binary logistic regression Linear classifiers Nonlinear classifiers Maximum likelihood estimation Stochastic gradient descent Perceptron algorithm Iteratively reweighted least squares MAP estimation Standardization Multinomial logistic regression Linear and nonlinear classifiers Maximum likelihood estimation Gradient-based optimization Bound optimization MAP estimation Maximum entropy classifiers Hierarchical classification Handling large numbers of classes Robust logistic regression * Mixture model for the likelihood Bi-tempered loss Bayesian logistic regression * Laplace approximation Approximating the posterior predictive Exercises Linear Regression Introduction Least squares linear regression Terminology Least squares estimation Other approaches to computing the MLE Measuring goodness of fit Ridge regression Computing the MAP estimate Connection between ridge regression and PCA Choosing the strength of the regularizer Lasso regression MAP estimation with a Laplace prior (1 regularization) Why does 1 regularization yield sparse solutions? Hard vs soft thresholding Regularization path Comparison of least squares, lasso, ridge and subset selection Variable selection consistency Group lasso Elastic net (ridge and lasso combined) Optimization algorithms Regression splines * B-spline basis functions Fitting a linear model using a spline basis Smoothing splines Generalized additive models Robust linear regression * Laplace likelihood Student-t likelihood Huber loss RANSAC Bayesian linear regression * Priors Posteriors Example Computing the posterior predictive The advantage of centering Dealing with multicollinearity Automatic relevancy determination (ARD) * Exercises Generalized Linear Models * Introduction Examples Linear regression Binomial regression Poisson regression GLMs with non-canonical link functions Maximum likelihood estimation Worked example: predicting insurance claims III Deep Neural Networks Neural Networks for Structured Data Introduction Multilayer perceptrons (MLPs) The XOR problem Differentiable MLPs Activation functions Example models The importance of depth Thedeep learning revolution''
Connections with biology
Backpropagation
Forward vs reverse mode differentiation
Reverse mode differentiation for multilayer perceptrons
Vector-Jacobian product for common layers
Computation graphs
Training neural networks
Tuning the learning rate
Vanishing and exploding gradients
Non-saturating activation functions
Residual connections
Parameter initialization
Parallel training
Regularization
Early stopping
Weight decay
Sparse DNNs
Dropout
Bayesian neural networks
Regularization effects of (stochastic) gradient descent *
Other kinds of feedforward networks *
Radial basis function networks
Mixtures of experts
Exercises
Neural Networks for Images
Introduction
Common layers
Convolutional layers
Pooling layers
Putting it all together
Normalization layers
Common architectures for image classification
LeNet
AlexNet
GoogLeNet (Inception)
ResNet
DenseNet
Neural architecture search
Other forms of convolution *
Dilated convolution
Transposed convolution
Depthwise separable convolution
Solving other discriminative vision tasks with CNNs *
Image tagging
Object detection
Instance segmentation
Semantic segmentation
Human pose estimation
Generating images by inverting CNNs *
Converting a trained classifier into a generative model
Image priors
Visualizing the features learned by a CNN
Deep Dream
Neural style transfer
Neural Networks for Sequences
Introduction
Recurrent neural networks (RNNs)
Vec2Seq (sequence generation)
Seq2Vec (sequence classification)
Seq2Seq (sequence translation)
Teacher forcing
Backpropagation through time
Vanishing and exploding gradients
Gating and long term memory
Beam search
1d CNNs
1d CNNs for sequence classification
Causal 1d CNNs for sequence generation
Attention
Attention as soft dictionary lookup
Kernel regression as non-parametric attention
Parametric attention
Seq2Seq with attention
Seq2vec with attention (text classification)
Seq+Seq2Vec with attention (text pair classification)
Soft vs hard attention
Transformers
Self-attention
Multi-headed attention
Positional encoding
Putting it all together
Comparing transformers, CNNs and RNNs
Transformers for images *
Other transformer variants *
Efficient transformers *
Fixed non-learnable localized attention patterns
Learnable sparse attention patterns
Memory and recurrence methods
Low-rank and kernel methods
Language models and unsupervised representation learning
ELMo
BERT
GPT
T5
Discussion
IV Nonparametric Models
Exemplar-based Methods
K nearest neighbor (KNN) classification
Example
The curse of dimensionality
Reducing the speed and memory requirements
Open set recognition
Learning distance metrics
Linear and convex methods
Deep metric learning
Classification losses
Ranking losses
Speeding up ranking loss optimization
Other training tricks for DML
Kernel density estimation (KDE)
Density kernels
Parzen window density estimator
How to choose the bandwidth parameter
From KDE to KNN classification
Kernel regression
Kernel Methods *
Mercer kernels
Mercer's theorem
Some popular Mercer kernels
Gaussian processes
Noise-free observations
Noisy observations
Comparison to kernel regression
Weight space vs function space
Numerical issues
Estimating the kernel
GPs for classification
Connections with deep learning
Scaling GPs to large datasets
Support vector machines (SVMs)
Large margin classifiers
The dual problem
Soft margin classifiers
The kernel trick
Converting SVM outputs into probabilities
Connection with logistic regression
Multi-class classification with SVMs
How to choose the regularizer C
Kernel ridge regression
SVMs for regression
Sparse vector machines
Relevance vector machines (RVMs)
Comparison of sparse and dense kernel methods
Exercises
Trees, Forests, Bagging, and Boosting
Classification and regression trees (CART)
Model definition
Model fitting
Regularization
Handling missing input features
Pros and cons
Ensemble learning
Stacking
Ensembling is not Bayes model averaging
Bagging
Random forests
Boosting
Forward stagewise additive modeling
Quadratic loss and least squares boosting
Exponential loss and AdaBoost
LogitBoost
Gradient boosting
Interpreting tree ensembles
Feature importance
Partial dependency plots
V Beyond Supervised Learning
Learning with Fewer Labeled Examples
Data augmentation
Examples
Theoretical justification
Transfer learning
Fine-tuning
Adapters
Supervised pre-training
Unsupervised pre-training (self-supervised learning)
Domain adaptation
Semi-supervised learning
Self-training and pseudo-labeling
Entropy minimization
Co-training
Label propagation on graphs
Consistency regularization
Deep generative models *
Combining self-supervised and semi-supervised learning
Active learning
Decision-theoretic approach
Information-theoretic approach
Batch active learning
Meta-learning
Model-agnostic meta-learning (MAML)
Few-shot learning
Matching networks
Weakly supervised learning
Exercises
Dimensionality Reduction
Principal components analysis (PCA)
Examples
Derivation of the algorithm
Computational issues
Choosing the number of latent dimensions
Factor analysis *
Generative model
Probabilistic PCA
EM algorithm for FA/PPCA
Unidentifiability of the parameters
Nonlinear factor analysis
Mixtures of factor analysers
Exponential family factor analysis
Factor analysis models for paired data
Autoencoders
Bottleneck autoencoders
Denoising autoencoders
Contractive autoencoders
Sparse autoencoders
Variational autoencoders
Manifold learning *
What are manifolds?
The manifold hypothesis
Approaches to manifold learning
Multi-dimensional scaling (MDS)
Isomap
Kernel PCA
Maximum variance unfolding (MVU)
Local linear embedding (LLE)
Laplacian eigenmaps
t-SNE
Word embeddings
Latent semantic analysis / indexing
Word2vec
GloVE
Word analogies
RAND-WALK model of word embeddings
Contextual word embeddings
Exercises
Clustering
Introduction
Evaluating the output of clustering methods
Hierarchical agglomerative clustering
The algorithm
Example
Extensions
K means clustering
The algorithm
Examples
Vector quantization
The K-means++ algorithm
The K-medoids algorithm
Speedup tricks
Choosing the number of clusters K
Clustering using mixture models
Mixtures of Gaussians
Mixtures of Bernoullis
Spectral clustering *
Normalized cuts
Eigenvectors of the graph Laplacian encode the clustering
Example
Connection with other methods
Biclustering *
Basic biclustering
Nested partition models (Crosscat)
Recommender Systems
Explicit feedback
Datasets
Collaborative filtering
Matrix factorization
Autoencoders
Implicit feedback
Bayesian personalized ranking
Factorization machines
Neural matrix factorization
Leveraging side information
Exploration-exploitation tradeoff
Graph Embeddings *
Introduction
Graph Embedding as an Encoder/Decoder Problem
Shallow graph embeddings
Unsupervised embeddings
Distance-based: Euclidean methods
Distance-based: non-Euclidean methods
Outer product-based: Matrix factorization methods
Outer product-based: Skip-gram methods
Supervised embeddings
Graph Neural Networks
Message passing GNNs
Spectral Graph Convolutions
Spatial Graph Convolutions
Non-Euclidean Graph Convolutions
Deep graph embeddings
Unsupervised embeddings
Semi-supervised embeddings
Applications
Unsupervised applications
Supervised applications
Notation
Introduction
Common mathematical symbols
Functions
Common functions of one argument
Common functions of two arguments
Common functions of >2 arguments
Linear algebra
General notation
Vectors
Matrices
Matrix calculus
Optimization
Probability
Information theory
Statistics and machine learning
Supervised learning
Unsupervised learning and generative models
Bayesian inference
Abbreviations
Index

✦ Subjects

Machine Learning; Probabilistic Models; Neural Networks; Deep Learning; Supervised Learning; Python; Recommender Systems; Bayesian Inference; Clustering; Kernel Methods; Statistics; Optimization; Linear Regression; Logistic Regression; Linear Algebra; Probability Theory; Linear Models; Random Forest; Dimensionality Reduction; Bagging; Nonparametric Models; Boosting; Graph Embeddings

📜 SIMILAR VOLUMES

Probabilistic Machine Learning : An Intr

📁 Probabilistic Machine Learning : An Introduction

✍ Kevin P. Murphy 📂 Library 📅 2022 🏛 MIT Press 🌐 English