Information-Driven Machine Learning: Data Science as an Engineering Discipline

✍ Scribed by Gerald Friedland

Publisher: Springer
Year: 2023
Tongue: English
Leaves: 281
Edition: 1
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

This groundbreaking book transcends traditional machine learning approaches by introducing information measurement methodologies that revolutionize the field.

Stemming from a UC Berkeley seminar on experimental design for machine learning tasks, these techniques aim to overcome the 'black box' approach of machine learning by reducing conjectures such as magic numbers (hyper-parameters) or model-type bias. Information-based machine learning enables data quality measurements, a priori task complexity estimations, and reproducible design of data science experiments. The benefits include significant size reduction, increased explainability, and enhanced resilience of models, all contributing to advancing the discipline's robustness and credibility.

While bridging the gap between machine learning and disciplines such as physics, information theory, and computer engineering, this textbook maintains an accessible and comprehensive style, making complex topics digestible fora broad readership. Information-Driven Machine Learning explores the synergistic harmony among these disciplines to enhance our understanding of data science modeling. Instead of solely focusing on the "how," this text provides answers to the "why" questions that permeate the field, shedding light on the underlying principles of machine learning processes and their practical implications. By advocating for systematic methodologies grounded in fundamental principles, this book challenges industry practices that have often evolved from ideologic or profit-driven motivations. It addresses a range of topics, including deep learning, data drift, and MLOps, using fundamental principles such as entropy, capacity, and high dimensionality.

Ideal for both academia and industry professionals, this textbook serves as a valuable tool for those seeking to deepen their understanding of data science as an engineering discipline. Its thought-provoking content stimulates intellectual curiosity and caters to readers who desire more than just code or ready-made formulas. The text invites readers to explore beyond conventional viewpoints, offering an alternative perspective that promotes a big-picture view for integrating theory with practice. Suitable for upper undergraduate or graduate-level courses, this book can also benefit practicing engineers and scientists in various disciplines by enhancing their understanding of modeling and improving data measurement effectively.

✦ Table of Contents

Preface
Contents
List of Figures
1 Introduction
1.1 Science
1.1.1 Step 1: Observation
1.1.2 Step 2: Hypothesis
1.1.3 Step 3: Experiment
1.1.4 Step 4: Conclusion
1.1.5 Additional Step: Simplification
1.2 Data Science
1.3 Information Measurements
1.4 Exercises
1.5 Further Reading
2 The Automated Scientific Process
2.1 The Role of the Human
2.1.1 Curiosity
2.1.2 Data Collection
2.1.3 The Data Table
2.2 Automated Model Building
2.2.1 The Finite State Machine
2.2.2 How Machine Learning Generalizes
Well-Posedness
2.3 Exercises
2.4 Further Reading
3 The (Black Box) Machine Learning Process
3.1 Types of Tasks
3.1.1 Unsupervised Learning
3.1.2 Supervised Learning
3.2 Black-Box Machine Learning Process
3.2.1 Training/Validation Split
3.2.2 Independent But Identically Distributed
Example
3.3 Types of Models
3.3.1 Nearest Neighbors
3.3.2 Linear Regression
Training
3.3.3 Decision Trees
Training
3.3.4 Random Forests
Training
3.3.5 Neural Networks
Neuron Model
Perceptron Learning
Backpropagation
3.3.6 Support Vector Machines
Linear Support Vector Machines
Kernel Support Vector Machines
3.3.7 Genetic Programming
Training
3.4 Error Metrics
3.4.1 Binary Classification
3.4.2 Detection
3.4.3 Multi-class Classification
3.4.4 Regression
3.5 The Information-Based Machine Learning Process
3.6 Exercises
3.7 Further Reading
4 Information Theory
4.1 Probability, Uncertainty, Information
4.1.1 Chance and Probability
4.1.2 Probability Space
4.1.3 Uncertainty and Entropy
4.1.4 Information
4.1.5 Example
4.2 Minimum Description Length
4.2.1 Example
4.3 Information in Curves
4.4 Information in a Table
4.5 Exercises
4.6 Further Reading
5 Capacity
5.1 Intellectual Capacity
5.1.1 Minsky's Criticism
5.1.2 Cover's Solution
5.1.3 MacKay's Viewpoint
5.2 Memory-Equivalent Capacity of a Model
5.3 Exercises
5.4 Further Reading
6 The Mechanics of Generalization
6.1 Logic Definition of Generalization
6.2 Translating a Table into a Finite State Machine
6.3 Generalization as Compression
6.4 Resilience
6.5 Adversarial Examples
6.6 Exercises
6.7 Further Reading
7 Meta-Math: Exploring the Limits of Modeling
7.1 Algebra
7.1.1 Garbage In, Garbage Out
Creativity and the Data Processing Inequality
7.1.2 Randomness
7.1.3 Transcendental Numbers
7.2 No Rule Without Exception
7.2.1 Example: Why Do Prime Numbers Exist?
7.2.2 Compression by Association
7.3 Correlation vs. Causality
7.4 No Free Lunch
7.5 All Models Are Wrong
7.6 Exercises
7.7 Further Reading
8 Capacity of Neural Networks
8.1 Memory-Equivalent Capacity of Neural Networks
8.2 Upper-Bounding the MEC Requirement of a Neural Network Given Training Data
8.3 Topological Concerns
8.4 MEC for Regression Networks
8.5 Exercises
8.6 Further Reading
9 Neural Network Architectures
9.1 Deep Learning and Convolutional Neural Networks
9.1.1 Convolutional Neural Networks
9.1.2 Residual Networks
9.2 Generative Adversarial Networks
9.3 Autoencoders
9.4 Transformers
9.4.1 Architecture
9.4.2 Self-attention Mechanism
9.4.3 Positional Encoding
9.4.4 Example Transformation
9.4.5 Applications and Limitations
9.5 The Role of Neural Architectures
9.6 Exercises
9.7 Further Reading
10 Capacities of Some Other Machine Learning Methods
10.1 k-Nearest Neighbors
10.2 Support Vector Machines
10.3 Decision Trees
10.3.1 Converting a Table into a Decision Tree
10.3.2 Decision Trees
10.3.3 Generalization of Decision Trees
10.3.4 Ensembling
10.4 Genetic Programming
10.5 Unsupervised Methods
10.5.1 k-Means Clustering
10.5.2 Hopfield Networks
10.6 Exercises
10.7 Further Reading
11 Data Collection and Preparation
11.1 Data Collection and Annotation
11.2 Task Definition
11.3 Well-Posedness
11.3.1 Chaos and How to Avoid It
11.3.2 Example
Example
11.3.3 Forcing Well-Posedness
11.4 Tabularization
11.4.1 Table Data
11.4.2 Time-Series Data
11.4.3 Natural Language and Other Varying-Dependency Data
Stop-Symbol Cycles
Non-linear Dependencies
11.4.4 Perceptual Data
Estimating the Signal-to-Noise Ratio
11.4.5 Multimodal Data
11.5 Data Validation
11.5.1 Hard Conditions
Rows of Different Dimensionality
Missing Input Values
Missing Target Values
Less Than Two Classes
Database Key/Maximum Entropy Column
Constant Values
Redundancy
Contradictions
11.5.2 Soft Conditions
High-Entropy Target Column
High/Low-Entropy Input Columns
Out-of-Range Number Columns
11.6 Numerization
11.7 Imbalanced Data
11.7.1 Extension Beyond Simple Accuracy
11.8 Exercises
11.9 Further Reading
12 Measuring Data Sufficiency
12.1 Dispelling a Myth
12.2 Capacity Progression
12.3 Equilibrium Machine Learner
12.4 Data Sufficiency Using the Equilibrium Machine Learner
12.5 Exercises
12.6 Further Reading
13 Machine Learning Operations
13.1 What Makes a Predictor Production-Ready?
13.2 Quality Assurance for Predictors
13.2.1 Traditional Unit Testing
13.2.2 Synthetic Data Crash Tests
13.2.3 Data Drift Test
13.2.4 Adversarial Examples Test
13.2.5 Regression Tests
13.3 Measuring Model Bias
13.3.1 Where Does the Bias Come from?
Add-One-In Test
Leave-One-Out Test
Cheating Experiments
13.4 Security and Privacy
13.5 Exercises
13.6 Further Reading
14 Explainability
14.1 Explainable to Whom?
14.2 Occam's Razor Revisited
14.3 Attribute Ranking: Finding What Matters
14.4 Heatmapping
14.5 Instance-Based Explanations
14.6 Rule Extraction
14.6.1 Visualizing Neurons and Layers
14.6.2 Local Interpretable Model-Agnostic Explanations (LIME)
14.7 Future Directions
14.7.1 Causal Inference
14.7.2 Interactive Explanations
14.7.3 Explainability Evaluation Metrics
14.8 Fewer Parameters
14.9 Exercises
14.10 Further Reading
15 Repeatability and Reproducibility
15.1 Traditional Software Engineering
15.2 Why Reproducibility Matters
15.3 Reproducibility Standards
15.4 Achieving Reproducibility
15.5 Beyond Reproducibility
15.6 Exercises
15.7 Further Reading
16 The Curse of Training and the Blessing of High Dimensionality
16.1 Training Is Difficult
16.1.1 Common Workarounds
Hardware Support
Early Stopping
Transfer Learning
Model Selection and AutoML
16.2 Training in Logarithmic Time
16.3 Building Neural Networks Incrementally
16.4 The Blessing of High Dimensionality
16.5 Exercises
16.6 Further Reading
17 Machine Learning and Society
17.1 Societal Reaction: The Hype Train, Worship, or Fear
17.2 Some Basic Suggestions from a Technical Perspective
17.2.1 Understand Technological Diffusion and Allow Society Time to Adapt
17.2.2 Measure Memory-Equivalent Capacity (MEC)
17.2.3 Focus on Smaller, Task-Specific Models
17.2.4 Organic Growth of Large-Scale Models from Small-Scale Models
17.2.5 Measure and Control Generalization to Solve Copyright Issues
17.2.6 Leave Decisions to Qualified Humans
17.3 Exercises
17.4 Further Reading
A Recap: The Logarithm
B More on Complexity
B.1 O-Notation
B.2 Kolmogorov Complexity
B.3 VC Dimension
B.4 Shannon Entropy as the Only Way to Measure Information
B.5 Physical Work
B.5.1 Example 1: Physical View of the Halting Problem
B.5.2 Example 2: Why Do We Expect Diamonds to Exist?
B.6 P vs NP Complexity
C Concepts Cheat Sheet
D A Review Form That Promotes Reproducibility
Bibliography
Index

📜 SIMILAR VOLUMES

Information-Driven Machine Learning : Da

📁 Information-Driven Machine Learning : Data Science as an Engineering Discipline

✍ Gerald Friedland 📂 Library 📅 2023 🏛 Springer International Publishing 🌐 English

This groundbreaking book transcends traditional machine learning approaches by introducing information measurement methodologies that revolutionize the field. Stemming from a UC Berkeley seminar on experimental design for machine learning tasks, these techniques aim to overcome the 'black box' appr

Data-Driven Science and Engineering: Mac

📁 Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control

✍ Steven L. Brunton, J. Nathan Kutz 📂 Library 📅 2019 🏛 Cambridge University Press 🌐 English

Data-driven discovery is revolutionizing the modeling, prediction, and control of complex systems. This textbook brings together machine learning, engineering mathematics, and mathematical physics to integrate modeling and control of dynamical systems with modern methods in data science. It highligh

Data-Driven Science and Engineering: Mac

📁 Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control

✍ J. Nathan Kutz & Steven L. Brunton 📂 Library 📅 2019 🏛 Cambridge University Press 🌐 English

Data-Driven Science and Engineering: Mac

📁 Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control

✍ Steven L. Brunton, J. Nathan Kutz 📂 Library 📅 2019 🏛 Cambridge University Press 🌐 English

Data-Driven Science and Engineering: Mac

📁 Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control 2nd Edition, Kindle Edition

✍ Steven L. Brunton, J. Nathan Kutz 📂 Library 📅 2022 🏛 Cambridge University Press 🌐 English

Data Driven Science & Engineering

📁 Data Driven Science & Engineering

✍ Steven L. Brunton, J. Nathan Kutz 📂 Library 📅 2017 🏛 Brunton & Kutz 🌐 English

Data Driven Science & Engineering. 2017.