Statistics for Data Science and Analytics

✍ Scribed by Peter C. Bruce; Peter Gedeck; Janet Dobbins

Publisher: Wiley
Year: 2024
Tongue: English
Leaves: 366
Edition: 1
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Introductory statistics textbook with a focus on data science topics such as prediction, correlation, and data exploration

Introductory Statistics Using Python is a comprehensive guide to statistical analysis using Python, presenting important topics useful for data science such as prediction, correlation, and data exploration. The authors provide an introduction to statistical science and big data, as well as an overview of Python data structures and operations.

A range of statistical techniques are presented with their implementation in Python, including hypothesis testing, probability, exploratory data analysis, categorical variables, surveys and sampling, A/B testing, and correlation. The text introduces binary classification, a foundational element of machine learning, validation of statistical models by applying them to holdout data, and probability and inference via the easy-to-understand method of resampling and the bootstrap instead of using a myriad of “kitchen sink” formulas. Regression is taught both as a tool for explanation and for prediction.

This book is informed by the authors' experience designing and teaching both introductory statistics and machine learning at Statistics.com. Each chapter includes practical examples, explanations of the underlying concepts, and Python code snippets to help readers apply the techniques themselves.

Introductory Statistics Using Python includes information on sample topics such as

Int, float, and string data types, numerical operations, manipulating strings, converting data types, and advanced data structures like lists, dictionaries, and sets
Experiment design via randomizing, blinding, and before-after pairing, as well as proportions and percents when handling binary data
Specialized Python packages like numpy, scipy, pandas, scikit-learn and statsmodels―the workhorses of data science―and how to get the most value from them
Statistical versus practical significance, random number generators, functions for code reuse, and binomial and normal probability distributions
Written by and for data science instructors, Introductory Statistics Using Python is an excellent learning resource for data science instructors prescribing a required intro stats course for their programs, as well as other students and professionals seeking to transition to the data science field.

✦ Table of Contents

fmatter
Title Page
Copyright
Contents
About the Authors
Acknowledgments
About the Companion Website
Introduction
ch1
1.1 Big Data: Predicting Pregnancy
1.2 Phantom Protection from Vitamin E
1.3 Statistician, Heal Thyself
1.4 Identifying Terrorists in Airports
1.5 Looking Ahead
1.6 Big Data and Statisticians
1.6.1 Data Scientists
ch2
2.1 Statistical Science
2.2 Big Data
2.3 Data Science
2.4 Example: Hospital Errors
2.5 Experiment
2.6 Designing an Experiment
2.6.1 A/B Tests; A Controlled Experiment for the Hospital Plans
2.6.2 Randomizing
2.6.3 Planning
2.6.4 Bias
2.6.4.1 Placebo
2.6.4.2 Blinding
2.6.4.3 Before‐after Pairing
2.7 The Data
2.7.1 Dataframe Format
2.8 Variables and Their Flavors
2.8.1 Numeric Variables
2.8.2 Categorical Variables
2.8.3 Binary Variables
2.8.4 Text Data
2.8.5 Random Variables
2.8.6 Simplified Columnar Format
2.9 Python: Data Structures and Operations
2.9.1 Primary Data Types
2.9.2 Comments
2.9.3 Variables
2.9.4 Operations on Data
2.9.4.1 Converting Data Types
2.9.5 Advanced Data Structures
2.9.5.1 Classes and Objects
2.9.5.2 Data Types and Their Declaration
2.10 Are We Sure We Made a Difference?
2.11 Is Chance Responsible? The Foundation of Hypothesis Testing
2.11.1 Looking at Just One Hospital
2.12 Probability
2.12.1 Interpreting Our Result
2.13 Significance or Alpha Level
2.13.1 Increasing the Sample Size
2.13.2 Simulating Probabilities with Random Numbers
2.14 Other Kinds of Studies
2.15 When to Use Hypothesis Tests
2.16 Experiments Falling Short of the Gold Standard
2.17 Summary
2.18 Python: Iterations and Conditional Execution
2.18.1 if Statements
2.18.2 for Statements
2.18.3 while Statements
2.18.4 break and continue Statements
2.18.5 Example: Calculate Mean, Standard Deviation, Subsetting
2.18.6 List Comprehensions
2.19 Python: Numpy, scipy, and pandas—The Workhorses of Data Science
2.19.1 Numpy
2.19.2 Scipy
2.19.3 Pandas
2.19.3.1 Reading and Writing Data
2.19.3.2 Accessing Data
2.19.3.3 Manipulating Data
2.19.3.4 Iterating Over a DataFrame
2.19.3.5 And a Lot More
Exercises
ch3
3.1 Exploratory Data Analysis
3.2 What to Measure—Central Location
3.2.1 Mean
3.2.2 Median
3.2.3 Mode
3.2.4 Expected Value
3.2.5 Proportions for Binary Data
3.2.5.1 Percents
3.3 What to Measure—Variability
3.3.1 Range
3.3.2 Percentiles
3.3.3 Interquartile Range
3.3.4 Deviations and Residuals
3.3.5 Mean Absolute Deviation
3.3.6 Variance and Standard Deviation
3.3.6.1 Denominator of N or N–1?
3.3.7 Population Variance
3.3.8 Degrees of Freedom
3.4 What to Measure—Distance (Nearness)
3.5 Test Statistic
3.5.1 Test Statistic for this Study
3.6 Examining and Displaying the Data
3.6.1 Frequency Tables
3.6.2 Histograms
3.6.3 Bar Chart
3.6.4 Box Plots
3.6.5 Tails and Skew
3.6.6 Errors and Outliers Are Not the Same Thing!
3.7 Python: Exploratory Data Analysis/Data Visualization
3.7.1 Matplotlib
3.7.2 Data Visualization Using Pandas and Seaborn
Exercises
ch4
4.1 Avoid Being Fooled by Chance
4.2 The Null Hypothesis
4.3 Repeating the Experiment
4.3.1 Shuffling and Picking Numbers from a Hat or Box
4.3.2 How Many Reshuffles?
4.3.3 The t‐Test
4.3.4 Conclusion
4.4 Statistical Significance
4.4.1 Bottom Line
4.4.1.1 Statistical Significance as a Screening Device
4.4.2 Torturing the Data
4.4.3 Practical Significance
4.5 Power
4.6 The Normal Distribution
4.6.1 The Exact Test
4.7 Summary
4.8 Python: Random Numbers
4.8.1 Generating Random Numbers Using the random Package
4.8.2 Random Numbers in numpy and scipy
4.8.3 Using Random Numbers in Other Packages
4.8.4 Example: Implement a Resampling Experiment
4.8.5 Write Functions for Code Reuse
4.8.6 Organizing Code into Modules
Exercises
ch5
5.1 What Is Probability
5.2 Simple Probability
5.2.1 Venn Diagrams
5.3 Probability Distributions
5.3.1 Binomial Distribution
5.3.1.1 Example
5.4 From Binomial to Normal Distribution
5.4.1 Standardization (Normalization)
5.4.2 Standard Normal Distribution
5.4.2.1 z‐Tables
5.4.3 The 95 Percent Rule
5.5 Appendix: Binomial Formula and Normal Approximation
5.5.1 Normal Approximation
5.6 Python: Probability
5.6.1 Converting Counts to Probabilities
5.6.2 Probability Distributions in Python
5.6.3 Probability Distributions in random
5.6.4 Probability Distributions in the scipy Package
5.6.4.1 Continuous Distributions
5.6.4.2 Discrete Distributions
Exercises
ch6
6.1 Two‐way Tables
6.2 Conditional Probability
6.2.1 From Numbers to Percentages to Conditional Probabilities
6.3 Bayesian Estimates
6.3.1 Let's Review the Different Probabilities
6.3.2 Bayesian Calculations
6.4 Independence
6.4.1 Chi‐square Test
6.4.1.1 Sensor Calibration
6.4.1.2 Standardizing Departure from Expected
6.5 Multiplication Rule
6.6 Simpson's Paradox
6.7 Python: Counting and Contingency Tables
6.7.1 Counting in Python
6.7.2 Counting in Pandas
6.7.3 Two‐way Tables Using Pandas
6.7.4 Chi‐square Test
Exercises
ch7
7.1 Literary Digest—Sampling Trumps “All Data”
7.2 Simple Random Samples
7.3 Margin of Error: Sampling Distribution for a Proportion
7.3.1 The Confidence Interval
7.3.2 A More Manageable Box: Sampling with Replacement
7.3.3 Summing Up
7.4 Sampling Distribution for a Mean
7.4.1 Simulating the Behavior of Samples from a Hypothetical Population
7.5 The Bootstrap
7.5.1 Resampling Procedure (Bootstrap)
7.6 Rationale for the Bootstrap
7.6.1 Let's Recap
7.6.2 Formula‐based Counterparts to Resampling
7.6.2.1 FORMULA: The Z‐interval
7.6.2.2 Proportions
7.6.3 For a Mean: T‐interval
7.6.4 Example—Manual Calculations
7.6.5 Example—Software
7.6.6 A Bit of History—1906 at Guinness Brewery
7.6.7 The Bootstrap Today
7.6.8 Central Limit Theorem
7.7 Standard Error
7.7.1 Standard Error via Formula
7.8 Other Sampling Methods
7.8.1 Stratified Sampling
7.8.2 Cluster Sampling
7.8.3 Systematic Sampling
7.8.4 Multistage Sampling
7.8.5 Convenience Sampling
7.8.6 Self‐selection
7.8.7 Nonresponse Bias
7.9 Absolute vs. Relative Sample Size
7.10 Python: Random Sampling Strategies
7.10.1 Implement Simple Random Sample (SRS)
7.10.2 Determining Confidence Intervals
7.10.3 Bootstrap Sampling to Determine Confidence Intervals for a Mean
7.10.4 Advanced Sampling Techniques
7.10.4.1 Stratified Sampling for Categorical Variables
7.10.4.2 Stratified Sampling of Continuous Variables
Exercises
ch8
8.1 Count Data—R × C Tables
8.2 The Role of Experiments (Many Are Costly)
8.2.1 Example: Marriage Therapy
8.3 Chi‐Square Test
8.3.1 Alternate Option
8.3.2 Testing for the Role of Chance
8.3.3 Standardization to the Chi‐Square Statistic
8.3.4 Chi‐Square Example on the Computer
8.4 Single Sample—Goodness‐of‐Fit
8.4.1 Resampling Procedure
8.5 Numeric Data: ANOVA
8.6 Components of Variance
8.6.1 From ANOVA to Regression
8.7 Factorial Design
8.7.1 Stratification and Blocking
8.7.2 Blocking
8.8 The Problem of Multiple Inference
8.9 Continuous Testing
8.9.1 Medicine
8.9.2 Business
8.10 Bandit Algorithms
8.10.1 Web Testing
8.11 Appendix: ANOVA, the Factor Diagram, and the F‐Statistic
8.11.1 Decomposition: The Factor Diagram
8.11.2 Constructing the ANOVA Table
8.11.3 Inference Using the ANOVA Table
8.11.4 The F‐Distribution
8.11.5 Different Sized Groups
8.11.5.1 Resampling Method
8.11.5.2 Formula Method
8.11.6 Caveats and Assumptions
8.12 More than One Factor or Variable—From ANOVA to Statistical Models
8.13 Python: Contingency Tables and Chi‐square Test
8.13.1 Example: Marriage Therapy
8.13.2 Example: Imanishi‐Kari Data
8.14 Python: ANOVA
8.14.1 Visual Comparison of Groups
8.14.2 ANOVA Using Resampling Test
8.14.3 ANOVA Using the F‐Statistic
Exercises
ch9
9.1 Example: Delta Wire
9.2 Example: Cotton Dust and Lung Disease
9.3 The Vector Product Sum Test
9.3.1 Example: Baseball Payroll
9.3.1.1 Resampling Procedure
9.4 Correlation Coefficient
9.4.1 Inference for the Correlation Coefficient—Resampling
9.4.1.1 Hypothesis Test—Resampling
9.4.1.2 Example: Baseball Again
9.4.1.3 Inference for the Correlation Coefficient: Formulas
9.5 Correlation is not Causation
9.5.1 A Lurking External Cause
9.5.2 Coincidence
9.6 Other Forms of Association
9.7 Python: Correlation
9.7.1 Vector Operations
9.7.2 Resampling Test for Vector Product Sums
9.7.3 Calculating Correlation Coefficient
9.7.4 Calculate Correlation with numpy, pandas
9.7.5 Hypothesis Tests for Correlation
9.7.6 Using the t Statistic
9.7.7 Visualizing Correlation
Exercises
ch10
10.1 Finding the Regression Line by Eye
10.1.1 Making Predictions Based on the Regression Line
10.2 Finding the Regression Line by Minimizing Residuals
10.2.1 The “Loss Function”
10.3 Linear Relationships
10.3.1 Example: Workplace Exposure and PEFR
10.3.2 Residual Plots
10.3.2.1 How to Read the Payroll Residual Plot
10.4 Prediction vs. Explanation
10.4.1 Research Studies: Regression for Explanation
10.4.2 Assessing the Performance of Regression for Explanation
10.4.3 Big Data: Regression for Prediction
10.4.4 Assessing the Performance of Regression for Prediction
10.5 Python: Linear Regression
10.5.1 Linear Regression Using Statsmodels
10.5.2 Using the Non‐formula Interface to statsmodels
10.5.3 Linear Regression Using scikit‐learn
10.5.4 Splitting Datasets and Evaluating Model Performance
Exercises
ch11
11.1 Terminology
11.2 Example—Housing Prices
11.2.1 Explaining Home Prices
11.2.2 House Prices in Boston
11.2.3 Explore the Data
11.2.3.1 Performing and Interpreting a Regression Analysis
11.2.4 Using the Regression Equation
11.3 Interaction
11.3.1 Original Regression with No Interaction Term
11.3.2 The Regression with an Interaction Term
11.3.3 Does Crime Pay?
11.4 Regression Assumptions
11.4.1 Violation of Assumptions—Is the Model Useless?
11.5 Assessing Explanatory Regression Models
11.5.1 Overall Model Strength R2
11.5.2 Assessing Individual Coefficients
11.5.3 Resampling Procedure to Test Statistical Significance
11.5.4 Resampling Procedure for a Confidence Interval (the Pulmonary Data)
11.5.4.1 Interpretation
11.5.5 Formula‐based Inference
11.5.6 Interpreting Software Output
11.5.7 More Practice: Bootstrapping the Boston Housing Model
11.5.8 Inference for Regression—Hypothesis Tests
11.6 Assessing Regression for Prediction
11.6.1 Separate Training and Holdout Data
11.6.2 Root Mean Squared Error—RMSE
11.6.3 Tayko
11.6.4 Binary and Categorical Variables in Regression
11.6.5 Multicollinearity
11.6.6 Tayko—Building the Model
11.6.7 Reviewing the Output
11.6.8 Scoring the Model to the Validation Partition
11.6.9 The Naive Rule
11.7 Python: Multiple Linear Regression
11.7.1 Using Statsmodels
11.7.1.1 Adding Interaction Terms
11.7.2 Diagnostic Plots
11.7.3 Using Scikit‐learn
11.7.3.1 Adding Interaction Terms
11.7.4 Resampling Procedures
11.7.4.1 Estimating the Significance of the Coefficients
11.7.4.2 Estimating Confidence Intervals—The Bootstrap
Exercises
ch12
12.1 K‐Nearest‐Neighbors
12.1.1 Predicting Which Customers Might be Pregnant
12.1.2 Small Hypothetical Example
12.1.3 Setting k
12.1.4 K‐Nearest‐Neighbors and Numerical Outcomes
12.1.5 Explanatory Modeling
12.2 Python: Classification
12.2.1 Classification Using scikit‐learn
12.2.2 Evaluating the Model
12.2.3 Streamlining Model Fitting Using Pipelines
Exercises
index

📜 SIMILAR VOLUMES

Statistics for Data Science and Analytic

📁 Statistics for Data Science and Analytics

✍ Peter C. Bruce; Peter Gedeck; Janet Dobbins 📂 Library 📅 2024 🏛 Wiley 🌐 English

Introductory statistics textbook with a focus on data science topics such as prediction, correlation, and data exploration Introductory Statistics Using Python is a comprehensive guide to statistical analysis using Python, presenting important topics useful for data science such as prediction, co

Statistics for Data Science and Analytic

📁 Statistics for Data Science and Analytics

✍ Peter C. Bruce, Peter Gedeck, Janet Dobbins 📂 Library 📅 2024 🏛 Wiley 🌐 English

Introductory statistics textbook with a focus on data science topics such as prediction, correlation, and data explorationStatistics for Data Science and Analytics is a comprehensive guide to statistical analysis using Python, presenting important topics use

Statistics and Data Analysis For Behavio

📁 Statistics and Data Analysis For Behavioral Sciences

✍ Dana S Dunn 📂 Library 📅 2000 🏛 McGraw-Hill Humanities/Social Sciences/Languages 🌐 English

Dana S. Dunn, author of The Practical Researcher: A Student Guide to Conducting Psychological Research, brings his twelve years of statistics teaching experience to life in the new Statistics and Data Analysis for the Behavioral Sciences. Dr. Dunn combines the quantitative aspects of statistics wit

Statistics for Data Science and Policy A

📁 Statistics for Data Science and Policy Analysis

✍ Azizur Rahman (editor) 📂 Library 📅 2020 🏛 Springer 🌐 English

This book brings together the best contributions of the Applied Statistics and Policy Analysis Conference 2019. Written by leading international experts in the field of statistics, data science and policy evaluation. This book explores the theme of effective policy methods through the use o

Statistics and Data Analysis for the Beh

📁 Statistics and Data Analysis for the Behavioral Sciences

✍ Dana S. Dunn, Suzanne Mannes 📂 Library 📅 2001 🏛 McGraw-Hill Companies 🌐 English

Statistical Data Analysis for Ocean and

📁 Statistical Data Analysis for Ocean and Atmospheric Sciences

✍ H. Jean Thiebaux (Auth.) 📂 Library 📅 1994