Modern Statistics with R: From Wrangling and Exploring Data to Inference and Predictive Modelling

✍ Scribed by Måns Thulin

Publisher: CRC Press
Year: 2024
Tongue: English
Leaves: 492
Edition: 2
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

The past decades have transformed the world of statistical data analysis, with new methods, new types of data, and new computational tools. Modern Statistics with R introduces you to key parts of this modern statistical toolkit. It teaches you: • Data wrangling - importing, formatting, reshaping, merging, and filtering data in R. • Exploratory data analysis - using visualisations and multivariate techniques to explore datasets. • Statistical inference - modern methods for testing hypotheses and computing confidence intervals. • Predictive modelling - regression models and machine learning methods for prediction, classification, and forecasting. • Simulation - using simulation techniques for sample size computations and evaluations of statistical methods. • Ethics in statistics - ethical issues and good statistical practice. R programming - writing code that is fast, readable, and (hopefully!) free from bugs. No prior programming experience is necessary. Clear explanations and examples are provided to accommodate readers at all levels of familiarity with statistical principles and coding practices. A basic understanding of probability theory can enhance comprehension of certain concepts discussed within this book. In addition to plenty of examples, the book includes more than 200 exercises, with fully worked solutions available at: www.modernstatisticswithr.com.

✦ Table of Contents

Cover
Half Title
Title Page
Copyright Page
Dedication
Contents
List of Figures
1. Introduction
1.1. Welcome to R
1.2. About this book
2. The basics
2.1. Installing R and RStudio
2.2. A first look at RStudio
2.3. Running R code
2.3.1. R scripts
2.4. Variables and functions
2.4.1. Storing data
2.4.2. What's in a name?
2.4.3. Vectors and data frames
2.4.4. Functions
2.4.5. Mathematical operations
2.5. Packages
2.6. Descriptive statistics
2.6.1. Numerical data
2.6.2. Categorical data
2.7. Plotting numerical data
2.7.1. Our first plot
2.7.2. Colours, shapes and axis labels
2.7.3. Axis limits and scales
2.7.4. Comparing groups
2.7.5. Boxplots
2.7.6. Histograms
2.8. Plotting categorical data
2.8.1. Bar charts
2.9. Saving your plot
2.10. Data frames and data types
2.10.1. Types and structures
2.10.2. Types of tables
2.11. Vectors in data frames
2.11.1. Accessing vectors and elements
2.11.2. Adding and changing data using dollar signs
2.11.3. Filtering using conditions
2.12. Grouped summaries
2.13. Using |> pipes
2.13.1. Ceci n'est pas une pipe
2.13.2. Placeholders and with
2.14. Flavours of R: base and tidyverse
2.15. Importing data
2.15.1. Importing files through the RStudio menus
2.15.2. Importing csv files
2.15.3. File paths
2.15.4. Importing Excel files
2.16. Saving and exporting your data
2.16.1. Exporting data
2.16.2. Saving and loading R data
2.17. RStudio projects
2.18. Troubleshooting
3. The cornerstones of statistics
3.1. The three cultures
3.2. Frequencies, proportions, and cross-tables
3.2.1. Frequency tables
3.2.2. Publication-ready tables
3.2.3. Contingency tables
3.2.4. Three-way and four-way tables
3.3. Hypothesis testing and p-values
3.3.1. The lady tasting tea
3.3.2. How low does the p-value have to be?
3.3.3. Fisher's exact test
3.3.4. One- and two-sided hypotheses
3.3.5. The lady binging tea: power and how the sample size affects the analysis
3.3.6. Permutation tests
3.4. χ2-tests
3.4.1. When can we use χ2-tests?
3.5. Confidence intervals
3.5.1. Confidence intervals for proportions
3.5.2. Sample size calculations
3.6. Comparing mean values
3.6.1. The t-test for comparing two groups
3.6.2. One-sided hypotheses
3.6.3. The t-test for a single sample
3.6.4. The t-test for paired samples
3.6.5. When can we use the t-test?
3.6.6. Permutation t-tests
3.6.7. Bootstrap t-tests
3.6.8. Publication-ready tables for means
3.6.9. Sample size computations for the t-test
3.7. Multiple testing
3.7.1. Adjusting for multiplicity
3.7.2. Multivariate testing with Hotelling's T2
3.8. Correlations
3.8.1. Estimation
3.8.2. Hypothesis testing
3.9. Bayesian approaches
3.9.1. Inference for a proportion
3.9.2. Inference for means
3.10. Reporting statistical results
3.10.1. What should you include?
3.10.2. Citing R packages
3.11. Ethics and good statistical practice
3.11.1. Ethical guidelines
3.11.2. p-hacking and the file-drawer problem
3.11.3. Reproducibility
4. Exploratory data analysis and unsupervised learning
4.1. Reports with R Markdown
4.1.1. A first example
4.1.2. Formatting text
4.1.3. Lists, tables, and images
4.1.4. Code chunks
4.2. Customising ggplot2 plots
4.2.1. Modifying labels
4.2.2. Modifying axis scales
4.2.3. Using themes
4.2.4. Colour palettes
4.2.5. Theme settings
4.3. Exploring distributions
4.3.1. Density plots and frequency polygons
4.3.2. Asking questions
4.3.3. Violin plots
4.4. Combining multiple plots into a single graphic
4.5. Outliers and missing data
4.5.1. Detecting outliers
4.5.2. Labelling outliers
4.5.3. Missing data
4.5.4. Exploring data
4.6. Trends in scatterplots
4.7. Exploring time series
4.7.1. Annotations and reference lines
4.7.2. Longitudinal data
4.7.3. Path plots
4.7.4. Spaghetti plots
4.7.5. Seasonal plots and decompositions
4.7.6. Detecting changepoints
4.7.7. Interactive time series plots
4.8. Using polar coordinates
4.8.1. Visualising periodic data
4.8.2. Pie charts
4.9. Visualising multiple variables
4.9.1. Scatterplot matrices
4.9.2. 3D scatterplots
4.9.3. Correlograms
4.9.4. Adding more variables to scatterplots
4.9.5. Overplotting
4.9.6. Categorical data
4.9.7. Putting it all together
4.10. Sankey diagrams
4.11. Principal component analysis
4.11.1. Running a principal component analysis
4.11.2. Choosing the number of components
4.11.3. Plotting the results
4.12. Cluster analysis
4.12.1. Hierarchical clustering
4.12.2. Heatmaps and clustering variables
4.12.3. Centroid-based clustering
4.12.4. Fuzzy clustering
4.12.5. Model-based clustering
4.12.6. Comparing clusters
5. Dealing with messy data
5.1. Changing data types
5.2. Working with lists
5.2.1. Splitting vectors into lists
5.2.2. Collapsing lists into vectors
5.3. Working with numbers
5.3.1. Rounding numbers
5.3.2. Sums and means in data frames
5.3.3. Summaries of series of numbers
5.3.4. Scientific notation: 1e-03
5.3.5. Floating point arithmetics
5.4. Working with categorical data and factors
5.4.1. Creating factors
5.4.2. Changing factor levels
5.4.3. Changing the order of levels
5.4.4. Combining levels
5.5. Working with strings
5.5.1. Concatenating strings
5.5.2. Changing case
5.5.3. Finding patterns using regular expressions
5.5.4. Substitution
5.5.5. Splitting strings
5.5.6. Variable names
5.6. Working with dates and times
5.6.1. Date formats
5.6.2. Plotting with dates
5.7. Data manipulation with data.table, dplyr, and tidyr
5.7.1. data.table and tidyverse syntax basics
5.7.2. Modifying a variable
5.7.3. Computing a new variable based on existing variables
5.7.4. Renaming a variable
5.7.5. Removing a variable
5.7.6. Recoding factor levels
5.7.7. Grouped summaries
5.7.8. Filling in missing values
5.7.9. Chaining commands together
5.8. Filtering: select rows
5.8.1. Filtering using row numbers
5.8.2. Filtering using conditions
5.8.3. Selecting rows at random
5.8.4. Using regular expressions to select rows
5.9. Subsetting: select columns
5.9.1. Selecting a single column
5.9.2. Selecting multiple columns
5.9.3. Using regular expressions to select columns
5.9.4. Subsetting using column numbers
5.10. Sorting
5.10.1. Changing the column order
5.10.2. Changing the row order
5.11. Reshaping data
5.11.1. From long to wide
5.11.2. From wide to long
5.11.3. Splitting columns
5.11.4. Merging columns
5.12. Merging data from multiple tables
5.12.1. Binds
5.12.2. Merging tables using keys
5.12.3. Inner and outer joins
5.12.4. Semijoins and antijoins
5.13. Scraping data from websites
5.14. Other commons tasks
5.14.1. Deleting variables
5.14.2. Importing data from other statistical packages
5.14.3. Importing data from databases
5.14.4. Importing data from JSON files
6. R programming
6.1. Functions
6.1.1. Creating functions
6.1.2. Local and global variables
6.1.3. Will your function work?
6.1.4. More on arguments
6.1.5. Namespaces
6.1.6. Sourcing other scripts
6.2. More on pipes
6.2.1. Ce ne sont pas non plus des pipes
6.2.2. Writing functions with pipes
6.3. Checking conditions
6.3.1. if and else
6.3.2. & & &&
6.3.3. ifelse
6.3.4. switch
6.3.5. Failing gracefully
6.4. Iteration using loops
6.4.1. for loops
6.4.2. Loops within loops
6.4.3. Keeping track of what's happening
6.4.4. Loops and lists
6.4.5. while loops
6.5. Iteration using vectorisation and functionals
6.5.1. A first example with apply
6.5.2. Variations on a theme
6.5.3. purrr
6.5.4. Specialised functions
6.5.5. Exploring data with functionals
6.5.6. Keep calm and carry on
6.5.7. Iterating over multiple variables
6.6. Measuring code performance
6.6.1. Timing functions
6.6.2. Measuring memory usage (and a note on compilation)
7. The role of simulation in modern statistics
7.1. Simulation and distributions
7.1.1. Generating random numbers
7.1.2. Some common distributions
7.1.3. Assessing distributional assumptions
7.1.4. Monte Carlo integration
7.2. Evaluating statistical methods using simulation
7.2.1. Comparing estimators
7.2.2. Type I error rate of hypothesis tests
7.2.3. Power of hypothesis tests
7.2.4. Power of some tests of location
7.2.5. Some advice on simulation studies
7.3. Sample size computations using simulation
7.3.1. Writing your own simulation
7.3.2. The Wilcoxon-Mann-Whitney test
7.4. Bootstrapping
7.4.1. A general approach
7.4.2. Bootstrap confidence intervals
7.4.3. Bootstrap hypothesis tests
7.4.4. The parametric bootstrap
8. Regression models
8.1. Linear models
8.1.1. Fitting linear models
8.1.2. Publication-ready summary tables
8.1.3. Dummy variables and interactions
8.1.4. Model diagnostics
8.1.5. Bootstrap and permutation tests
8.1.6. Transformations
8.1.7. Prediction
8.1.8. ANOVA
8.2. Linear models: advanced topics
8.2.1. Robust estimation
8.2.2. Interactions between numerical variables
8.2.3. Bootstrapping regression coefficients
8.2.4. Prediction intervals using the bootstrap
8.2.5. Prediction for multiple datasets
8.2.6. Alternative summaries with broom
8.2.7. Variable selection
8.2.8. Bayesian estimation of linear models
8.3. Modelling proportions: logistic regression
8.3.1. Generalised linear models
8.3.2. Fitting logistic regression models
8.3.3. Bootstrap confidence intervals
8.3.4. Model diagnostics
8.3.5. Prediction
8.4. Regression models for multicategory response variables
8.4.1. Ordinal response variables
8.4.2. Nominal response variables
8.5. Modelling count data
8.5.1. Poisson and negative binomial regression
8.5.2. Modelling rates
8.6. Bayesian estimation of generalised linear models
8.7. Missing data and multiple imputation
8.7.1. Multiple imputation
8.7.2. The effect of missing data
8.7.3. The effect of multiple imputation
8.8. Mixed models
8.8.1. Fitting a linear mixed model
8.8.2. Model diagnostics
8.8.3. Nested random effects and multilevel/hierarchical models
8.8.4. ANOVA with random effects
8.8.5. Generalised linear mixed models
8.8.6. Bayesian estimation of mixed models
8.9. Creating matched samples
8.9.1. Propensity score matching
8.9.2. Stepwise matching
8.10. Ethical issues in regression modelling
9. Survival analysis and censored data
9.1. The basics of survival analysis
9.1.1. Visualising survival
9.1.2. Testing for group differences
9.1.3. Hazard functions
9.2. Regression models
9.2.1. The Cox proportional hazards model
9.2.2. Repeated observation and frailty models
9.2.3. Accelerated failure time models
9.3. Competing risks
9.4. Recurrent events
9.5. Advanced topics
9.5.1. Multivariate survival analysis
9.5.2. Bayesian survival analysis
9.5.3. Power estimates for the logrank test
9.6. Left-censored data and nondetects
9.6.1. Estimation
9.6.2. Tests of means
9.6.3. Censored regression
10. Structural equation models, factor analysis, and mediation
10.1. Exploratory factor analysis
10.1.1. Running a factor analysis
10.1.2. Choosing the number of factors
10.1.3. Latent class analysis
10.2. Confirmatory factor analysis
10.2.1. Running a confirmatory factor analysis
10.2.2. Plotting path diagrams
10.2.3. Assessing model fit
10.3. Structural equation modelling
10.3.1. Fitting a SEM
10.3.2. Assessing and plotting the model
10.4. Mediation and moderation in regression models
10.4.1. Fitting a mediation model
10.4.2. Mediation with confounders
10.4.3. Moderation
10.4.4. Mediated moderation and moderated mediation
11. Predictive modelling and machine learning
11.1. Evaluating predictive models
11.1.1. Evaluating regression models
11.1.2. Test-training splits
11.1.3. Leave-one-out cross-validation and caret
11.1.4. k-fold cross-validation
11.1.5. Twinned observations
11.1.6. Bootstrapping
11.1.7. Evaluating classification models
11.1.8. Visualising decision boundaries
11.2. Ethical issues in predictive modelling
11.3. Challenges in predictive modelling
11.3.1. Handling class imbalance
11.3.2. Assessing variable importance
11.3.3. Extrapolation
11.3.4. Missing data and imputation
11.3.5. Endless waiting
11.3.6. Overfitting to the test set
11.4. Regularised regression models
11.4.1. Ridge regression
11.4.2. The lasso
11.4.3. Elastic net
11.4.4. Choosing the best model
11.4.5. Regularised mixed models
11.5. Machine learning models
11.5.1. Decision trees
11.5.2. Random forests
11.5.3. Boosted trees
11.5.4. Model trees
11.5.5. Discriminant analysis
11.5.6. Support vector machines
11.5.7. Nearest neighbours classifiers
11.6. Forecasting time series
11.6.1. Decomposition
11.6.2. Forecasting using ARIMA models
11.7. Deploying models
11.7.1. Creating APIs with plumber
11.7.2. Different types of output
12. Advanced topics
12.1. More on packages
12.1.1. Loading and auto-installing packages
12.1.2. Updating R and your packages
12.1.3. Alternative repositories
12.1.4. Removing packages
12.2. Speeding up computations with parallelisation
12.2.1. Parallelising for loops
12.2.2. Parallelising functionals
12.3. Linear algebra and matrices
12.3.1. Creating matrices
12.3.2. Sparse matrices
12.3.3. Matrix operations
12.4. Integration with other programming languages
12.4.1. Integration with C++
12.4.2. Integration with Python
12.4.3. Integration with Tensorflow and PyTorch
12.4.4. Integration with Spark
13. Debugging
13.1. Debugging
13.1.1. Find out where the error occured with traceback
13.1.2. Interactive debugging of functions with debug
13.1.3. Investigate the environment with recover
13.2. Common error messages
13.2.1. +
13.2.2. could not find function
13.2.3. object not found
13.2.4. cannot open the connection and No such file or directory
13.2.5. invalid 'description' argument
13.2.6. missing value where TRUE/FALSE needed
13.2.7. unexpected '=' in ...
13.2.8. attempt to apply non-function
13.2.9. undefined columns selected
13.2.10. subscript out of bounds
13.2.11. Object of type 'closure' is not subsettable
13.2.12. $ operator is invalid for atomic vectors
13.2.13. (list) object cannot be coerced to type 'double'
13.2.14. arguments imply differing number of rows
13.2.15. non-numeric argument to a binary operator
13.2.16. non-numeric argument to mathematical function
13.2.17. cannot allocate vector of size ...
13.2.18. Error in plot.new() : figure margins too large
13.2.19. Error in .Call.graphics(C_palette2, .Call(C_palette2, NULL)) : invalid graphics state
13.2.20. Error in select(...) : unused argument (...)
13.3. Common warning messages
13.3.1. replacement has ... rows ...
13.3.2. the condition has length > 1 and only the first element will be used
13.3.3. number of items to replace is not a multiple of replacement length
13.3.4. longer object length is not a multiple of shorter object length
13.3.5. NAs introduced by coercion
13.3.6. package is not available (for R version x.x.x)
14. Mathematical appendix
14.1. Bootstrap confidence intervals
14.2. The equivalence between confidence intervals and hypothesis tests
14.3. Two types of p-values
14.4. Deviance tests
14.5. Regularised regression
References
Index

📜 SIMILAR VOLUMES

Modern Statistics with R: From Wrangling

📁 Modern Statistics with R: From Wrangling and Exploring Data to Inference and Predictive Modelling

✍ Måns Thulin 📂 Library 📅 2024 🏛 Chapman and Hall/CRC 🌐 English

The past decades have transformed the world of statistical data analysis, with new methods, new types of data, and new computational tools. Modern Statistics with R introduces you to key parts of this modern statistical toolkit. It teaches you:<ul><li><s

Effective Data Wrangling and Exploration

📁 Effective Data Wrangling and Exploration with R

✍ FRU Kingsly 📂 Library 📅 2021 🌐 English

Data Wrangling with R: Load, explore, tr

📁 Data Wrangling with R: Load, explore, transform and visualize data for modeling with tidyverse libraries

✍ Gustavo R Santos 📂 Library 📅 2023 🏛 Packt Publishing 🌐 English

In this information era, where large volumes of data are being generated every day, companies want to get a better grip on it to perform more efficiently than before. This is where skillful data analysts and data scientists come into play, wrangling and exploring data to generate valuable business i

Data Mining and Exploration: From Tradit

📁 Data Mining and Exploration: From Traditional Statistics to Modern Data Science

✍ Chong Ho Alex Yu 📂 Library 📅 2022 🏛 CRC Press 🌐 English

This book introduces both conceptual and procedural aspects of cutting-edge data science methods, such as dynamic data visualization, artificial neural networks, ensemble methods, and text mining. There are at least two unique elements that can set the book apart from its rivals.

Learning Data Science: Data Wrangling, E

📁 Learning Data Science: Data Wrangling, Exploration, Visualization, and Modeling with Python

✍ Sam Lau, Joseph Gonzalez, Deborah Nolan 📂 Library 📅 2023 🏛 O'Reilly Media 🌐 English

As an aspiring data scientist, you appreciate why organizations rely on data for important decisions—whether it's for companies designing websites, cities deciding how to improve services, or scientists discovering how to stop the spread of disease. And you want the skills required to distill a mess

Learning Data Science Data Wrangling, Ex

📁 Learning Data Science Data Wrangling, Exploration,Visualization and Modeling with Python

✍ Sam Lau, Deborah Nolan, and Joseph Gonzalez 📂 Library 📅 2023 🏛 O'Reilly Media, Inc. 🌐 English

As an aspiring data scientist, you appreciate why organizations rely on data for important decisions–whether it's for companies designing websites, cities deciding how to improve services, or scientists discovering how to stop the spread of disease. And you want the skills required to distill a mess