Data Science for Infectious Disease Data Analytics: An Introduction with R

✍ Scribed by Lily Wang

Publisher: CRC Press/Chapman & Hall
Year: 2022
Tongue: English
Leaves: 420
Series: Chapman & Hall/CRC Data Science Series
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Data Science for Infectious Disease Data Analytics: An Introduction with R provides an overview of modern data science tools and methods that have been developed specifically to analyze infectious disease data. With a quick start guide to epidemiological data visualization and analysis in R, this book spans the gulf between academia and practices providing many lively, instructive data analysis examples using the most up-to-date data, such as the newly discovered coronavirus disease (COVID-19).

The primary emphasis of this book is the data science procedures in epidemiological studies, including data wrangling, visualization, interpretation, predictive modeling, and inference, which is of immense importance due to increasingly diverse and nonexperimental data across a wide range of fields. The knowledge and skills readers gain from this book are also transferable to other areas, such as public health, business analytics, environmental studies, or spatio-temporal data visualization and analysis in general.

Aimed at readers with an undergraduate knowledge of mathematics and statistics, this book is an ideal introduction to the development and implementation of data science in epidemiology.

Features

Describes the entire data science procedure of how the infectious disease data are collected, curated, visualized, and fed to predictive models, which facilitates effective communication between data sources, scientists, and decision-makers.
Explains practical concepts of infectious disease data and provides particular data science perspectives.
Overview of the unique features and issues of infectious disease data and how they impact epidemic modeling and projection.
Introduces various classes of models and state-of-the-art learning methods to analyze infectious diseases data with valuable insights on how different models and methods could be connected.

✦ Table of Contents

Cover
Half Title
Series Page
Title Page
Copyright Page
Dedication
Contents
Preface
1. Introduction
1.1. Aims and Scope of This Book
1.2. The Structure of This Book
1.2.1. Infectious Disease Data
1.2.2. Basic Characteristics of the Infection Process
1.2.3. Data Visualization
1.2.4. Epidemic Modeling and Forecasting
2. Data Wrangling
2.1. An Introduction to R Packages “dplyr” and “tidyr”
2.2. Learning R Package “dplyr”
2.2.1. Tibbles
2.2.2. Importing Data
2.2.3. Common “dplyr” Functions
2.3. Selecting Columns and Filtering Rows
2.3.1. Subsetting Variables
2.3.2. Subsetting Observations
2.3.3. Pipes
2.3.4. Selecting Rows with Highest or Lowest Values of a Variable
2.3.5. Additional Features
2.4. Making New Variables with mutate()
2.5. Summarizing Data
2.6. Combining Datasets
2.6.1. The “Join” Family
2.6.2. Toy Examples with Joins
2.6.3. Practicing with Joins for Real Data
2.6.4. More on Combining Rows of Tables
2.7. Data Reshaping
2.7.1. From Wide to Long
2.7.2. From Long to Wide
2.8. Further Reading
2.9. Exercises
3. Data Visualization with R Package “ggplot2”
3.1. An Introduction
3.2. Types of Variables and Preparation
3.2.1. Types of Variables
3.2.2. Rules for Graph Designing
3.2.3. Installing Packages and Loading Data
3.2.4. A Simple Scatterplot
3.3. Position Scales and Axes
3.3.1. Changing the Label of the Axis
3.3.2. Changing the Range of the Axis
3.4. Color Scales and Size of geom_point()
3.4.1. Changing the Color of All Points
3.4.2. Coloring Observations by the Value of a Feature
3.4.3. Changing the Color Palette
3.4.4. Changing the Size by the Value of a Feature
3.4.5. An Example of a Row-labeled Dot Plot
3.5. Individual Geoms
3.5.1. Histograms
3.5.2. Bar Charts
3.5.3. The Default Bar Chart
3.5.4. Bar Charts with Assigned Values
3.5.5. Legends
3.5.6. Boxplots, Jittering and Violin Plots
3.6. Collective Geoms
3.6.1. Smoothers
3.7. Time Series
3.7.1. Basic Line Plots
3.7.2. Adding a Second Line
3.7.3. Adding Ribbons
3.7.4. Adjusting the Scale of the Time Axis
3.7.5. Adding Annotations
3.8. Maps
3.8.1. Making a Base Map
3.8.2. Customizing Choropleth Maps
3.8.3. Overlaying Polygon Maps
3.9. Other Useful Plots
3.9.1. Density and Conditional Density Plots
3.9.2. Adding Marginal Plots
3.10. Arranging Plots
3.10.1. Facets
3.10.2. Combining Plots Using R Package “patchwork”
3.11. Saving the Figure and Output
3.11.1. Saving in Figure Format
3.11.2. Saving in RDS Format
3.12. Further Reading
3.13. Exercises
4. Interactive Visualization
4.1. An Introduction
4.2. Creating Plotly Objects
4.2.1. Using plot_ly() to Create a Plotly Object
4.2.2. Using “dplyr” Verbs to Modify Data
4.2.3. Using ggplotly() to Create a Plotly Object
4.3. Scatterplots and Line Plots
4.3.1. Making a Scatterplot
4.3.2. Markers
4.3.3. A Single Time Series Plot
4.3.4. Hovering Text and Template
4.3.5. Multiple Time Series Plots
4.3.6. More Features About the Lines
4.3.7. Adding Ribbons
4.4. Pie Charts
4.4.1. Draw Static Pie Charts
4.4.2. Drawing Interactive Pie Charts
4.5. Animation
4.5.1. An Animation of the Evolution of Infected vs. Death Count
4.5.2. An Animation of the State-level Time Series Plot of Infected Count
4.6. Saving HTMLs
4.6.1. Saving as Standalone HTML Files
4.6.2. Saving as Non-self-contained HTML Files
4.7. Further Reading
4.8. Exercises
5. R Shiny
5.1. An Introduction to Shiny
5.1.1. Structure of a Shiny Application
5.1.2. Launching a Shiny Application
5.1.3. Creating the First Shiny Application
5.1.4. Creating a New Shiny Application in RStudio
5.1.5. Sharing the Shiny Application
5.2. Useful Input Widgets
5.3. Displaying Reactive Outputs
5.4. Rendering Plotly inside Shiny
5.5. Further Reading
6. Interactive Geospatial Visualization
6.1. An Introduction to Leaflet
6.1.1. Features and Installation
6.1.2. Basic Usage
6.2. The Data Object
6.2.1. Specifying Latitude/Longitude in Base R
6.2.2. Using R Package “sp”
6.2.3. Using R Package “maps”
6.3. Choropleth Maps
6.3.1. Creating a Base Map
6.3.2. Coloring the Map
6.3.3. Interactive Maps
6.4. Legends
6.4.1. Classification Schemes
6.4.2. Mapping Variables to Colors
6.5. Examples of County-level Maps
6.5.1. A County-level Map of COVID-19 Infection Risk
6.5.2. A County-level Map of COVID-19 Control Policy
6.6. Spot Maps
6.6.1. Adding Circles
6.6.2. Adding Popups
6.6.3. Adding Labels
6.7. Integrating Leaflet with R Shiny
6.8. Further Reading
6.9. Exercises
7. Epidemic Modeling
7.1. An Introduction to Epidemic Modeling
7.2. Mechanistic Models
7.2.1. Compartment Modeling
7.2.2. Agent-based Methods
7.3. Phenomenological Models
7.3.1. Time Series Analysis
7.3.2. Regression Methods
7.3.3. Machine Learning Methods
7.4. Hybrid Models and Ensemble Methods
7.5. Epidemic Modeling: Mathematical and Statistical Perspectives
7.6. Some Terms in Epidemic Modeling
7.7. Further Reading
8. Compartment Models
8.1. SIS Models
8.2. SIR Models
8.3. SIR Models with Births and Deaths
8.4. SEIR Models
8.5. Parameter Estimation for Compartment Models
8.5.1. Least-squares Method
8.5.2. Maximum Likelihood Method
8.6. Implementation of Parameter Estimation in R
8.6.1. An Application to Influenza-like Illness Data
8.6.2. An Application to COVID-19 Data
8.7. Basic and Effective Reproduction Number
8.7.1. Basic Reproduction Number
8.7.2. Effective Reproduction Number
8.7.3. Herd Immunity
8.8. Further Reading
8.9. Exercises
9. Time Series Analysis of Infectious Disease Data
9.1. Datasets and R Packages
9.1.1. Data
9.1.2. R Package “fable”
9.2. An Introduction to Time Series Analysis
9.2.1. Tsibble Objects
9.2.2. Working with Tsibble Objects
9.2.3. Drawing Time Series Plots
9.2.4. Objectives of Time Series Analysis
9.2.5. Stationarity
9.2.6. Autocovariance and Autocorrelation
9.3. Time Series Decomposition
9.3.1. Box-Cox Transformations
9.3.2. Methods for Estimating the Trend
9.3.3. Seasonal Component
9.3.4. Trend and Seasonality Estimation
9.4. Simple Time Series Forecasting Approaches
9.4.1. Average Method
9.4.2. Random Walk Forecasts
9.4.3. Seasonal Random Walk Forecasts
9.4.4. Random Walk with Drift Method
9.4.5. Displaying All Forecasting Results
9.4.6. Distributional Forecasts and Prediction Intervals
9.5. Residual Diagnostics and Accuracy Evaluation
9.5.1. Residual Diagnostics
9.5.2. Forecasting Accuracy
9.5.3. Selection of the Time Series
9.6. ARIMA Models
9.6.1. Differencing
9.6.2. ARMA Models
9.6.3. ARIMA Models
9.6.4. Seasonal ARIMA (SARIMA) Model
9.6.5. Building SARIMA Models
9.7. Model Comparison
9.7.1. Exponential Smoothing and ARIMA Models
9.7.2. Cross-validation for Time Series Analysis
9.8. Ensuring Forecasts Stay within Limits
9.8.1. Positive Forecasts
9.8.2. Forecasts Constrained to an Interval
9.9. Prediction and Prediction Intervals for Aggregates
9.10. Outliers and Anomalies
9.10.1. Empirical Rule
9.10.2. Boxplots
9.10.3. Outliers in Time Series
9.10.4. Tidy Anomaly Detection for Time Series with “anomalize”
9.10.5. A Discussion on Outlier and Anomalies Repair
9.11. Further Reading
9.12. Exercises
10. Regression Methods
10.1. Parametric Regression Methods
10.1.1. Linear Regression and Nonlinear Regression
10.1.2. Model Adequacy Checking
10.2. Nonparametric Regression Methods
10.2.1. Piecewise Constant Splines
10.2.2. Truncated Power Splines
10.2.3. B-splines and Natural Splines
10.2.4. Smoothing Splines
10.3. An Application to CDC FluView Portal Data
10.3.1. Trigonometric Regression
10.3.2. Smoothing Splines
10.4. Poisson Regression
10.4.1. Poisson Regression
10.4.2. Zero-inflated Poisson Regression
10.4.3. Count Time Series Analysis
10.5. Logistic Regression
10.5.1. Odds and Odds Ratios
10.5.2. Estimating Logistic Regression Coefficients
10.5.3. Logistic Regression with Multiple Explanatory Variables
10.6. Further Reading
10.7. Exercises
11. Neural Networks
11.1. A Single Neuron
11.2. Neural Network Structure
11.3. Neural Network Training
11.3.1. Forward Propagation
11.3.2. Backpropagation
11.4. Overfitting
11.5. Neural Network Auto-Regressive (NNAR) Models
11.6. COVID-19 Forecasting Using NNAR
11.7. Further Reading
11.8. Exercises
12. Hybrid Models
12.1. Ensembling Time Series Models
12.2. R Package “forecastHybrid”
12.2.1. Installation
12.2.2. An Introduction
12.2.3. Model Diagnostics
12.2.4. Forecasting
12.2.5. Performing Cross-Validation on a Time Series
12.2.6. Weights Selection Using Cross-Validation
12.3. R Package “opera”
12.3.1. Installation
12.3.2. An Introduction
12.4. Further Reading
12.5. Exercises
A. Appendix A
A.1. R Introduction and Preliminaries
A.1.1. The R Environment and Language
A.1.2. Obtaining R, RStudio and Installation
A.2. Starting RStudio
A.2.1. Source Pane
A.2.2. Console Pane
A.2.3. Error Messages
A.2.4. R Help
A.2.5. R Packages
A.2.6. Creating a Project and Setting a Working Directory
A.3. Exporting and Importing Data
A.3.1. Data Export
A.3.2. Data Import
A.3.3. The read.csv() Function
A.3.4. The “readr” Package
A.3.5. Importing an Excel File into R
A.3.6. Accessing Built-in Datasets
A.4. Control Structures in R
A.4.1. Grouped Expressions and Control Structures
A.4.2. Iterations
B. Appendix B
B.1. COVID-19 Data and Factors Integrated from Multiple Sources
B.1.1. Epidemic Data
B.1.2. Other Factors
B.1.3. Datasets
B.2. CDC FluView Portal Data
C. Appendix C
C.1. Classes: R Dates and Times
C.2. Formatting Date and Date/Time Variables
C.3. Creating Data/Time Objects in R
C.4. Parsing Date and Time
C.4.1. Date-time Conversion to and from Character Using Base R Functions
C.4.2. Parsing Date and Time Using “lubridate”
C.5. Setting and Extracting Information
C.5.1. Epidemiological Calendar
C.6. Merging Separate Date Information
C.7. Date Calculations in R
Bibliography
Index

📜 SIMILAR VOLUMES

Data Analytics with Hadoop: An Introduct

📁 Data Analytics with Hadoop: An Introduction for Data Scientists

✍ Benjamin Bengfort, Jenny Kim 📂 Library 📅 2016 🏛 O’Reilly Media 🌐 English

Data Analytics with Hadoop: An Introduct

📁 Data Analytics with Hadoop: An Introduction for Data Scientists

✍ Benjamin Bengfort, Jenny Kim 📂 Library 📅 2016 🏛 O’Reilly Media 🌐 English

Ready to use statistical and machine-learning techniques across large data sets? This practical guide shows you why the Hadoop ecosystem is perfect for the job. Instead of deployment, operations, or software development usually associated with distributed computing, you'll focus on particular analys

Data analytics with Hadoop: an introduct

📁 Data analytics with Hadoop: an introduction for data scientists

✍ Bengfort, Benjamin;Kim, Jenny 📂 Library 📅 2016 🏛 O'Reilly Media 🌐 English

The age of the data product -- An operating system for big data -- A framework for Python and Hadoop streaming -- In-memory computing with Spark -- Distributed analysis and patterns -- Data mining and warehousing -- Data ingestion -- Analytics with higher-level APIs -- Machine learning -- Summary :

Data Analytics with Hadoop: An Introduct

📁 Data Analytics with Hadoop: An Introduction for Data Scientists

✍ Bengfort, B.; Kim, J. 📂 Library 📅 2016 🏛 O’Reilly Media 🌐 English

Ready to use statistical and machine-learning techniques across large data sets? This practical guide shows you why the Hadoop ecosystem is perfect for the job. Instead of deployment, operations, or software development usually associated with distributed computing, you’ll focus on particular ana

Quantitative Social Science Data with R:

📁 Quantitative Social Science Data with R: An Introduction

✍ Brian J Fogarty 📂 Library 📅 2019 🏛 SAGE Publications Ltd 🌐 English

"One of the few books that provide an accessible introduction to quantitative data analysis with R. A particular strength of the text is the focus on ′real world′ examples which help students to understand why they are learning these methods." - Dr Roxanne Connelly, University o

Introduction to Data Science Data Analys

📁 Introduction to Data Science Data Analysis and Prediction Algorithms with R

✍ Rafael A. Irizarry 📂 Library 📅 2019 🏛 CRC Press 🌐 English