This book introduces the reader to data science using R and the tidyverse. No prerequisite knowledge is needed in college-level programming or mathematics (e.g., calculus or statistics). The book is self-contained so readers can immediately begin building data science workflows without needing to re
Introduction to Data Science in Biostatistics: Using R, the Tidyverse Ecosystem, and APIs
â Scribed by Thomas W. MacFarland
- Publisher
- Springer
- Year
- 2024
- Tongue
- English
- Leaves
- 536
- Category
- Library
No coin nor oath required. For personal study only.
⌠Synopsis
Introduction to Data Science in Biostatistics: Using R, the Tidyverse Ecosystem, and APIs defines and explores the term "data science" and discusses the many professional skills and competencies affiliated with the industry. With data science being a leading indicator of interest in STEM fields, the text also investigates this ongoing growth of demand in these spaces, with the goal of providing readers who are entering the professional world with foundational knowledge of required skills, job trends, and salary expectations. The text provides a historical overview of computing and the field's progression to R as it exists today, including the multitude of packages and functions associated with both Base R and the tidyverse ecosystem. Readers will learn how to use R to work with real data, as well as how to communicate results to external stakeholders. A distinguishing feature of this text is its emphasis on the emerging use of APIs to obtain data.
⌠Table of Contents
Foreword
Preface
Acknowledgments
Contents
Chapter 1: Emergence of Data Science as a Critical Discipline in Biostatistics
Definition and History of Data Science
The State of Data Science and the Need for Data Scientists
Definition of Data
Emergence of Data as a Valued Problem-Solving Input
Emergence of Data Science as a Highly Valued Occupation and Career Paths
Biostatistics: Definition and Applications Allowing for Frequent Overlap
Agriculture
Biology
Epidemiology
Health Science
Academic Growth of Data Science Programs of Study in the Biological Sciences, Based on Classification of Instructional Programs (CIP) Codes Related to Biostatistics
CIP Series 01: Agricultural, Animal, Plant, Veterinary Science and Related Fields
CIP Series 26: Biological and Biomedical Sciences
CIP Series 27: Mathematics and Statistics
CIP Series 30: Multi-Interdisciplinary Studies
CIP Series 44: Public Administration and Social Service
CIP Series: 51: Health Professions and Related Programs
Jobs and Job Requirements for a Data Scientist
Job Opportunities and Salaries in Data Science
Job Opportunities and Salaries in Data Science
Computing and Data Science
Pre-ENIAC (1946)
Mainframe Computing (1950s Onward)
Personal Computing (1980s Onward)
Widespread Acceptance of the Internet (1970s Onward) and the World Wide Web (1989 Onward)
Movement to Cloud Computing (2006 Onward)
Data Types Supported by R
Boolean (e.g., Logical) Data Expressing Comparisons and Order of Operations
Numeric Data
Decimal or Real Numeric Data
Integer Numeric Data
String or Character Data
Time and Dates
Missing Data
Data Structures Used in R
Dataframe (and tibble as a Special Type of Dataframe)
Factors
List
Matrix
Vector
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional Programs (CIP) Completions
Addendum 2: National and State Data for OCC-Identified Jobs Associated with Data Science and Biostatistics
External Data and/or Data Resources Used in This Lesson
Chapter 2: Data Sources in Biostatistics
Personal Data Sources
Local Data Sources
State Data Sources
National Data Sources
United States Census Bureau
United States Centers for Disease Control and Prevention
United States Department of Agriculture
United States Department of Education
United States Department of Labor
United States Environmental Protection Agency
United States National Science Foundation
International Data Sources
European Centre for Disease Prevention and Control
The Organization for Economic Co-operation and Development
Our World in Data
United Nations Food and Agriculture Organization
World Bank
World Health Organization
Proprietary and Other Resources
Google Cloud Platform Datasets for COVID-19 Research
New York Times COVID-19 Data at github
Addendum 1: Our World in Data
Addendum 2: United States Department of Labor, Bureau of Labor Statistics
External Data and/or Data Resources Used in This Lesson
Chapter 3: Role of Statistics for Decision-Making in Biostatistics
Ten-Point Process When Using R for Statistical Analysis
Identify Problems That Benefit from Statistical Analysis
Identify Potential Data Resources
Obtain the Data
Identify and Organize the Data and All Relevant Variables
Outline Potential Approach(s) for Analyses and Consider Alternate Approaches
Put Plans into Action, with Frequent Checks for Quality Assurance
Individual Review of All Outcomes
External Review of Outcomes Whenever Possible
Report at an Appropriate Level for the Intended Audience
Debrief to Establish Processes for Future Improvements
General Approach When Using R for Statistical Analysis
Exploratory Graphics
Exploratory Descriptive Statistics and Measures of Central Tendency
Exploratory Analyses
Addendum: Use Inferential Statistics and R Syntax to Address Differences in Percentage Deaths from COVID-19 by the Urban v Rural Continuum
External Data and/or Data Resources Used in This Lesson
Chapter 4: Data Science and R, Base R, and the tidyverse Ecosystem
Workflow for Reproducible, Efficient, and Accurate Analyses and Presentations
Base R
The tidyverse Ecosystem
The tidyverse Ecosystem as an Idea and the Need for Tidy Data
The Core tidyverse Ecosystem as a Set of Tools in R Packages for Data Science
Auxiliary Packages Outside of the Core tidyverse Ecosystem
Addendum 1: Complex Data Set on Birth Rates Easily Accommodated by Using the tidyverse Ecosystem
Addendum 2: Complex Data Set on Gross Domestic Product (GDP) and Comparison to Birth Rates by Using the tidyverse Ecosystem
Addendum 3: Individual Initiative of Planned Workflow, Analyses, and Graphical Presentations
Addendum 4: Essential tidyverse Ecosystem Functions That Every Data Scientists Should Master
External Data and/or Data Resources Used in This Lesson
Chapter 5: Statistical Analyses and Graphical Presentations in Biostatics Using Base R and the tidyverse Ecosystem
Overview of Using R for Statistical Analysis
Background
Import Data
Code Book and Data Organization
Exploratory Graphics
Exploratory Descriptive Statistics and Measures of Central Tendency
Exploratory Analyses
Presentation of Outcomes
Examples of Leading Statistical Tests, Including All Syntax and Presentation of Screen Outcomes and Graphics
Nonparametric Tests
Parametric Tests
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical Presentations for Data on Rates of Births and Rates of Deaths
Background
Description of the Data
Null Hypothesis
Import Data
Code Book and Data Organization
Exploratory Graphics
Graphics Using Base R
Graphics Using the tidyverse Ecosystem
Exploratory Descriptive Statistics and Measures of Central Tendency
Exploratory Analyses
Null Hypotheses
Prepare a National Dataset from the Four Regional Datasets
Presentation of Outcomes
Addendum 2: A Nonparametric Approach to Statistical Analyses and Graphical Presentations for Data on Rates of Births and Rates of Deaths
Addendum 3: Data Wrangling, and Then Statistical Analyses and Mapping
Background
Import the Data
Code Book and Data Organization
Exploratory Graphics
Exploratory Descriptive Statistics and Measures of Central Tendency
Exploratory Analyses
Presentation of Outcomes
Addendum 4: Prediction
Background
Code Book
Import the Data
Graphics (e.g., Figures and/or Maps)
Figures â Numeric-Type Object Variables
Figures â Factor-Type Object Variables (Include NAs Since They Are Relevant for These Figures)
Exploratory Descriptive Statistics and Measures of Central Tendency
Frequency Distributions of Factor-Type Variables
Descriptive Statistics and Measures of Central Tendency for Numeric Variables
Exploratory Analyses
Binary Logistic Regression
Probability Plot as a Value-Added Activity
External Data and/or Data Resources Used in This Lesson
Chapter 6: Use of R-Based APIs (Application Programming Interface) to Obtain Data
Emergence of APIs as a Data Resource
APIs and Reproducible Syntax
APIs and the Need for a Key
Structure of an API to Automate Data Retrieval
Structure of Data Returned by an API
Data in Returned Format
Data After Organization and Manipulation with Tidyverse Tools
Common API Resources in Biostatistics, Government and Proprietary
Addendum 1: Use of the tidyUSDA::getQuickstat() API
Addendum 2: Use an API to Obtain Multiple Files, Wrangle the Data, Merge Files, Review Absolute and Percentage Change Over Time
Obtain Data on Iowa Corn Prices, 1867 Onward
Obtain Data on Iowa Corn Acreage, 1926 Onward
Wrangle the Data into a Singular Dataset
Addendum 3: Use of Known URLs as a Proxy API (Application Programming Interface)
Addendum 4: API-Based Data in JavaScript Object Notation (JSON) Format
External Data and/or Data Resources Used in this Lesson
Chapter 7: Putting It All Together â R, the tidyverse Ecosystem, and APIs
Obtain Data from an API
Make the Data Tidy
Statistical Tests â Base R and tidyverse Ecosystem Functions
Beautiful Graphics
Grouped Data
Interval and Real Numeric Data
Beautiful Maps
Bar Plot
Mosaic Plot (Fig. 7.9)
Waffle Plot (e.g., Square Pie Chart)
Beanplot
Beeswarm Plot (Fig. 7.12)
Boxplot (Fig. 7.13)
Density Plot (Fig. 7.14)
Dotplot (Fig. 7.15)
Histogram (Fig. 7.16)
Line Chart â Static (Fig. 7.17)
Line Chart â Multiple
Pirate Plot
Quantile-Quantile (QQ) Plot (Fig. 7.20)
Scatter Plot
Scatter Plot Matrix
Violin Plot
International
National
State
County
Sub-county
R Markdown and LaTeX Demonstrations of a Summary Memorandum of Findings
R Markdown
LaTeX
Concluding Comments and Next Steps
Technical Skills of a Data Scientist
Soft Skills of a Data Scientist
Future Employment Opportunities
Contact the Author
External Data and/or Data Resources Used in This Lesson
Index
đ SIMILAR VOLUMES
Develop insights from data with tidy tools. Import, wrangle, visualize, and model data with the Tidyverse R packages. This book is intended for data scientists with some familiarity with the R programming language who are seeking to do Data Science using the Tidyverse family of packages. Through
Today, data science is an indispensable tool for any organization, allowing for the analysis and optimization of decisions and strategy. R has become the preferred software for data science, thanks to its open source nature, simplicity, applicability to data analysis, and the abundance of libraries
<p><p>Through real-world datasets, this book shows the reader how to work with material in biostatistics using the open source software R. These include tools that are critical to dealing with missing data, which is a pressing scientific issue for those engaged in biostatistics. Readers will be equi
<span>This book covers some introductory steps in biostatistics using R programming language. Biostatistics is the branch of statistics that applies statistical methods to medical and biological problems. Biostatistics has become more important recently for studying the great amount of data that is
R is the most powerful tool you can use for statistical analysis. This definitive guide smooths Râs steep learning curve with practical solutions and real-world applications for commercial environments. In R in Action, Third Edition you will learn how to: ⢠Set up and install R and RStudio ⢠Cl