<span><p>Discover best practices for data analysis and software development in R and start on the path to becoming a fully-fledged data scientist. This book teaches you techniques for both data manipulation and visualization and shows you the best way for developing new software packages for R.</p><
Beginning Data Science in R: Data Analysis, Visualization, and Modelling for the Data Scientist
β Scribed by Mailund, Thomas
- Publisher
- Apress
- Year
- 2017
- Tongue
- English
- Leaves
- 369
- Category
- Library
No coin nor oath required. For personal study only.
β¦ Synopsis
Discover best practices for data analysis and software development in R and start on the path to becoming a fully-fledged data scientist. This book teaches you techniques for both data manipulation and visualization and shows you the best way for developing new software packages for R.
Data Science in Rdetails how data science is a combination of statistics, computational science, and machine learning. You ll see how to efficiently structure and mine data to extract useful patterns and build mathematical models. This requires computational methods and programming, and R is an ideal programming language for this.
This book is based on a number of lecture notes for classes the author has taught on data science and statistical programming using the R programming language. Modern data analysis requires computational skills and usually a minimum of programming.
What You Will Learn
Perform data science and analytics using statistics and the R programming language
Visualize and explore data, including working with large data sets found in big data
Build an R package
Test and check your code
Practice version control
Profile and optimize your code
Who This Book Is For
Those with some data science or analytics background, but not necessarily experience with the R programming language.
"
β¦ Table of Contents
Contents at a Glance......Page 4
Contents......Page 5
About the Author......Page 16
About the Technical Reviewer......Page 17
Acknowledgments......Page 18
Introduction......Page 19
Basic Interaction with R......Page 24
Simple Expressions......Page 26
Actually, All of the Above Are Vectors of Valuesβ¦......Page 28
Indexing Vectors......Page 29
Vectorized Expressions......Page 30
Functions......Page 31
Getting Documentation for Functions......Page 32
Writing Your Own Functions......Page 33
A Quick Look at Control Structures......Page 35
Factors......Page 39
Data Frames......Page 41
Dealing with Missing Values......Page 43
Using R Packages......Page 44
Data Pipelines (or Pointless Programming)......Page 45
Writing Functions that Work with Pipelines......Page 46
The magical β.β argument......Page 47
Defining Functions Using .......Page 48
Anonymous Functions......Page 49
Other Pipeline Operations......Page 50
Root Mean Square Error......Page 51
Chapter 2: Reproducible Analysis......Page 52
Creating an R Markdown/knitr Document in RStudio......Page 53
The YAML Language......Page 56
The Markdown Language......Page 57
Formatting Text......Page 58
Cross-Referencing......Page 61
Controlling the Output (Templates/Stylesheets)......Page 62
Running R Code in Markdown Documents......Page 63
Using Chunks when Analyzing Data (Without Compiling Documents)......Page 65
Displaying Data......Page 66
Add Caching......Page 67
Data Already in R......Page 68
Quickly Reviewing Data......Page 70
Reading Data......Page 71
Breast Cancer Dataset......Page 72
Boston Housing Dataset......Page 78
The readr Package......Page 79
Manipulating Data with dplyr......Page 81
select(): Pick Selected Columns and Get Rid of the Rest......Page 82
mutate():Add Computed Values to Your Data Frame......Page 84
arrange(): Reorder Your Data Frame by Sorting Columns......Page 85
filter(): Pick Selected Rows and Get Rid of the Rest......Page 86
summarise/summarize(): Calculate Summary Statistics......Page 87
Breast Cancer Data Manipulation......Page 88
Tidying Data with tidyr......Page 92
Exercises......Page 95
Using tidyr......Page 96
Basic Graphics......Page 97
The Grammar of Graphics and the ggplot2 Package......Page 105
Using qplot()......Page 106
Using Geometries......Page 110
Facets......Page 119
Scaling......Page 122
Themes and Other Graphics Transformations......Page 127
Figures with Multiple Plots......Page 131
Exercises......Page 133
Subsample Your Data Before You Analyze the Full Dataset......Page 134
Running Out of Memory During Analysis......Page 136
Too Large to Plot......Page 137
Too Slow to Analyze......Page 141
Too Large to Load......Page 142
Hex and 2D Density Plots......Page 145
Supervised Learning......Page 146
Regression versus Classification......Page 147
Inference versus Prediction......Page 148
Linear Regression......Page 149
Logistic Regression (Classification, Really)......Page 154
Model Matrices and Formula......Page 157
Evaluating Regression Models......Page 166
Evaluating Classification Models......Page 168
Confusion Matrix......Page 169
Accuracy......Page 170
Sensitivity and Specificity......Page 172
Other Measures......Page 173
Random Permutations of Your Data......Page 174
Cross-Validation......Page 178
Selecting Random Training and Testing Data......Page 180
Decision Trees......Page 182
Random Forests......Page 184
Neural Networks......Page 185
Naive Bayes......Page 186
Breast Cancer Classification......Page 187
Compare Classification Algorithms......Page 188
Principal Component Analysis......Page 189
Multidimensional Scaling......Page 197
Clustering......Page 201
k-Means Clustering......Page 202
Hierarchical Clustering......Page 208
Association Rules......Page 212
Project 1......Page 216
Importing Data......Page 217
Distribution of Quality Scores......Page 218
Is This Wine Red or White?......Page 219
Fitting Models......Page 223
Analyzing Your Own Dataset......Page 224
Arithmetic Expressions......Page 225
Boolean Expressions......Page 226
The Numeric Type......Page 227
The Logical Type......Page 228
Vectors......Page 229
Matrix......Page 230
Lists......Page 232
Indexing......Page 233
Named Values......Page 235
Selection Statements......Page 236
Loops......Page 238
A Word of Warning About Looping......Page 239
Functions......Page 240
Named Arguments......Page 241
Return Values......Page 242
Lazy Evaluation......Page 243
Scoping......Page 244
Recursive Functions......Page 247
Linear Time Merge......Page 249
More Sorting......Page 250
Selecting the k Smallest Element......Page 251
Working with Vectors and Vectorizing Functions......Page 252
Vectorizing Functions......Page 254
The apply Family......Page 256
apply......Page 257
lapply......Page 259
sapply and vapply......Page 260
Infix Operators......Page 261
Replacement Functions......Page 262
How Mutable Is Data Anyway?......Page 264
Anonymous Functions......Page 265
Functions Returning Functions (and Closures)......Page 266
Filter, Map, and Reduce......Page 267
Function Operations: Functions as Input and Output......Page 269
Ellipsis Parameters......Page 272
Factorial Again......Page 274
Function Composition......Page 275
Data Structures......Page 276
Example: Bayesian Linear Model Fitting......Page 277
Classes......Page 278
Polymorphic Functions......Page 280
Defining Your Own Polymorphic Functions......Page 281
Specialization as Interface......Page 282
Specialization in Implementations......Page 283
Polynomials......Page 286
Package Names......Page 287
.Rbuildignore......Page 288
Version......Page 289
URL and BugReports......Page 290
Using an Imported Package......Page 291
NAMESPACE......Page 292
Documenting Functions......Page 293
Import and Export......Page 294
File Load Order......Page 295
Adding Data to Your Package......Page 296
Building an R Package......Page 297
Exercises......Page 298
Unit Testing......Page 299
Automating Testing......Page 300
Using testthat......Page 301
Writing Good Tests......Page 302
Testing Random Results......Page 303
Exercise......Page 304
Version Control and Repositories......Page 305
Installing git......Page 306
Making Changes to Files, Staging Files, and Committing Changes......Page 307
Bare Repositories and Cloning Repositories......Page 309
Pushing Local Changes and Fetching and Pulling Remote Changes......Page 310
Working with Branches......Page 312
GitHub......Page 315
Moving an Existing Repository to GitHub......Page 317
Pull Requests......Page 318
Exercises......Page 319
Profiling......Page 320
A Graph-Flow Algorithm......Page 321
Speeding Up Your Code......Page 332
Parallel Execution......Page 334
Switching to C++......Page 337
Project 2......Page 339
Bayesian Linear Regression......Page 340
Sample from a Multivariate Normal Distribution......Page 341
Computing the Posterior Distribution......Page 343
Predicting Target Variables for New Predictor Values......Page 345
Formulas and Their Model Matrix......Page 347
Working with Model Matrices in R......Page 348
Model Matrices Without Response Variables......Page 351
Predicting New Targets......Page 352
Constructor......Page 353
Updating Distributions: An Example Interface......Page 354
coefficients......Page 357
print......Page 358
Organization of Source Files......Page 359
Adding README and NEWS Files to Your Package......Page 360
Conclusions......Page 361
R Programming......Page 362
Acknowledgements......Page 363
Index......Page 364
π SIMILAR VOLUMES
RΓ©sumΓ© : Presenting best practices for data analysis and software development in R, this comprehensive book teaches you techniques for both data manipulation and visualization and shows you the best way for developing new software packages for R. --
<p>Discover best practices for data analysis and software development in R and start on the path to becoming a fully-fledged data scientist. This book teaches you techniques for both data manipulation and visualization and shows you the best way for developing new software packages for R.<br><i>Begi
<span>Discover best practices for data analysis and software development in R and start on the path to becoming a fully-fledged data scientist. Updated for the R 4.0 release, this book teaches you techniques for both data manipulation and visualization and shows you the best way for developing new s