<i>Data Mining Applications with R</i> is a great resource for researchers and professionals to understand the wide use of R, a free software environment for statistical computing and graphics, in solving different problems in industry. R is widely used in leveraging data mining techniques across ma
Data Mining Applications with R
β Scribed by Zhao, Yanchang
- Publisher
- Academic Press
- Year
- 2013
- Tongue
- English
- Leaves
- 493
- Category
- Library
No coin nor oath required. For personal study only.
β¦ Synopsis
Data Mining Applications with R is a great resource for researchers and professionals to understand the wide use of R, a free software environment for statistical computing and graphics, in solving different problems in industry. R is widely used in leveraging data mining techniques across many different industries, including government, finance, insurance, medicine, scientific research and more. This book presents 15 different real-world case studies illustrating various techniques in rapidly growing areas. It is an ideal companion for data mining researchers in academia and industry looking for ways to turn this versatile software into a powerful analytic tool.
R code, Data and color figures for the book are provided at the RDataMining.com website.
β¦ Table of Contents
Front Cover......Page 1
Data Mining Applications with R......Page 4
Copyright......Page 5
Contents......Page 6
Objectives and Significance......Page 14
Target Audience......Page 15
Acknowledgments......Page 16
Review Committee......Page 18
Additional Reviewers......Page 19
Foreword......Page 20
References......Page 22
1.1. Introduction......Page 24
1.2. A Brief Overview of the Power Grid......Page 25
1.3. Introduction to MapReduce, Hadoop, and RHIPE......Page 28
1.3.1.1. An Example: The Iris Data......Page 29
1.3.2. Hadoop......Page 30
1.3.3. RHIPE: R with Hadoop......Page 31
1.3.3.2. Iris MapReduce Example with RHIPE......Page 32
1.3.3.2.1. The Map Expression......Page 33
1.3.3.2.3. Running the Job......Page 34
1.3.3.2.4. Looking at Results......Page 35
1.3.4. Other Parallel R Packages......Page 36
1.4. Power Grid Analytical Approach......Page 37
1.4.1. Data Preparation......Page 38
1.4.2.1. 5-min Summaries......Page 39
1.4.2.2. Quantile Plots of Frequency......Page 41
1.4.2.3. Tabulating Frequency by Flag......Page 43
1.4.2.4. Distribution of Repeated Values......Page 44
1.4.2.5. White Noise......Page 46
1.4.3. Event Extraction......Page 48
1.4.3.1. OOS Frequency Events......Page 49
1.4.3.2. Finding Generator Trip Features......Page 50
1.4.3.3. Creating Overlapping Frequency Data......Page 51
1.5. Discussion and Conclusions......Page 54
Appendix......Page 55
References......Page 57
2.1. Introduction......Page 58
2.2. Related Works......Page 59
2.3. Motivations and Requirements......Page 60
2.3.1. R Packages Requirements......Page 61
2.4. Probabilistic Framework of NB Classifiers......Page 62
2.4.1. Choosing the Model......Page 63
2.4.1.2. Multinomial Model......Page 65
2.4.1.3. Poisson Model......Page 66
2.4.2. Estimating the Parameters......Page 67
2.5. Two-Dimensional Visualization System......Page 70
2.5.1. Design Choices......Page 71
2.5.2. Visualization Design......Page 72
2.6.1. Description of the Dataset......Page 75
2.6.2. Creating Document-Term Matrices......Page 76
2.6.3. Loading Existing Term-Document Matrices......Page 77
2.6.4.1. Comparing Models......Page 78
2.7. Conclusions......Page 82
References......Page 83
3.1. Introduction......Page 86
3.2. How Many Messages and How Many Twitter-Users in the Sample?......Page 88
3.3. Who Is Writing All These Twitter Messages?......Page 89
3.4. Who Are the Influential Twitter-Users in This Sample?......Page 90
3.5. What Is the Community Structure of These Twitter-Users?......Page 95
3.6. What Were Twitter-Users Writing About During the Meeting?......Page 98
3.7. What Do the Twitter Messages Reveal About the Opinions of Their Authors?......Page 103
3.8. What Can Be Discovered in the Less Frequently Used Words in the Sample?......Page 107
3.9. What Are the Topics That Can Be Algorithmically Discovered in This Sample?......Page 109
3.10. Conclusion......Page 111
References......Page 114
4.1. Introduction......Page 118
4.2. Dataset Preparation......Page 119
4.3.1. The Document-Term Matrix......Page 120
4.3.2. Term Frequency-Inverse Document Frequency......Page 122
4.3.3. Exploring the Document-Term Matrix......Page 123
4.4.1. The Latent Dirichlet Allocation......Page 124
4.4.2. Learning the Various Distributions for LDA......Page 125
4.4.3. Using the Log-Likelihood for Model Validation......Page 127
4.4.4. Topics Representation......Page 128
4.4.5. Plotting the Topics Associations......Page 129
4.5.1. Computing Similarities Between Documents......Page 131
4.6.1. Constructing the Network as a Graph......Page 132
4.6.2. Author Importance Using Centrality Measures......Page 136
References......Page 138
5.3. Evaluation......Page 140
5.4. Collaborative Filtering Methods......Page 141
5.5. Latent Factor Collaborative Filtering......Page 150
5.6. Simplified Approach......Page 166
5.7. Roll Your Own......Page 168
5.8. Final Thoughts......Page 172
References......Page 174
6.1. Introduction/Background......Page 176
6.2. Business Problem......Page 178
6.3. Proposed Response Model......Page 179
6.4.2. Data Preprocessing......Page 181
6.4.2.2. Data Normalization......Page 182
6.4.3.1. Target Variable Construction......Page 183
6.4.3.2. Predictor Variables......Page 184
6.4.3.3. Interaction Variables......Page 186
6.4.4. Feature Selection......Page 187
6.4.4.1. F-Score......Page 188
6.4.4.2. Step1: Selection of Interaction Features Using F-Score......Page 189
6.4.4.3. Step2: Selection of Features Using F-Score......Page 190
6.4.4.4. Step3: Selection of Best Subset of Features Using Random Forest......Page 191
6.4.5. Data Sampling for Training and Test......Page 192
6.4.6. Class Balancing......Page 194
6.4.7. Classifier (SVM)......Page 195
6.5. Prediction Result......Page 197
6.6. Model Evaluation......Page 198
6.7. Conclusion......Page 200
References......Page 201
7.1. Introduction......Page 204
7.2. Data Description and Initial Exploratory Data Analysis......Page 205
7.2.1. Variable Correlations and Logistic Regression Analysis......Page 207
7.3.1. Overview of Model Building and Validating......Page 208
7.3.2. Review of Four Classifier Methods......Page 211
7.3.3. RP Model......Page 213
7.3.4. Bagging Ensemble......Page 215
7.3.5. Support Vector Machine......Page 216
7.3.6. LR Classification......Page 218
7.3.7. Comparison of Four Classifier Models: ROC and AUC......Page 222
7.3.8. Model Comparison: Recall-Precision, Accuracy-v-Cut-off, and Computation Times......Page 224
7.4. Discussion of Results and Conclusion......Page 229
Appendix A. Details of the Full Data Set Variables......Page 232
Appendix B. Customer Profile Data-Frequency of Binary Values......Page 235
Appendix C. Proportion of Caravan Insurance Holders vis-Γ -vis other Customer Profile Variables......Page 243
Appendix D. LR Model Details......Page 245
Appendix F. Commands for Cross-Validation Analysis of Classifier Models......Page 248
References......Page 249
8.1. Introduction......Page 252
8.3. Data Extraction......Page 253
8.4.1. Null Value Detection......Page 254
8.4.2. Outlier Detection......Page 255
8.5.1. Relevance Analysis......Page 258
8.5.2. Data Set Balancing......Page 260
8.5.3. Feature Selection......Page 262
8.6. Modeling......Page 263
8.8. Finding and Model Deployment......Page 266
Appendix. Selecting Best Features for Predicting Bank Loan Default......Page 267
References......Page 268
9.1. Introduction......Page 270
9.2.1. Aggregation Functions......Page 271
9.2.2. Choquet Integral......Page 272
9.2.3. Fuzzy Measure Representation......Page 274
9.2.4. Shapley Value and Interaction Index......Page 275
9.3.1. Installation......Page 276
9.3.2. Toolbox Description......Page 277
9.3.3. Preference Analysis Example......Page 278
9.4.1. Traveler Preference Study and Hotel Management......Page 281
9.4.2. Data Collection and Experiment Design......Page 282
9.4.3. Model Evaluation......Page 283
9.4.4. Result Analysis......Page 286
9.4.4.1. Preference Profile Construction......Page 287
9.4.4.2. Interaction Behavior Analysis......Page 288
9.4.5. Discussion......Page 292
9.5. Conclusions......Page 293
References......Page 294
10.2. Housing Prices and Indices......Page 296
10.3. A Data Mining Approach......Page 297
10.3.1. Data Capture......Page 298
10.3.2. Geocoding......Page 300
10.3.3. Price Evolution......Page 303
10.4. Real Estate Pricing Models......Page 306
10.4.1. Model 1: Hedonic Model Plus Smooth Term......Page 307
10.4.2. Model 2: GWR Plus a Smooth Term......Page 310
10.4.3. Relationship to Other Work......Page 316
References......Page 318
11.1. Introduction......Page 322
11.2. Study Region and Data Processing......Page 323
11.2.2. Data Processing of Seabed Hardness......Page 324
11.2.3. Predictors......Page 327
11.3. Dataset Manipulation and Exploratory Analyses......Page 328
11.3.2. Exploratory Data Analyses......Page 329
11.4. Application of RF for Predicting Seabed Hardness......Page 330
11.5. Model Validation Using rfcv......Page 336
11.6. Optimal Predictive Model......Page 338
11.7. Application of the Optimal Predictive Model......Page 342
11.8.1. Selection of Relevant Predictors and the Consequences of Missing the Most Important Predictors......Page 344
11.8.2. Issues with Searching for the Most Accurate Predictive Model Using RF......Page 346
11.8.3. Predictive Accuracy of RF and Prediction Maps of Seabed Hardness......Page 347
11.8.4. Limitations......Page 348
Appendix BA. R Function, rf.cv, Shows the Cross-Validated Prediction Performance of a Predictive Model......Page 349
References......Page 350
12.1. Background......Page 354
12.2. Challenges......Page 355
12.3. Data Extraction and Exploration......Page 359
12.4. Data Preprocessing......Page 364
12.5. Modeling......Page 367
12.6. Model Evaluation......Page 371
12.7. Model Deployment......Page 378
12.8. Lessons, Discussion, and Conclusions......Page 382
Acknowledgments......Page 385
References......Page 386
13.1. Introduction......Page 390
13.2. Problem Definition......Page 391
13.4. Data Exploration and Preprocessing......Page 392
13.5. Visualizations......Page 398
13.6. Modeling......Page 408
13.7. Model Evaluation......Page 415
13.8. Discussions and Improvements......Page 417
References......Page 418
14.1. Introduction to the Case Study and Organization of the Analysis......Page 420
14.2. Background of the Analysis: The Italian Football Championship......Page 421
14.3.1. Data Extraction......Page 422
14.3.2. Data Exploration......Page 423
14.4.1. Variable Importance Evaluation......Page 426
14.4.2. Composite Indicators Construction......Page 431
14.4.2.1. PCA for the Home Team......Page 432
14.4.2.2. PCA for the Away Team......Page 434
14.5. Model Development: Building Classifiers......Page 435
14.5.1. Learning Step......Page 436
14.5.1.1. Random Forest......Page 437
14.5.1.2. Neural Network......Page 438
14.5.1.4. NaΓ―ve Bayesian Classification......Page 441
14.5.1.5. Multinomial Logistic Regression Model......Page 442
14.5.2. Model Selection......Page 444
14.5.3. Model Refinement......Page 447
14.6. Model Deployment......Page 449
14.7. Concluding Remarks......Page 453
References......Page 454
15.1. Introduction......Page 458
15.2. Data Extraction from PCAP to CSV File......Page 459
15.3. Data Importation from CSV File to R......Page 460
15.4. Dimension Reduction Via PCA......Page 461
15.5. Initial Data Exploration Via Graphs......Page 463
15.6. Variables Scaling and Samples Selection......Page 465
15.7. Clustering for Segmenting the FQDN......Page 466
15.8. Building Routing Table Thanks to Clustering......Page 469
15.9. Building Routing Table Thanks to Mixed Integer Linear Programming......Page 471
15.10. Building Routing Table Via a Heuristic......Page 474
15.11. Final Evaluation......Page 475
15.12. Conclusion......Page 477
References......Page 478
Index......Page 480
π SIMILAR VOLUMES
<i>Data Mining Applications with R</i> is a great resource for researchers and professionals to understand the wide use of R, a free software environment for statistical computing and graphics, in solving different problems in industry. R is widely used in leveraging data mining techniques across ma
<p><b>Develop key skills and techniques with R to create and customize data mining algorithms</b></p> <h2>About This Book</h2><ul><li>Develop a sound strategy for solving predictive modeling problems using the most popular data mining algorithms</li><li>Gain understanding of the major methods of pre
<p><b>Develop key skills and techniques with R to create and customize data mining algorithms</b></p> <h2>About This Book</h2><ul><li>Develop a sound strategy for solving predictive modeling problems using the most popular data mining algorithms</li><li>Gain understanding of the major methods of pre
"The versatile capabilities and large set of add-on packages make R an excellent alternative to many existing and often expensive data mining tools. Exploring this area from the perspective of a practitioner, Data mining with R: learning with case studies uses practical examples to illustrate the po