Introducing the IBM SPSS Modeler, this book guides readers through data mining processes and presents relevant statistical methods. There is a special focus on step-by-step tutorials and well-documented examples that help demystify complex mathematical algorithms and computer programs. The variety o
Data mining with SPSS Modeler : theory, exercises and solutions
β Scribed by Tilo Wendler; SΓΆren GrΓΆttrup
- Year
- 2021
- Tongue
- English
- Leaves
- 1285
- Edition
- Second
- Category
- Library
No coin nor oath required. For personal study only.
β¦ Table of Contents
Preface to the Second Edition
Preface to the First Edition
Contents
1: Introduction
1.1 The Concept of the SPSS Modeler
1.2 Structure and Features of This Book
1.2.1 Prerequisites for Using This Book
1.2.2 Structure of the Book and the Exercise/Solution Concept
1.2.3 Using the Data and Streams Provided with the Book
1.2.4 Datasets Provided with This Book
1.2.5 Template Concept of This Book
1.3 Introducing the Modeling Process
1.3.1 Exercises
1.3.2 Solutions
References
2: Basic Functions of the SPSS Modeler
2.1 Defining Streams and Scrolling Through a Dataset
2.2 Switching Between Different Streams
2.3 Defining or Modifying Value Labels
2.4 Adding Comments to a Stream
2.5 Exercises
2.6 Solutions
2.7 Data Handling Methods
2.7.1 Theory
2.7.2 Calculations
2.7.3 String Functions
2.7.4 Extracting/Selecting Records
2.7.5 Filtering Data
2.7.6 Data Standardization: Z-Transformation
2.7.7 Partitioning Datasets
2.7.8 Sampling Methods
2.7.9 Merge Datasets
2.7.10 Append Datasets
2.7.11 Exercises
2.7.12 Solutions
References
3: Univariate Statistics
3.1 Theory
3.1.1 Discrete Versus Continuous Variables
3.1.2 Scales of Measurement
3.1.3 Exercises
3.1.4 Solutions
3.2 Simple Data Examination Tasks
3.2.1 Theory
3.2.2 Frequency Distribution of Discrete Variables
3.2.3 Frequency Distribution of Continuous Variables
3.2.4 Distribution Analysis with the Data Audit Node
3.2.5 Concept of ``SuperNodes´´ and Transforming a Variable to Normality
3.2.6 Reclassifying Values
3.2.7 Binning Continuous Data
3.2.8 Exercises
3.2.9 Solutions
References
4: Multivariate Statistics
4.1 Theory
4.2 Scatterplot
4.3 Scatterplot Matrix
4.4 Correlation
4.5 Correlation Matrix
4.6 Exclusion of Spurious Correlations
4.7 Contingency Tables
4.8 Exercises
4.9 Solutions
References
5: Regression Models
5.1 Introduction to Regression Models
5.1.1 Motivating Examples
5.1.2 Concept of the Modeling Process and Cross-Validation
5.2 Simple Linear Regression
5.2.1 Theory
5.2.2 Building the Stream in SPSS Modeler
5.2.3 Identification and Interpretation of the Model Parameters
5.2.4 Assessment of the Goodness of Fit
5.2.5 Predicting Unknown Values
5.2.6 Exercises
5.2.7 Solutions
5.3 Multiple Linear Regression
5.3.1 Theory
5.3.2 Building the Model in SPSS Modeler
5.3.3 Final MLR Model and Its Goodness of Fit
5.3.4 Prediction of Unknown Values
5.3.5 Cross-Validation of the Model
5.3.6 Boosting and Bagging (for Regression Models)
5.3.7 Exercises
5.3.8 Solutions
5.4 Generalized Linear (Mixed) Model
5.4.1 Theory
5.4.2 Building a Model with the GLMM Node
5.4.3 The Model Nugget
5.4.4 Cross-Validation and Fitting a Quadric Regression Model
5.4.5 Exercises
5.4.6 Solutions
5.5 The Auto Numeric Node
5.5.1 Building a Stream with the Auto Numeric Node
5.5.2 The Auto Numeric Model Nugget
5.5.3 Exercises
5.5.4 Solutions
References
6: Factor Analysis
6.1 Motivating Example
6.2 General Theory of Factor Analysis
6.3 Principal Component Analysis
6.3.1 Theory
6.3.2 Building a Model in SPSS Modeler
6.3.3 Exercises
6.3.4 Solutions
6.4 Principal Factor Analysis
6.4.1 Theory
6.4.2 Building a Model
6.4.3 Feature Selection vs. Feature Reduction
6.4.4 Exercises
6.4.5 Solutions
References
7: Cluster Analysis
7.1 Motivating Examples
7.2 General Theory of Cluster Analysis
7.2.1 Exercises
7.2.2 Solutions
7.3 TwoStep Hierarchical Agglomerative Clustering
7.3.1 Theory of Hierarchical Clustering
7.3.2 Characteristics of the TwoStep Algorithm
7.3.3 Building a Model in SPSS Modeler
7.3.4 Exercises
7.3.5 Solutions
7.4 K-Means Partitioning Clustering
7.4.1 Theory
7.4.2 Building a Model in SPSS Modeler
7.4.3 Exercises
7.4.4 Solutions
7.5 Auto Clustering
7.5.1 Motivation and Implementation of the Auto Cluster Node
7.5.2 Building a Model in SPSS Modeler
7.5.3 Exercises
7.5.4 Solutions
7.6 Summary
References
8: Classification Models
8.1 Motivating Examples
8.2 General Theory of Classification Models
8.2.1 Process of Training and Using a Classification Model
8.2.2 Classification Algorithms
8.2.3 Classification Versus Clustering
8.2.4 Decision Boundary and the Problem with Over- and Underfitting
8.2.5 Performance Measures of Classification Models
8.2.6 The Analysis Node
8.2.7 The Evaluation Node
8.2.8 A Detailed Example how to Create a ROC Curve
8.2.9 Exercises
8.2.10 Solutions
8.3 Logistic Regression
8.3.1 Theory
8.3.2 Building the Model in SPSS Modeler
8.3.3 Optional: Model Types and Variable Interactions
8.3.4 Final Model and Its Goodness of Fit
8.3.5 Classification of Unknown Values
8.3.6 Cross-Validation of the Model
8.3.7 Exercises
8.3.8 Solutions
8.4 Linear Discriminate Classification
8.4.1 Theory
8.4.2 Building the Model with SPSS Modeler
8.4.3 The Model Nugget and the Estimated Model Parameters
8.4.4 Exercises
8.4.5 Solutions
8.5 Support Vector Machine
8.5.1 Theory
8.5.2 Building the Model with SPSS Modeler
8.5.3 The Model Nugget
8.5.4 Exercises
8.5.5 Solutions
8.6 Neuronal Networks
8.6.1 Theory
8.6.2 Building a Network with SPSS Modeler
8.6.3 The Model Nugget
8.6.4 Exercises
8.6.5 Solutions
8.7 K-Nearest Neighbor
8.7.1 Theory
8.7.2 Building the Model with SPSS Modeler
8.7.3 The Model Nugget
8.7.4 Dimensional Reduction with PCA for Data Preprocessing
8.7.5 Exercises
8.7.6 Solutions
8.8 Decision Trees
8.8.1 Theory
8.8.2 Building a Decision Tree with the C5.0 Node
8.8.3 The Model Nugget
8.8.4 Building a Decision Tree with the CHAID Node
8.8.5 Exercises
8.8.6 Solutions
8.9 The Auto Classifier Node
8.9.1 Building a Stream with the Auto Classifier Node
8.9.2 The Auto Classifier Model Nugget
8.9.3 Exercises
8.9.4 Solutions
References
9: Using R with the Modeler
9.1 Advantages of R with the Modeler
9.2 Connecting with R
9.3 Test the SPSS Modeler Connection to R
9.4 Calculating New Variables in R
9.5 Model Building in R
9.6 Modifying the Data Structure in R
9.7 Solutions
References
10: Imbalanced Data and Resampling Techniques
10.1 Characteristics of Imbalanced Datasets and Consequences
10.2 Resampling Techniques
10.2.1 Random Oversampling Examples (ROSE)
10.2.2 Synthetic Minority Oversampling Technique (SMOTE)
10.2.3 Adaptive Synthetic Sampling Method (abbr. ADASYN)
10.3 Implementation in SPSS Modeler
10.4 Using R to Implement Balancing Methods
10.4.1 SMOTE-Approach Using R
10.4.2 ROSE-Approach Using R
10.5 Exercises
10.5.1 Exercise 1: Recap Imbalanced Data
10.5.2 Exercise 2: Resampling Application to Identify Cancer
10.5.3 Exercise 3: Comparing Resampling Algorithms
10.6 Solutions
10.6.1 Exercise 1: Recap Imbalanced Data
10.6.2 Exercise 2: Resampling Application to Identify Cancer
10.6.3 Exercise 3: Comparing Resampling Algorithms
References
11: Case Study: Fault Detection in Semiconductor Manufacturing Process
11.1 Case Study Background
11.2 The Standard Process in Data Mining
11.2.1 Business Understanding (CRISP-DM Step 1)
11.2.2 Data Understanding (CRISP-DM Step 2)
11.2.3 Data Preparation (CRISP-DM Step 3)
11.2.3.1 Merging Data (CRISP-DM Step 3.1)
11.2.3.2 Separating Training and Test Data (CRISP-DM Step 3.2)
11.2.3.3 Reducing Dimensionality of Data by Feature Removal (CRISP-DM Step 3.3)
11.2.3.3.1 Deleting Features or Records
11.2.3.3.2 Identifying Correlated Features
11.2.3.4 Outlier Identification and Treatment (CRISP-DM Step 3.4)
11.2.3.5 Impute Missing Values (CRISP-DM Step 3.5)
11.2.3.5.1 Reasons for Missing Values and Implications for the Data Analysis Process
11.2.3.5.2 Imputation Methods
11.2.3.5.3 Implementing Imputation Methods
11.2.3.6 Calculating New Features (CRISP-DM Step 3.6)
11.2.3.7 Identification of Important Features by Using Feature Selection (CRISP-DM Step 3.7)
11.2.4 Modeling (CRISP-DM Step 4)
11.2.4.1 Balancing (CRISP-DM Step 4.1)
11.2.4.2 Feature Scaling and Model Building (CRISP-DM Step 4.2)
11.2.5 Evaluation and Deployment of Model (CRISP-DM Step 5 and 6)
11.3 Lessons Learned
11.4 Exercises
11.5 Solutions
References
12: Appendix
12.1 Data Sets Used in This Book
12.1.1 adult_income_data.txt
12.1.2 bank_full.csv
12.1.3 beer.sav
12.1.4 benchmark.xlsx
12.1.5 car_simple.sav
12.1.6 car_sales_modified.sav
12.1.7 chess_endgame_data.txt
12.1.8 credit_card_sampling_data.sav
12.1.9 customer_bank_data.csv
12.1.10 diabetes_data_reduced.sav
12.1.11 DRUG1n.csv
12.1.12 EEG_Sleep_Signals.csv
12.1.13 employee_dataset_001 and employee_dataset_002
12.1.14 England Payment Datasets
12.1.15 Features_eeg_signals.csv
12.1.16 gene_expression_leukemia_all.csv
12.1.17 gene_expression_leukemia_short.csv
12.1.18 gravity_constant_data.csv
12.1.19 hacide_train.SAV and hacide_test.SAV
12.1.20 Housing.data.txt
12.1.21 income_vs_purchase.sav
12.1.22 Iris.csv
12.1.23 IT-projects.txt
12.1.24 IT user satisfaction.sav
12.1.25 longley.csv
12.1.26 LPGA2009.csv
12.1.27 Mtcars.csv
12.1.28 nutrition_habites.sav
12.1.29 optdigits_training.txt, optdigits_test.txt
12.1.30 Orthodont.csv
12.1.31 Ozone.csv
12.1.32 pisa2012_math_q45.sav
12.1.33 sales_list.sav
12.1.34 secom.sav
12.1.35 ships.csv
12.1.36 test_scores.sav
12.1.37 Titanic.xlsx
12.1.38 tree_credit.sav
12.1.39 wine_data.txt
12.1.40 WisconsinBreastCancerData.csv and wisconsin_breast_cancer_data.sav
12.1.41 z_pm_customer1.sav
References
π SIMILAR VOLUMES
Springer, 2016. β 1068 p. β ISBN: 9783319287072<div class="bb-sep"></div>Introducing the IBM SPSS Modeler, this book guides readers through data mining processes and presents relevant statistical methods. There is a special focus on step-by-step tutorials and well-documented examples that help demys
<p><p>This book contains 296 exercises and solutions covering a wide variety of topics in linear model theory, including generalized inverses, estimability, best linear unbiased estimation and prediction, ANOVA, confidence intervals, simultaneous confidence intervals, hypothesis testing, and varianc
<p><p></p><p>This textbook presents a unified and rigorous approach to best linear unbiased estimation and prediction of parameters and random quantities in linear models, as well as other theory upon which much of the statistical methodology associated with linear models is based. The single most u
official instructor's manual for "Principles and Theory for Data Mining and Machine Learning" (2010), obtained right through Springer.com the book is the holy book of the mathematical underpinnings of Machine Learning; you might have some struggles at the beginning, but it certainly pays back. Enjo