Introduction to Data Science

✍ Scribed by Gaoyan Ou, Zhanxing Zhu, Bin Dong

Publisher: World Scientific Publishing
Year: 2023
Tongue: English
Leaves: 445
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Data Science is an emerging discipline which emphasizes the cultivation of Big Data talents with interdisciplinary ability. The book systematically introduces the basic contents of Data Science, including data preprocessing and basic methods of data analysis, handling special problems (e.g. text analysis), Deep Learning, and distributed systems. In addition to systematically introducing the basic content of Data Science from a theoretical point of view, the book also provides a large number of data analysis practice cases.

Its purpose is to comprehensively introduce models and algorithms in Data Science from a technical point of view. This book systematically introduces the basic theoretical content of Data Science, including data preprocessing, basic methods of data analysis, processing of special problems (such as text analysis), Deep Learning, and distributed systems. In addition, this book provides a large number of case studies for data analysis application practice. Students can conduct practical training and interact with data on the iData-Course platform.

Artificial Intelligence (AI) has become a field with many practical applications and active research attracting wide attentions from both academy and industry communities. We expect to simulate humans to automatically handle different types of tasks through Artificial Intelligence systems, such as understanding natural language, speech, image, video, and auxiliary medical diagnosis, etc. Deep Learning, which has been developed in the past decade, provides a promising solution for these tasks. An important nature of Deep Learning is the ability to learn the complex features or representations of data from the process of building a hierarchical network.

In Spark, data analysts do their work on data processing by using two types of operations defined on RDD: transformations and actions. Compared with MapReduce, Spark’s main advantage is to improve the performance of data processing. For example, researchers used Hadoop and Spark to train logistic regression models using the same training set, and found that Spark is more than 100 times more efficient. The main reason is that data in Spark is shared through memory, while through disk in Hadoop. Spark is not a substitute for Hadoop. Hadoop and Spark are complementary. When data analysts perform data processing and analysis, they can choose either MapReduce or Spark. The main difference is that Spark performs better than MapReduce on some iterative and interactive tasks.

✦ Table of Contents

Contents
Preface
About the Authors
About the Translators
1. Introduction
1.1. The Fundamental Contents of Data Science
1.1.1. Core Problem of Data Analysis
1.1.2. Mathematical Structure of Data
1.1.3. Major Difficulties in Data Analysis
1.1.4. The Importance of Algorithms
1.2. Impact on the Development of Discipline
1.2.1. The Impact on Traditional Disciplines
1.2.2. The Birth of a New Discipline: Computational Advertising
1.3. Impact on Scientific Research
1.4. The Curricula of Data Science
1.5. Contents
2. Data Preprocessing
2.1. Feature Encoding
2.1.1. Numeric Encoding
2.1.2. One-Hot Encoding
2.2. Missing Value Processing
2.2.1. Deletion Method
2.2.2. Mean Imputation
2.2.3. Stochastic Imputation
2.2.4. Model-based Imputation
2.2.5. Other Missing Values Imputation Methods
2.3. Data Standardization
2.3.1. Z-score Standardization
2.3.2. Min-Max Standardization
2.3.3. Decimal Scaling Standardization
2.3.4. Logistic Standardization
2.3.5. Comparison of Standardization Methods
2.4. Data Discretization
2.4.1. Equal-width Discretization
2.4.2. Equal-frequency Discretization
2.4.3. Clustering Based Discretization
2.4.4. Information Gain Discretization
2.4.5. Chi-squared Discretization
2.4.6. Class-attribute Interdependence Maximization
2.4.7. Summary
2.5. Outliers
2.5.1. Statistics-Based Method
2.5.2. Nearest-neighbors-based Method
2.5.3. Summary
2.6. Other Preprocessing Methods
2.7. Case Studies and Exercises
3. Regression Model
3.1. Linear Regression
3.1.1. Simple Linear Regression
3.1.2. Multiple Linear Regression
3.1.3. Summary
3.2. Linear Regression Regularization
3.2.1. Ridge and LASSO
3.2.2. Other Regularized Linear Regression Models
3.3. Nonlinear Regression
3.3.1. Spline Regression
3.3.2. Radial Basis Function Network
3.4. Case Studies and Exercises
4. Classification Model
4.1. Logistic Regression
4.1.1. From Linear Regression to Logistic Regression
4.1.2. Parameter Estimation
4.1.3. Summary
4.2. K-Nearest Neighbor
4.2.1. Choosing the Value of k
4.2.2. Improving Prediction Speed
4.2.3. Summary
4.3. Decision Tree
4.3.1. Decision Tree Generation
4.3.2. Decision Tree Algorithms
4.3.3. Pruning of Decision Trees
4.3.4. Analysis of Decision Trees
4.4. Naive Bayes
4.4.1. Bayes Theorem
4.4.2. Naive Bayes Model
4.4.3. Parameter Estimation
4.4.4. Algorithm Analysis
4.5. Support Vector Machine (SVM)
4.5.1. Margin and Support Vector
4.5.2. Dual Problem and SMO Algorithm
4.5.3. Soft Margin SVM
4.5.4. Kernel Functions and Kernel Methods
4.5.5. Advantages and Disadvantages of Support Vector Machine
4.6. Case Studies and Exercises
5. Ensemble Method
5.1. Overview of Ensemble Method
5.1.1. Bagging
5.1.2. Boosting
5.1.3. Stacking
5.2. Random Forest
5.2.1. Random Forest Algorithm
5.2.2. Performance Evaluation and Feature Evaluation
5.2.3. Algorithm Analysis
5.3. AdaBoost
5.3.1. The Process of AdaBoost Algorithm
5.3.2. Error Analysis of AdaBoost
5.3.3. Objective Function of AdaBoost
5.3.4. Summary
5.4. Case Study: Personal Credit Risk Assessment
5.4.1. Background
5.4.2. Model Building
5.4.3. Evaluation
5.4.4. Summary
5.5. Case Studies and Exercises
6. Clustering Model
6.1. K-means Clustering
6.1.1. K-means Clustering Model
6.1.2. Choice of K
6.1.3. Choice of the Centroids
6.1.4. Variants of K-means
6.2. Hierarchical Clustering
6.2.1. Agglomerative Clustering
6.2.2. Divisive Clustering
6.3. Spectral Clustering
6.4. Density-Based Clustering
6.5. Summary
6.6. Case Studies and Exercises
7. Association Rule Mining
7.1. Association Rule
7.2. Apriori Algorithm
7.2.1. Apriori Property
7.2.2. Apriori Algorithm
7.2.3. Case Study of Apriori Algorithm
7.2.4. Association Rules Generation
7.2.5. Summary
7.3. FP-Growth Algorithm
7.3.1. Build FP-tree
7.3.2. FP-Growth Algorithm
7.3.3. Association Rule Generation
7.3.4. Summary
7.4. Case Studies Exercises
8. Dimensionality Reduction
8.1. Principal Component Analysis
8.1.1. PCA Algorithm
8.1.2. Summary
8.2. Linear Discriminant Analysis
8.2.1. The Optimization Objective of LDA
8.2.2. The Solution of LDA
8.2.3. Summary
8.3. Multi-dimensional Scaling
8.3.1. The Optimization Objective of MDS
8.3.2. The Solution of MDS
8.3.3. Practical Example
8.3.4. Summary
8.4. Locally Linear Embedding
8.4.1. LLE Algorithm
8.4.2. Locally Linear Reconstruction
8.4.3. Low-Dimensional Representation
8.4.4. Summary
8.5. Other Dimensionality Reduction Methods
8.6. Case Studies and Exercises
9. Feature Selection
9.1. General Process of Feature Selection
9.2. Feature Selection Method
9.2.1. Filtering Method
9.2.2. Wrapper Method
9.2.3. Embedding Method
9.3. Unsupervised Feature Selection
9.4. Summary
9.5. Case Studies and Exercises
10. EM Algorithm
10.1. EM Algorithm
10.2. Application of EM: Gaussian Mixture Model
10.3. Summary
10.4. Case Studies and Exercises
11. Probabilistic Graphical Model
11.1. Overview of Probabilistic Graphical Model
11.1.1. Directed Graphical Model
11.1.2. Undirected Graphical Model
11.2. Hidden Markov Model
11.2.1. The Estimation Problem: Forward and Backward Algorithm
11.2.2. The Decoding Problem: Viterbi Algorithm
11.2.3. The Learning Problem: Baum–Welch Algorithm
11.2.4. Some Extensions of HMM
11.3. Conditional Random Field
11.3.1. Linear Chain CRF and its General Form
11.3.2. Feature Engineering
11.3.3. Parameter Estimation of CRF
11.3.4. Inference of CRF
11.4. Summary
11.5. Case Studies and Exercises
12. Text Analysis
12.1. Text Representation
12.1.1. Vector Space Model
12.1.2. Text Dimensionality Reduction
12.2. Topic Model
12.2.1. LDA Model
12.2.2. Parameter Estimation
12.2.3. Summary of the Topic Model
12.3. Sentiment Analysis
12.3.1. Sentiment Classification
12.3.2. Aspect-based Sentiment Analysis
12.3.3. Summary
13. Graph and Network Analysis
13.1. Basic Concepts
13.1.1. Basic Definition
13.1.2. Commonly Used Graphs
13.2. Geometric Property
13.2.1. Centrality
13.2.2. Clustering Coefficient
13.2.3. Modularity
13.3. Link Analysis
13.3.1. PageRank
13.3.2. Topic-sensitive PageRank
13.3.3. HITS Algorithm
13.4. Community Discovery
13.4.1. Algorithms Based on Hierarchical Clustering
13.4.2. Algorithm Based on Modularity Optimization
13.5. Knowledge Graph
13.5.1. Data Model of Knowledge Graph
13.5.2. Knowledge Graph Data Management Methods
13.5.3. Research on Knowledge Graph in Different Disciplines
13.6. Case Studies and Exercises
14. Deep Learning
14.1. Multi-Layer Perceptron
14.1.1. Activation Function
14.1.2. Network Structure Design
14.1.3. Output Layer
14.1.4. Loss Function
14.1.5. Backpropagation Algorithm
14.2. Optimization of Deep Learning Model
14.2.1. Momentum Method
14.2.2. Nesterov Momentum Method
14.2.3. Optimization of Adaptive Learning Rate
14.2.4. Batch Normalization
14.2.5. Summary
14.3. Convolutional Neural Networks
14.3.1. Convolution
14.3.2. Pooling
14.3.3. Common Convolutional Neural Networks Structures
14.4. Recurrent Neural Network
14.4.1. Computational Graph of RNN
14.4.2. Network Structure of RNN
14.4.3. Gradient Computation of RNN
14.4.4. Long Short-term Memory Networks
14.5. Summary
15. Distributed Computing
15.1. Hadoop: Distributed Storage and Processing
15.1.1. HDFS: Distributed Data Storage
15.1.2. MapReduce: Distributed Data Processing
15.2. MapReduce Implementation of Common Models
15.2.1. Statistical Query Model
15.2.2. MapReduce Implementation of Linear Regression
15.2.3. MapReduce Implementation of Support Vector Machine
15.2.4. MapReduce Implementation of K-means
15.2.5. MapReduce Implementation of PageRank
15.2.6. Summary
15.3. Spark: Distributed Data Analysis
15.3.1. Resilient Distributed Dataset
15.3.2. Execution Process of Spark Program
15.3.3. Spark vs. Hadoop
15.4. Other Distributed Systems
Appendix
A. Matrix Operation
A.1. Basic Concepts
A.1.1. Transpose of Matrix
A.1.2. Inverse of Matrix
A.1.3. Rank of Matrix
A.1.4. Trace of Matrix
A.1.5. Vector Norm and Matrix Norm
A.1.6. Positive Definiteness of the Matrix
A.2. Matrix Derivation
A.3. Matrix Decomposition
A.3.1. Eigenvalue Decomposition
A.3.2. Singular Value Decomposition
B. Probability Basis
B.1. Basic Concepts
B.2. Common Probability Distribution
B.2.1. Gaussian Distribution
B.2.2. Uniform Distribution
B.2.3. Bernoulli Distribution
B.2.4. Binomial Distribution
B.2.5. Multinomial Distribution
B.2.6. Beta Distribution
B.2.7. Dirichlet Distribution
C. Optimization Algorithm
C.1. Basic Concepts
C.1.1. Convex Function
C.1.2. Jensen’s Inequality
C.2. Gradient Descent
C.3. Lagrangian Multiplier Method
D. Distance
D.1. Euclidean Distance
D.2. Manhattan Distance
D.3. Mahalanobis Distance
D.4. Hamming Distance
D.5. Cosine Similarity
D.6. Pearson Correlation Coefficient
D.7. Jaccard Similarity
D.8. KL Divergence
E. Model Evaluation
E.1. Basic Concepts
E.1.1. Independent and Identically Distributed Data
E.1.2. Bias and Variance
E.1.3. Hyper-parameter and Parameter Tuning
E.1.4. Overfitting and Underfitting
E.2. Dataset Splitting
E.2.1. Hold-out
E.2.2. Cross Validation
E.2.3. Bootstrapping
E.3. Model Evaluation Metrics
E.3.1. Regression Metrics
E.3.2. Classification Metrics
E.3.3. Clustering Metrics
E.3.4. Rand Index
References

📜 SIMILAR VOLUMES

Introduction to Data Science

📁 Introduction to Data Science

✍ Laura Igual & Santi Segui 📂 Library 📅 2017 🏛 Springer 🌐 English

Introduction to Data Science

📁 Introduction to Data Science

✍ Laura Igual & Santi SeguГ 📂 Library 📅 0 🏛 Springer International Publishing, Cham 🌐 English

Introduction to Data Science

📁 Introduction to Data Science

✍ Gaoyan Ou, Zhanxing Zhu, Bin Dong 📂 Library 📅 2023 🏛 WSPC/HEP 🌐 English

The book systematically introduces the basic contents of data science, including data preprocessing and basic methods of data analysis, handling special problems (e.g. text analysis), deep learning, and distributed systems. In addition to systemati

Introduction to Data Science

📁 Introduction to Data Science

✍ Stanton J. 📂 Library 🌐 English

Portions, 2013. – 183 p. – ISBN: N/A<div class="bb-sep"></div>This book is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 license. You are free to copy, distribute, and transmit this work. You are free to add or adapt the work. You must attribute the work to the auth

Introduction to Environmental Data Scien

📁 Introduction to Environmental Data Science

✍ Jerry D. Davis 📂 Library 📅 2023 🏛 CRC Press/Chapman & Hall 🌐 English

Introduction to Environmental Data Science focuses on data science methods in the R language applied to environmental research, with sections on exploratory data analysis in R including data abstraction, transformation, and visualization; spatial data analysis in vector and raster models; statistics

An Introduction to Data Science

📁 An Introduction to Data Science

✍ Jeffrey S. Saltz, Jeffrey M. Stanton 📂 Library 📅 2017 🏛 SAGE Publications, Inc 🌐 English

An Introduction to Data Science by Jeffrey S. Saltz and Jeffrey M. Stanton is an easy-to-read, gentle introduction for people with a wide range of backgrounds into the world of data science. Needing no prior codin