๐”– Bobbio Scriptorium
โœฆ   LIBER   โœฆ

Applied Predictive Modeling || Data Pre-processing

โœ Scribed by Kuhn, Max; Johnson, Kjell


Book ID
120344017
Publisher
Springer New York
Year
2013
Weight
971 KB
Category
Article
ISBN
1461468493

No coin nor oath required. For personal study only.

โœฆ Synopsis


Data pre-processing techniques generally refer to the addition, deletion, or transformation of training set data. Although this text is primarily concerned with modeling techniques, data preparation can make or break a model's predictive ability. Different models have different sensitivities to the type of predictors in the model; how the predictors enter the model is also important. Transformations of the data to reduce the impact of data skewness or outliers can lead to significant improvements in performance. Feature extraction, discussed in Sect. 3.3, is one empirical technique for creating surrogate variables that are combinations of multiple predictors. Additionally, simpler strategies such as removing predictors based on their lack of information content can also be effective.

The need for data pre-processing is determined by the type of model being used. Some procedures, such as tree-based models, are notably insensitive to the characteristics of the predictor data. Others, like linear regression, are not. In this chapter, a wide array of possible methodologies are discussed. For modeling techniques described in subsequent chapters, we will also discuss which, if any, pre-processing techniques can be useful.

This chapter outlines approaches to unsupervised data processing: the outcome variable is not considered by the pre-processing techniques. In other chapters, supervised methods, where the outcome is utilized to pre-process the data, are also discussed. For example, partial least squares (PLS) models are essentially supervised versions of principal component analysis (PCA). We also describe strategies for removing predictors without considering how those variables might be related to the outcome. Chapter 19 discusses techniques for finding subsets of predictors that optimize the ability of the model to predict the response.

How the predictors are encoded, called feature engineering, can have a significant impact on model performance. For example, using combinations of predictors can sometimes be more effective than using the individual values: the ratio of two predictors may be more effective than using two independent


๐Ÿ“œ SIMILAR VOLUMES


Data Matching || Data Pre-Processing
โœ Christen, Peter ๐Ÿ“‚ Article ๐Ÿ“… 2012 ๐Ÿ› Springer Berlin Heidelberg ๐ŸŒ German โš– 451 KB

Data matching (also known as record or data linkage, entity resolution, object identification, or field matching) is the task of identifying, matching and merging records that correspond to the same entities from several databases or even within one database. Based on research in various domains inc