Often considered more as an art than a science, the field of clustering has been dominated by learning through examples and by techniques chosen almost through trial-and-error. Even the most popular clustering methods--K-Means for partitioning the data set and Ward's method for hierarchical clusteri
Clustering for Data Mining: A Data Recovery Approach
✍ Scribed by Boris Mirkin
- Publisher
- Chapman and Hall/CRC
- Year
- 2005
- Tongue
- English
- Leaves
- 278
- Series
- Computer Science and Data Analysis
- Edition
- 1
- Category
- Library
No coin nor oath required. For personal study only.
✦ Synopsis
This book gives a smooth, motivated and example-richintroduction to clustering, which is innovative in many aspects.Answers to important questions that are very rarely addressed if addressed at all, are provided.Examples:(a) what to do if the user has no idea of the numberof clusters and/or their location - use what is called intelligent k-means;(b) what to do if the data contain both numeric and categoricalfeatures - use what is called three-step standardization procedure;(c) how to catch anomalous patterns, (d) how to validate clusters, etc.Some of these may be subject to criticism, however some motivation is alwayssupplied, and the results are always reproducible thus testable.The book introduces a numberof non-conventional cluster interpretation aids derived from a datageometry view accepted by the author and based on what is referredthe contribution weights - basically showing those elements of clusterstructures that distinguish clusters from the rest. These contributionweights, applied to categorical data, appear to be highly compatiblewith what statisticians such as A. Quetelet and K. Pearson were developingin the past couple of centuries, which is a highly original and welcomedevelopment. The book reviews a rich set of approaches being accumulatedin such hot areas as text mining and bioinformatics, and shows thatclustering is not just a set of naive methods for data processing butforms an evolving area of data science.I adopted the book as a text for my courses in data mining for bachelorand master degrees.
✦ Table of Contents
cover......Page 1
Clustering for Data Mining: A Data Recovery Approach......Page 2
Preface......Page 9
Acknowledgments......Page 14
Author......Page 15
List of Denotations......Page 16
Introduction: Historical Remarks......Page 18
Contents......Page 5
Base words......Page 23
Market towns......Page 25
Primates and Human origin......Page 26
Gene presenceabsence proles......Page 28
1.1.2 Description......Page 31
Body mass......Page 32
1.1.3 Association......Page 34
Digits and patterns of confusion between them......Page 35
Literary masterpieces......Page 37
1.1.4 Generalization......Page 39
One dimensional data......Page 43
One dimensional data within groups......Page 44
Block structure......Page 46
Visualization using an inherent topology......Page 48
Data......Page 49
Cluster structure......Page 50
1.2.2 Criteria for revealing a cluster structure......Page 51
1.2.3 Three types of cluster description......Page 53
1.2.4 Stages of a clustering application......Page 54
Statistics perspective......Page 55
Machine learning perspective......Page 56
Classification knowledge discovery perspective......Page 57
Base words......Page 59
2.1.1 Feature Scale Types......Page 62
2.1.2 Quantitative Case......Page 64
2.1.3 Categorical case......Page 67
2.2.1 Two quantitative variables......Page 69
2.2.2 Nominal and quantitative variables......Page 71
2.2.3 Two nominal variables cross-classfied......Page 73
2.2.4 Relation between correlation and contingency......Page 79
2.2.5 Meaning of correlation......Page 80
2.3.1 Data Matrix......Page 82
2.3.2 Feature space: distance and inner product......Page 83
2.4 Pre processing and standardizing mixed data......Page 86
2.5.1 Dissimilarity and similarity data......Page 92
Standardization of similarity data......Page 93
2.5.2 Contingency and flow data......Page 94
Base words......Page 97
3.1.1 Straight K Means......Page 100
3.1.2 Square error criterion......Page 104
3.1.3 Incremental versions of K Means......Page 106
3.2.1 Traditional approaches to initial setting......Page 108
3.2.2 MaxMin for producing deviate centroids......Page 110
Reference point based clustering......Page 112
3.3.1 Iterated Anomalous pattern for iK Means......Page 115
3.3.2 Cross validation of iK Means results......Page 118
3.4.1 Conventional interpretation aids......Page 122
3.4.2 Contribution and relative contribution tables......Page 123
3.4.3 Cluster representatives......Page 127
3.4.4 Measures of association from ScaD tables......Page 129
3.5 Overall assessment......Page 131
Base words......Page 133
4.1 Agglomeration: Ward algorithm......Page 135
4.2 Divisive clustering with Ward criterion......Page 139
4.2.1 2-Means splitting......Page 140
4.2.2 Splitting by separating......Page 141
4.2.3 Interpretation aids for upper cluster hierarchies......Page 145
4.3 Conceptual clustering......Page 149
4.4.2 Hierarchical clustering for contingency and flow data......Page 154
4.5 Overall assessment......Page 157
Base words......Page 158
5.1 Statistics modeling as data recovery......Page 161
5.1.2 Linear regression......Page 162
5.1.3 Principal component analysis......Page 163
5.1.4 Correspondence factor analysis......Page 166
5.2.1 Equation and data scatter decomposition......Page 169
5.2.2 Contributions of clusters, features, and individual entities......Page 170
5.2.3 Correlation ratio as contribution......Page 171
5.2.4 Partition contingency coefficients......Page 172
5.3.1 Data recovery models with cluster hierarchies......Page 173
5.3.2 Covariances, variances and data scatter decomposed......Page 174
5.3.3 Direct proof of the equivalence between 2-Means and Ward criteria......Page 177
5.3.4 Gower's controversy......Page 178
5.4.1 Similarity and attraction measures compatible with K Means and Ward criteria......Page 179
5.4.2 Application to binary data......Page 184
Agglomeration......Page 185
5.4.4 Extension to multiple data......Page 187
5.5.1 PCA and data recovery clustering......Page 189
5.5.2 Divisive Ward like clustering......Page 190
5.5.3 Iterated Anomalous pattern......Page 191
5.5.4 Anomalous pattern versus Splitting......Page 192
5.5.5 One by one clusters for similarity data......Page 193
5.6 Overall assessment......Page 195
Base words......Page 197
6.1.1 Clustering criteria and implementation......Page 200
6.1.2 Partitioning around medoids PAM......Page 201
6.1.3 Fuzzy clustering......Page 203
6.1.4 Regression-wise clustering......Page 205
6.1.5 Mixture of distributions and EM algorithm......Page 206
6.1.6 Kohonen self organizing maps SOM......Page 209
6.2.1 Single linkage, minimum spanning tree and connected components......Page 210
6.2.2 Finding a core......Page 214
6.3 Conceptual description of clusters......Page 217
6.3.2 Conceptually describing a partition......Page 218
6.3.3 Describing a cluster with production rules......Page 222
6.3.4 Comprehensive conjunctive description of a cluster......Page 223
6.4 Overall assessment......Page 226
Base words......Page 227
7.1.1 A review......Page 229
7.1.2 Comprehensive description as a feature selector......Page 231
7.1.3 Comprehensive description as a feature extractor......Page 232
7.2.1 Dis/similarity between entities......Page 235
7.2.2 Pre-processing feature based data......Page 236
7.2.3 Data standardization......Page 238
7.3 Similarity on subsets and partitions......Page 240
Set theoretic similarity measures......Page 241
7.3.2 Dis/similarity between partitions......Page 244
Matching based similarity versus Quetelet association......Page 248
Dissimilarity of a set of partitions......Page 249
7.4.1 Imputation as part of pre processing......Page 250
7.4.4 Least squares approximation......Page 251
7.5.1 Index based validation......Page 252
Internal indexes......Page 253
Use of internal indexes to estimate the number of clusters......Page 255
7.5.2 Resampling for validation and selection......Page 256
Determining the number of clusters with K-Means......Page 260
Model selection for cluster description......Page 261
7.6 Overall assessment......Page 263
1. Pre-Processing......Page 264
2. Clustering......Page 265
3. Interpretation aids......Page 266
Bibliography......Page 268
📜 SIMILAR VOLUMES
<p>Often considered more of an art than a science, books on clustering have been dominated by learning through example with techniques chosen almost through trial and error. Even the two most popular, and most related, clustering methods-K-Means for partitioning and Ward`s method for hierarchical cl
<P>This book presents new approaches to data mining and system identification. Algorithms that can be used for the clustering of data have been overviewed. New techniques and tools are presented for the clustering, classification, regression and visualization of complex datasets. Special attention i
<P>This book presents new approaches to data mining and system identification. Algorithms that can be used for the clustering of data have been overviewed. New techniques and tools are presented for the clustering, classification, regression and visualization of complex datasets. Special attention i
<P>This book presents new approaches to data mining and system identification. Algorithms that can be used for the clustering of data have been overviewed. New techniques and tools are presented for the clustering, classification, regression and visualization of complex datasets. Special attention i
<P>This book presents new approaches to data mining and system identification. Algorithms that can be used for the clustering of data have been overviewed. New techniques and tools are presented for the clustering, classification, regression and visualization of complex datasets. Special attention i