𝔖 Scriptorium
✦   LIBER   ✦

📁

Clustering for Data Mining: A Data Recovery Approach

✍ Scribed by Boris Mirkin


Publisher
Chapman & Hall – CRC
Year
2005
Tongue
English
Leaves
277
Series
Computer Science and Data Analysis
Edition
1
Category
Library

⬇  Acquire This Volume

No coin nor oath required. For personal study only.

✦ Synopsis


Often considered more as an art than a science, the field of clustering has been dominated by learning through examples and by techniques chosen almost through trial-and-error. Even the most popular clustering methods--K-Means for partitioning the data set and Ward's method for hierarchical clustering--have lacked the theoretical attention that would establish a firm relationship between the two methods and relevant interpretation aids.Rather than the traditional set of ad hoc techniques, Clustering for Data Mining: A Data Recovery Approach presents a theory that not only closes gaps in K-Means and Ward methods, but also extends them into areas of current interest, such as clustering mixed scale data and incomplete clustering. The author suggests original methods for both cluster finding and cluster description, addresses related topics such as principal component analysis, contingency measures, and data visualization, and includes nearly 60 computational examples covering all stages of clustering, from data pre-processing to cluster validation and results interpretation.This author's unique attention to data recovery methods, theory-based advice, pre- and post-processing issues that are beyond the scope of most texts, and clear, practical instructions for real-world data mining make this book ideally suited for virtually all purposes: for teaching, for self-study, and for professional reference.

✦ Table of Contents


Title......Page 1
Preface......Page 8
Acknowledgments......Page 13
Author......Page 14
List of Denotations......Page 15
Introduction: Historical Remarks......Page 17
Contents......Page 4
Base words......Page 22
Market towns......Page 24
Primates and Human origin......Page 25
Gene presenceabsence pro les......Page 27
1.1.2 Description......Page 30
Body mass......Page 31
1.1.3 Association......Page 33
Digits and patterns of confusion between them......Page 34
Literary masterpieces......Page 36
1.1.4 Generalization......Page 38
One dimensional data......Page 42
One dimensional data within groups......Page 43
Block structure......Page 45
Visualization using an inherent topology......Page 47
Data......Page 48
Cluster structure......Page 49
1.2.2 Criteria for revealing a cluster structure......Page 50
1.2.3 Three types of cluster description......Page 52
1.2.4 Stages of a clustering application......Page 53
Statistics perspective......Page 54
Machine learning perspective......Page 55
Classification knowledge discovery perspective......Page 56
Base words......Page 58
2.1.1 Feature Scale Types......Page 61
2.1.2 Quantitative Case......Page 63
2.1.3 Categorical case......Page 66
2.2.1 Two quantitative variables......Page 68
2.2.2 Nominal and quantitative variables......Page 70
2.2.3 Two nominal variables cross-classfied......Page 72
2.2.4 Relation between correlation and contingency......Page 78
2.2.5 Meaning of correlation......Page 79
2.3.1 Data Matrix......Page 81
2.3.2 Feature space: distance and inner product......Page 82
2.4 Pre processing and standardizing mixed data......Page 85
2.5.1 Dissimilarity and similarity data......Page 91
Standardization of similarity data......Page 92
2.5.2 Contingency and flow data......Page 93
Base words......Page 96
3.1.1 Straight K Means......Page 99
3.1.2 Square error criterion......Page 103
3.1.3 Incremental versions of K Means......Page 105
3.2.1 Traditional approaches to initial setting......Page 107
3.2.2 MaxMin for producing deviate centroids......Page 109
Reference point based clustering......Page 111
3.3.1 Iterated Anomalous pattern for iK Means......Page 114
3.3.2 Cross validation of iK Means results......Page 117
3.4.1 Conventional interpretation aids......Page 121
3.4.2 Contribution and relative contribution tables......Page 122
3.4.3 Cluster representatives......Page 126
3.4.4 Measures of association from ScaD tables......Page 128
3.5 Overall assessment......Page 130
Base words......Page 132
4.1 Agglomeration: Ward algorithm......Page 134
4.2 Divisive clustering with Ward criterion......Page 138
4.2.1 2-Means splitting......Page 139
4.2.2 Splitting by separating......Page 140
4.2.3 Interpretation aids for upper cluster hierarchies......Page 144
4.3 Conceptual clustering......Page 148
4.4.2 Hierarchical clustering for contingency and flow data......Page 153
4.5 Overall assessment......Page 156
Base words......Page 157
5.1 Statistics modeling as data recovery......Page 160
5.1.2 Linear regression......Page 161
5.1.3 Principal component analysis......Page 162
5.1.4 Correspondence factor analysis......Page 165
5.2.1 Equation and data scatter decomposition......Page 168
5.2.2 Contributions of clusters, features, and individual entities......Page 169
5.2.3 Correlation ratio as contribution......Page 170
5.2.4 Partition contingency coefficients......Page 171
5.3.1 Data recovery models with cluster hierarchies......Page 172
5.3.2 Covariances, variances and data scatter decomposed......Page 173
5.3.3 Direct proof of the equivalence between 2-Means and Ward criteria......Page 176
5.3.4 Gower's controversy......Page 177
5.4.1 Similarity and attraction measures compatible with K Means and Ward criteria......Page 178
5.4.2 Application to binary data......Page 183
Agglomeration......Page 184
5.4.4 Extension to multiple data......Page 186
5.5.1 PCA and data recovery clustering......Page 188
5.5.2 Divisive Ward like clustering......Page 189
5.5.3 Iterated Anomalous pattern......Page 190
5.5.4 Anomalous pattern versus Splitting......Page 191
5.5.5 One by one clusters for similarity data......Page 192
5.6 Overall assessment......Page 194
Base words......Page 196
6.1.1 Clustering criteria and implementation......Page 199
6.1.2 Partitioning around medoids PAM......Page 200
6.1.3 Fuzzy clustering......Page 202
6.1.4 Regression-wise clustering......Page 204
6.1.5 Mixture of distributions and EM algorithm......Page 205
6.1.6 Kohonen self organizing maps SOM......Page 208
6.2.1 Single linkage, minimum spanning tree and connected components......Page 209
6.2.2 Finding a core......Page 213
6.3 Conceptual description of clusters......Page 216
6.3.2 Conceptually describing a partition......Page 217
6.3.3 Describing a cluster with production rules......Page 221
6.3.4 Comprehensive conjunctive description of a cluster......Page 222
6.4 Overall assessment......Page 225
Base words......Page 226
7.1.1 A review......Page 228
7.1.2 Comprehensive description as a feature selector......Page 230
7.1.3 Comprehensive description as a feature extractor......Page 231
7.2.1 Dis/similarity between entities......Page 234
7.2.2 Pre-processing feature based data......Page 235
7.2.3 Data standardization......Page 237
7.3 Similarity on subsets and partitions......Page 239
Set theoretic similarity measures......Page 240
7.3.2 Dis/similarity between partitions......Page 243
Matching based similarity versus Quetelet association......Page 247
Dissimilarity of a set of partitions......Page 248
7.4.1 Imputation as part of pre processing......Page 249
7.4.4 Least squares approximation......Page 250
7.5.1 Index based validation......Page 251
Internal indexes......Page 252
Use of internal indexes to estimate the number of clusters......Page 254
7.5.2 Resampling for validation and selection......Page 255
Determining the number of clusters with K-Means......Page 259
Model selection for cluster description......Page 260
7.6 Overall assessment......Page 262
Conclusion: Data Recovery Approach in Clustering......Page 263
Bibliography......Page 267


📜 SIMILAR VOLUMES


Clustering for Data Mining: A Data Recov
✍ Boris Mirkin 📂 Library 📅 2005 🏛 Chapman and Hall/CRC 🌐 English

This book gives a smooth, motivated and example-richintroduction to clustering, which is innovative in many aspects.Answers to important questions that are very rarely addressed if addressed at all, are provided.Examples:(a) what to do if the user has no idea of the numberof clusters and/or their lo

Clustering: A Data Recovery Approach, Se
✍ Boris Mirkin (Author) 📂 Library 📅 2013 🏛 Chapman and Hall/CRC

<p>Often considered more of an art than a science, books on clustering have been dominated by learning through example with techniques chosen almost through trial and error. Even the two most popular, and most related, clustering methods-K-Means for partitioning and Ward`s method for hierarchical cl

Cluster Analysis for Data Mining and Sys
✍ János Abonyi, Balázs Feil 📂 Library 📅 2007 🏛 Birkhäuser Basel 🌐 English

<P>This book presents new approaches to data mining and system identification. Algorithms that can be used for the clustering of data have been overviewed. New techniques and tools are presented for the clustering, classification, regression and visualization of complex datasets. Special attention i

Cluster Analysis for Data Mining and Sys
✍ János Abonyi, Balázs Feil 📂 Library 📅 2007 🏛 Birkhäuser Basel 🌐 English

<P>This book presents new approaches to data mining and system identification. Algorithms that can be used for the clustering of data have been overviewed. New techniques and tools are presented for the clustering, classification, regression and visualization of complex datasets. Special attention i

Cluster Analysis for Data Mining and Sys
✍ János Abonyi, Balázs Feil 📂 Library 📅 2007 🏛 Birkhäuser Basel 🌐 English

<P>This book presents new approaches to data mining and system identification. Algorithms that can be used for the clustering of data have been overviewed. New techniques and tools are presented for the clustering, classification, regression and visualization of complex datasets. Special attention i

Cluster Analysis for Data Mining and Sys
✍ János Abonyi, Balázs Feil 📂 Library 📅 2007 🏛 Birkhäuser Basel 🌐 English

<P>This book presents new approaches to data mining and system identification. Algorithms that can be used for the clustering of data have been overviewed. New techniques and tools are presented for the clustering, classification, regression and visualization of complex datasets. Special attention i