๐”– Bobbio Scriptorium
โœฆ   LIBER   โœฆ

[ACM Press the 2004 ACM SIGKDD international conference - Seattle, WA, USA (2004.08.22-2004.08.25)] Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '04 - A probabilistic framework for semi-supervised clustering

โœ Scribed by Basu, Sugato; Bilenko, Mikhail; Mooney, Raymond J.


Book ID
125804702
Publisher
ACM Press
Year
2004
Weight
183 KB
Category
Article

No coin nor oath required. For personal study only.

โœฆ Synopsis


Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semisupervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and I-divergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework.


๐Ÿ“œ SIMILAR VOLUMES


[ACM Press the 2004 ACM SIGKDD internati
โœ Steyvers, Mark; Smyth, Padhraic; Rosen-Zvi, Michal; Griffiths, Thomas ๐Ÿ“‚ Article ๐Ÿ“… 2004 ๐Ÿ› ACM Press โš– 316 KB

We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probabilit