✦ LIBER ✦

Inference and evaluation of the multinomial mixture model for text clustering

✍ Scribed by Loïs Rigouste; Olivier Cappé; François Yvon

Book ID: 113663720
Publisher: Elsevier Science
Year: 2007
Tongue: English
Weight: 603 KB
Volume: 43
Category: Article
ISSN: 0306-4573
DOI: 10.1016/j.ipm.2006.11.001

No coin nor oath required. For personal study only.

✦ Synopsis

In this article, we investigate the use of a probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction.Recent proposals have been made of probabilistic clustering models, which build ''soft'' theme-document associations. These models allow to compute, for each document, a probability vector whose values can be interpreted as the strength of the association between documents and clusters. As such, these vectors can also serve to project texts into a lower-dimensional ''semantic'' space. These models however pose non-trivial estimation problems, which are aggravated by the very high dimensionality of the parameter space.The model considered in this paper consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We propose a systematic evaluation framework to contrast various estimation procedures for this model. Starting with the expectation-maximization (EM) algorithm as the basic tool for inference, we discuss the importance of initialization and the influence of other features, such as the smoothing strategy or the size of the vocabulary, thereby illustrating the difficulties incurred by the high dimensionality of the parameter space. We empirically show that, in the case of text processing, these difficulties can be alleviated by introducing the vocabulary incrementally, due to the specific profile of the word count distributions. Using the fact that the model parameters can be analytically integrated out, we finally show that Gibbs sampling on the theme configurations is tractable and compares favorably to the basic EM approach.

📜 SIMILAR VOLUMES

Multinomial mixture model with feature s

Multinomial mixture model with feature selection for text clustering

✍ Minqiang Li; Liang Zhang 📂 Article 📅 2008 🏛 Elsevier Science 🌐 English ⚖ 185 KB

Evaluating mixture modeling for clusteri

Evaluating mixture modeling for clustering: Recommendations and cautions.

✍ Steinley, Douglas; Brusco, Michael J. 📂 Article 📅 2011 🏛 American Psychological Association 🌐 English ⚖ 491 KB

Clustering for binary data and mixture m

Clustering for binary data and mixture models—choice of the model

✍ Nadif, M. ;Govaert, G. 📂 Article 📅 1997 🏛 John Wiley and Sons 🌐 English ⚖ 101 KB 👁 3 views

When cluster analysis is based on mixture models, choosing an appropriate model is a difficult problem. Previous studies usually addressed a part of this problem by estimating the number of clusters and assuming the type of model to be known. Various criteria to be minimized have been proposed to me

Approximate distribution and test of fit

Approximate distribution and test of fit for the clustering effect in the dirichlet multinomial model

✍ Wilson, Jeffrey R. 📂 Article 📅 1986 🏛 Taylor and Francis Group 🌐 English ⚖ 392 KB

SMIXTURE: strategy for mixture model clu

SMIXTURE: strategy for mixture model clustering of multivariate images

✍ Thanh N. Tran; Ron Wehrens; Lutgarde M. C. Buydens 📂 Article 📅 2005 🏛 John Wiley and Sons 🌐 English ⚖ 408 KB 👁 2 views

A finite mixture model for the clusterin

A finite mixture model for the clustering of mixed-mode data

✍ B.S. Everitt 📂 Article 📅 1988 🏛 Elsevier Science 🌐 English ⚖ 309 KB