✦ LIBER ✦

Data mining for text categorization with semi-supervised agglomerative hierarchical clustering

✍ Scribed by Antonio Gómez Skarmeta; Amine Bensaid; Nadia Tazi

Publisher: John Wiley and Sons
Year: 2000
Tongue: English
Weight: 119 KB
Volume: 15
Category: Article
ISSN: 0884-8173
DOI: 10.1002/(sici)1098-111x(200007)15:7<633::aid-int4>3.0.co;2-8

No coin nor oath required. For personal study only.

✦ Synopsis

In this paper we study the use of a semi-supervised agglomerative hierarchical clustering Ž . ssAHC algorithm to text categorization, which consists of assigning text documents to Ž . Ž . predefined categories. ssAHC is i a clustering algorithm that ii uses a finite design set Ž . Ž . of labeled data to iii help agglomerative hierarchical clustering AHC algorithms Ž . partition a finite set of unlabeled data and then iv terminates without the capability to label other objects. We first describe the text representation method we use in this work; we then present a feature selection method that is used to reduce the dimensionality of the feature space. Finally, we apply the ssAHC algorithm to the Reuters database of documents and show that its performance is superior to the Bayes classifier and to the Expectation-Maximization algorithm combined with Bayes classifier. We showed also that ssAHC helps AHC techniques to improve their performance.

📜 SIMILAR VOLUMES

Integrating WordNet knowledge to supplem

Integrating WordNet knowledge to supplement training data in semi-supervised agglomerative hierarchical clustering for text categorization

✍ Mohammed Benkhalifa; Abdelhak Mouradi; Houssaine Bouyakhf 📂 Article 📅 2001 🏛 John Wiley and Sons 🌐 English ⚖ 168 KB

The text categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has been an application for many learning approaches, which proved effective. Nevertheless, TC provides many challenges to machine learning. In this paper, we suggest, f