The text categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has been an application for many learning approaches, which proved effective. Nevertheless, TC provides many challenges to machine learning. In this paper, we suggest, f
Data mining for text categorization with semi-supervised agglomerative hierarchical clustering
✍ Scribed by Antonio Gómez Skarmeta; Amine Bensaid; Nadia Tazi
- Publisher
- John Wiley and Sons
- Year
- 2000
- Tongue
- English
- Weight
- 119 KB
- Volume
- 15
- Category
- Article
- ISSN
- 0884-8173
No coin nor oath required. For personal study only.
✦ Synopsis
In this paper we study the use of a semi-supervised agglomerative hierarchical clustering Ž . ssAHC algorithm to text categorization, which consists of assigning text documents to Ž . Ž . predefined categories. ssAHC is i a clustering algorithm that ii uses a finite design set Ž . Ž . of labeled data to iii help agglomerative hierarchical clustering AHC algorithms Ž . partition a finite set of unlabeled data and then iv terminates without the capability to label other objects. We first describe the text representation method we use in this work; we then present a feature selection method that is used to reduce the dimensionality of the feature space. Finally, we apply the ssAHC algorithm to the Reuters database of documents and show that its performance is superior to the Bayes classifier and to the Expectation-Maximization algorithm combined with Bayes classifier. We showed also that ssAHC helps AHC techniques to improve their performance.
📜 SIMILAR VOLUMES