✦ LIBER ✦

Integrating WordNet knowledge to supplement training data in semi-supervised agglomerative hierarchical clustering for text categorization

✍ Scribed by Mohammed Benkhalifa; Abdelhak Mouradi; Houssaine Bouyakhf

Publisher: John Wiley and Sons
Year: 2001
Tongue: English
Weight: 168 KB
Volume: 16
Category: Article
ISSN: 0884-8173
DOI: 10.1002/int.1042

No coin nor oath required. For personal study only.

✦ Synopsis

The text categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has been an application for many learning approaches, which proved effective. Nevertheless, TC provides many challenges to machine learning. In this paper, we suggest, for text categorization, the integration of external WordNet lexical information to supplement training data for a semi-supervised clustering algorithm which (i) uses a finite design set of labeled data to (ii) help agglomerative hierarchical clustering algorithms (AHC) partition a finite set of unlabeled data and then (iii) terminates without the capacity to classify other objects. This algorithm is the "semi-supervised agglomerative hierarchical clustering algorithm" (ssAHC). Our experiments use Reuters 21578 database and consist of binary classifications for categories selected from the 89 TOPICS classes of the Reuters collection. Using the vector space model (VSM), each document is represented by its original feature vector augmented with external feature vector generated using WordNet. We verify experimentally that the integration of WordNet helps ssAHC improve its performance, effectively addresses the classification of documents into categories with few training documents, and does not interfere with the use of training data.

📜 SIMILAR VOLUMES

Data mining for text categorization with

Data mining for text categorization with semi-supervised agglomerative hierarchical clustering

✍ Antonio Gómez Skarmeta; Amine Bensaid; Nadia Tazi 📂 Article 📅 2000 🏛 John Wiley and Sons 🌐 English ⚖ 119 KB

In this paper we study the use of a semi-supervised agglomerative hierarchical clustering Ž . ssAHC algorithm to text categorization, which consists of assigning text documents to Ž . Ž . predefined categories. ssAHC is i a clustering algorithm that ii uses a finite design set Ž . Ž . of labeled dat