✦ LIBER ✦

Nonhierarchical document clustering based on a tolerance rough set model

✍ Scribed by Tu Bao Ho; Ngoc Binh Nguyen

Publisher: John Wiley and Sons
Year: 2002
Tongue: English
Weight: 118 KB
Volume: 17
Category: Article
ISSN: 0884-8173
DOI: 10.1002/int.10016

No coin nor oath required. For personal study only.

✦ Synopsis

Document clustering, the grouping of documents into several clusters, has been recognized as a means for improving efficiency and effectiveness of information retrieval and text mining. With the growing importance of electronic media for storing and exchanging large textual databases, document clustering becomes more significant. Hierarchical document clustering methods, having a dominant role in document clustering, seem inadequate for large document databases as the time and space requirements are typically of order O(N 3 ) and O(N 2 ), where N is the number of index terms in a database. In addition, when each document is characterized by only several terms or keywords, clustering algorithms often produce poor results as most similarity measures yield many zero values. In this article we introduce a nonhierarchical document clustering algorithm based on a proposed tolerance rough set model (TRSM). This algorithm contributes two considerable features: (1) it can be applied to large document databases, as the time and space requirements are of order O(N logN ) and O(N ), respectively; and (2) it can be well adapted to documents characterized by a few terms due to the TRSM's ability of semantic calculation. The algorithm has been evaluated and validated by experiments on test collections.