Nonhierarchical document clustering based on a tolerance rough set model
✍ Scribed by Tu Bao Ho; Ngoc Binh Nguyen
- Publisher
- John Wiley and Sons
- Year
- 2002
- Tongue
- English
- Weight
- 118 KB
- Volume
- 17
- Category
- Article
- ISSN
- 0884-8173
No coin nor oath required. For personal study only.
✦ Synopsis
Document clustering, the grouping of documents into several clusters, has been recognized as a means for improving efficiency and effectiveness of information retrieval and text mining. With the growing importance of electronic media for storing and exchanging large textual databases, document clustering becomes more significant. Hierarchical document clustering methods, having a dominant role in document clustering, seem inadequate for large document databases as the time and space requirements are typically of order O(N 3 ) and O(N 2 ), where N is the number of index terms in a database. In addition, when each document is characterized by only several terms or keywords, clustering algorithms often produce poor results as most similarity measures yield many zero values. In this article we introduce a nonhierarchical document clustering algorithm based on a proposed tolerance rough set model (TRSM). This algorithm contributes two considerable features: (1) it can be applied to large document databases, as the time and space requirements are of order O(N logN ) and O(N ), respectively; and (2) it can be well adapted to documents characterized by a few terms due to the TRSM's ability of semantic calculation. The algorithm has been evaluated and validated by experiments on test collections.