## Aims: The objectives of this study were to replicate smoker profi les identifi ed in Batra et al. (in press) and to develop a cluster-based classifi cation system to categorize new cases into smoker profi les so that an appropriate tailored intervention could be applied. Methods: Participants w
Clustering and classification of large document bases in a parallel environment
β Scribed by Ruocco, Anthony S. ;Frieder, Ophir
- Publisher
- John Wiley and Sons
- Year
- 1997
- Tongue
- English
- Weight
- 189 KB
- Volume
- 48
- Category
- Article
- ISSN
- 0002-8231
No coin nor oath required. For personal study only.
β¦ Synopsis
Development of cluster-based search systems has been ing, information retrieval systems must be prepared to hampered by prohibitive times involved in clustering process large amounts of data. As systems become inunlarge document sets. Once completed, maintaining clusdated with more and more information, it may not be ter organizations is difficult in dynamic file environpossible for people to fully understand what they have ments. We propose the use of parallel computing syscollected. The ability to produce information that categotems to overcome the computationally intense clustering process. Two operations are examined. The first is rizes the data, that is, the ability to produce metadata, is clustering a document set and the second is classifying in many cases as important as identifying specific pieces the document set. A subset of the TIPSTER corpus, speof data within a document set. cifically, articles from the Wall Street Journal, is used.
The ever-increasing size, coupled with the increasing
Document set classification was performed without the large storage requirement (potentially as high as 522M)
requirements to classify, group, and process the document for ancillary data matrices. In all cases, the time perforsets, all within nonprohibitive execution times, motivates mance of the parallel system was an improvement over the use of parallel processing computers. Parallel informasequential system times, and produced the same clustion retrieval focuses on this particular domain (Pogue, tering and classification scheme. Some results show 1988; Rasmussen, 1991;. Query pronear linear speed up in higher threshold clustering applications.
cessing assumes an organized data set as input. We, however, rely on parallel computing to organize the data by performing two cluster preprocessing operations. The first *Ophir Frieder is currently on leave from the Department of Computer Science, George Mason
π SIMILAR VOLUMES
The Virtual Programming Laboratory (VPL) is a Web-based virtual programming environment built based on a client-server architecture. The system can be accessed on any platform (Unix, PC or Mac) using a standard Java-enabled browser. Software delivery over the Web imposes a novel set of constraints o
A small short muscle frequently acts across a joint in parallel with a vastly larger and longer muscle; therefore it should play a minimal role in the mechanical control of that joint. This study provides evidence suggesting that the small member of such a "parallel muscle combination" (PMC) may ser