𝔖 Bobbio Scriptorium
✦   LIBER   ✦

Clustering and classification of large document bases in a parallel environment

✍ Scribed by Ruocco, Anthony S. ;Frieder, Ophir


Publisher
John Wiley and Sons
Year
1997
Tongue
English
Weight
189 KB
Volume
48
Category
Article
ISSN
0002-8231

No coin nor oath required. For personal study only.

✦ Synopsis


Development of cluster-based search systems has been ing, information retrieval systems must be prepared to hampered by prohibitive times involved in clustering process large amounts of data. As systems become inunlarge document sets. Once completed, maintaining clusdated with more and more information, it may not be ter organizations is difficult in dynamic file environpossible for people to fully understand what they have ments. We propose the use of parallel computing syscollected. The ability to produce information that categotems to overcome the computationally intense clustering process. Two operations are examined. The first is rizes the data, that is, the ability to produce metadata, is clustering a document set and the second is classifying in many cases as important as identifying specific pieces the document set. A subset of the TIPSTER corpus, speof data within a document set. cifically, articles from the Wall Street Journal, is used.

The ever-increasing size, coupled with the increasing

Document set classification was performed without the large storage requirement (potentially as high as 522M)

requirements to classify, group, and process the document for ancillary data matrices. In all cases, the time perforsets, all within nonprohibitive execution times, motivates mance of the parallel system was an improvement over the use of parallel processing computers. Parallel informasequential system times, and produced the same clustion retrieval focuses on this particular domain (Pogue, tering and classification scheme. Some results show 1988; Rasmussen, 1991;. Query pronear linear speed up in higher threshold clustering applications.

cessing assumes an organized data set as input. We, however, rely on parallel computing to organize the data by performing two cluster preprocessing operations. The first *Ophir Frieder is currently on leave from the Department of Computer Science, George Mason


πŸ“œ SIMILAR VOLUMES


Development and validation of a cluster-
✍ Susan E. Collins; Iris Torchalla; Martina SchrΓΆter; Gerhard Buchkremer; Anil Bat πŸ“‚ Article πŸ“… 2008 πŸ› John Wiley and Sons 🌐 English βš– 126 KB

## Aims: The objectives of this study were to replicate smoker profi les identifi ed in Batra et al. (in press) and to develop a cluster-based classifi cation system to categorize new cases into smoker profi les so that an appropriate tailored intervention could be applied. Methods: Participants w

Using Java and JavaScript in the Virtual
✍ Dincer, Kivanc; Fox, Geoffrey C. πŸ“‚ Article πŸ“… 1997 πŸ› John Wiley and Sons 🌐 English βš– 277 KB πŸ‘ 2 views

The Virtual Programming Laboratory (VPL) is a Web-based virtual programming environment built based on a client-server architecture. The system can be accessed on any platform (Unix, PC or Mac) using a standard Java-enabled browser. Software delivery over the Web imposes a novel set of constraints o

A comparison of spindle concentrations i
✍ D. Peck; D. F. Buxton; A. Nitz πŸ“‚ Article πŸ“… 1984 πŸ› John Wiley and Sons 🌐 English βš– 607 KB

A small short muscle frequently acts across a joint in parallel with a vastly larger and longer muscle; therefore it should play a minimal role in the mechanical control of that joint. This study provides evidence suggesting that the small member of such a "parallel muscle combination" (PMC) may ser