## Abstract Web pages from a Web site can often be associated with concepts in an ontology, and pairs of Web pages also can be associated with relationships between concepts. With such associations, the Web site can be searched, browsed, or even reorganized based on the concept and relationship lab
Automating extraction of logical domains in a web site
✍ Scribed by Necip Fazıl Ayan; Wen-Syan Li; Okan Kolak
- Publisher
- Elsevier Science
- Year
- 2002
- Tongue
- English
- Weight
- 535 KB
- Volume
- 43
- Category
- Article
- ISSN
- 0169-023X
No coin nor oath required. For personal study only.
✦ Synopsis
The domain name field in a universal resource locator (URL) has been viewed as a natural choice to organize Web pages. For example, Web search results may be grouped in terms of domains and presented to users as clusters for ease of visualization. However, using this approach, large Web sites, such as Geocities, W3C, and www.cs.umd.edu, tend to yield many matches that leads to a few large, flat structured, and unorganized clusters. As a matter of fact, many pages in these sites are actually ''logical domains'' by themselves. For example, Web sites for projects at a university or the XML section at W3C could be viewed as ''logical domains''. In this paper, we propose the concept of a logical domain, which is identified by semantic relatedness, as opposed to a physical domain, which is identified simply by domain name. The identification of logical domain is important to many Web applications, such as query result reorganization, site map generation, and topic distillation. We have developed and implemented a set of rules based on link structure, path information, document metadata, and citations to identify logical domain entry pages (i.e., root pages of logical domains). The importance of these rules are automatically adjusted using a novel decision tree algorithm and training data provided by human feedback. We also develop techniques to define the boundary of each logical domain based on identified logical domain entry pages. We have conducted extensive experiments on real Web sites to evaluate the effectiveness of our proposed techniques. The experimental results show that our techniques perform very well in extracting logical domains in a Web site.
📜 SIMILAR VOLUMES
Extractive purification of proteins, using fumarase from Saccharomyces cerevisiae as a model system, is demonstrated to allow automatic and continuous processing desirable for industrial production purposes.