Tunneling enhanced by web page content block partition for focused crawling
✍ Scribed by Tao Peng; Changli Zhang; Wanli Zuo
- Book ID
- 102809637
- Publisher
- John Wiley and Sons
- Year
- 2007
- Tongue
- English
- Weight
- 517 KB
- Volume
- 20
- Category
- Article
- ISSN
- 1532-0626
- DOI
- 10.1002/cpe.1211
No coin nor oath required. For personal study only.
✦ Synopsis
Abstract
The complexity of web information environments and multiple‐topic web pages are negative factors significantly affecting the performance of focused crawling. A highly relevant region in a web page may be obscured because of low overall relevance of that page. Segmenting the web pages into smaller units will significantly improve the performance. Conquering and traversing irrelevant page to reach a relevant one (tunneling) can improve the effectiveness of focused crawling by expanding its reach. This paper presents a heuristic‐based method to enhance focused crawling performance. The method uses a Document Object Model (DOM)‐based page partition algorithm to segment a web page into content blocks with a hierarchical structure and investigates how to take advantage of block‐level evidence to enhance focused crawling by tunneling. Page segmentation can transform an uninteresting multi‐topic web page into several single topic context blocks and some of which may be interesting. Accordingly, focused crawler can pursue the interesting content blocks to retrieve the relevant pages. Experimental results indicate that this approach outperforms Breadth‐First, Best‐First and Link‐context algorithm both in harvest rate, target recall and target length. Copyright © 2007 John Wiley & Sons, Ltd.
📜 SIMILAR VOLUMES