๐”– Bobbio Scriptorium
โœฆ   LIBER   โœฆ

[ACM Press the ninth ACM SIGKDD international conference - Washington, D.C. (2003.08.24-2003.08.27)] Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '03 - Eliminating noisy information in Web pages for data mining

โœ Scribed by Yi, Lan; Liu, Bing; Li, Xiaoli


Book ID
125490627
Publisher
ACM Press
Year
2003
Tongue
English
Weight
458 KB
Category
Article
ISBN-13
9781581137378

No coin nor oath required. For personal study only.

โœฆ Synopsis


A commercial Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements (for business purposes and for easy user access). We call these blocks that are not the main content blocks of the page the noisy blocks. We show that the information contained in these noisy blocks can seriously harm Web data mining. Eliminating these noises is thus of great importance. In this paper, we propose a noise elimination technique based on the following observation: In a given Web site, noisy blocks usually share some common contents and presentation styles, while the main content blocks of the pages are often diverse in their actual contents and/or presentation styles. Based on this observation, we propose a tree structure, called Style Tree, to capture the common presentation styles and the actual contents of the pages in a given Web site. By sampling the pages of the site, a Style Tree can be built for the site, which we call the Site Style Tree (SST). We then introduce an information based measure to determine which parts of the SST represent noises and which parts represent the main contents of the site. The SST is employed to detect and eliminate noises in any Web page of the site by mapping this page to the SST. The proposed technique is evaluated with two data mining tasks, Web page clustering and classification. Experimental results show that our noise elimination technique is able to improve the mining results significantly.


๐Ÿ“œ SIMILAR VOLUMES


[ACM Press the ninth ACM SIGKDD internat
โœ Yi, Lan; Liu, Bing; Li, Xiaoli ๐Ÿ“‚ Article ๐Ÿ“… 2003 ๐Ÿ› ACM Press ๐ŸŒ English โš– 458 KB

This Conference Brings Together Researchers And Practitioners And Focuses On New Developments In Knowledge Discovery And Data Mining. The Challenge Of Extracting Knowledge From Data Is An Area Of Common Interest To Researchers In Several Fields, Including Statistics, Databases, Pattern Recognition,

[ACM Press the ninth ACM SIGKDD internat
โœ Last, Mark; Friedman, Menahem; Kandel, Abraham ๐Ÿ“‚ Article ๐Ÿ“… 2003 ๐Ÿ› ACM Press ๐ŸŒ English โš– 289 KB

This Conference Brings Together Researchers And Practitioners And Focuses On New Developments In Knowledge Discovery And Data Mining. The Challenge Of Extracting Knowledge From Data Is An Area Of Common Interest To Researchers In Several Fields, Including Statistics, Databases, Pattern Recognition,

[ACM Press the ninth ACM SIGKDD internat
โœ Hsu, Wynne; Dai, Jing; Lee, Mong Li ๐Ÿ“‚ Article ๐Ÿ“… 2003 ๐Ÿ› ACM Press ๐ŸŒ English โš– 398 KB

This Conference Brings Together Researchers And Practitioners And Focuses On New Developments In Knowledge Discovery And Data Mining. The Challenge Of Extracting Knowledge From Data Is An Area Of Common Interest To Researchers In Several Fields, Including Statistics, Databases, Pattern Recognition,