✦ LIBER ✦

Link-based similarity measures for the classification of Web documents

✍ Scribed by Pável Calado; Marco Cristo; Marcos André Gonçalves; Edleno S. de Moura; Berthier Ribeiro-Neto; Nivio Ziviani

Publisher: John Wiley and Sons
Year: 2005
Tongue: English
Weight: 239 KB
Volume: 57
Category: Article
ISSN: 1532-2882
DOI: 10.1002/asi.20266

No coin nor oath required. For personal study only.

✦ Synopsis

Abstract

Traditional text‐based document classifiers tend to perform poorly on the Web. Text in Web documents is usually noisy and often does not contain enough information to determine their topic. However, the Web provides a different source that can be useful to document classification: its hyperlink structure. In this work, the authors evaluate how the link structure of the Web can be used to determine a measure of similarity appropriate for document classification. They experiment with five different similarity measures and determine their adequacy for predicting the topic of a Web page. Tests performed on a Web directory show that link information alone allows classifying documents with an average precision of 86%. Further, when combined with a traditional text‐based classifier, precision increases to values of up to 90%, representing gains that range from 63 to 132% over the use of text‐based classification alone. Because the measures proposed in this article are straightforward to compute, they provide a practical and effective solution for Web classification and related information retrieval tasks. Further, the authors provide an important set of guidelines on how link structure can be used effectively to classify Web documents.

📜 SIMILAR VOLUMES

Web-based text classification in the abs

Web-based text classification in the absence of manually labeled training documents

✍ Chen-Ming Hung; Lee-Feng Chien 📂 Article 📅 2006 🏛 John Wiley and Sons 🌐 English ⚖ 371 KB 👁 1 views

## Abstract Most text classification techniques assume that manually labeled documents (corpora) can be easily obtained while learning text classifiers. However, labeled training documents are sometimes unavailable or inadequate even if they are available. The goal of this article is to present a s

Conceptualizing documentation on the Web

Conceptualizing documentation on the Web: An evaluation of different heuristic-based models for counting links between university Web sites

✍ Mike Thelwall 📂 Article 📅 2002 🏛 John Wiley and Sons 🌐 English ⚖ 138 KB 👁 1 views

Similarity Measurement Method for the Cl

✍ Yoav Smith; Gershom Zajicek; Michael Werman; Galina Pizov; Yoav Sherman 📂 Article 📅 1999 🏛 Elsevier Science 🌐 English ⚖ 290 KB

A Web-Based Secure System for the Distri

A Web-Based Secure System for the Distributed Printing of Documents and Images

✍ Ping Wah Wong; Daniel Tretter; Thomas Kite; Qian Lin; Hugh Nguyen 📂 Article 📅 1999 🏛 Elsevier Science 🌐 English ⚖ 292 KB

We propose and consider a secure printing system for the distributed printing of documents and images over the World Wide Web. The main feature of the system is that it allows previewing and printing of selected documents and images, where only a certain number of hardcopies can be generated based o

“Link rot” limits the usefulness of web-

“Link rot” limits the usefulness of web-based educational materials in biochemistry and molecular biology

✍ John Markwell; David W. Brooks 📂 Article 📅 2003 🏛 The American Society for Biochemistry and Molecula 🌐 English ⚖ 99 KB 👁 3 views

The connection between the research of a

The connection between the research of a university and counts of links to its web pages: An investigation based upon a classification of the relationships of pages to the research of the host university

✍ Mike Thelwall; Gareth Harries 📂 Article 📅 2003 🏛 John Wiley and Sons 🌐 English ⚖ 100 KB 👁 1 views