𝔖 Bobbio Scriptorium
✦   LIBER   ✦

Automatic subject heading assignment for online government publications using a semi-supervised machine learning approach

✍ Scribed by Xiao Hu; Larry S. Jackson; Sai Deng; Jing Zhang


Publisher
Wiley (John Wiley & Sons)
Year
2006
Tongue
English
Weight
246 KB
Volume
42
Category
Article
ISSN
0044-7870

No coin nor oath required. For personal study only.

✦ Synopsis


Abstract

As the dramatic expansion of online publications continues, state libraries urgently need effective tools to organize and archive the huge number of government documents published online. Automatic text categorization techniques can be applied to classify documents approximately, given a sufficient number of labeled training examples. However, obtaining training labels is very expensive, requiring a lot of manual labor. We present a real world online government information preservation project (PEP) in the State of Illinois, and a semi‐supervised machine learning approach, an Expectation‐Maximization (EM) algorithm‐based text classifier, which is applied to automatically assign subject headings to documents harvested in the PEP project. The EM classifier makes use of easily obtained unlabeled documents and thus reduces the demand for labeled training examples. This paper describes both the context and the procedure of such an application. Experiment results are reported and other alternative approaches are also discussed.