Automatic subject heading assignment for online government publications using a semi-supervised machine learning approach
✍ Scribed by Xiao Hu; Larry S. Jackson; Sai Deng; Jing Zhang
- Publisher
- Wiley (John Wiley & Sons)
- Year
- 2006
- Tongue
- English
- Weight
- 246 KB
- Volume
- 42
- Category
- Article
- ISSN
- 0044-7870
No coin nor oath required. For personal study only.
✦ Synopsis
Abstract
As the dramatic expansion of online publications continues, state libraries urgently need effective tools to organize and archive the huge number of government documents published online. Automatic text categorization techniques can be applied to classify documents approximately, given a sufficient number of labeled training examples. However, obtaining training labels is very expensive, requiring a lot of manual labor. We present a real world online government information preservation project (PEP) in the State of Illinois, and a semi‐supervised machine learning approach, an Expectation‐Maximization (EM) algorithm‐based text classifier, which is applied to automatically assign subject headings to documents harvested in the PEP project. The EM classifier makes use of easily obtained unlabeled documents and thus reduces the demand for labeled training examples. This paper describes both the context and the procedure of such an application. Experiment results are reported and other alternative approaches are also discussed.