Relevance weighting for combining multi-domain data for n-gram language modeling
✍ Scribed by R. Iyer; M. Ostendorf
- Publisher
- Elsevier Science
- Year
- 1999
- Tongue
- English
- Weight
- 155 KB
- Volume
- 13
- Category
- Article
- ISSN
- 0885-2308
No coin nor oath required. For personal study only.
✦ Synopsis
Standard statistical language modeling techniques suffer from sparse-data problems in tasks where large amounts of domain-specific text are not available. In this paper, we focus on improving the estimation of domain-dependent n-gram models by the selective use of out-of-domain text data. Previous approaches for estimating language models from multi-domain data have not accounted for the characteristic variations of style and content across domains. In contrast, this work aims at differentially weighting subsets of the out-of-domain data according to style and/or content similarity to the given task, where "style" is represented by part-of-speech statistics and "content" by the particular choice of vocabulary items. In addition to ngram estimation, the differential weights can be used for lexicon design. Recognition experiments are based on the Switchboard corpus of spontaneous conversations, with out-of-domain text drawn from the Wall Street Journal and Broadcast News corpora. The similarity weighting approach gives a 3-5% reduction in word error rate over a domain-specific n-gram language model, providing some of the largest language modeling gains reported for the Switchboard task in recent years.