An empirical study of smoothing techniques for language modeling
β Scribed by Stanley F. Chen; Joshua Goodman
- Publisher
- Elsevier Science
- Year
- 1999
- Tongue
- English
- Weight
- 641 KB
- Volume
- 13
- Category
- Article
- ISSN
- 0885-2308
No coin nor oath required. For personal study only.
β¦ Synopsis
We survey the most widely-used algorithms for smoothing models for language n-gram modeling. We then present an extensive empirical comparison of several of these smoothing techniques, including those described by Jelinek and Mercer (1980); Katz (1987);Bell, Cleary and Witten (1990); Ney, Essen andKneser (1994), andKneser andNey (1995). We investigate how factors such as training data size, training corpus (e.g. Brown vs. Wall Street Journal), count cutoffs, and n-gram order (bigram vs. trigram) affect the relative performance of these methods, which is measured through the cross-entropy of test data. We find that these factors can significantly affect the relative performance of models, with the most significant factor being training data size. Since no previous comparisons have examined these factors systematically, this is the first thorough characterization of the relative performance of various algorithms. In addition, we introduce methodologies for analyzing smoothing algorithm efficacy in detail, and using these techniques we motivate a novel variation of Kneser-Ney smoothing that consistently outperforms all other algorithms evaluated. Finally, results showing that improved language model smoothing leads to improved speech recognition performance are presented.
π SIMILAR VOLUMES
Knowledge of the photosynthetically active radiation is necessary in different applications dealing with plant physiology, biomass production and natural illumination in greenhouses. Nevertheless, as a result of the absence of widespread measurements of this radiometric flux, it is often calculated
The object-oriented methodology for systems analysis and design has generated considerable interest recently . Object-orientation represents a fundamental shift in focus from the traditional process-oriented approaches that have dominated software development for over two decades . Although there is
## Abstract This poster reports on an empirical study on children's source use for their Internet searches. A group of thirdβ and fifthβgrade students participated in this study over a 15βweek period, during which the students conducted Internet searches for their schoolwork as part of their curric