✦ LIBER ✦

A model for estimating the occurrence of same-frequency words and the boundary between high- and low-frequency words in texts

✍ Scribed by Sun, Qinglan ;Shaw, Debora ;Davis, Charles H.

Publisher: John Wiley and Sons
Year: 1999
Tongue: English
Weight: 66 KB
Volume: 50
Category: Article
ISSN: 0002-8231
DOI: 10.1002/(sici)1097-4571(1999)50:3<280::aid-asi11>3.0.co;2-h

No coin nor oath required. For personal study only.

✦ Synopsis

A simpler model is proposed for estimating the frequency of any same-frequency words and identifying the boundary point between high-frequency words and lowfrequency words in a text. The model, based on a "maximum ranking method," assigns ranks to the words and estimates word frequency by the formula:

The boundary value between high-frequency and low-frequency words is obtained by taking the square root of the number of different words in the text: n* ‫؍‬ (D) 1/2 . This straightforward model was used successfully with both English and Chinese texts, demonstrating that the frequency of words and the number of same-frequency words are dependent only on the vocabulary of a text (the number of different words) but not on its length. Like Zipf's Law, the model may be universally applicable.