𝔖 Bobbio Scriptorium
✦   LIBER   ✦

Text segmentation for Chinese spell checking

✍ Scribed by Lee, Kin Hong ;Ng, Mau Kit Michael ;Lu, Qin


Publisher
John Wiley and Sons
Year
1999
Tongue
English
Weight
202 KB
Volume
50
Category
Article
ISSN
0002-8231

No coin nor oath required. For personal study only.

✦ Synopsis


Chinese spell checking is different from its counterparts for Western languages because Chinese words in texts are not separated by spaces. Chinese spell checking in this article refers to how to identify the misuse of characters in text composition. In other words, it is error correction at the word level rather than at the character level. Before Chinese sentences are spell checked, the text is segmented into semantic units. Error detection can then be carried out on the segmented text based on thesaurus and grammar rules. Segmentation is not a trivial process due to ambiguities in the Chinese language and errors in texts. Because it is not practical to define all Chinese words in a dictionary, words not predefined must also be dealt with. The number of word combinations increases exponentially with the length of the sentence. In this article, a Block-of-Combinations (BOC) segmentation method based on frequency of word usage is proposed to reduce the word combinations from exponential growth to linear growth. From experiments carried out on Hong Kong newspapers, BOC can correctly solve 10% more ambiguities than the Maximum Match segmentation method. To make the segmentation more suitable for spell checking, user interaction is also suggested.


πŸ“œ SIMILAR VOLUMES


Color segmentation for text extraction
✍ Hiroyuki Hase; Masaaki Yoneda; Shogo Tokai; Jien Kato; ChingY. Suen πŸ“‚ Article πŸ“… 2003 πŸ› Springer-Verlag 🌐 English βš– 873 KB
A heuristic method based on a statistica
✍ Christopher C. Yang; K. W. Li πŸ“‚ Article πŸ“… 2005 πŸ› John Wiley and Sons 🌐 English βš– 251 KB πŸ‘ 1 views

## Abstract The authors propose a heuristic method for Chinese automatic text segmentation based on a statistical approach. This method is developed based on statistical information about the association among adjacent characters in Chinese text. Mutual information of bi‐grams and significant estim

Compression techniques for Chinese text
✍ Phil Vines; Justin Zobel πŸ“‚ Article πŸ“… 1998 πŸ› John Wiley and Sons 🌐 English βš– 98 KB

With the growth of digital libraries and the internet, large volumes of text are available in electronic form. The majority of this text is English but other languages are increasingly well represented, including large-alphabet languages such as Chinese. It is thus attractive to compress text writte