๐”– Bobbio Scriptorium
โœฆ   LIBER   โœฆ

Compression techniques for Chinese text

โœ Scribed by Phil Vines; Justin Zobel


Publisher
John Wiley and Sons
Year
1998
Tongue
English
Weight
98 KB
Volume
28
Category
Article
ISSN
0038-0644

No coin nor oath required. For personal study only.

โœฆ Synopsis


With the growth of digital libraries and the internet, large volumes of text are available in electronic form. The majority of this text is English but other languages are increasingly well represented, including large-alphabet languages such as Chinese. It is thus attractive to compress text written in the large alphabet languages, but the general-purpose compression utilities are not particularly effective for this application. In this paper we survey proposals for compressing Chinese text, then examine in detail the application to Chinese text of the partial predictive matching compression technique (PPM). We propose several refinements to PPM to make it more effective for Chinese text, and, on our publicly-available test corpus of around 50 Mb of Chinese text documents, show that these refinements can significantly improve compression performance while using only a limited volume of memory.


๐Ÿ“œ SIMILAR VOLUMES


Text segmentation for Chinese spell chec
โœ Lee, Kin Hong ;Ng, Mau Kit Michael ;Lu, Qin ๐Ÿ“‚ Article ๐Ÿ“… 1999 ๐Ÿ› John Wiley and Sons ๐ŸŒ English โš– 202 KB

Chinese spell checking is different from its counterparts for Western languages because Chinese words in texts are not separated by spaces. Chinese spell checking in this article refers to how to identify the misuse of characters in text composition. In other words, it is error correction at the wor

A study on word-based and integral-bit C
โœ Cheng, Kwok-Shing ;Young, Gilbert H. ;Wong, Kam-Fai ๐Ÿ“‚ Article ๐Ÿ“… 1999 ๐Ÿ› John Wiley and Sons ๐ŸŒ English โš– 189 KB ๐Ÿ‘ 1 views

Experimental results show that a word-based arithmetic coding scheme can achieve a higher compression performance for Chinese text. However, an arithmetic coding scheme is a fractional-bit compression algorithm which is known to be time consuming. In this article, we change the direction to study ho

Compression techniques for fast external
โœ John Yiannis; Justin Zobel ๐Ÿ“‚ Article ๐Ÿ“… 2006 ๐Ÿ› Springer-Verlag ๐ŸŒ English โš– 444 KB

External sorting of large files of records involves use of disk space to store temporary files, processing time for sorting, and transfer time between CPU, cache, memory, and disk. Compression can reduce disk and transfer costs, and, in the case of external sorts, cut merge costs by reducing the num