𝔖 Bobbio Scriptorium
✦   LIBER   ✦

Lempel-Ziv compression of highly structured documents

✍ Scribed by Joaquín Adiego; Gonzalo Navarro; Pablo de la Fuente


Book ID
101655461
Publisher
John Wiley and Sons
Year
2007
Tongue
English
Weight
472 KB
Volume
58
Category
Article
ISSN
1532-2882

No coin nor oath required. For personal study only.

✦ Synopsis


Abstract

The authors describe Lempel‐Ziv to Compress Structure (LZCS), a novel Lempel–Ziv approach suitable for compressing structured documents. LZCS takes advantage of repeated substructures that may appear in the documents, by replacing them with a backward reference to their previous occurrence. The result of the LZCS transformation is still a valid structured document, which is human‐readable and can be transmitted by ASCII channels. Moreover, LZCS transformed documents are easy to search, display, access at random, and navigate. In a second stage, the transformed documents can be further compressed using any semistatic technique, so that it is still possible to do all those operations efficiently; or with any adaptive technique to boost compression. LZCS is especially efficient in the compression of collections of highly structured data, such as extensible markup language (XML) forms, invoices, e‐commerce, and Web‐service exchange documents. The comparison with other structure‐aware and standard compressors shows that LZCS is a competitive choice for these type of documents, whereas the others are not well‐suited to support navigation or random access. When joined to an adaptive compressor, LZCS obtains by far the best compression ratios.


📜 SIMILAR VOLUMES


Application of Lempel–Ziv factorization
✍ Wojciech Rytter 📂 Article 📅 2003 🏛 Elsevier Science 🌐 English ⚖ 235 KB

We introduce new type of context-free grammars, AVL-grammars, and show their applicability to grammar-based compression. Using this type of grammars we present O(n log | |) time and O(log n)-ratio approximation of minimal grammar-based compression of a given string of length n over an alphabet and O