Journal "Computational Technologies"

Article information

2015 , Volume 20, № 3, p.75-98

Hmelnov A.E.

A lossless compression algorithm for integer differences sequences by optimization of their division into intervals of constant bit depth values

Purpose. A principle for construction of algorithms for fast lossless compression of integer data values, which are distributed mainly near zero, is considered. Such data come from, e.g., differential encoding of integer sequences, which represent gradually varying quantities (the quantities, which take similar values in neighboring points). According to the degree of compression for the given kind of data, the algorithms considered are of superior performance compared to ZLib [1] in its strongest compression mode Z_BEST_COMPRESSION, but they take considerably less time for both compression and unpacking, since the proposed compression algorithm is characterized by a linear computational complexity in terms of the processed data size.

Design/methodology/approach. To find the best (most compact) representation of an integer sequence by series of intervals for values of constant bit depth we use the dynamic programming approach. The dynamic programming scheme used is of the VSEncoding algorithm [4] type. A concrete compression algorithm is defined by specifying a particular method for the interval header representation.

Findings. The VSEopt algorithm - an optimized version of VSEncoding algorithm is proposed, which allows to find the true minimum of compressed size without limiting the depth of the interval length search by the max K parameter. A theoretical justification of the VSEopt algorithm is presented. The performed tests have demonstrated that VSEopt works as fast as VSEncoding with impractically small max K=16, but even with max K=1024 (and a lot of time spent) the VSEncoding algorithm can`t achieve the optimal compression found by VSEopt. Another theoretical inference eliminates the use of the VSEopt compression in order to employ the buffer of limited size without losing the capability of finding the true optimal solution. The performed tests show that the algorithm still be capable to find the best solution even when the buffer may be as small as 2000 items (when compressing much larger blocks of 160000 values). Some particular methods of interval header encoding were considered and tested. The best results were obtained for encodings, which use the static Huffman codes obtained with the help of statistics for interval depths and lengths distribution. For the test performed, the process of finding the header encoding which matches the statistics resulting from its usage have converged to the smallest of all the tests executed compressed data size.

Originality/value. The VSEncoding algorithm is the well known tool for compression of document indexes. The optimized version of the algorithm -- VSEopt is a new result. Our formulation and solution addresses the problem how to use a buffer of limited capacity but still find the best compression. The methods used for the algorithms of compression of differential integer sequences are also new and extend the field of the algorithm usage.

[full text]
Keywords: lossless compression, specialized data compression algorithm, dynamic programming

Author(s):
Hmelnov Alexey Evgenievich
PhD. , Associate Professor
Position: Head of Laboratory
Office: Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences
Address: 664033, Russia, Irkutsk, 134 Lermontov str.
Phone Office: (3952) 45-30-71
E-mail: hmelnov@icc.ru
SPIN-code: 8041-3667

References:

[1] Gailly, J.-L., Adler, M. ZLib home site, available at: http://www.zlib.net.

[2] D. Vatolin, A. Ratushnyak, M. Smirnov, V. Yukin. Metody szhatiya dannykh. Ustroystvo arkhivatorov, szhatie izobrazheniy i video [Data compression methods. Construction of archivers, image and video compression]. Moscow: DIALOG-MIFI; 2003: 384. (In Russ.)

[3] Hmelnov, A.E. The MRG file format for compact representation and fast decompression of large digital elevation models. Computional Technologies. 2015; 20(1):63–74. (In Russ.)

[4] Silvestri, F., Venturini, R. VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming. Proceeding CIKM ’10 Proceedings of the 19th ACM International Conference on Information and Knowledge Management. USA: New York; 2010:1219–1228.

[5] A. R. Calderbank, I. Daubechies, W. Sweldens, and B.-L. Yeo, Wavelet transforms that map integers to integers. Technical report, Department of Mathematics. USA: Princeton University; 1996, available at: ftp://ftp.ldv.ei.tum.de/pub/lossless/wavelet_transf_integers.pdf

[6] Vatolin, D. All about data, image and video compression, available at: http://www.compression.ru. (In Russ.)

[7] Salomon, D. A Guide to data compression methods. New York: Springer-Verlag; 2002.

[8] Lemire, D., Boytsov, L. Decoding billions of integers per second through vectorization. Software: Practice and Experience. 2015; 45(1):1–29.

[9] Deveaux, J.-P., Rau-Chaplin, A., Zeh, N. Adaptive tuple differential coding. Database and Expert Systems Applications. Lecture Notes in Computer Science. 2007; (4653):109–119, available at: http://dx.doi.org/10.1007/978-3-540-74469-6_12

[10] Klimenko, S., Mitselmakher, G., Sazonov, A. Losless compression of LIGO data. Technical Note LIGO-T000076- 00- D, MIT 07.08.2000, available at: http://www.phys.ufl.edu/~klimenko/wavelets/lossless.pdf.

[11] Dubinin, M. The SRTM data description and sources, available at: http://gislab.info/qa/srtm.html. (In Russ.)

Bibliography link:
Hmelnov A.E. A lossless compression algorithm for integer differences sequences by optimization of their division into intervals of constant bit depth values // Computational technologies. 2015. V. 20. № 3. P. 75-98