Compression technologies: benchmarked

Hello! Today I will be benchmarking the performance of various popular compression algorithms, and sharing the conditions and results of that. The algorithms under examination today include GZip, BZip2, LZMA/2, LZ4, and LZO.

These files were selected:

view origin

Linux source tarball

File: linux-3.16.57.tar
Size: 579,962,880 bytes (~553.1 MiB)
Origin: https://www.kernel.org/
SHA2-256: a6733aed49b13f8dbf2bf80caf978613f9aaf5a63ff6d1d50d7c3beb968e9be8

view origin

The Story of Satoshi Tajiri

File: satoshi-tajiri.webm
Size: 49,678,811 bytes (~47.4 MiB)
Origin: https://en.wikipedia.org/wiki/File:The_Story_of_Satoshi_Tajiri.webm
SHA2-256: ab72294d339ba670bae7337d3cc7b3b80c6f20a2b321b315d9596735de18e687

These machines were used:

bahariya

Model: Online.net Dedibox LT 2017
CPU: Intel Xeon E3-1240 v6 @ 3.7GHz (4.1GHz), quad-core HT CPU, 72W TDP
System: CentOS 3.10.0-693.17.1.el7.x86_64
RAM: 32GiB DDR4 ECC

These program versions were fetched and compiled for use on the above systems:

BZip2

Version: 1.0.6
File: bzip2-1.0.6.tar
Size: 2,590,720 bytes (~2.47 MiB)
SHA2-256: 351c652cb503ef907b77c7b33aaa2731742747ccbe4335f56d2187780794154c

GZip

Version: 1.9
File: gzip-1.9.tar
Size: 5,253,120 bytes (~5.00 MiB)
SHA2-256: baf4c156272460703eae41d071ce10ee112bb3424105cab5f5438e41d7d997db

LZ4

Version: 1.8.2
File: lz4-1.8.2.tar
Size: 1,361,920 bytes (~1.30 MiB)
SHA2-256: 9a992ac50c9612d12c6b8e68d877fc791af6f6ce369ad300bdd02014c9971cb9

LZMA/2 (using XZ)

Version: 5.2.4
File: xz-5.2.4.tar
Size: 5,654,480 (~5.39 MiB)
SHA2-256: 7f77d67aec8207e4fef28c58f19919e51ef469621a58eafd13bf1f80ce956312

LZO (algorithm)

Version: 2.10
File: lzo-2.10.tar
Size:
SHA2-256: ef02813312a1c9de5bee3d8aad7a5af40515f4e2d3a78b58cb45c2c6f94f7a33

LZOP (frontend)

Version: 1.04
File: lzop-1.04.tar
Size: 3,328,000 bytes (~3.17 MiB)
SHA2-256: 2c336178760c20e74457542397bf84c3299f634b1143904fe621d3b6c162e6a1

These were the settings used to compile all of the above programs:

CC=clang
CFLAGS+=-O2 -pipe
All symbols stripped (strip -s)

Results

For testing, two shell scripts were built to track and record the performance of the algorithms, available as a Gist. First is compressbench.sh, which does most of the heavy lifting. The other script, iterbench.sh, wraps the first script and iterates over it 6 times, keeping successive results - this is to average the output of some algorithms which muddle in the margin of error at certain (usually lighter) compression levels.

The raw CSV data for these results can be viewed here. The algo column lists the program in use; xze is the same as xz except with the -e flag passed as well. ctime and dtime are measured in seconds. cmem and dmem are measured in kibibytes (multiples of 1024 bytes). size is measured in bytes, which can be compared to original file sizes shown above.

Conclusion

The results here are pretty interesting. Here are some quick facts:

From what the data shows, LZMA/2 is still the reigning champion for raw performance. I can understand why BZip2 has fallen out of favour judging from its resource usage and inferiority to LZMA/2, and yet GZip has stuck around, if not for legacy reasons than its speed and consistency.

LZO and LZ4 are very interesting for how much lazier their performance is in favour of speed. It seems that such algorithms may easily be I/O limited, even on solid state media, which is a lovely space-saving proposition that not only will not cost extra time, but will save even more time the more aggressive the compressor was in packing the data. In systems that use large amounts of text or other assets, it would be a no-brainer to use.