Ratarmount is an excellent tool for mounting archives as filesystems, and I use it a lot. Mostly for union-mounting tar.xz telemetry bundles created by sos report. The ratarmount README suggests to prefer indexed tar.xz archives created using pixz for performance, so let’s see what’s the best compression to use.
TL;DRs
- On huge archives, tar.gz is always fast and fastest, no optimization required.
- Best to recompress tar.xz to tar.gz for best performance. Recompressing with pixz yields an improvement, but not as much as gzip.
- On tiny archives close to the host’s RAM size, performance is hard to predict and may put gzip behind.
The “Backup” use-case
For my test to have a bit of a sizable workload, I pick a reasonably-sized tar.gz, a remnant historical backup of a long-gone server:
-rw-r----- 1 root root 1.7G Aug 6 10:16 example.tar.gz
This is stored on a RAID-1 of 7200 rpm hard drives, which should amplify all seek performance issues. 8 GB RAM, 6 physical CPU cores, 12 threads.
I prepare a list of 1000 random files from the archive that I’ll be reading from the mounted archive.
tar ztf example.tar.gz | egrep -v '(/$|(sys|proc|dev|run))' | shuf | head -1000 > example.list
Now, I mount the tar.gz for my baseline measurement.
umount ./mnt; ratarmount example.tar.gz ./mnt
time xargs -I{} md5sum ./mnt/{} < example.list
...
real 0m13.974s
user 0m1.178s
sys 0m0.984s
I recompress the tar from gzip to pixz and measure again:
gzip -dc example.tar.gz | pixz > example.tar.pxz
umount ./mnt; ratarmount example.tar.pxz ./mnt
time xargs -I{} md5sum ./mnt/{} < example.list
...
real 0m57.408s
user 0m1.216s
sys 0m0.904s
Multiple times slower! Back to ratarmount‘s README: “In contrast to bzip2 and gzip compressed files, true seeking on XZ and ZStandard files is only possible at block or frame boundaries.” – Are you telling me gzip is not the issue and only naively compressed xz is? A quick recompress using vanilla xz and a comparison of that to the pixz compressed archive:
gzip -dc example.tar.gz | xz --threads=$(nproc) > example.tar.xz
umount ./mnt; ratarmount example.tar.xz ./mnt
time xargs -I{} md5sum ./mnt/{} < example.list
...
real 1m33.549s
user 0m1.109s
sys 0m0.838s
Indeed a noticable, although not huge, penalty compared to pixz. Now that I’m here and wasted this much time, a final measurement using bzip2:
gzip -dc example.tar.gz | bzip2 > example.tar.bz2
umount ./mnt; ratarmount example.tar.bz2 ./mnt
time xargs -I{} md5sum ./mnt/{} < example.list
...
real 0m44.410s
user 0m1.301s
sys 0m1.164s
So ratarmount handles bzip2 around the same speed as xz created by pixz.
Gzip is always fastest, even without any special treatment, and I assume this is because multi-threaded rapidgzip literally is ratarmount’s sister project.
The “Telemetry” use-case
Back to my tiny sosreport files in tar.xz format, still on the 7200-rpm HDD system. For consistency, I’ll use the same md5sum benchmark on 1000 archive members as above.
ls -lh sosreport.tar.xz
-rw------- 1 root root 9.3M Aug 14 16:07 sosreport.tar.xz
tar Jtf sosreport.tar.xz | egrep -v '(/$|(sys|proc|dev|run))' | shuf | head -1000 > sosreport.list
umount ./mnt; ratarmount sosreport.tar.xz ./mnt
time xargs -I{} md5sum ./mnt/{} < sosreport.list
...
real 0m4.979s
user 0m0.940s
sys 0m0.625s
A conversion to pixz:
xz -dc sosreport.tar.xz | pixz > sosreport.tar.pxz
umount ./mnt; ratarmount sosreport.tar.pxz ./mnt
time xargs -I{} md5sum ./mnt/{} < sosreport.list
...
real 0m6.847s
user 0m0.829s
sys 0m0.553s
And a conversion to tar.gz:
xz -dc sosreport.tar.xz | gzip > sosreport.tar.gz
umount ./mnt; ratarmount sosreport.tar.gz ./mnt
time xargs -I{} md5sum ./mnt/{} < sosreport.list
...
real 0m13.202s
user 0m0.918s
sys 0m0.598s
gzip is suddenly slower here, and I believe it’s because the file turned out more than 50% larger than the xz versions, both of which are close to the hosts’s RAM size of 8 GB:
-rw-r--r-- 1 root root 14M Aug 14 16:17 sosreport.tar.gz
-rw-r--r-- 1 root root 7.8M Aug 14 16:16 sosreport.tar.pxz
-rw------- 1 root root 9.3M Aug 14 16:07 sosreport.tar.xz
My initial notes on how to install ratarmount in a python virtualenv are documented in Too good to #0013.