Tag Archives: python

Benchmarking ratarmount

Ratarmount is an excellent tool for mounting archives as filesystems, and I use it a lot. Mostly for union-mounting tar.xz telemetry bundles created by sos report. The ratarmount README suggests to prefer indexed tar.xz archives created using pixz for performance, so let’s see what’s the best compression to use.


TL;DRs

  • On huge archives, tar.gz is always fast and fastest, no optimization required.
  • Best to recompress tar.xz to tar.gz for best performance. Recompressing with pixz yields an improvement, but not as much as gzip.
  • On tiny archives close to the host’s RAM size, performance is hard to predict and may put gzip behind.

The “Backup” use-case

For my test to have a bit of a sizable workload, I pick a reasonably-sized tar.gz, a remnant historical backup of a long-gone server:

-rw-r----- 1 root root 1.7G Aug  6 10:16 example.tar.gz

This is stored on a RAID-1 of 7200 rpm hard drives, which should amplify all seek performance issues. 8 GB RAM, 6 physical CPU cores, 12 threads.

I prepare a list of 1000 random files from the archive that I’ll be reading from the mounted archive.

tar ztf example.tar.gz | egrep -v '(/$|(sys|proc|dev|run))' | shuf | head -1000 > example.list

Now, I mount the tar.gz for my baseline measurement.

umount ./mnt; ratarmount example.tar.gz ./mnt
time xargs -I{} md5sum ./mnt/{} < example.list
...
real    0m13.974s
user    0m1.178s
sys     0m0.984s

I recompress the tar from gzip to pixz and measure again:

gzip -dc example.tar.gz | pixz > example.tar.pxz
umount ./mnt; ratarmount example.tar.pxz ./mnt
time xargs -I{} md5sum ./mnt/{} < example.list
...
real    0m57.408s
user    0m1.216s
sys     0m0.904s

Multiple times slower! Back to ratarmount‘s README: “In contrast to bzip2 and gzip compressed files, true seeking on XZ and ZStandard files is only possible at block or frame boundaries.”Are you telling me gzip is not the issue and only naively compressed xz is? A quick recompress using vanilla xz and a comparison of that to the pixz compressed archive:

gzip -dc example.tar.gz | xz --threads=$(nproc) > example.tar.xz
umount ./mnt; ratarmount example.tar.xz ./mnt
time xargs -I{} md5sum ./mnt/{} < example.list
...
real    1m33.549s
user    0m1.109s
sys     0m0.838s

Indeed a noticable, although not huge, penalty compared to pixz. Now that I’m here and wasted this much time, a final measurement using bzip2:

gzip -dc example.tar.gz | bzip2 > example.tar.bz2
umount ./mnt; ratarmount example.tar.bz2 ./mnt
time xargs -I{} md5sum ./mnt/{} < example.list
...
real    0m44.410s
user    0m1.301s
sys     0m1.164s

So ratarmount handles bzip2 around the same speed as xz created by pixz.

Gzip is always fastest, even without any special treatment, and I assume this is because multi-threaded rapidgzip literally is ratarmount’s sister project.


The “Telemetry” use-case

Back to my tiny sosreport files in tar.xz format, still on the 7200-rpm HDD system. For consistency, I’ll use the same md5sum benchmark on 1000 archive members as above.

ls -lh sosreport.tar.xz
-rw------- 1 root root 9.3M Aug 14 16:07 sosreport.tar.xz
tar Jtf sosreport.tar.xz | egrep -v '(/$|(sys|proc|dev|run))' | shuf | head -1000 > sosreport.list
umount ./mnt; ratarmount sosreport.tar.xz ./mnt
time xargs -I{} md5sum ./mnt/{} < sosreport.list
...
real    0m4.979s
user    0m0.940s
sys     0m0.625s

A conversion to pixz:

xz -dc sosreport.tar.xz | pixz > sosreport.tar.pxz
umount ./mnt; ratarmount sosreport.tar.pxz ./mnt
time xargs -I{} md5sum ./mnt/{} < sosreport.list
...
real    0m6.847s
user    0m0.829s
sys     0m0.553s

And a conversion to tar.gz:

xz -dc sosreport.tar.xz | gzip > sosreport.tar.gz
umount ./mnt; ratarmount sosreport.tar.gz ./mnt
time xargs -I{} md5sum ./mnt/{} < sosreport.list
...
real    0m13.202s
user    0m0.918s
sys     0m0.598s

gzip is suddenly slower here, and I believe it’s because the file turned out more than 50% larger than the xz versions, both of which are close to the hosts’s RAM size of 8 GB:

-rw-r--r-- 1 root root  14M Aug 14 16:17 sosreport.tar.gz
-rw-r--r-- 1 root root 7.8M Aug 14 16:16 sosreport.tar.pxz
-rw------- 1 root root 9.3M Aug 14 16:07 sosreport.tar.xz

My initial notes on how to install ratarmount in a python virtualenv are documented in Too good to #0013.

Too good to #0008

rinetd-style circuit level gateway in systemd

This accepts port 465/tcp and forwards all connections to a service running somewhere else on 1194/tcp.

The socket unit accepts the connection on port 465:

# /etc/systemd/system/tcp465-to-tcp1194.socket
[Unit]
Description="openvpn 465/tcp to 1194/tcp (socket)"

[Socket]
ListenStream=465

[Install]
WantedBy=sockets.target

systemd-socket-proxyd connects to the backend:

# /etc/systemd/system/tcp465-to-tcp1194.service
[Unit]
Description="openvpn 465/tcp to 1194/tcp (service)"

[Service]
ExecStart=/lib/systemd/systemd-socket-proxyd 10.12.13.14:1194
User=proxy

(Anyone old enough to remember that this was called a plug-gateway in the TIS Firewall Toolkit?)


Python pip/virtualenv/pipenv micro-HOWTO

Clone project with wacky dependencies:

git clone https://github.com/example/project.git

Install dependencies (from requirements.txt):

pipenv install (-r requirements.txt)

Run:

pipenv run ./script

Template for git commit message

Create the template, I prefer it outside the repository:

(blank line)
(blank line)
foo#1234 is the neverending story I'm constantly working on

Configure the path, relative to the repository root:

git config commit.template ../commit-template-for-foo.txt

Too good to #0004

systemd, the good parts: monotonic timers

When systemd makes you suffer because “run job every 10 minutes” is infinitely harder to specify than in crontab, remember there are monotonic timers in systemd that aren’t derived from wallclock time, or “Calendar Events” as they call it.

Run timer once 60 seconds after system startup, then every 10 minutes after the job finished:

# /etc/systemd/system/demo.timer
[Unit]
Description=demo monotonic timer

[Timer]
OnStartupSec=60
OnUnitInactiveSec=600

[Install]
WantedBy=timers.target    # When in a system session
# WantedBy=default.target # When in a user session (~/.config/systemd, systemctl --user etc.)

Python: Use tabulate to format output in columns

#!/usr/bin/env python3
from tabulate import tabulate

data = [
	[ 'foo', 'bar', 'baz' ],
	[ 'spam', 'eggs', 'bacon' ]
]

headers = ['Eine', '2', 'Whatever']

print(tabulate(data, headers=headers, tablefmt='simple'))

Output:

$ ./tab.py 
Eine    2     Whatever
------  ----  ----------
foo     bar   baz
spam    eggs  bacon

virsh/libvirt, automate key presses

(* Updated to sleep 5 seconds after each keypress.)

for key in R E I S U B; do virsh send-key "${domain}" KEY_LEFTALT KEY_SYSRQ KEY_${key}; sleep 5; done