Feep! » Blog » Post

A quick fix for a persistent annoyance

Almost every time I update StackOverflow's data files (by far the largest single data source I have, at 40 GB), I find out halfway through the process that I don’t have enough disk space and need to clear out some old data. This is incredibly annoying when it happens, so I took the opportunity today to make the failure a bit more graceful and prompt.

The main problem is that, when this happens, sort only tells me it doesn’t have enough space in /tmp to do the job once it actually runs out of space. This means it happens after it’s already spent a long time working, and then has to do that work all over again once I clear out some room. This time was even more annoying: once I’d cleared out the disk and sort had gotten its work done, I found out that the output disk didn’t have enough space either! So I had to clear that up and then start over from the beginning again.

Since this step just involves shuffling data around, I can just check if each filesystem has enough free space to fit the input file. If it doesn’t, I can fail early rather than waiting until a bunch of work has already been done to find out that there’s a problem.

The resulting block in the script looks like this (the way I'm getting the file sizes probably isn’t the greatest, but it does work):

# Pre-emptively check for free space in tmp (where sort will do its thing)
# and the output directory - otherwise we'll waste a bunch of time getting
# halfway through the sort and then crash and have to redo all that work.
# NOTE: This relies on the knowledge that /tmp and the output are on two
# different filesystems. If that might not be the case you’ll need to check
# first, and instead double the output size and check just the one filesystem.
size_input="$(du -a -B1 "$input" | cut -d$'\t' -f1)"
free_tmp="$(df --output=avail -B1 /tmp/ | tail -n+2)"
if [ "$size_input" -gt "$free_tmp" ]; then
	echo "$0: Not enough free space in /tmp/">&2
	exit 1
fi
free_dest="$(df --output=avail -B1 "$(dirname "$3")" | tail -n+2)"
if [ "$size_input" -gt "$free_dest" ]; then
	echo "$0: Not enough free space in destination">&2
	exit 1
fi

It’s a small annoyance, in the grand scheme of things, but it does rankle and small things like this tend to add up into a looming annoyance. Hopefully this check will make future updates just that little less frustrating.