Experimenting with redo
I've gone through several different schemes for coordinating everything that needs to be done to get pages into the index. Today I had another go at it, and came up with a scheme that seems to be a good foundation, if a bit rocky in places.
StackExchange was the first datasource I wrote code for, and it's usually the one I use to experiment with new approaches. A new set of data dumps were released recently, so I took the opportunity to try another rearchitecture, since the approach I'm currently taking is confusing to work with and doesn't feel like the right direction.
Considering the problem
Here's a diagram I drew of the data flow for the StackExchange data source. The details of what all of the boxes do is a topic for another post; the main thing to see here is that there are a lot of steps that all have to happen in the right order.
After staring at this diagram and scribbling a lot, I had an epiphany: pages
and canonicalize
together create an API boundary: everything above them is specific to StackExchange, but everything below them can be written in a generic way. Once I drew some boxes on the diagram and zoomed out it became a lot clearer where the responsibilities belonged:
A simple (?) solution
Putting boxes on the diagram also made the way some other things ought to work a lot more obvious in my head, but for the rest of this post I want to focus just on the StackExchange side of things:
This alone looks pretty simple; it could just be a shell script:
# NOTE illustrative purposes only, some complications not shown
wget "https://archive.org/download/stackexchange/$1.7z"
7z e -so -- "$1.7z" Posts.xml | python3 ./extract_xml.py "$1"
sort -k1,1 -s -n "$1.answers.jsonlx" | cut -d' ' -f2- > "$1.answers.jsonl"
In fact, this is what I did originally. It has complications, though: it's annoying to do partial runs while I'm testing things—I have to comment bits of the script out and keep track of which files are out of date—and it's also tricky to coordinate across sixty different StackExchange sites.
However, this problem is starting to take on a familiar pattern: it's a directed acyclic graph of files that need to be updated by running a command when other files change. Hang on a minute... that's just a build system! It doesn't matter that the files I'm working with are big data blobs instead of source code, the principle remains the same: file Z can be obtained by running command Y, which depends on file W. Good old make
could take care of this for us!
I've never had a great relationship with make
. This is partly because most of my encounters with it have been large and impenetrable Makefile
s (often generated by autoconf
/automake
), and partly because of its inherent problems:
One of the reasons
make
was so simple was that it was only supposed to solve my problem, and help Steve too. So, I did it. I found it very useful for my little problem. But, very rapidly, I found myself adding junk. …the tool took off and because it took off so fast, I never went back and fixed many of the design errors… because I didn't want to screw up my fifteen person user community.
I remember meeting Stu Feldman at Bell Labs some 35 years ago: the first thing he told me after being introduced as the make guy was an apology, Hi, pleased to meet you, sorry about the TABs...
Anyway, I've found make
to be reasonably useful so long as I don't try to do anything especially complicated, but I decided to take this opportunity to try something different. There are a bunch of alternative build systems that are generic enough they'd work for this, but I decided to try what's known as "djb redo".
A brief introduction to redo
redo
isn't a build system of itself so much as it is a sketch of what a build system might look like, using the UNIX philosophy of taking a bunch of small programs and plumbing them together. It doesn't have any syntax of its own; instead, the build scripts are specially-named .do
files which are run to update each target. (They're sh
scripts by default, but you can use a shebang line to write them in any language.) Dependency management is done by invoking redo-ifchange
, which sorts out what's needed for the current build and also keeps track of the dependencies for future runs.
This outline of the redo
design has inspired a bunch of programmers to build different implementations; I'm using apenwarr/redo, a choice I made mainly because its documentation has a section called “What's so special about redo?” that gives an excellent overview. (In theory, redo
implementations are mostly interchangeable, though I don't know how well that works in practice.) After sorting out some complications with my weird symlink structure, I've been pretty pleased with the results so far.
I start off with a simple default.7z.do
:
wget --progress=dot:giga -O "$3" "https://archive.org/download/stackexchange/$2.7z"
Then, I have a default.questions.jsonl.do
(redo
doesn't support multiple targets, so I had to pick one output from the extraction to be the "primary" output):
# NOTE simplified for expository purposes
redo-ifchange "$2.7z" ".up/extract_xml.py"
# extract_xml takes .questions.jsonl, .answers.jsonlx, and .aqmap.dat.
# Since redo doesn't deal with multiple outputs well, use the .questions.jsonl
# as the primary output (and therefore use "$3" for it) and then store the
# other two files as a tmpfile we'll symlink in a separate target.
7z e -so -- "$2.7z" Posts.xml | \
python3 .up/extract_xml.py \
"$3" \
"$2.answers.jsonlx.redo" \
"$2.aqmap.dat.redo"
As mentioned in the comment, I also need a default.answers.jsonlx.do
:
# Workaround for redo multiple outputs
redo-ifchange "$2.questions.jsonl"
ln "$1.redo" "$3"
default.answers.jsonl.do
is similarly trivial:
redo-ifchange "$2.answers.jsonlx"
sort -k1,1 -s -n "$2.answers.jsonlx" | cut -d' ' -f2- > "$3"
And finally, with all that out of the way, I can write extract-all.do
to regenerate everything:
redo-ifchange stacksites.txt
for ext in .answers.jsonl .questions.jsonl .aqmap.dat; do
< stacksites.txt grep '^Y ' | awk '{ print "data/" $2 "'$ext'" }' | xargs redo-ifchange
done
Results!
With all this work out of the way, I can now mess about with the processing pipeline and redo
will figure out what needs to be done to update the file I asked for. When I'm ready to apply my changes to everything, I just run:
redo extract-all
and redo does whatever work needs to be done to finish off the process and get everything ready for the rest of the ingestion system.