wget does WARC

In my previous post, I talked about crawling the web and mentioned that I planned to start with the ability to download a list of URLs and write them to a WARC file. At the time, I was imagining that I’d have to write some code to do this, but while I was doing some research I discovered that wget added some basic support for WARC output a few years ago.

Using it is pretty straightforward; just add --warc-file=$myfile to the flags and it will create a $myfile.warc.gz containing every request/response, plus some metadata like the flags that were used and logs from wget. However, this already has some pitfalls:

To simplify this, I wrote a wrapper script called wget-warc:

# USAGE: wget-warc <warc file prefix> <wget args>...

set -eu -o pipefail


wget -O /dev/null --warc-file="$warcprefix-$(date +'%F_%H-%M-%S')" --warc-cdx --no-verbose "$@"

I also have a somewhat hacky script, wget-warc-ifnew, which (somewhat hackily) adds support for downloading URLs only once:

# Takes the list of URLS to download on stdin, and passes to wget only the ones that aren't already downloaded
# USAGE: wget-warc-ifnew <warc file prefix> <wget args>...

set -eu -o pipefail


	if compgen -G "$warcprefix*.cdx" >/dev/null; then
		export LC_ALL=C
		# NOTE: this assumes the first field of the CDX is the URL; wget happens to hard-code this so it works.
		comm -13 \
			<(cut -d' ' -f1 "$warcprefix"*.cdx | sort -u) \
			<(sort -u)
) | "$(dirname "$0")"/wget-warc "$warcprefix" --input-file=- "$@"

With these scripts, I’ve completed step 1 of my crawling roadmap, and moved my favicon processing over to using WARC files. I think I’ll also use them for step 2 (crawling a few hand-picked site), but I’m going to have to replace wget with something more robust before doing general web crawling. The problem is that it has a number of limitations for this use-case:

As a result, I’m probably going to end up writing a custom downloader anyway; but wget was very easy to get started with, and it’s is a good first step that should carry me through the prototyping phase without having had to do too much work to get it set up.