Feep! » Blog » Post

wget does WARC

In my previous post, I talked about crawling the web and mentioned that I planned to start with the ability to download a list of URLs and write them to a WARC file. At the time, I was imagining that I’d have to write some code to do this, but while I was doing some research I discovered that wget added some basic support for WARC output a few years ago.

Using it is pretty straightforward; just add --warc-file=$myfile to the flags and it will create a $myfile.warc.gz containing every request/response, plus some metadata like the flags that were used and logs from wget. However, this already has some pitfalls:

The WARC file will be overwritten if it exists. I use --warc-file="$myfile-$(date +'%F_%H-%M-%S')" to avoid clobbering existing files.
wget will also continue writing whatever other output files it would have written. Passing -O /dev/null suppresses this, leaving just the WARC.

To simplify this, I wrote a wrapper script called wget-warc:

#!/bin/bash
# USAGE: wget-warc <warc file prefix> <wget args>...

set -eu -o pipefail

warcprefix="$1"
shift

wget -O /dev/null --warc-file="$warcprefix-$(date +'%F_%H-%M-%S')" --warc-cdx --no-verbose "$@"

I also have a somewhat hacky script, wget-warc-ifnew, which (somewhat hackily) adds support for downloading URLs only once:

#!/bin/bash
# Takes the list of URLS to download on stdin, and passes to wget only the ones that aren't already downloaded
# USAGE: wget-warc-ifnew <warc file prefix> <wget args>...

set -eu -o pipefail

warcprefix="$1"
shift

(
	if compgen -G "$warcprefix*.cdx" >/dev/null; then
		export LC_ALL=C
		# NOTE: this assumes the first field of the CDX is the URL; wget happens to hard-code this so it works.
		comm -13 \
			<(cut -d' ' -f1 "$warcprefix"*.cdx | sort -u) \
			<(sort -u)
	else
		cat
	fi
) | "$(dirname "$0")"/wget-warc "$warcprefix" --input-file=- "$@"

With these scripts, I’ve completed step 1 of my crawling roadmap, and moved my favicon processing over to using WARC files. I think I’ll also use them for step 2 (crawling a few hand-picked site), but I’m going to have to replace wget with something more robust before doing general web crawling. The problem is that it has a number of limitations for this use-case:

It can’t limit the downloaded size of a file, so if I accidentally include a large download in the crawl it’ll save the whole thing.
It doesn’t have any way to limit access or otherwise prevent Server-Side Request Forgery-type attacks, so it could be induced to crawl my local network. I don’t think there’s anything in my house that’s prone to security issues from GET requests, but at a minimum it’d be annoying and confusing if http://127.0.0.1:8080 appeared in my search results because I crawled it by mistake.
It doesn’t have the smarts I’d like for dynamic rate limiting and giving up on hosts that seem to be down; it’ll just carry on down its list of URLs regardless, which isn’t very polite and is likely to get my bot banned.

As a result, I’m probably going to end up writing a custom downloader anyway; but wget was very easy to get started with, and it’s is a good first step that should carry me through the prototyping phase without having had to do too much work to get it set up.