wget does WARC
In my previous post, I talked about crawling the web and mentioned that I planned to start with the ability to download a list of URLs and write them to a WARC file. At the time, I was imagining that I’d have to write some code to do this, but while I was doing some research I discovered that wget
added some basic support for WARC output a few years ago.
Using it is pretty straightforward; just add --warc-file=$myfile
to the flags and it will create a $myfile.warc.gz
containing every request/response, plus some metadata like the flags that were used and logs from wget. However, this already has some pitfalls:
- The WARC file will be overwritten if it exists. I use
--warc-file="$myfile-$(date +'%F_%H-%M-%S')"
to avoid clobbering existing files. wget
will also continue writing whatever other output files it would have written. Passing-O /dev/null
suppresses this, leaving just the WARC.
To simplify this, I wrote a wrapper script called wget-warc
:
#!/bin/bash
# USAGE: wget-warc <warc file prefix> <wget args>...
set -eu -o pipefail
warcprefix="$1"
shift
wget -O /dev/null --warc-file="$warcprefix-$(date +'%F_%H-%M-%S')" --warc-cdx --no-verbose "$@"
I also have a somewhat hacky script, wget-warc-ifnew
, which (somewhat hackily) adds support for downloading URLs only once:
#!/bin/bash
# Takes the list of URLS to download on stdin, and passes to wget only the ones that aren't already downloaded
# USAGE: wget-warc-ifnew <warc file prefix> <wget args>...
set -eu -o pipefail
warcprefix="$1"
shift
(
if compgen -G "$warcprefix*.cdx" >/dev/null; then
export LC_ALL=C
# NOTE: this assumes the first field of the CDX is the URL; wget happens to hard-code this so it works.
comm -13 \
<(cut -d' ' -f1 "$warcprefix"*.cdx | sort -u) \
<(sort -u)
else
cat
fi
) | "$(dirname "$0")"/wget-warc "$warcprefix" --input-file=- "$@"
With these scripts, I’ve completed step 1 of my crawling roadmap, and moved my favicon processing over to using WARC files. I think I’ll also use them for step 2 (crawling a few hand-picked site), but I’m going to have to replace wget
with something more robust before doing general web crawling. The problem is that it has a number of limitations for this use-case:
- It can’t limit the downloaded size of a file, so if I accidentally include a large download in the crawl it’ll save the whole thing.
- It doesn’t have any way to limit access or otherwise prevent Server-Side Request Forgery-type attacks, so it could be induced to crawl my local network. I don’t think there’s anything in my house that’s prone to security issues from GET requests, but at a minimum it’d be annoying and confusing if
http://127.0.0.1:8080
appeared in my search results because I crawled it by mistake. - It doesn’t have the smarts I’d like for dynamic rate limiting and giving up on hosts that seem to be down; it’ll just carry on down its list of URLs regardless, which isn’t very polite and is likely to get my bot banned.
As a result, I’m probably going to end up writing a custom downloader anyway; but wget
was very easy to get started with, and it’s is a good first step that should carry me through the prototyping phase without having had to do too much work to get it set up.