A roadmap to web crawling
I’m currently relying on various bulk exports (Stack Exchange XML dumps, Zim files from Wikipedia, and so on) to populate the search engine. This is a convenient way to get a lot of pages in very quickly—a single export file contains hundreds or thousands of pages, with generally well-structured HTML—but it’s also limiting: most of the web doesn‘t come in nicely structured dump files, but I’d still like to have some of it in my searches. So, I’m going to need to implement web crawling at some point.
Crawling is pretty complicated, though, and there are a lot of ways it can go wrong. So rather than diving in and trying to implement a whole crawling system at once, I’m planning on doing it in stages:
- Write WARC files. WARC (Web ARChive) is a standard file format for storing web crawls, which is useful because there’s a bunch of tools for working with it and I won’t have to reinvent any wheels. My plan for this step is to be able to pass in a list of URLs and get a WARC file out; in particular, I’d like to avoid any
robots.txt
handling or HTML parsing. I’d still like to have something useful, though, so I’m going to apply this to favicon fetching: I can just take a list of domains and tack/favicon.ico
on the end to get my URL list, and this will clean up my current favicon downloader, which is a shell script wrapping aroundcurl
and is pretty awkward to deal with. - Index WARC files. Now that I have a way to download a list of URLs, apply it to some websites. (To avoid having to deal with
robots.txt
, I’ll hand-pick some sites that look particularly easy to work with, and check their robots settings beforehand.) In theory, I should be able to take the resulting WARC files and just treat them as another bulk datasource, with a thin wrapper to feed them into the ingestion system. (In practice, I expect I’ll probably need to do at least a little fiddling with the way I handle HTML, since it was mostly designed around fragments—like the Stack Exchange dumps have—rather than complete pages.) - Reorganize link processing. This isn’t strictly about web crawling, but I’m going to want to do this before the next step: the way I handle links, PageRank, and so on is currently very ad-hoc; I need to set up a schema and some utilities so I can ask questions like “what are the top N URLs by PageRank that have not yet been indexed?” without writing a bunch of code to pull data from all over the place and try to merge it.
- Generic web crawling. Use the above query, feed it through a
robots.txt
checker, and now I’m crawling all sorts of sites. That makes this sound easy, but I expect this to be by far the most complicated part: this is where I encounter all the edge cases, the bizzare things that can go wrong, the stuff that requires fiddling with tricky heuristics to try and make things work. - JavaScript-enabled crawls. At some point, I’m going to want to also handle sites that for some reason require JavaScript to render the content. But this is a long ways off so I’ve not put too much thought into it yet.
Having written that down, I’m glad to see it still makes sense the way it did in my head. I’m planning on doing step 1 soon, because the favicon handling is becoming a bit of a headache, but it’ll probably be a while before I get to the rest of it: I really need to work on improving the quality of the results I already have, before I spent time adding more of them.