Why writing your own search engine is now easy
A 2004 article on Why Writing Your Own Search Engine is Hard was posted on Hacker News today, and I wrote a comment comparing it with my experiences writing my own search engine 18 years later. This post is that comment, cleaned up and expanded a bit. (Go read the article first, I quote a bit from it here but this post won’t make as much sense without context.)
Bandwidth: This is now also cheap; my residential service is 1 Gbit. However, the suggestion to wait until you’ve got indexing working well before optimizing crawling is IMO still spot-on; trying to make a polite, performant crawler that can deal with all the bizzare edge cases on the Web will drag you down. (I bypassed this problem by starting with the Stack Exchange data dumps and Wikipedia crawls, which are a lot more consistent than trying to deal with random websites. In fact I still don’t have a crawler, though I do have a pretty good idea of what it will look like.)
CPU: Computers are really fast now; I’m using a 2-core Pentium G2020T from 2014 and it does what I need just fine. (I do peg the CPU usage during indexing, but that can happen in the background and I’ve set things up so that it runs at night when I don’t care how long it takes. Dev work and querying are both plenty fast enough.)
Disk: SATA is the new thing now, of course, but the question these days is HDD vs SSD. SSD is faster: but you can design your architecture so that this mostly doesn’t matter, and even a “slow” HDD will be running at capacity. (The trick is to do linear streaming as much as possible, and avoid seeks at all costs.) Still, it’s probably a good idea to store your production index on an SSD, and it’s useful for intermediate data as well; I have (by happenstance more than design) a large HDD and a small SSD and they balance each other nicely.
Storing files: 100% agree with this section, for the disk-seek reasons I mention above. The way to do this nowadays is with WARC files, since it’s a standardized format where all the problems have been ironed out for you and there’s lots of tooling available. One additional advantage: pages from the same website often compress very well against each other (since they’re using the same templates, large chunks of HTML can be squished down considerably), so if you’re pressed for space consider storing one GZIPped file per domain. (The tradeoff with compression is that you can’t arbitrarily seek, but ideally you’ve designed things so you don’t need to do that anyway.)
Networking: I skipped this by just storing everything on one computer; I expect to be able to continue doing this for a long time, since vertical scaling can get you very far these days.
Indexing: You basically don’t need to write anything to get started with this these days! I’m just using bog-standard Elasticsearch with some glue code to do html2text; it’s working fine and took all of an afternoon to set up from scratch. (That said, I’m not sure I’ll continue using Elastic: it has a ton of features I don’t need, which makes it very hard to understand and work with since there’s so much that’s irrelevant to me. I’m probably going to switch to either straight Lucene or Bleve soon.)
Page rank: I added pagerank very early on in the hopes that it would improve my results, though I’m not really sure how helpful it is if your ranking isn’t decent to begin with. Still, the march of Moore’s law has made it an easy experiment: what Page and Brin’s server could compute in a week with carefully optimized C code, mine can do in less than 5 minutes (!) with a bit of JavaScript (!!).
Serving: Again, ElasticSearch has already solved this entire problem for me (at least to start with); all the frontend has to do is take the JSON result and poke it into an HTML template.
It’s easier than ever to start building a search engine in your own home; the recent explosion of such services is an indicator of the feasibility, and the rising complaints about Google show that the demand is there. Come and join us, the water’s fine!