Bringing chunks to ElasticSearch
Previously, I introduced the concept of "chunks" for use in generating link data. This time, I'd like to apply the same principle to the process of importing pages into ElasticSearch. On the surface, this looks the same as for the link data; however, there are some complications because I'm now working in ElasticSearch instead of with files on a disk.
Right now, I'm storing everything in a single large Elastic index. This works mostly adequately, but is not ideal. For one thing, it causes problems with updates: to "update" a document, Elastic marks the previous version as deleted and inserts a new one, bloating the index until it can be re-compacted. (And my system has no way of telling which pages changed and which stayed the same, so when I reload stackoverflow.com it sees 21 million updates!) I'd like to avoid ever doing updates to existing items in an index, and instead just drop the entire index and start over; but I don't want to have to reload the entire index whenever anything changes.
Fortunately, ElasticSearch has index aliases, which can make multiple indexes appear together as one. So I can make multiple indexes, and then when I need to update something I can just reload everything in that index into a new one, switch them around, and cleanly delete the old index—no compaction needed.
I looked into using one index per chunk, which seemed the simplest solution, but it seems like Elastic probably prefers to have fewer large indexes rather than a bunch of small indexes. I don't think this would matter so much right now, since the dataset I have is relatively small (well, less than 45 GB, anyway); but I expect to have some datasources in the future that have a lot of very small chunks so I thought it best to prepare for the future. I therefore compromised on introducing "clumps", which are a collection of chunks, so I can have one clump per index. (The term is very evocative, but Elastic has already used the good words like "shards", "segments", and "clusters", and I wanted to avoid a naming collision.) By default, each datasource gets a single clump, but datasources can also define their own clumping. StackOverflow gets a clump of it's own, because it's so much bigger than all of the other chunks put together. I've mostly tried to aim for clumps that are 10GB or less, as an arbitrary threshold that doesn't take too long to load.
Once I had the clumps defined, I had to figure out how to manage indexing them. I wanted to support adding to an existing index, if there are new chunks in the clump, since the index doesn't need to be reloaded in this case; and if the index did need to be reloaded I also wanted to support keeping the current index around until the next one was ready, so clumps didn't disappear wholesale from the search results during an update. After a lot of thinking, the final logic ended up looking like this:
if not clump.current_index:
clump.current_index = create_index(clump)
if not index_obsolete(clump.current_index):
update_index(clump)
else:
if clump.nextindex and index_obsolete(clump.next_index):
delete_index(clump.next_index)
if not clump.next_index:
clump.next_index = create_index(clump)
update_index(clump)
update_alias(delete=clump.current_index, add=clump.next_index)
clump.current_index = clump.next_index
It looks simple here, but there are a lot of fiddly bits I've omitted. The information about which chunks are in each clump, and what the clump's current indexes are, is currently stored in a JSON file on disk; I should look into attaching it as metadata somewhere in Elastic but this works for the moment.
Right now it kind of feels like I've done all this work for nothing, since the end result looks exactly the same despite all the extra code, but this should make index updates faster and less error-prone in the future.