Feep! » Blog » Post

A better approach to link processing

In my last post I showed a big diagram of all the steps to get a StackExchange post from a dump file into the search index, and drew some boxes around different parts. This time I'm looking at the .link.dat generation, which is shown in the diagram as part of pagerank, but was actually managed by the datasource.

graph LR subgraph "datasource" H[[pages]] I[[canonicalize]] end subgraph pagerank H & I --> P1{{generate_link_dat.js}} --> P2[.links.dat] end

Previously, generating the .links.dat files was the responsibility of each data source, and they were stored next to the other data files, however that datasource organized them. This felt a bit awkward, but I hadn't yet come up with a good abstraction for keeping track of them. Once I figured out the clean split between datasources and generic processing, it was obvious that the link data belonged on the “generic” side of that line, but I still wasn't sure what the interface should be—the problem being that, even if we don't know anything about the datasource's internals, we still need to be able to keep separate .links.dat files and know when they need to be rebuilt.

I considered implementing this with redo again, but that had some challenges: the default.links.dat.do script doesn't know what files it should depend on for each datasource, or how to find them, or how to read them. The datasource might not even have files, in the future, if I make one that stores its data in a relational database, for example. I briefly considered making a virtual target in each datasource, but it still wasn't clear to me exactly what the interface should be. I also realized that I should have the entire ingestion pipeline in a single process to avoid the performance overhead of serializing and deserializing objects, so anything involving a shell pipeline was out.

Given this last constraint, I decided to consider what could be done using the existing ingestion system. I already have a “page source” interface standardized; each datasource defines a function which takes some arguments and returns a stream of pages. However, there isn't any way to know what arguments to pass, or when the result will have changed and the pipeline needs to be rerun. (I'm keeping track of this all in my head at the moment, and running various commands in the right order when things change. It's a bit difficult to do and I'd like to automate this properly.) If there was a way to get this information automatically, I could write a script that would get it from all of the datasources and run whichever changes need to be made.

With that in mind, I added to the datasource interface as follows:

interface Datasource {
	chunks(): Promise<Array<{
		source_args: string[],
		name: string,
		cachekey: string,
	}>>
}

This introduces the concept of a "chunk", which each datasource is free to define as it likes. (StackExchange emits one chunk per site; others might do it based on time, or for simple datasources just have a single chunk.) Each chunk defines what the arguments are to source() it, and also defines a cachekey which is an opqaue identifier that will change if the chunk needs to be reloaded. (StackExchange currently just uses the mtime of the .answers.jsonl file.)

With this system, updating all of the .links.dat files is simple. They are all stored together in a folder, and named ${datasource}:${chunk.name}:${chunk.cachekey}:.links.dat. Updating them is a three-step process:

Get a list of all current chunks and their filenames.
Delete all the files in the directory that we don't know about.
Generate new files for chunks in the list that don't have corresponding files, using datasource.source(...source_args).

Because the filenames have the cache key in them, cache invalidation happens naturally: the old cache entry simply appears as a file with a name that's not on the list, and so gets deleted; and the new entry is then created. Chunks that are newly created or removed are handled in the same way.

This approach to managing the problem of data tracking and updates is simple and easy to understand. Now that I've proven it with link data, I'm planning on applying the same principle to ElasticSearch ingestion—watch for a post about that soon.