A simple load-shedding scheme
To prepare for setting up a production environment, I just added a some code to handle the situation when the server gets overloaded. The actual implementation is simple, but the theory behind it is quite interesting.
Rather than re-explain it, I’ll just reproduce the code and comment that I already wrote:
//// QUERY HANDLER LOAD-SHEDDING ////
// Make a search query, but fail fast if things have gotten clogged up.
//
// Elastic will queue requests if it’s too busy to service them immediately.
// This sounds helpful, but it means that if the server is overloaded, the
// queue will start backing up, and searches will take longer to complete.
// This is particularly annoying because, as things start to take longer,
// people will start abandoning searches, so some of the work will be doing
// will be for users who have already closed the tab. If queries keep showing
// up faster than we can service them, eventually *every* query will be like
// this as things start timing out.
//
// So, rather than make everyone equally unhappy, I’d rather make as many
// people happy as possible, by continuing to serve whichever queries we can
// handle in a timely manner, and failing for everyone else as quickly as
// possible so they can make a decision for themselves. (Since it takes almost
// no effort to return an error, we don’t really care how fast they hammer the
// retry button, and we won’t spend time working on queries for users who have
// already given up and gone elsewhere.)
//
// To that end, this code limits the number of in-flight request to Elastic to
// a small queue, which should saturate quickly if queries start backing up.
// Once this happens, we’ll start dropping queries; as demand rises, more and
// more queries will get dropped (with the occasional one getting lucky and
// nabbing a slot that was cleared as older queries finish), but the queue
// won’t fill any further and once demand goes down things should recover fast.
//
// Note that the limit is intentionally very small: it's bigger than Elastic's
// thread pool size, (for my server, 4 threads, since I have 2 processors) but
// smaller than the queue that backs that thread pool (1000, I think; Elastic 7
// has an experimental autoscaler thingie that sounds like it tries to auto-
// twiddle this. However, note that I think Elastic might also turn our request
// into multiple work items for the thread pool? This number isn't very
// scientific, it's just "quite small but bigger than the minimum".)
const ELASTIC_MAX_CONCURRENCY = 8
let active_fetches = 0
async function elastic_query(opts) {
if (active_fetches >= ELASTIC_MAX_CONCURRENCY) {
return http_error(503)
}
active_fetches++
try {
return await elastic_query_immediate(opts)
} finally {
active_fetches--
}
}
Queueing theory is very tricky and counter-intuitive; for a more complicated or sophisticated system, this could be taken a lot further. But I think this is enough to ameliorate the most likely kinds of disasters I’m likely to encounter immediately (getting a “hug of death” from a popular link, or something going wrong and causing the server to thrash on swap), and it’s certainly easy to understand and work with.
Further reading that might be of interest: