Favicons
Today I added favicons the search results, to make them more visually interesting.
Once again I'm using redo to implement the glue logic; it's a bit hairy in places but not too bad. The pipeline looks a bit like this:
- Get a list of domains (currently manually obtained from various datasources)
- Download /favicon.ico
- Convert ico to a 32x32 (or less) PNG:
- Pick a frame from the PNG
- Figure out the final size
- Use imagemagick to extract, resize, and convert to PNG
- Apply
zopflipng
to squish the PNG
- Get a list of all PNG images and their sha256sums
- Make a SHA-based filename for each PNG and generate a lookup table for the frontend
Written down like that, it seems pretty simple, but there are a lot of complications. Here's the ones I've encountered so far:
- Since I'm downloading from random servers, I want to make sure I can't end up with a multi-megabyte file somehow. Curl has a
--max-filesize
flag, but this is merely advisory: it relies on the server responding with a content-length header. Instead, I usedulimit -f 50
to limit the size to 50 KB. - Because
ulimit -f
is in place,redo-ifchange
can't make large changes to its internal database, and crashes withOperationalError: disk I/O error
! I spent a lot of time scratching my head about this one. (The workaround is to be careful not to callredo
inside the script after setting the ulimit.) - Some websites return compressed content, even if the user-agent didn't send an
Accept-Encoding
header (which curl doesn't, by default). The result of this will be a gibberish file, even though the icon appears fine in the browser. The workaround for this one is to use the--compressed
flag, which does sendAccept-Encoding
, and also has the side-effect of making curl look for aContent-Encoding
header in the response and behave appropriately. - Not every site has a /favicon.ico. The ones that don't may not serve an HTTP error; instead you might get e.g. a 200 with an HTML document. My current workaround for this is to use
file
to check for HTML, and clear out the file in that case. This won't work for other kinds of incorrect responses but I'll cross that bridge when I get to it. - Some sites serve a PNG image instead of an ICO, despite the extension. Fortunately
imagemagick
looks at magic numbers instead so this hasn't been a problem. - Some sites have a favicon, even if they don't serve /favicon.ico—instead they set it in the HTML of each page using
<link rel="icon"/>
. If I want to handle these I'll need to fetch the homepage and parse the HTML; I haven't done that yet.
Once I've gotten a legitimate ICO file, there's still more fun to be had. ICO files can contain multiple images; the idea is that they can contain, for example, separate hand-drawn icons for different display sizes, or icons at multiple bit-depths depending on graphics quality. (I'm more familiar with the classic Mac approach, which uses ICN#
, icl4
, icl8
, etc. resources, grouped together into a BNDL
. ICO files are separate, since Windows doesn't have a resource fork, but more flexible about the kinds of icon they can store.) Favicons are nominally 16x16, but I'm rendering them as 32x32 if possible to support high-resolution screens. (If there's only a 16x16 icon available, I don't bother scaling it up, since all that would do is make a bigger file that looks blurry.)
The heuristic I'm using right now, which works on the sites I have indexed so far, is:
- If there's only one image in the file, use that (obviously)
- Next, use the 32x32 image with the highest number of colors (on the theory that the lower-color ones are probably for lower bit depths, and don't look as nice)
- Finally, pick the highest-color image regardless of size.
This heuristic has some glaring gaps, but they turn out not to matter, at least not yet. It seems that all of the favicons I have so far fall into two buckets:
- There's only one (usually 16x16) image, or
- There's a bunch of different sizes: 16x16, 32x32, and 48x48 seem popular, but so far these files always contain at least one 32x32 image. (In some cases they contain several, with e.g. increasingly nice gradients.)
The one exception to these two buckets so far is the icon for fishshell.org
, which contains two icons, of dimensions 15x13
and 32x27
. The intent here appears to be that the browser will fit these into its favicon square; my color-based fallback picks the higher-resolution one and I have some code that handles this appropriately. (Imagemagick doesn't seem to have an option to scale to fit in an aspect ratio—the promising-looking option averages the two values instead—so I have some special code to detect this case and pick the appropriate image size.)
Once I've produced a standardized PNG image, I pass it through zopflipng
to compress it. zopflipng
is a PNG encoder that uses the zopfli algorithm to do gzip compression; this produces a compressed output that's much smaller than most gzip algorithms, at the expense of more CPU time. These images are so tiny to begin with that the extra CPU time is negligible in comparison to all the other stuff I'm doing, and often achives files 70% smaller than the original imagemagick output. The result is that most images are well under 1K; the smallest (99 bytes) is underscorejs.org
, which is 2 colors plus transparency; the largest is gis.stackexchange.com
(2,370 bytes) which has a lot of shading so almost no two colors are alike.
Finally, I take the sha256sum (encoded as base64url) of each icon to use as the filename for the web server. Making the files content-addressed has several benefits:
- The icons on the results page update as soon as the frontend finds out about them, rather than after a cache expiration.
- Multiple domains that use the same icon (for example, separate marketing and documentation sites) are deduplicated.
The end result of all this work is a slighly prettier results page; hopefully the new icons will make it easier to skim the results at a glance.