A URL by many different names
Over time, the indexing and ranking code has gotten hard to follow, in part because I was using the term “URL” for what turned out to be several different things. I recently cleaned that up, which not only made the code easier to work with but also gave me a good opportunity explain how the system thinks about links.
A raw URL is simply the destination of an <a href>
link. This is calculated as new URL(a.href, window.document.baseURI)
. This turns a possibly-relative URL into an absolute URL, including handling the <base href>
if the document has one.
The canonical URL for a page is the URL I’ve chosen as its primary addres, which will be showin in the search results. For example, all of these links have the same canonical URL, https://stackoverflow.com/questions/1732348
:
https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
https://stackoverflow.com/questions/1732348/you-can-actually-put-anything-here
https://stackoverflow.com/q/1732348
http://www.stackexchange.com/q/1732348
https://stackoverflow.com/a/1732454
(which redirects tohttps://stackoverflow.com/questions/1732348/.../1732454#1732454
)
Turning a raw URL into a canonical one involves several layers of checks. First, if I’ve actually crawled the URL and it redirected, I follow the redirects to the final destination.
Then, for certain sites, I have code that can do structure-aware rewrites (like /w/index.php?title=...
to /wiki/...
for Wikipedia).
Finally, if all else fails, I just fall back to the raw URL.
A normalized “URL” is a last-resort guess at figuring out which URLs refer to the same page. This avoids splitting the PageRank for a document that happens to have multiple URLs, and reduces duplicates among known-but-uncrawled pages. (For the latter case, I also have heuristics to choose the “best” raw URL to use as a canonical URL for the group.)
I put “URL” in scare quotes for this one because it isn’t actually a Resource Locator: it’s a sort of ‘perceptual hash’ that looks like a URL, but might not actually be a valid way of retrieving a document.
To discourage accidentally interpreting one as a URL, I try to handle them as version 5 UUIDs with a custom namespace, as a convenient way of making a stable opaque identifier. I still keep around the string representation, though, since it’s convenient to be able to look at it while debugging.
For example, take the following list of raw URLs:
https://example.com/document/index.aspx
http://www.example.com/document.html
http://example.com/document/?utm_source=ref
I normalize all of these to http://example.com/document
, even though these URLs could, in theory, point to different documents with distinct content. If the server is conventionally configured, this normalized URL would likely also work to fetch the same document; but there’s no guarantee of that, so it can’t be used for retrieval, only to group similar URLs and guess that they probably refer to the same page.
Finally, one more type of value, as yet unimplemented but conceptually relevant, is a “version URL template”. This looks like https://docs.python.org/%s/library/datetime.html
and represents the same page across different software versions, so I can add a version picker to search results. (Right now, searches often show the same page for many versions, cluttering the results.)
Cleaning up how I think about URLs has made this part of the code much easier to work with. There are a lot of interesting edge cases; by formalizing the structure I’ve made the code simpler to understand for future changes, both small tweaks that I’ve been putting off as well as bigger features like a version selector.