Feep! » Blog » Post

DuckDB + Parquet for looking at links

In my last post, I wrote about cleaning up URL processing. Those URLs are used to process information about pages and the links between them (for computing things like PageRank metrics), and all that data needs to get stored somewhere. Until recently I was using a collection of ad-hoc file formats, which meant that pulling any data out required writing a bunch of bespoke code. This was tedious at best and meant that I didn’t have much visibility into statistics about the pages I was indexing, because they was hard to query. I’ve just migrated all of this data to Parquet files, and now I can run SQL queries with DuckDB, which is much more convenient.

If I had fully normalized the schema for page and link data, it would look like this:

erDiagram chunk { enum datasource string name hash version } chunk ||--|{ page : "has many" page { string title bool nofollow } page ||--o{ link : "has many" page }|--|| canonical_url : "has a" link { string text } link }|--|| raw_url : "destination has a" raw_url { url raw_url bool can_rank string origin } raw_url }|--|| canonical_url : "has a" canonical_url { url canonical_url string origin } canonical_url }|--|| normalized_url : "has a" normalized_url { uuidv5 normalized_id float pagerank }

However, in practice the schema I'm using is heavily denormalized. This is both for convenience when querying (I wish something like a key join was an accepted part of the SQL standard), and also for performance. I tried joining the normalized tables when running queries, and DuckDB kept running out of memory: there is no convenient ordering that makes these joins efficient, so it attempted to construct hash tables, but the data it needed to run the query was just too big and intermediate spilling just wasn’t enough to help.

The raw_url, canonical_url, and normalized_url tables are actually derived data—everything in these tables is computed from the raw URL string—but this calculation can be quite involved and isn't practical to implement in SQL, so it’s simplest to re-compute it for every row while I’m generating the pages and links tables; that way I don’t actually need to do any lookups while outputting the data. Combining all of this together results in a schema that’s quite wide:

create table pages (
    chunk_name varchar,
    url varchar,
    url_origin varchar,
    url_canonical varchar,
    url_normalized varchar,
    url_normid uuid,
    title varchar
);
create table links (
    src varchar,
    src_can_rank boolean,
    src_origin varchar,
    src_canonical varchar,
    src_normalized varchar,
    src_normid varchar,
    dst varchar,
    dst_can_rank boolean,
    dst_origin varchar,
    dst_canonical varchar,
    dst_normalized varchar,
    dst_normid uuid,
    text string,
    nofollow bool
);
create table pageranks (
    normid uuid,
    pagerank float
);

The pageranks table is stored separately so I don’t have to rebuild all the other tables after recomputing rankings; fortunately, it’s small enough that joining to it isn’t a performance problem.

(Actually, the join is done not by UUID but by a serially-assigned integer called prid, but the way that works is quite involved and mostly an artifact of the way this code has evolved over time, so I’ve left it out of this post for simplicity.)

I created a couple of views on top to merge the parquet files for each chunk and join in the PageRank data for convenience:

create or replace view pagerank as (select * from "pagerank/data/pageranks.parquet");
create or replace view pages as (
  select pages.*, pagerank.rank as url_pagerank
  from "chunkinfo/data/cache/*/pages.parquet" as pages
  left outer join pagerank on pages.url_prid = pagerank.prid
);
create or replace view links as (
  select
    links.*,
    pagerank_src.rank as src_pagerank,
    pagerank_dst.rank as dst_pagerank,
    exists (select 1 from pages where pages.url_prid = dst_prid) as dst_indexed
  from "chunkinfo/data/cache/*/links.parquet" as links
  left outer join pagerank as pagerank_src on links.src_prid = pagerank_src.prid
  left outer join pagerank as pagerank_dst on links.dst_prid = pagerank_dst.prid
);

So now I can do queries, like quickly checking how many pages are currently in the index (turns out the answer is 32,174,986):

select count(*) from pages;

Or finding out which are the most popular pages from Wikipedia I’m still missing:

select distinct
  dst_canonical, dst_prid, dst_pagerank
from links
where
  not dst_indexed and
  dst_canonical like 'https://en.wikipedia.org/wiki/%'
order by dst_pagerank desc
limit 100

Or look up back-links:

select src, dst, text
from links
where dst_canonical = 'https://en.wikipedia.org/wiki/Anti-pattern'

Or whatever other query I might need to investigate. I’m very glad that DuckDB supports reading directly from external Parquet files; this means that I can manage the data using my existing file-based caching architecture, while still getting the advantages of a SQL OLAP database for querying. I expect this to be very useful when investigating bugs in the future, as well as finding out what sites it would be most helpful to scrape next.