Feep! » Blog » Post

The Constant DataBase file format

For a new feature I’m building (“quick results”, coming Soon™), I needed a simple key-value file format, ideally one that was fast and easy work with. The “CDB” format seemed to fit the bill, but there weren’t any JavaScript libraries for it that I liked, so I found myself digging in to the internals to undertand how it worked; and this blog post is a summary of my investigation into the details of the file format.

(Why not SQLite? Well, I was using that originally, but bulk-updating the database seems to have caused a lot of churn in the B-Tree index it uses; I want to store this data on a spinning HDD, and the resulting seeks caused updates to become very slow. CDB is optimized for linear writes, which is perfect for a HDD, and most lookups take two disk seeks or less, which is also about the optimum. The tradeoff is that I need to rewrite the entire database to update it, but that’s fine for this use case since updates are rare and rebuilding the .cdb isn’t the slow part anyway.)

Most of the information in this post was gleaned from the cdb format specification and cdb(5) from tinycdb. Yusuke Shinyama also has a diagram of the cdb file structure; and R. J. Howe’s implementation of cdb has some notes on the format which I only discovered after writing this post.

What is CDB?

“CDB” (short for “Constant DataBase”) is a file format for a write-once key-value store. It was created by D. J. Bernstein for use in djbdns, qmail, and others; as it’s a useful general-purpose format, various other libraries and utilities have since been created for it. (In particular, I’m using TinyCDB to build my database files, though I’ve written my own utility to read them.)

Some quick facts:

File layout

A CDB file is arranged in three sections, like so:

┌──────────────────────────────────────┐
│           Table of contents          │
├──────────────────────────────────────┤
│                                      │
│             Data section             │
│                                      │
│                   ┌──────────────────┤
├───────────────────┘                  │
│             Hash tables              │
│       (aka “Index section”) ┌────────┘
└─────────────────────────────┘

(There is no alignment or padding; everything is squished together as much as it will go.)

I believe the sections are ordered like this so that, when writing the database, the writer can:

  1. Reserve 2048 bytes at the start of the file
  2. Dump all the input into the data section and forget about it, retaining only the hash and offset
  3. Write the hash tables, and go back to fill in the TOC.

This means cdbmake only needs to keep 16 bytes per record (the hash and offset) in memory while preparing the database; the keys and data can total more than available memory without causing any problems.

Hash function

Since CDB files are based on hash tables, we need a hash function. CDB uses (seemingly arbitrarily) the following:

uint32 hash([]byte key)key.reduce(5381,(uint32 hv,byte c)(  ((hv< ⁣ ⁣<5)+hv)c  ))\begin{align*} &{\small\text{uint32 }}\texttt{hash}({\small\pmb{\normalsize[\,]}\text{byte }}key) \Rightarrow key\text{.reduce}(\\ &\hspace{2em} 5381,\\ &\hspace{2em} ({\small\text{uint32 }}hv, {\small\text{byte }} c) \Rightarrow \big(\; ((hv \mathbin{<\!\!<} 5) + hv) \mathbin{^\wedge} c \;\big)\\ &) \end{align*}

(Note that all hash calculations are done with uint32{\small\text{uint32}} values; I had some trouble because JavaScript’s << operator uses signed values. The fix for JS seems to be to add ‘>>> 0’ onto the end of the expression, to recast from sint32{\small\text{sint32}} to uint32{\small\text{uint32}}.)

Looking up records

To look up a key in the database, first calculate HV=hash(key)HV = \texttt{hash(}key\texttt{)}, and then read the file as explained below.

Read Table of Contents

The TOC (also known as the “initial pointers”) takes up the first 2048 bytes of the file (256 entries×2 uint32entry×4 bytesuint32)(256 \text{ entries} \times \frac{2 {\text{ uint32}}}{\text{entry}} \times \frac{4 \text{ bytes}}{\text{uint32}}), and has the type:

toc:[256]{uint32le fileOffset,uint32le slotCount}\texttt{toc} : \pmb{\big[} 256 \pmb{\big]} \{{\small\text{uint32le }}\texttt{fileOffset}, {\small\text{uint32le }}\texttt{slotCount}\}

Each key is bucketed into a hash table TT by T=(HVmod256)T = (HV \bmod 256). If toc[T].slotCount\texttt{toc}[T].\texttt{slotCount} is zero, that entire hash table doesn’t exist; if a hash value goes into that bucket we can immediately conclude that the key isn’t in the database.

Read hash table

Hash table TT is found at toc[T].fileOffset\texttt{toc}[T].\texttt{fileOffset} bytes into the file, and has the type:

table:[toc[T].slotCount]{uint32le hash,uint32le fileOffset}\texttt{table} : \pmb{\big[} \texttt{toc}[T].\texttt{slotCount} \pmb{\big]} \{{\small\text{uint32le }} \texttt{hash}, {\small\text{uint32le }}\texttt{fileOffset}\}

To find a hash HVHV in the hash table, start at slot S=((HV> ⁣ ⁣>3)mod(toc[T].slotCount))S = \big((HV \mathbin{>\!\!>} 3) \bmod (\texttt{toc}[T].\texttt{slotCount)}\big). (> ⁣ ⁣>3>\!\!>3 is equivalent to ÷256\div\:256, rounding down. For JavaScript, again, I had to use >>> 3 instead to get unsigned semantics.) Then proceed through the table; for each slot:

If you get to the end of the table before you stop, wrap around to the beginning (at toc[T].fileOffset\texttt{toc}[T].\texttt{fileOffset}) and continue until you get to an empty slot or come back to where you started.

Read data section

The data section contains all of the “records” (key+value pairs) in the order they were added to the database; each one has the type:

{uint32le keyLength,uint32le valueLength,[keyLength]byte key,[valueLength]byte value}\{ {\small\text{uint32le }}\texttt{keyLength}, {\small\text{uint32le }}\texttt{valueLength}, {\small \pmb{\normalsize[} \texttt{keyLength} \pmb{\normalsize]} \text{byte }} \texttt{key}, {\small \pmb{\normalsize[} \texttt{valueLength} \pmb{\normalsize]} \text{byte }} \texttt{value} \}

Normally, you’d find a record by getting a fileOffset\texttt{fileOffset} from the hash table, but it’s also possible to read all records in a file sequentially by starting at byte 2048 (just after the TOC) and scanning through records until byte min(...toc[].fileOffset)\min(...\texttt{toc[}{*}\texttt{].fileOffset}) (just before the first hash table).

When a slot with a matching hash is found in the hash table:

  1. Look up the record starting at R=table[S].fileOffsetR = \texttt{table}[S].\texttt{fileOffset} bytes into the file.
  2. If file[R].keyLength\texttt{file}[R].\texttt{keyLength} does not equal len(key)\text{len}(key): return to reading the hash table; don’t bother looking at file[R].key\texttt{file}[R].\texttt{key} since if the length is different it can’t possibly be the same key. (This happens when there’s a hash collision with a key of a different length.)
  3. If file[R].key\texttt{file}[R].\texttt{key} does not equal keykey: return. (This happens when there’s a hash collision with a key of the same length.)
  4. Yield file[R].value\texttt{file}[R].\texttt{value}, then return to reading the hash table for additional values.

Conclusion

And with that successful lookup, so ends my little tour of the CDB file format. This was the first time in a while that I’ve done any work involving binary formats, so it was an interesting diversion. I hope this description gives you an idea of how cdb works, and why I’ve decided to use it as the basis for this new feature.