Experimental single file content addressed database.
"Content address" means that data is keyed by a hash digest.
CADB is a single file database for storing content addressed block data. You can think of it like a key/value store where the keys MUST be hash digests, of a sufficient length and security, and the value is arbitrary binary data.
CADB is a special B+ tree which is both sorted and balanced using hash digests. The keys (digests) are sorted by binary comparison and the tree's structure is chunked using randomness derived from the tail of each hash. This produces a self-balancing and well sorted structure on-disc.
The benefits of this approach are:
Every write to disc MUST be a single page file. This means that every write to the append-only file must use
write()
or writev()
for a complete page as described below. If you do not properly structure these writes you
can end up with an unreadable CADB file if your process is abruptly terminated. To avoid this database corruption
implementations MUST guarantee that complete page files are written atomically disc.
Page File Format Example:
+---------------+
| Block Data |
+---------------+
| Block Data |
+---------------+
| Leaf Header |
+---------------+
| Block Data |
+---------------+
| Branch Header |
+---------------+
| DBTAIL |
+---------------+
A few terms:
POS
is an 8 byte BigUint64. This is used for all file position references.LENGTH
is a single 4 byte Uint32. This means CADB cannot store a single block larger than 4GB.ADDR
is an 8 byte POS
followed by a 4 byte Uint32 for the LENGTH
of a read.Leaf
is a node containing the key/value pairs of the hash digest (key) and the POS
of the block data (value)Branch
is a node containing child nodes (either more branches or leaves). It is a list of key/value pairs of the first hash digestPOS
of the child header data.All page file sections are optional and un-ordered (unless being compacted) except the DBTAIL
.
The DBTAIL
is a single ADDR
(12 byte pair of POS
and LENGTH
) pointing to either a leaf or branch header for the current root of the database.
If you need to stream block data into the database before committing it you can do so by writing page files with your Block Data
entries followed by the current DBTAIL.
Token
is a Uint8.Type
is the header type, either LEAF
or BRANCH
. (Note: parsing rules are identical between types but you must know the type in orderread(...ADDR)
of each entry.)Open/Closed
is only necessary in branch headers. It indicates whether or not the last child's entry has closed theSIZE
is the size of the DIGEST
in bytes.+-----------------------------+
| T | Type | O/C | SIZE |
+-----------------------------+
| 1 | LEAF | N/A | 8 |
| 2 | LEAF | N/A | 16 |
| 3 | LEAF | N/A | 32 |
| 4 | LEAF | N/A | 64 |
| 5 | LEAF | N/A | 128 |
| 6 | LEAF | N/A | VAR |
| 7 | BRANCH | OPEN | 8 |
| 8 | BRANCH | OPEN | 16 |
| 9 | BRANCH | OPEN | 32 |
| 10 | BRANCH | OPEN | 64 |
| 11 | BRANCH | OPEN | 128 |
| 12 | BRANCH | OPEN | VAR |
| 13 | BRANCH | CLOSED | 8 |
| 14 | BRANCH | CLOSED | 16 |
| 15 | BRANCH | CLOSED | 32 |
| 16 | BRANCH | CLOSED | 64 |
| 17 | BRANCH | CLOSED | 128 |
| 18 | BRANCH | CLOSED | VAR |
+-----------------------------+
There are two types of headers, one type for common fixed size digests and another for variably sized digests.
The basic parsing rules for headers are the same except for how to parse the individual entries.
FIXED_LENGTH
+--------------------------+
| TOKEN | ... ENTRIES |
+--------------------------+
VARIABLE_LENGTH
+----------------------------------------------+
| TOKEN | COUNT |...LENGTHS | ... LEAF_ENTRIES |
+----------------------------------------------+
ENTRY
+---------------+
| DIGEST | ADDR |
+---------------+
TOKEN
refers to the typing token (see TOKEN_TABLE).DIGEST
is the hash digest key. Its size is determined either by the typing token or by the list of lengthsCOUNT
the number of entries in the digest. Given the COUNT
you can parse LENGTHS
since every length is a fixed size of 4 bytes.LENGTHS
is an ordered Uint32 list of DIGEST
lengths. You can parse the entries by adding 12 bytes to every length (ADDR
is 12 bytes).In a branch header, the DIGEST
is the first digest in the child. read(...ADDR)
should be parsed as a HEADER
.
In a leaf header, the DIGEST
is the hash digest key of the block data. read(...ADDR)
should be parsed as block value data.
The database is a binary sorted tree of DIGEST
keys. The leaf nodes need to be chunked into headers, and the branches that build
the sorted tree alos need to be chunked.
The algorithm used to chunking is deterministic. This means that we'll end up with the same tree structure regardless of the order in which entries have been inserted, unlike a typical BTree. Ensuring this deterministic chunking means a little more work when we mutate the tree, but it also means the tree remains balanced without the need to compact the database.
Since hash digests ensure randomization, we use the tail of these hashes in order to determine where each to chunk entries into a single header.
If a DIGEST
in a leaf header ends in Uint8 0
it closes the leaf header.
When creating branch headers we SHA2-256
hash the child header and if it ends in Uint8 0
it closes the branch header. In this case, a TOKEN
that
is CLOSED
must be used for the branch. The last branch header at every layer of depth may end in with an entry for a child branch that does not
close and must therefor use an UNCLOSED
.
The reason we use these CLOSED
/UNCLOSED
tokens is so that we can mutate branch data without reading the tail data and re-hashing it and without
storing the hash of every child in every branch.
The tree manipulation is relatively straightforward if you're familiar with BTree manipulation except for managing the consistent chunking.
Whenever you mutate a header you'll need:
UNCLOSED
headers need to be re-chunked on any mutation.CLOSED
and a new entry is appended to the end of the header or the last entry is removed, the header entries must be concatenated withCLOSED
any new child entries must be checked to see if they close the chunk and the headerOne goal of compaction is that the resulting database file byte matches any other compacted store from any other implementation.
TODO: ordering rules.