Building a Lock-Free Cache-Oblivious B-Tree

Part 0: A Prelude

5/5/2020

I like a challenge.

I also have no idea what I'm doing. I've never really built a formal data structure before, let alone a high-performance concurrent one, and so this blog series will be much less of a how-to and more of a "come along with me" type of journey.

Background

This idea was borne out of a project I was working on last year. We were transitioning the primary key of a very large MySQL table from a plain auto-incrementing id column to a composite key (e.g. [routing_key, id]). This netted us somewhere between 4-10x improvements in range reads due to the records being queried together were actually stored together. Who woulda guessed.

The downside of this solution is that it also resulted in a 20-40x decrease in write performance, because of the additional time to seek through the table's B-Tree to find the correct page to insert into (and potentially causing a page split to make room).

Given that result, a few thoughts went through my mind:

Colocating records has massive benefits for query performance and database health
There can only be one sort order chosen for the row data
That write penalty is prohibitively expensive on high-insert tables

The Project

That lead me to my latest poorly-thought-through side project. My goal is to build a cache in front of MySQL that builds fully-covering indexes to common queries stored in the sort order of the query. It's philosophically the same as a normal secondary index on the table, but this will be done automatically as it finds "hot" queries. The design of the system suggest there's going to be a lot of data duplication in here, but that's the tradeoff I'm accepting up-front.

MySQL's usage of B-Trees is obviously a great design choice for maximizing the amount of data being read from disk or sent across the wire, but we can do better! Cache-oblivious data structures are a primarily-academic concept that abstracts away the idea of cache block size which can vary wildly from the L1-L3 cache on the CPU, to the system's memory, all the way up to disk storage. Instead, it tries to lay out the data in a way that implicitly optimizes searches down a binary tree. As the search descends down the tree, each branch is contained in a smaller and smaller window of the memory block, eventually leading to some level where all remaining searches in the tree are inside of the cache and now free—like magic!

The Code

Before getting into the MySQL caching bit, which is likely a massive project on its own, I'm starting "small" with the data structure. I picked Rust as my language due to its concurrency-safety guarantees which will hopefully tell me all of the things I'm doing wrong as I try to build this.

I've started a GitHub repo where the code will live. I make no guarantees about the quality of this project: jahfer/cache-oblivious-b-tree. In the next part of this series, I'm going to be delving into the actual implementation process.

Fingers crossed.