Building a Lock-Free Cache-Oblivious B-Tree

Part 2: Building the Static-Search Tree

5/24/2020

If you missed the intro, check out Part 0: A Prelude.

To be honest, I started pecking away at this project a number of weeks ago, but I'll categorize that effort as "learning Rust". At this point I'm done my pre-prototypes and ready to start on the real prototype.

Finally, Some Code

This is the design of the search tree I've landed on so far:

struct StaticSearchTree<K: Eq, V> {
  memory: Pin<Box<[Node<K, V>]>>,
  root: NonNull<Node<K, V>>,
  _pin: PhantomPinned,
}

The memory field contains a block of contiguous storage where all of the nodes of the tree will exist. I’m also using Pin<T> to allow me to slice references off of it and store them without fear of the data moving underneath me. root will be the initial access point for a search to begin its traversal.

A core property of the search tree is that it needs to be lock-free to allow concurrent searches during a write event. Since inserts to the packed memory array always happen rightward, an outdated pointer can still scan right until it finds what it was looking for. Eventually the tree is updated so wasteful scans are kept to a minimum.

This is what a single node in that binary search tree looks like:

enum Node<K: Eq, V> {
  Leaf(Element<K>, Block<K, V>),
  Branch {
    min_rhs: Element<K>,
    left: MaybeUninit<NonNull<Node<K, V>>>,
    right: MaybeUninit<NonNull<Node<K, V>>>,
  },
}

A Node::Leaf contains both a pointer to a section in the packed-memory array (Block<K, V>), and a key describing the minimum value in that subarray. A Node::Branch holds pointers to the locations of the left and right nodes as well as the minimum value kept on the rightward branch (as in a normal binary tree).

There are quite a few other structures I built underneath this as well:

Element<T>

pub enum Element<T> {
  Infimum,
  Value(T),
  Supremum,
}

An Element<T> is a simple enum that implements std::cmp::Ord for any <T: Ord>, while providing a generic negative and positive infinity. This way, our tree doesn't need to know how to write infinity for all supported key types, and it doesn't take up any more space than T would itself.

Block<K, V>

struct Block<K: Eq, V> {
  min_key: Element<K>,
  max_key: Element<K>,
  cells: NonNull<[Cell<K, V>]>,
}

As mentioned above, a block is a region of the packed memory array. Specifically, it is a slice of Cell structs.

Cell<K, V>

struct Cell<K, V> {
  version: AtomicU16,
  marker: AtomicPtr<Marker<K, V>>,
  empty: UnsafeCell<bool>,
  key: UnsafeCell<Option<K>>,
  value: UnsafeCell<Option<V>>,
}

A cell represents the actual data stored in the packed memory array. On first glance, this seems to bypass a lot of the safety guarantees that Rust offers. We can create multiple references to a Cell<K, V>, and actually mutate the contents! I'll address how this works in the next section.

Marker<K, V>

enum Marker<K, V> {
  Empty(u16),
  Move(u16, K, V),
  InsertCell(u16, K, V),
  DeleteCell(u16, K),
}

This is referenced inside of a cell and lets us record the operation we are about to perform next. Concurrent processes can read this instruction and actually carry out the change in parallel (or before our process) without having to acquire a lock beforehand. Importantly, each element in the enum holds a u16 that synchronized with the version field on the Cell<K, V>.

Concurrent Mutation

The reason Cell<K, V> can skirt the normal rules of mutation in Rust is twofold:

AtomicU16 and AtomicPtr from std::sync::atomic have a store method that only requires &self to write new data to it
We'll only touch the UnsafeCell<_> fields once we've interacted with the atomic fields in a way that ensures unique access for the rest of the fields

The types in std::sync::atomic export CPU-level atomic operations, ensuring only one thing in the system can be reading or writing from this location at any given instant. This works great for things like AtomicBool or AtomicU8, where you want atomic access to a value at any given instant. We can leverage this concept in our tree to make sure that whenever we're performing a read, we can access the pointer without locking.

When we go to make a change to the cell, we will perform the following steps in an infinite loop:

Read the current marker
Read the current cell version
If the marker has a version greater than the cell, start the loop over
If the marker is not Marker::Empty, perform the described task inside of the marker
If the marker is Marker::Empty, perform a compare-and-swap operation on the field to insert the Marker for the current operation with an incremented version number
If the CAS operation failed, start the loop over
If the CAS operation succeeded, we now have exclusive access to the remaining fields while the Cell's marker and version are out of sync
Perform the operation, and bump the version of the Cell to match the Marker, and set the marker back to Marker::Empty

AtomicPtr<T>: Here be Dragons

The only field on a Cell<K, V> that doesn't live inside of the packed-memory array is the Marker cell. This is because of the potential space it might take up in describing the action to be performed. Instead, we use an AtomicPtr<T> to point to some short-term data that describes the action to be done.

From my understanding, the existence of AtomicPtr<T> is both a blessing and a curse. AtomicPtr<T> is much more complex than the other atomic datatypes: pointers are not values themselves, but just references to where a value lives in memory. While the CPU will prevent two threads from stepping over one another (e.g. one trying to read out the address while another writes a new pointer address), it does not prevent a process from reclaiming the memory behind a pointer after it's been replaced.

I've cooked up a contrived example of where this can go wrong:

use std::sync::atomic::{AtomicPtr, Ordering};

fn main() {
    let mut x = 1;
    let atomic = AtomicPtr::new(&mut x);

    // Process A
    let ptr = atomic.load(Ordering::Acquire);

    // Process B
    let mut y = 2;
    atomic.store(&mut y, Ordering::Release);
    println!("Current address: {:#?}", atomic.load(Ordering::Acquire));

    // Process A
    println!("ptr address: {:#?}", ptr);
}

Current address: 0x00007fff1501f520
ptr address: 0x00007fff1501f504

Oof! When we compare the current value of the pointer to the one stored in ptr, we can see Process A is holding a reference to the old location!

Attempting to dereference this could result in a couple of very serious problems. In the more obvious case, the program panics due to accessing invalid memory. In the painfully subtle one, data that used to live at that address had since been freed and new data has taken its place from another part of the program but wholly unrelated to the variable. You might never know something was wrong.

Fortunately, Rust tries to hint at the dangers and forces you to wrap any pointer dereferences in an unsafe { .. } block:

// Process A
let value = unsafe { *ptr }; /*🔥*/

In the case of our data structure, we're going to leverage Pin<T> to make sure our data doesn't move from where it was originally allocated. Since our leaves are pointing into a range of the packed-memory, if the pointer hasn't been updated since an insert or delete it will still point to a valid location and we can scan beginning from the old pointer location to find the actual value we want.

Next Steps

Now that we have a rough grasp on the data structure design, the next post will get into the actual logic of initializing and performing searches on the tree.