The "SQL" Magic Box is a Lie
Why I built my own database, and the physical constraints of software.
For a long time, I thought the database was just a magic box.
You build an application. You need to save user data. You send a formatted SQL string over the wire, and the box sends you JSON back. I used to tell myself that at the end of the day, it is all just a DB query.
That works when you are building standard web apps. But it completely falls apart the second you step into high-throughput, event-driven systems.
When you start dealing with massive concurrency, the magic box breaks. You realize that a database is not a query engine. It is an operating system in disguise. It is deeply constrained by physical reality.
I realized how little I actually understood about what happens after you hit execute. So I decided to build my own database from scratch (B+Tree, append only).
Btw B-Tree and B+Tree, there is a difference.
Here is what you learn when you finally open the box.
Writing to a file is a lie
If you want to save data, your instinct is to just open a file and write to it.
The problem is that the operating system lies to you. When the write function returns a success code, the data isn't on the disk. It is sitting in the OS page cache in memory. If someone kicks the server's power cable, that data is gone forever. If you try to overwrite a file and crash halfway through, you corrupt the whole file.
To actually, truly save data, you have to fight the operating system. You use a very specific pattern:
Write data to a temporary file.
Call
fsync(temp)to force the hardware to physically flush data to the disk.Call
fsync(dir)to flush the directory metadata (inode).Call
rename(temp, target).
The rename operation is guaranteed atomic by POSIX. If the process crashes before the rename, the original file is untouched. This is exactly how SQLite survives power failures.
Hash maps are useless on disk
Every software interview tests you on hash maps. They are incredibly fast in memory.
But on a physical disk, they are terrible. If you want to find all users who signed up between January and March, a hash map cannot help you because it has no order.
To find data fast on a disk, you need a B+ Tree.
A disk reads data in chunks, usually 4 kilobytes at a time. A B+ Tree is designed so that every node is exactly the size of one of those pages. You pack as many routing keys into that 4KB block as possible. When you search for a record, you are just hopping from one 4KB block to the next, minimizing the amount of work the physical hardware has to do.
You can't lock the world
In an event-driven system, you have massive read and write concurrency. If a reader has to wait for a writer to finish updating a row, the entire system grinds to a halt.
The solution is an immutable, Copy-on-Write (CoW) B-Tree.
When you update a record, you don't actually overwrite the old data. You create a brand new copy of that specific path in the tree. The readers keep looking at the old version without being blocked. The writers build the new version in the background. Once the write is completely safe on the disk via a two-phase commit, you flip a single pointer in the master page to make the new tree the source of truth.
There are no massive locks. The system just keeps moving.
Why not just put it on a GPU?
Once I understood this architecture, my first thought was: if this is just math and memory mapping, why don't we put the database on a GPU? A GPU has thousands of cores. It should be infinitely faster.
I was wrong again.
GPUs are built for massive parallel processing of identical instructions (SIMD). A B-Tree traversal is the exact opposite. It is entirely branching logic. If key < X, go left. Else, go right. When different GPU cores try to take different branches in the tree, it causes "warp divergence," and the GPU panics, executing one path at a time while the other cores sit idle.
Add in the fact that moving data across the motherboard's PCIe bus to the GPU VRAM takes longer than a CPU takes to just execute the point query, and you realize why transactional databases live strictly on CPUs.
GPUs are for scanning massive, unbranched columnar arrays. CPUs are for navigating complex, branching trees.
What happens when you close the box
Building this permanently breaks how you view software.
Now, when I write a basic update query at work, I don't see a string of text anymore. I see the B-tree traversal. I see the exact 4KB page allocations in memory. I see the hardware fsync barriers keeping the data alive against power failures.
I used to think it was all just a query. Now I know you can never ignore the physical machine.
Reference Architecture
If you want to build this yourself, here are the core technical concepts I used to structure the engine.
1. The B-Tree Node Format (Binary Layout) Nodes are raw byte slices as the same format in memory and on disk. | type (2B) | nkeys (2B) | pointers (nkeys x 8B) | offsets (nkeys x 2B) | key-values | Each Key-Value pair is encoded as: | klen (2B) | vlen (2B) | key (klen B) | val (vlen B) |
2. The Two-Phase Commit Every update follows a strict barrier protocol:
Phase 1: Write all new B-tree pages to disk ->
fsync()(data must reach disk before phase 2).Phase 2: Update the master page with the new root pointer ->
fsync()(must reach disk before responding to the client).
3. Multi-Version Concurrency Control (MVCC) Because the tree is CoW, old pages are orphaned on every update. A free list tracks pages no longer reachable from the current root. A freed page can only be safely reused when its version is older than the oldest active reader transaction in the system.
4. Secondary Indexes Each secondary index is a separate KV prefix in the B-tree.
Index key: encoded index columns + primary key columns.
Index value: empty. To fetch a row, you decode the primary key from the index key, then execute a primary key lookup.