B-Trees: More Than I Thought I'd Want to Know

dragontamer 2021-08-18 21:17:42 +0000 UTC [ - ]

I feel like FAT and FAT32 back in the DOS days were so simple and easy to explain, that my generation of programmers have the "sweet point" in terms of learning technology.

FAT / FAT32 wasn't about B-trees or balancing or anything. It was a simple linked list, that got severely tangled up during use and required an hour of "defragmenting" to make your hard drive go back at high speed.

As a computer user of the time, I was too young / ignorant to know why it happened like that, but that experience stuck with me until college. In college, I learned about how FAT32 filesystem worked fundamentally (though obsolete, its so simple its a great introduction).

From there, I learned why we have moved onto more advanced data-structures like B-Trees. Why things like B-trees are less prone to fragmentation. Etc. etc.

----------

A big part of my learning process was the hours of waiting I'd put in to defrag my hard drive back in the day. Those memories stuck with me, I know what the "downsides" are if I don't use a self-balancing B-Tree to store data on a hard drive.

Perhaps today, we're no longer at a point where we can just convince legions of new computer-users (or programmers) to use inferior file systems like FAT32 in their daily lives. But we can at least tell stories about those times, so that they can understand why we put so much complexity into our modern filesystem data-structures.

wirthjason 2021-08-18 23:56:20 +0000 UTC [ - ]

Gone are the days of selling software to magically to make you computer faster. Half of them just ran defrag under the hood.

nanidin 2021-08-18 22:07:27 +0000 UTC [ - ]

Wasn’t defragmentation more important in general on spinning media to reduce random seeks? And now most performance oriented things are happening on SSD?

NikolaNovak 2021-08-19 01:27:15 +0000 UTC [ - ]

Defragmentation was indeed more important on spinning disks; I think your "to reduce random seeks" is largely correct: another way to think about it is that it was ensuring sequential reads ("read in this 10MB file") did not become random seeks ("Sure, but that 10MB file is in these 5,381 chunks all over the place")

You are correct that SSDs do not require defragmentation, and to a certain degree defragmentation is somewhat harmful in the sense that it forces unnecessary read/write cycles.

nayuki 2021-08-19 06:09:00 +0000 UTC [ - ]

If I understand correctly, NTFS describes a file as a sequence of cluster ranges, e.g. [7,12), [0,4), [20,25). So the more fragments a file or folder has, the more space is required to describe how it's stored. This applies to the file system and is independent of HDD/SSD technology.

As for SSDs, they need some degree of garbage collection - it's considered bad for an erasable block (e.g. 1 MiB) to contain a mix of live pages and deleted pages (4 KiB each).

dragontamer 2021-08-18 22:18:44 +0000 UTC [ - ]

I'm very confused by your question. I don't think FAT or FAT32 was ever a performance-oriented filesystem.

Back then, we were just happy having a personal computer. I'm sure there was better tech being used somewhere. But MS-DOS / Windows was just your cheap $99 OS in an age when other compilers and OSes cost you something like $1000 to $10,000.

You formatted your hard-drive in FAT because MS-DOS used FAT. You used MS-DOS because that's what PCs and PC-clones came with. There was no alternative, and PCs were already considered a pretty luxury / nerd item. (I know people brag about their DEC Alphas, Unix copies or OS/2... but you got enough "nerd cred" back then if you simply had a PC at all)

Its just how computers worked back then. Every PC was using FAT, and every few months of regular use the disk reads/writes would get so slow that performance was unbearable.

At that point, you spent an hour defragmenting the hard drive, and everything worked great again.

-------

Eventually 4GB hard drives were made, and you upgraded your computer to FAT32. Soon after that, Windows promised an end to defragmenting with its next-generation NTFS filesystem (based off of the B-trees discussed in this article) and we never looked back.

NikolaNovak 2021-08-19 01:28:18 +0000 UTC [ - ]

>>"Windows promised an end to defragmenting with its next-generation NTFS filesystem"

I thought this mostly reduced fragmentation of the filesystem metadata; but not the fragmentation of actual datafiles themselves?

rasz 2021-08-19 05:29:24 +0000 UTC [ - ]

How would btrees on their own solve fragmentation problem? The source of slowdown was random seek speed, not data structure used. You can optimize FAT32 driver to always fit files into sequential gaps, early Windows implementations didnt.

dragontamer 2021-08-19 05:48:24 +0000 UTC [ - ]

Files grow and shrink. Think about even the simplest "log" file (ex: Apache logs) appending constantly to the end of the file. First you have an access.log (+1 line to the end). Then Apache opens the error.log and adds some info to the end. Then PHP has something logged somewhere, again opening and closing a new set of files.

Even if these files were all perfectly allocated in sequential order at the beginning, a few operations like this quickly fragments the files.

--------------

I don't recall all the files MS-DOS or Win95 opened / closed on a regular basis, but play a video game (save file changed), open a word document, edit a photo, and come back to the video game... and the save file will now start getting fragmented as the size of the save files change and new allocated blocks need to be figured out.

rasz 2021-08-19 07:11:25 +0000 UTC [ - ]

- How would btrees help with that?

- can you name couple of games saving state by appending/modifying files in place on disk?

bern4444 2021-08-18 17:55:19 +0000 UTC [ - ]

I had to implement a simple B+ tree for a databases class in college (since they're often used when implementing databases).

It was one of those assignments few people fully completed. Most completed parts of it to varying degrees. I got more than halfway through but wasn't able to complete it.

It was an extremely difficult assignment and stands out in my mind.

They're very interesting and very complex. Nice article!

lmkg 2021-08-18 19:17:14 +0000 UTC [ - ]

Implementing a B+ tree was the final assignment in a grad-level Algorithms & Data Structures call that I took. I thought it was a reasonable assignment in that context. It felt like a good capstone to the class, building on what we had learned and finally crossing the line from toy examples to something that approached real-world usefulness.

It seems very out-of-place for a databases course. Yes, B-Trees are important for a database, but I don't see how implementing one teaches you about how a database uses one or why it's important.

jaytaylor 2021-08-19 00:05:58 +0000 UTC [ - ]

As another data point, I implemented a B+ tree with rebalancing deletion functionality back in CS3090 at UVU circa 2008 from Dr. Curtis Wellborn.

It was an undergraduate course, but the professor emphasized that everywhere else it would be a graduate-level of course.

My B+ tree implementation served as the foundation for my own (shitty) DBMS ;D

As a funny sidenote, the professor claimed that even Stanford DB courses had a buggy / broken algorithm for balanced deletes. While preparing the course, he'd spent a chunk of the summer doing researching it, and had developed his alogrithm in Prolog. He shared the concepts and our implementations worked fine. However, it was ultimately mostly an academic exercise, as balanced deletions come with significant tradeoffs which make it not as useful as I initially thought.

Even full-fledged database systems don't typically implement B+ tree indexes with deletion+rebalancing support (the usual technique is marking nodes deleted and cleaning them up when the index is rebuilt).

psykotic 2021-08-19 05:48:34 +0000 UTC [ - ]

> Even full-fledged database systems don't typically implement B+ tree indexes with deletion+rebalancing support (the usual technique is marking nodes deleted and cleaning them up when the index is rebuilt).

Yes, and it took a surprisingly long time before anyone bothered to do the theoretical analysis. Some good papers to check out if anyone is interested are Deletion Without Rebalancing in Multiway Search Trees (http://sidsen.azurewebsites.net/papers/relaxed-b-tree-tods.p...) and Deletion Without Rebalancing in Binary Search Trees (http://sidsen.azurewebsites.net//papers/ravl-trees-journal.p...). Jeff Erickson also has some good notes on using local and global rebuilds for search trees: https://jeffe.cs.illinois.edu/teaching/algorithms/notes/10-s...

It's intuitively clear that worst-case performance [1] can degrade to a function of the number of insertions since a rebuild (as opposed to the number of currently live items) when you use relaxed deletion. If you can afford to globally rebuild the tree you get to play the usual amortization tricks, but global rebuilds can be problematic in a large-scale live database; one solution is to rebuild from a snapshot in the background (on another thread or machine) and then replay late arriving operations before switching over to the rebuilt tree. But if someone is okay with the amortization spikes from rehashing a hash table, they should be okay with non-incremental global rebuilds for search trees; a hash table where deletions leave behind tombstones and the table is rehashed when the ratio of tombstones gets too high is basically the same idea.

Tostino 2021-08-18 20:43:11 +0000 UTC [ - ]

There are actually some pretty good database courses at CMU for example, that go into what is required to build databases from the ground up.

So don't think all DB classes are the same, there are some vast differences in that type of class compared to a class teaching you how to use SQL.

ddlutz 2021-08-18 20:15:32 +0000 UTC [ - ]

I've seen quite a few database courses have you implement them, as they are the backbone for many database systems.

cogman10 2021-08-18 17:59:40 +0000 UTC [ - ]

Perhaps it was just my professor, but what really helped was we first did red-black trees and after doing that our teacher explained to us that red black trees were just a type of B-Tree. Once you have that, it's pretty easy to go from one to the other.

jasonwatkinspdx 2021-08-19 05:15:01 +0000 UTC [ - ]

That's the approach Sedeewick of the (in?)famous textbook takes. He also has one neat tweak that simplifies rebalancing: when traversing a B-tree for insertion, preemptively split any full nodes as you visit them. This ensures there will always been room in a parent node if child nodes force keys up.

nicolaslem 2021-08-18 19:32:05 +0000 UTC [ - ]

I got interested in the topic a few years ago and wrote an implementation in Python[0] for myself. It's just a toy but the funny thing is that I regularly get waves of students staring the repository, probably when a teacher gives a similar assignment.

[0] https://github.com/NicolasLM/bplustree

samsquire 2021-08-19 09:22:01 +0000 UTC [ - ]

Btrees can be really simple, I've written a super simple understandable readable btree here in python.

https://github.com/samsquire/btree

It doesn't balance between neighbours on the left and right like some btrees do but it handles splits in a really simple way and balances otherwise.

I would recommend everybody write the core algorithm in a simple language like Python first to get the core algorithm understood before trying to implement in a low level language like C or Rust.

pkaye 2021-08-19 00:29:18 +0000 UTC [ - ]

I still remember my B-tree implementation in college. The professor wanted us to run it against his data. I quickly commented up my debug prints which introduced a bug with an "if" statement and failed for some of his test cases but I got most of the credit. Later I found the cause of the bug and got pissed off for such a silly mistake.

macintux 2021-08-19 00:30:57 +0000 UTC [ - ]

Discovering that your debugging code introduced a new bug is intensely frustrating. It's not nearly as rare an occasion for me as I would prefer.

efficax 2021-08-18 20:59:23 +0000 UTC [ - ]

An in-memory b-tree or one stored on disk? The first one should not be a huge amount of work over implementing, say, a red-black tree. But sorting out the disk storage is hard (and in the end, you've just written a database!)

mad_vill 2021-08-18 17:58:40 +0000 UTC [ - ]

uw madison?

jwitthuhn 2021-08-18 21:52:59 +0000 UTC [ - ]

I'm not the original commenter, but I sure did implement a B-tree for my databases course at UW Madison.

cryptonector 2021-08-18 18:53:12 +0000 UTC [ - ]

The best B-tree resource is SQLite3's documentation of its B-tree on-disk format.

emptybits 2021-08-18 20:19:17 +0000 UTC [ - ]

Fantastic. Also a great, related, listen is the recent Corecursive Podcast with Richard Hipp as guest. He’s the man who created SQLite, with good technical stories to share.

“The Untold Story of SQLite” https://corecursive.com/066-sqlite-with-richard-hipp/

craigching 2021-08-19 02:07:33 +0000 UTC [ - ]

Postgres has great documentation for their concurrent btree algorithm based on the Lehman Yao paper: https://github.com/postgres/postgres/blob/master/src/backend...

jgwil2 2021-08-18 19:05:24 +0000 UTC [ - ]

This one? https://www.sqlite.org/fileformat2.html#btree

cryptonector 2021-08-19 01:04:52 +0000 UTC [ - ]

Almost. This one: https://www.sqlite.org/fileformat2.html :)

2021-08-19 02:30:07 +0000 UTC [ - ]

jgwil2 2021-08-18 21:58:17 +0000 UTC [ - ]

Instead of downvoting could someone just provide the correct link?

macintux 2021-08-19 00:34:38 +0000 UTC [ - ]

I've upvoted your question, and keep hoping someone will chime in. That certainly looks like the closest match.

spanspan 2021-08-18 20:38:54 +0000 UTC [ - ]

This is really cool. I recently attempted to build a toy database, and subsequently implemented my own b-tree ( https://github.com/spandanb/learndb-py). I ended up running into a lot of these issues.

I also did a write-up on why everyone should engage in similar projects: https://www.spandanbemby.com/joys-of-database.html

eatonphil 2021-08-18 18:19:49 +0000 UTC [ - ]

It would be great to see a walkthrough writing a minimal btree from scratch with reading/writing from disk.

If you've got one (walkthrough with code, not just an existing implementation), link it please!

spanspan 2021-08-18 20:40:49 +0000 UTC [ - ]

This incomplete tutorial is about how to write a sqlite clone, with particular emphasis on how to implement the underlying b-tree: https://cstack.github.io/db_tutorial/

ddlutz 2021-08-18 22:10:39 +0000 UTC [ - ]

for YEARS that tutorial has been stuck on "Alright. One more step toward a fully-operational btree implementation. The next step should be splitting internal nodes. Until then!". Must not be worth the time to finish writing the tutorial.

craigching 2021-08-19 02:11:55 +0000 UTC [ - ]

I just implemented one in go, working on the repl now. I plan to eventually write it in C++ for a real database, but I’m doing this for my team to teach them the difference between LSM trees and Btrees. I’m hoping to show that a specialized concurrent btree, though harder to write, performs better than LSM trees for the same data. Maybe I’ll make a blog post about it.

EDIT: For more context, this is for time series metrics, so super specialized data.

jasonwatkinspdx 2021-08-19 05:18:14 +0000 UTC [ - ]

I'd be interested in that blog post for sure.

CodeWriter23 2021-08-18 20:31:24 +0000 UTC [ - ]

IMO unlike the opening drawing, in a disk-based B-Tree index, Leaf nodes should not have "next" pointers, one should instead go back up the Tree node(s) (iterating upward when necessary) to find the next Leaf node. Next pointers are a duplication of information violating referential integrity within the index, and will be invalidated when an index update causes a split or merge rebalancing operation, creating concurrency issues.

SamReidHughes 2021-08-18 22:48:48 +0000 UTC [ - ]

When you split a node, you're probably [1] already modifying the node you're splitting, so there's no problem updating unidirectional sibling pointers. If duplication of information is a problem, well, you have a bug in your storage engine somewhere.

[1] It occurs to me that you don't necessarily need to modify the node you're splitting, if the key you're writing is on the new node's half of the split.

However, sibling pointers would be the dumbest way to traverse the data, even on a read-only tree, because it has serial round-trip latency to disk for every block. You would never use them for traversal.

They can be useful, for a fancy concurrency hack, if you use them to split the node without updating the parent, and then update the parent in a separate operation after the fact. This lets write operations release their lock on the parent before accessing the child. In that case any code traversing to a child makes note of the child's right sibling stored in the parent and uses the child's sibling pointer only if it's different from the right sibling as stored in the parent (which means the parent hasn't been updated yet).

hashmash 2021-08-18 23:21:53 +0000 UTC [ - ]

Concurrency issues are a challenge, but a bigger problem is when designing a copy-on-write B-Tree, which is what I did. With such a design, any change to a node causes a replacement to be allocated, and then the pointer into it must be changed too. This means that the referencing node must be copied, and so on.

If only the parent pointer needs to be changed (and up to the root), this isn't a big deal considering that B-Trees aren't very deep, and a good implementation will cache modifications anyhow.

When sibling pointers are introduced, a copy-on-write B-Tree will on average need to read and rewrite half of the B-Tree on every modification. For this reason, I don't use sibling pointers.

a1369209993 2021-08-19 06:23:47 +0000 UTC [ - ]

> one should instead go back up the [parent] node(s) (iterating upward when necessary) to find the next Leaf node.

Nitpick: rather, you should not go anywhere, but refer back to the parent nodes that you already have in memory, because how else did you find the current leaf node in the first place? (This also allows you to avoid round-trip latency by fetching sibling-after-next before you actually get the data in the sibling node.)

tomnipotent 2021-08-19 01:53:42 +0000 UTC [ - ]

> Leaf nodes should not have "next" pointers

In a standard B-tree, yes.

PostgreSQL and MySQL, for example, use B-tree variants that have pointers between leaf nodes. Postgres is based on Lehman & Yao (itself a B+(*?) Tree variant), and InnoDB uses a B+ Tree with a next pointer.

masklinn 2021-08-18 18:22:14 +0000 UTC [ - ]

> In my college Data Structures and Algorithms course, we covered B-Trees, but I didn’t grok why I’d choose to use one. As presented, B-Trees were essentially “better” Binary Search Trees, with some hand-waving done that they had improved performance when used in database applications.

With modern memory hierarchies that also tends to be the case in-memory (with lower node densities more suited to caches and cache lines).

pkaye 2021-08-18 19:09:54 +0000 UTC [ - ]

This articles talks about some of the reasons in the section "Disk-Induced Constraints".

masklinn 2021-08-18 19:45:53 +0000 UTC [ - ]

Yes, except for #2 those reasons apply essentially as-is to in-memory b-trees if you replace “memory” with “cache” and “disk” with “memory”.

dragontamer 2021-08-18 21:27:34 +0000 UTC [ - ]

DDR4 has a burst-length of 8, meaning the smallest you can write is a block of 64 bytes (corresponding to a modern cache-line).

It also should be noted that DDR4 has "rows" roughly in the 1024 byte size. A RAS command opens up a "row" (aka 1024 bytes), while a CAS command reads from a column (aka: 64-bytes).

DDR4 cells can only be read once (!!!), and afterwards, the data gets mangled. To prevent the loss of data, all "reads" are first into sense-amplifiers... and these sense-amplifiers can be read infinitely.

This "read into sense-amplifiers" is called a RAS command, and it transfers the entire 1024-byte row into sense amplifiers. A CAS can then read 64-byte chunks from the page at high speeds and randomly.

--------

Before a new RAS command can be issued, there needs to be a few steps.

1. The "current row" of data needs to be written back to DRAM (remember, the data in the DRAM was lost when you read the data in the first place).

2. After writing the data, the sense-amplifiers must be emptied (aka: precharged), so that they're ready to receive the new data.

3. After #1 and #2, it is finally legal to issue a RAS

--------

So in fact, DDR4 RAM is also a block device. Sure, RAS is much faster than a hard-drive seek, but given how much slower RAS (row address strobe) is than a CAS (column address strobe)... a lot of these disk-based discussions (such as B-trees) end up applying to DDR4 in practice.

--------

The only thing that really works like classic RAM is L3 / L2 / L1 caches. Everything else in the modern world is basically a block device.

toolslive 2021-08-18 18:09:04 +0000 UTC [ - ]

B+Trees minimize the maximum depth of the lookup. All the values have exactly the same cost. In all honesty, I don't care about the lookup cost of the values I never need.

bob1029 2021-08-18 18:36:46 +0000 UTC [ - ]

Sounds like you are looking for a Splay Tree implementation. Their dynamic optimality properties ensure your most popular data lives right at the root.

toolslive 2021-08-18 18:59:34 +0000 UTC [ - ]

For example. In essence any tree that allows you to "rotate" based on some statistics about subtree popularity. This becomes more and more interesting in combination with copy-on-write strategies because there you have to rewrite the top of the tree anyway. So why not aggregate some statistics while you're at it.

jlokier 2021-08-19 01:53:21 +0000 UTC [ - ]

The "hot items" optimisation you're thinking of have less of an effect on trees with wide fanout, such as B-trees with a substantial block size.

This is because the ratio of number of "hot" items you can store at a reduced depth to total number of item is a much smaller ratio with a high fanout tree than a low fanout tree; and at the same time, the depths are lower with a high fanout tree, reducing the benefit of reduced depth anyway.

In other words, you can only give the benefit to a smaller number of hot items, and the benefit you can give is smaller.

On top of that, the added complications of rotations and statistics mean fewer items can be stored in interior nodes (which I'd call index nodes) for a given node size. This reduces the fanout and increases the full depth.

And the rotations and access statistics cause more writing.

There are tradeoffs and perhaps it is worth it for some tree parameters. Perhaps for exceptionally large trees with particular access patterns. It would be interesting to see how it's implemented.

But generally for a high-fanout B-tree, you might as well just store frequently accessed items in a separate small tree as a cache, and keep that cached in memory.