B-Trees: More Than I Thought I'd Want to Know
bern4444 2021-08-18 17:55:19 +0000 UTC [ - ]
It was one of those assignments few people fully completed. Most completed parts of it to varying degrees. I got more than halfway through but wasn't able to complete it.
It was an extremely difficult assignment and stands out in my mind.
They're very interesting and very complex. Nice article!
lmkg 2021-08-18 19:17:14 +0000 UTC [ - ]
It seems very out-of-place for a databases course. Yes, B-Trees are important for a database, but I don't see how implementing one teaches you about how a database uses one or why it's important.
jaytaylor 2021-08-19 00:05:58 +0000 UTC [ - ]
It was an undergraduate course, but the professor emphasized that everywhere else it would be a graduate-level of course.
My B+ tree implementation served as the foundation for my own (shitty) DBMS ;D
As a funny sidenote, the professor claimed that even Stanford DB courses had a buggy / broken algorithm for balanced deletes. While preparing the course, he'd spent a chunk of the summer doing researching it, and had developed his alogrithm in Prolog. He shared the concepts and our implementations worked fine. However, it was ultimately mostly an academic exercise, as balanced deletions come with significant tradeoffs which make it not as useful as I initially thought.
Even full-fledged database systems don't typically implement B+ tree indexes with deletion+rebalancing support (the usual technique is marking nodes deleted and cleaning them up when the index is rebuilt).
psykotic 2021-08-19 05:48:34 +0000 UTC [ - ]
Yes, and it took a surprisingly long time before anyone bothered to do the theoretical analysis. Some good papers to check out if anyone is interested are Deletion Without Rebalancing in Multiway Search Trees (http://sidsen.azurewebsites.net/papers/relaxed-b-tree-tods.p...) and Deletion Without Rebalancing in Binary Search Trees (http://sidsen.azurewebsites.net//papers/ravl-trees-journal.p...). Jeff Erickson also has some good notes on using local and global rebuilds for search trees: https://jeffe.cs.illinois.edu/teaching/algorithms/notes/10-s...
It's intuitively clear that worst-case performance [1] can degrade to a function of the number of insertions since a rebuild (as opposed to the number of currently live items) when you use relaxed deletion. If you can afford to globally rebuild the tree you get to play the usual amortization tricks, but global rebuilds can be problematic in a large-scale live database; one solution is to rebuild from a snapshot in the background (on another thread or machine) and then replay late arriving operations before switching over to the rebuilt tree. But if someone is okay with the amortization spikes from rehashing a hash table, they should be okay with non-incremental global rebuilds for search trees; a hash table where deletions leave behind tombstones and the table is rehashed when the ratio of tombstones gets too high is basically the same idea.
Tostino 2021-08-18 20:43:11 +0000 UTC [ - ]
So don't think all DB classes are the same, there are some vast differences in that type of class compared to a class teaching you how to use SQL.
ddlutz 2021-08-18 20:15:32 +0000 UTC [ - ]
cogman10 2021-08-18 17:59:40 +0000 UTC [ - ]
jasonwatkinspdx 2021-08-19 05:15:01 +0000 UTC [ - ]
nicolaslem 2021-08-18 19:32:05 +0000 UTC [ - ]
samsquire 2021-08-19 09:22:01 +0000 UTC [ - ]
https://github.com/samsquire/btree
It doesn't balance between neighbours on the left and right like some btrees do but it handles splits in a really simple way and balances otherwise.
I would recommend everybody write the core algorithm in a simple language like Python first to get the core algorithm understood before trying to implement in a low level language like C or Rust.
pkaye 2021-08-19 00:29:18 +0000 UTC [ - ]
macintux 2021-08-19 00:30:57 +0000 UTC [ - ]
efficax 2021-08-18 20:59:23 +0000 UTC [ - ]
cryptonector 2021-08-18 18:53:12 +0000 UTC [ - ]
emptybits 2021-08-18 20:19:17 +0000 UTC [ - ]
“The Untold Story of SQLite” https://corecursive.com/066-sqlite-with-richard-hipp/
craigching 2021-08-19 02:07:33 +0000 UTC [ - ]
jgwil2 2021-08-18 19:05:24 +0000 UTC [ - ]
cryptonector 2021-08-19 01:04:52 +0000 UTC [ - ]
spanspan 2021-08-18 20:38:54 +0000 UTC [ - ]
I also did a write-up on why everyone should engage in similar projects: https://www.spandanbemby.com/joys-of-database.html
eatonphil 2021-08-18 18:19:49 +0000 UTC [ - ]
If you've got one (walkthrough with code, not just an existing implementation), link it please!
spanspan 2021-08-18 20:40:49 +0000 UTC [ - ]
ddlutz 2021-08-18 22:10:39 +0000 UTC [ - ]
craigching 2021-08-19 02:11:55 +0000 UTC [ - ]
EDIT: For more context, this is for time series metrics, so super specialized data.
CodeWriter23 2021-08-18 20:31:24 +0000 UTC [ - ]
SamReidHughes 2021-08-18 22:48:48 +0000 UTC [ - ]
[1] It occurs to me that you don't necessarily need to modify the node you're splitting, if the key you're writing is on the new node's half of the split.
However, sibling pointers would be the dumbest way to traverse the data, even on a read-only tree, because it has serial round-trip latency to disk for every block. You would never use them for traversal.
They can be useful, for a fancy concurrency hack, if you use them to split the node without updating the parent, and then update the parent in a separate operation after the fact. This lets write operations release their lock on the parent before accessing the child. In that case any code traversing to a child makes note of the child's right sibling stored in the parent and uses the child's sibling pointer only if it's different from the right sibling as stored in the parent (which means the parent hasn't been updated yet).
hashmash 2021-08-18 23:21:53 +0000 UTC [ - ]
If only the parent pointer needs to be changed (and up to the root), this isn't a big deal considering that B-Trees aren't very deep, and a good implementation will cache modifications anyhow.
When sibling pointers are introduced, a copy-on-write B-Tree will on average need to read and rewrite half of the B-Tree on every modification. For this reason, I don't use sibling pointers.
a1369209993 2021-08-19 06:23:47 +0000 UTC [ - ]
Nitpick: rather, you should not go anywhere, but refer back to the parent nodes that you already have in memory, because how else did you find the current leaf node in the first place? (This also allows you to avoid round-trip latency by fetching sibling-after-next before you actually get the data in the sibling node.)
tomnipotent 2021-08-19 01:53:42 +0000 UTC [ - ]
In a standard B-tree, yes.
PostgreSQL and MySQL, for example, use B-tree variants that have pointers between leaf nodes. Postgres is based on Lehman & Yao (itself a B+(*?) Tree variant), and InnoDB uses a B+ Tree with a next pointer.
masklinn 2021-08-18 18:22:14 +0000 UTC [ - ]
With modern memory hierarchies that also tends to be the case in-memory (with lower node densities more suited to caches and cache lines).
pkaye 2021-08-18 19:09:54 +0000 UTC [ - ]
masklinn 2021-08-18 19:45:53 +0000 UTC [ - ]
dragontamer 2021-08-18 21:27:34 +0000 UTC [ - ]
It also should be noted that DDR4 has "rows" roughly in the 1024 byte size. A RAS command opens up a "row" (aka 1024 bytes), while a CAS command reads from a column (aka: 64-bytes).
DDR4 cells can only be read once (!!!), and afterwards, the data gets mangled. To prevent the loss of data, all "reads" are first into sense-amplifiers... and these sense-amplifiers can be read infinitely.
This "read into sense-amplifiers" is called a RAS command, and it transfers the entire 1024-byte row into sense amplifiers. A CAS can then read 64-byte chunks from the page at high speeds and randomly.
--------
Before a new RAS command can be issued, there needs to be a few steps.
1. The "current row" of data needs to be written back to DRAM (remember, the data in the DRAM was lost when you read the data in the first place).
2. After writing the data, the sense-amplifiers must be emptied (aka: precharged), so that they're ready to receive the new data.
3. After #1 and #2, it is finally legal to issue a RAS
--------
So in fact, DDR4 RAM is also a block device. Sure, RAS is much faster than a hard-drive seek, but given how much slower RAS (row address strobe) is than a CAS (column address strobe)... a lot of these disk-based discussions (such as B-trees) end up applying to DDR4 in practice.
--------
The only thing that really works like classic RAM is L3 / L2 / L1 caches. Everything else in the modern world is basically a block device.
toolslive 2021-08-18 18:09:04 +0000 UTC [ - ]
bob1029 2021-08-18 18:36:46 +0000 UTC [ - ]
toolslive 2021-08-18 18:59:34 +0000 UTC [ - ]
jlokier 2021-08-19 01:53:21 +0000 UTC [ - ]
This is because the ratio of number of "hot" items you can store at a reduced depth to total number of item is a much smaller ratio with a high fanout tree than a low fanout tree; and at the same time, the depths are lower with a high fanout tree, reducing the benefit of reduced depth anyway.
In other words, you can only give the benefit to a smaller number of hot items, and the benefit you can give is smaller.
On top of that, the added complications of rotations and statistics mean fewer items can be stored in interior nodes (which I'd call index nodes) for a given node size. This reduces the fanout and increases the full depth.
And the rotations and access statistics cause more writing.
There are tradeoffs and perhaps it is worth it for some tree parameters. Perhaps for exceptionally large trees with particular access patterns. It would be interesting to see how it's implemented.
But generally for a high-fanout B-tree, you might as well just store frequently accessed items in a separate small tree as a cache, and keep that cached in memory.
dragontamer 2021-08-18 21:17:42 +0000 UTC [ - ]
FAT / FAT32 wasn't about B-trees or balancing or anything. It was a simple linked list, that got severely tangled up during use and required an hour of "defragmenting" to make your hard drive go back at high speed.
As a computer user of the time, I was too young / ignorant to know why it happened like that, but that experience stuck with me until college. In college, I learned about how FAT32 filesystem worked fundamentally (though obsolete, its so simple its a great introduction).
From there, I learned why we have moved onto more advanced data-structures like B-Trees. Why things like B-trees are less prone to fragmentation. Etc. etc.
----------
A big part of my learning process was the hours of waiting I'd put in to defrag my hard drive back in the day. Those memories stuck with me, I know what the "downsides" are if I don't use a self-balancing B-Tree to store data on a hard drive.
Perhaps today, we're no longer at a point where we can just convince legions of new computer-users (or programmers) to use inferior file systems like FAT32 in their daily lives. But we can at least tell stories about those times, so that they can understand why we put so much complexity into our modern filesystem data-structures.
wirthjason 2021-08-18 23:56:20 +0000 UTC [ - ]
nanidin 2021-08-18 22:07:27 +0000 UTC [ - ]
NikolaNovak 2021-08-19 01:27:15 +0000 UTC [ - ]
You are correct that SSDs do not require defragmentation, and to a certain degree defragmentation is somewhat harmful in the sense that it forces unnecessary read/write cycles.
nayuki 2021-08-19 06:09:00 +0000 UTC [ - ]
As for SSDs, they need some degree of garbage collection - it's considered bad for an erasable block (e.g. 1 MiB) to contain a mix of live pages and deleted pages (4 KiB each).
dragontamer 2021-08-18 22:18:44 +0000 UTC [ - ]
Back then, we were just happy having a personal computer. I'm sure there was better tech being used somewhere. But MS-DOS / Windows was just your cheap $99 OS in an age when other compilers and OSes cost you something like $1000 to $10,000.
You formatted your hard-drive in FAT because MS-DOS used FAT. You used MS-DOS because that's what PCs and PC-clones came with. There was no alternative, and PCs were already considered a pretty luxury / nerd item. (I know people brag about their DEC Alphas, Unix copies or OS/2... but you got enough "nerd cred" back then if you simply had a PC at all)
Its just how computers worked back then. Every PC was using FAT, and every few months of regular use the disk reads/writes would get so slow that performance was unbearable.
At that point, you spent an hour defragmenting the hard drive, and everything worked great again.
-------
Eventually 4GB hard drives were made, and you upgraded your computer to FAT32. Soon after that, Windows promised an end to defragmenting with its next-generation NTFS filesystem (based off of the B-trees discussed in this article) and we never looked back.
NikolaNovak 2021-08-19 01:28:18 +0000 UTC [ - ]
I thought this mostly reduced fragmentation of the filesystem metadata; but not the fragmentation of actual datafiles themselves?
rasz 2021-08-19 05:29:24 +0000 UTC [ - ]
dragontamer 2021-08-19 05:48:24 +0000 UTC [ - ]
Even if these files were all perfectly allocated in sequential order at the beginning, a few operations like this quickly fragments the files.
--------------
I don't recall all the files MS-DOS or Win95 opened / closed on a regular basis, but play a video game (save file changed), open a word document, edit a photo, and come back to the video game... and the save file will now start getting fragmented as the size of the save files change and new allocated blocks need to be figured out.
rasz 2021-08-19 07:11:25 +0000 UTC [ - ]
- can you name couple of games saving state by appending/modifying files in place on disk?