Classical data structures that can outperform learned indexes (2018)

dataflow 2021-08-16 07:28:55 +0000 UTC [ - ]

Why do cuckoo hashing advertisements always sound like snake oil sales pitches?

Claim: "A simple and beautiful technique that can achieve 99% occupancy and serve all lookups with just two memory accesses thanks to the power of two choices."

Great, so apparently I can achieve 99% occupancy?

Reality [1]: "Insertions succeed in expected constant time [...] as long as the number of keys is kept below half of the capacity of the hash table, i.e., the load factor is below 50%."

Ok so never mind...

[1] https://en.wikipedia.org/wiki/Cuckoo_hashing#Theory

FreakLegion 2021-08-16 09:25:57 +0000 UTC [ - ]

> Ok so never mind...

Keep reading, specifically the Variations section. 50% occupancy is for constructions with a per-bucket capacity of 1. At a capacity of 2, occupancy improves to a little under 90%, and at 4 to just under 98%. The linked write-up uses 8, which does in fact achieve very high occupancy.

They could've done better, though. By using windows instead of buckets (i.e. allowing the buckets to overlap), a capacity of 2 already yields > 96% occupancy, and 4 reaches 99.9%, so smaller and faster (lookups examine fewer keys on average). This approach is detailed in "3.5-Way Cuckoo Hashing for the Price of 2-and-a-Bit".

dataflow 2021-08-16 09:43:53 +0000 UTC [ - ]

Thanks, that explains what they're saying now. But I'm still skeptical. How achievable is 98% fill for 4-element buckets in the first place? Intuitively I feel like you'd frequently have to scrap the table and rehash or enlarge it... is that not the case?

FreakLegion 2021-08-17 20:49:53 +0000 UTC [ - ]

Like with Bloom filters (which also draw their power from hashing items into multiple buckets), the statistical guarantees are quite strong, and things work well in practice.

There are other options for the insertion algorithm than the standard random walk, too. E.g. take a look at page 18 of "Load Thresholds for Cuckoo Hashing with Overlapping Blocks" [1] for empirical data on 2-3 hash functions for 2-3 size windows. They don't show 4- or 8-size windows, but you get the gist.

Like nullc says below, it's trivial to tune things for your requirements. Want fast insertions? Overprovision. Want more predictable insertions? Use a smarter insertion algorithm. Want small tables? Use more hash functions and/or larger buckets. Want fast queries? Keep the hash count and bucket sizes small.

Rehashing isn't really a concern beyond the case of 2 hash functions and non-overlapping buckets of size 1.

1. https://arxiv.org/pdf/1707.06855.pdf

jasonwatkinspdx 2021-08-16 18:28:06 +0000 UTC [ - ]

So, for the complete formal answer get Mitzenmacher's book: https://www.amazon.com/Probability-Computing-Randomized-Algo...

There are incrementally resizing versions, generally under the name Levelized Hashing. The most state of the art versions of these are lock free. (example: https://www.usenix.org/conference/atc20/presentation/chen)

phkahler 2021-08-16 16:04:10 +0000 UTC [ - ]

Seems like a good point. If you're at 90+ percent occupancy you're going to run into trouble adding more data.

throwaway81523 2021-08-16 08:03:20 +0000 UTC [ - ]

IIRC you can get to high occupancy by doing enough rehashes. After that, if you do no more insertions and only lookups, it is a good deal since each lookup takes at most two memory accesses. So this is useful if you are willing to spend a long time building what will then be a read-only table. Obviously there are uses for that.

dataflow 2021-08-16 08:22:21 +0000 UTC [ - ]

> IIRC you can get to high occupancy by doing enough rehashes.

Yeah so how many total insertions (as part of the rehashing) do you expect to have to do to achieve 99% occupancy? Wouldn't it be even worse than O(n^2)?

And if you're going to spend a lot of time rebuilding the hash table all the time, then why not just use a perfect hash generator?

throwaway81523 2021-08-16 08:30:14 +0000 UTC [ - ]

Large perfect hash tables have to store an awful lot of information. I don't remember quite how it's done but it's not a free lunch. The hash function itself has a size that grows with the total size of all the keys.

I don't know offhand how many rehashes you need to get 98% occupancy with cuckoo hashing. There may be ways to optimize it by sharding the table into smaller ones. I'll re-read the wikipedia article when I get a chance. It's a fun algorithm and I've sometimes looked for places to use it.

nullc 2021-08-16 23:54:19 +0000 UTC [ - ]

> Yeah so how many total insertions (as part of the rehashing) do you expect to have to do to achieve 99% occupancy? Wouldn't it be even worse than

It depends on your bucket sizes and how many hashes you use.

If you attempt 99% occupancy with 2 hashes and 4 entries per bucket, then you are going to be doing a LOT of kicking. 92% with that geometry, OTOH, is fine and will end up with just a couple kicks per insert on average.

signa11 2021-08-16 08:03:29 +0000 UTC [ - ]

i found this (https://codecapsule.com/?s=hashing) to be quite instructive as an overview of various hashing techniques. check it out for some fun ?

2021-08-16 09:23:10 +0000 UTC [ - ]

truenindb 2021-08-16 07:43:10 +0000 UTC [ - ]

It's because the internet is a perfect sterling engine, guaranteeing that all eyeballs will be monotized in an adiabatic flow. It is know as "The Permanent November of ImaginosVictory Law" to those of us that have been using the internet since Salman Rushdie invented email, which is one of the best exemplars of a good protocol since Google Wave.

lrem 2021-08-16 08:00:49 +0000 UTC [ - ]

Was this generated with some GPT3-like or something?

orf 2021-08-16 09:16:05 +0000 UTC [ - ]

I think so, the only result on Google for “ ImaginosVictory” is that very comment.

chana_masala 2021-08-16 08:05:35 +0000 UTC [ - ]

Do you have an ICO I buy into?

xarope 2021-08-16 07:47:40 +0000 UTC [ - ]

bingo?

DiabloD3 2021-08-16 09:05:16 +0000 UTC [ - ]

I don't know who's running a GPT bot on HN, but this is a beautiful work of art.

the_duke 2021-08-16 07:43:01 +0000 UTC [ - ]

It wasn't clear to me how Cuckoo hash tables are supposed to work if both locations are full.

Here is the relevant explanation from Wikipedia :

> The new key is inserted in one of its two possible locations, "kicking out", that is, displacing, any key that might already reside in this location. This displaced key is then inserted in its alternative location, again kicking out any key that might reside there. The process continues in the same way until an empty position is found, completing the algorithm. However, it is possible for this insertion process to fail, by entering an infinite loop or by finding a very long chain (longer than a preset threshold that is logarithmic in the table size). In this case, the hash table is rebuilt in-place using new hash functions

https://en.wikipedia.org/wiki/Cuckoo_hashing#Operation

Sound like with high occupancy the table would need to be rebuilt constantly.

toxik 2021-08-16 09:51:17 +0000 UTC [ - ]

Ah, O(wildly erratic) insertion. Great for that jittery feel to your programs.

j-pb 2021-08-16 10:46:22 +0000 UTC [ - ]

Since you double the table size after failing to insert/displace above a constant threshhold, e.g. 8. You have O(1) insert performance. You don't need to rehash the table either, you can add additional hash functions on each grow, or reuse your existing hash function and use bit masking, (that one does require a copy of the table, albeit no rehashing, and copying chunks is something our CPUs are really good at)

adrianN 2021-08-16 11:47:57 +0000 UTC [ - ]

Most popular data structures only provide average case guarantees with bad worst-case bounds. Dynamic arrays also have O(wildly erratic) append. I think for most use cases this is perfectly fine and does not result in any perceivable jitter in programs.

toxik 2021-08-16 12:09:18 +0000 UTC [ - ]

They do in fact not have erratic behavior, they are quite predictable. The cuckooing process, on the other hand, is not predictable by design.

touisteur 2021-08-16 12:22:56 +0000 UTC [ - ]

Recently went on a deep dive about sorting algorithm actual predictability and for latency-sensitive workloads, most things you'd use because 'simple/standard' (quicksort, mergesort...) don't shine, with their horrid /worst case/ complexity, but also depending a lot on your input data. Quicksort with a badly chosen pivot, for example, has caused me headaches recently.

phkahler 2021-08-16 16:13:28 +0000 UTC [ - ]

Heapsort is O(n * Log(n)) worst case. It is also O(n * Log(n)) in most cases, including already sorted data. Most implementations also seem to have a slightly larger constant factor than quicksort, but I think that's largely due to implementation details (one should not actually swap values that are likely to be immediately swapped again).

Which one is right really does come down to a decision based on how much you care about Typical/Average/WorstCase time complexity and the actual size of your data set.

karpierz 2021-08-16 13:15:30 +0000 UTC [ - ]

Mergesort worst case is O(nlog(n)). If there's a downside to it, it's that you need to allocate memory.

touisteur 2021-08-17 22:15:58 +0000 UTC [ - ]

Which is where you get problematic tail latencies. Or your data fits on the stack.

jcelerier 2021-08-16 16:43:04 +0000 UTC [ - ]

> I think for most use cases this is perfectly fine and does not result in any perceivable jitter in programs.

we must not use the same programs :-(

nullc 2021-08-16 23:51:53 +0000 UTC [ - ]

A below transition fullness the insertion process takes N kicks with exponentially decreasing probability (for some constants depending on the fullness level and number of hash functions). E.g. 1/2^n kicks.

Not something you can really describe as wildly erratic.

If latency is particularly critical in your application, you could couple the hash table with a small "stash" map and perform a constant maximum number of kicks per insert, and when you reach the limit stick the current straggler in the map to be continued on the next insertion. As long as the constant maximum is well above the average the map will stay empty most of the time.

toxik 2021-08-17 09:09:34 +0000 UTC [ - ]

The fact that you're talking about probabilities sort of proves my point. You're instead saying it could be worse, and I'm saying random execution times in your critical paths is generally undesirable. It depends on your application of course, but if you have anything realtime or near realtime, I would definitely take consistent execution times.

There is a related e-mail from IdSoft's John Carmack about this at [1] which I found very interesting.

[1] http://number-none.com/blow/blog/programming/2014/09/26/carm...

nullc 2021-08-18 05:48:36 +0000 UTC [ - ]

There are 'random' execution times in the critical paths of essentially everything on a modern general purpose computer, unfortunately.

What matters for realtime operation is that the worst case is bounded and that the bound is acceptable.

TimonKnigge 2021-08-16 08:14:19 +0000 UTC [ - ]

In the linked article they resolve collisions via chaining:

> A typical hash function distributes keys randomly across the slots in a hash table, causing some slots to be empty, while others have collisions, which require some form of chaining of items

I.e. each field in the table is a linked list of values that hash to this position, and the new value is inserted in the shortest of the two lists it hashes to.

eutectic 2021-08-16 09:15:37 +0000 UTC [ - ]

No, chaining is presented as an alternative to Cuckoo hashing.

pclmulqdq 2021-08-16 12:34:08 +0000 UTC [ - ]

The learned index paper struck me as a bit of a marketing gimmick, but this paper also smells a bit. The central sleight of hand that both papers pull without admitting it is that they are creating mostly-read data structures (or in the first half of the learned index paper, read-only data structures). The learned index paper then compares them to read-write data structures and claims a win.

Due to cache locality, cuckoo hashing usually underperforms compared to linear probing hash tables, except when you want super high density and you don't expect to do much inserting. It gets especially bad if the keys or values are large. 99% of the time, you're better off with something other than cuckoo hashing.

resters 2021-08-16 13:58:20 +0000 UTC [ - ]

Learned indexes are an optimization technique that can help a lot in specific scenarios, but the authors were not under the impression that they are applicable to most use-cases.

thesz 2021-08-16 22:53:24 +0000 UTC [ - ]

Logarithmic method can transform static structures (what you call read-only, which means high cost of change) into dynamic ones.

Examples include B-trees (log structured merge trees, the most famous example of application of logarithmic method), kd-trees, sorted arrays (cache oblivious lookahead arrays - COLA) and more.

Usually, some variant of merge operation is much more efficient than application of changes. It is obvious for sorted arrays (merge sort). It is true for kd-trees - they can be efficiently constructed from sorted data and it is easy to fetch sorted data from kd-trees. It is also quite true for B-trees. B-trees degrade when under random data load (and hugely so), but they are doing well when changes are in order. Log structured merge trees make static (under random data load) B-trees dynamic again.

The merge operation for learned indices is, well, learning from two sources and merge information. My not so big experience with machine learning tells me that adjusting model is easier than training it anew.

rurban 2021-08-17 13:51:06 +0000 UTC [ - ]

And modern data structures always outperform classical ones.

Why cuckoo when you usually use swiss tables or stanford tables for longs? Or https://greg7mdp.github.io/parallel-hashmap/ for concurrent htables?

Learned indices (ie dynamic data structures optimized to the data) usually outperform these as well if they are properly optimized. Which can be neural nets, but also perfect hashes or just better dynamic lookup methods. Like three-way array lookups, properly compressed, as with Unicode. Not everything needs to be a NN, though it helps if you see no structure in the data.

dang 2021-08-16 19:29:13 +0000 UTC [ - ]

Discussed at the time:

Classical Data Structures That Can Outperform Learned Indexes - https://news.ycombinator.com/item?id=16138857 - Jan 2018 (20 comments)

dicroce 2021-08-16 12:38:43 +0000 UTC [ - ]

As I read this I thought "interesting. I would have thought probing would be better on modern computers with memory pre-fetching"... Then I come to the comments and find out im not the only one thinking that.

pclmulqdq 2021-08-16 12:46:50 +0000 UTC [ - ]

You're right, benchmarks have shown that probing is better until you reach ~90% capacity. Cuckoo hashing sufferers from terrible memory locality, and you can only make up for it when a probing table would have worse locality.

anonu 2021-08-16 15:13:30 +0000 UTC [ - ]

Another great lesson that you can't slap "AI" on everything...

truenindb 2021-08-16 07:39:05 +0000 UTC [ - ]

Look you don't need to learn any algorithms because SQL will allow the computer to algorithm for you. SQL will guarantee ACID properties and also BASE properties. You are going to be rich. Simply buy Larry Ellison a boat and a bunch of ads on the back of the economist to socially prove to the finance world and the folks from Dave Graeber's actually very good essay that has not yet been followed up with the great american novel, and you are gonna be rich with no algorithms not done by the computer. Don't worry, the computer is gonna take care of it, you won't need handwriting or food. Here is a link to Mr. Graeber's high quality novel, "On the phenomenon of bullshit jobs": https://www.theatlantic.com/magazine/archive/2004/07/i-agree...