Classical data structures that can outperform learned indexes (2018)
the_duke 2021-08-16 07:43:01 +0000 UTC [ - ]
Here is the relevant explanation from Wikipedia :
> The new key is inserted in one of its two possible locations, "kicking out", that is, displacing, any key that might already reside in this location. This displaced key is then inserted in its alternative location, again kicking out any key that might reside there. The process continues in the same way until an empty position is found, completing the algorithm. However, it is possible for this insertion process to fail, by entering an infinite loop or by finding a very long chain (longer than a preset threshold that is logarithmic in the table size). In this case, the hash table is rebuilt in-place using new hash functions
https://en.wikipedia.org/wiki/Cuckoo_hashing#Operation
Sound like with high occupancy the table would need to be rebuilt constantly.
toxik 2021-08-16 09:51:17 +0000 UTC [ - ]
j-pb 2021-08-16 10:46:22 +0000 UTC [ - ]
adrianN 2021-08-16 11:47:57 +0000 UTC [ - ]
toxik 2021-08-16 12:09:18 +0000 UTC [ - ]
touisteur 2021-08-16 12:22:56 +0000 UTC [ - ]
phkahler 2021-08-16 16:13:28 +0000 UTC [ - ]
Which one is right really does come down to a decision based on how much you care about Typical/Average/WorstCase time complexity and the actual size of your data set.
jcelerier 2021-08-16 16:43:04 +0000 UTC [ - ]
we must not use the same programs :-(
nullc 2021-08-16 23:51:53 +0000 UTC [ - ]
Not something you can really describe as wildly erratic.
If latency is particularly critical in your application, you could couple the hash table with a small "stash" map and perform a constant maximum number of kicks per insert, and when you reach the limit stick the current straggler in the map to be continued on the next insertion. As long as the constant maximum is well above the average the map will stay empty most of the time.
toxik 2021-08-17 09:09:34 +0000 UTC [ - ]
There is a related e-mail from IdSoft's John Carmack about this at [1] which I found very interesting.
[1] http://number-none.com/blow/blog/programming/2014/09/26/carm...
nullc 2021-08-18 05:48:36 +0000 UTC [ - ]
What matters for realtime operation is that the worst case is bounded and that the bound is acceptable.
TimonKnigge 2021-08-16 08:14:19 +0000 UTC [ - ]
> A typical hash function distributes keys randomly across the slots in a hash table, causing some slots to be empty, while others have collisions, which require some form of chaining of items
I.e. each field in the table is a linked list of values that hash to this position, and the new value is inserted in the shortest of the two lists it hashes to.
eutectic 2021-08-16 09:15:37 +0000 UTC [ - ]
pclmulqdq 2021-08-16 12:34:08 +0000 UTC [ - ]
Due to cache locality, cuckoo hashing usually underperforms compared to linear probing hash tables, except when you want super high density and you don't expect to do much inserting. It gets especially bad if the keys or values are large. 99% of the time, you're better off with something other than cuckoo hashing.
resters 2021-08-16 13:58:20 +0000 UTC [ - ]
thesz 2021-08-16 22:53:24 +0000 UTC [ - ]
Examples include B-trees (log structured merge trees, the most famous example of application of logarithmic method), kd-trees, sorted arrays (cache oblivious lookahead arrays - COLA) and more.
Usually, some variant of merge operation is much more efficient than application of changes. It is obvious for sorted arrays (merge sort). It is true for kd-trees - they can be efficiently constructed from sorted data and it is easy to fetch sorted data from kd-trees. It is also quite true for B-trees. B-trees degrade when under random data load (and hugely so), but they are doing well when changes are in order. Log structured merge trees make static (under random data load) B-trees dynamic again.
The merge operation for learned indices is, well, learning from two sources and merge information. My not so big experience with machine learning tells me that adjusting model is easier than training it anew.
rurban 2021-08-17 13:51:06 +0000 UTC [ - ]
Why cuckoo when you usually use swiss tables or stanford tables for longs? Or https://greg7mdp.github.io/parallel-hashmap/ for concurrent htables?
Learned indices (ie dynamic data structures optimized to the data) usually outperform these as well if they are properly optimized. Which can be neural nets, but also perfect hashes or just better dynamic lookup methods. Like three-way array lookups, properly compressed, as with Unicode. Not everything needs to be a NN, though it helps if you see no structure in the data.
dang 2021-08-16 19:29:13 +0000 UTC [ - ]
Classical Data Structures That Can Outperform Learned Indexes - https://news.ycombinator.com/item?id=16138857 - Jan 2018 (20 comments)
dicroce 2021-08-16 12:38:43 +0000 UTC [ - ]
pclmulqdq 2021-08-16 12:46:50 +0000 UTC [ - ]
anonu 2021-08-16 15:13:30 +0000 UTC [ - ]
truenindb 2021-08-16 07:39:05 +0000 UTC [ - ]
dataflow 2021-08-16 07:28:55 +0000 UTC [ - ]
Claim: "A simple and beautiful technique that can achieve 99% occupancy and serve all lookups with just two memory accesses thanks to the power of two choices."
Great, so apparently I can achieve 99% occupancy?
Reality [1]: "Insertions succeed in expected constant time [...] as long as the number of keys is kept below half of the capacity of the hash table, i.e., the load factor is below 50%."
Ok so never mind...
[1] https://en.wikipedia.org/wiki/Cuckoo_hashing#Theory
FreakLegion 2021-08-16 09:25:57 +0000 UTC [ - ]
Keep reading, specifically the Variations section. 50% occupancy is for constructions with a per-bucket capacity of 1. At a capacity of 2, occupancy improves to a little under 90%, and at 4 to just under 98%. The linked write-up uses 8, which does in fact achieve very high occupancy.
They could've done better, though. By using windows instead of buckets (i.e. allowing the buckets to overlap), a capacity of 2 already yields > 96% occupancy, and 4 reaches 99.9%, so smaller and faster (lookups examine fewer keys on average). This approach is detailed in "3.5-Way Cuckoo Hashing for the Price of 2-and-a-Bit".
dataflow 2021-08-16 09:43:53 +0000 UTC [ - ]
FreakLegion 2021-08-17 20:49:53 +0000 UTC [ - ]
There are other options for the insertion algorithm than the standard random walk, too. E.g. take a look at page 18 of "Load Thresholds for Cuckoo Hashing with Overlapping Blocks" [1] for empirical data on 2-3 hash functions for 2-3 size windows. They don't show 4- or 8-size windows, but you get the gist.
Like nullc says below, it's trivial to tune things for your requirements. Want fast insertions? Overprovision. Want more predictable insertions? Use a smarter insertion algorithm. Want small tables? Use more hash functions and/or larger buckets. Want fast queries? Keep the hash count and bucket sizes small.
Rehashing isn't really a concern beyond the case of 2 hash functions and non-overlapping buckets of size 1.
1. https://arxiv.org/pdf/1707.06855.pdf
jasonwatkinspdx 2021-08-16 18:28:06 +0000 UTC [ - ]
There are incrementally resizing versions, generally under the name Levelized Hashing. The most state of the art versions of these are lock free. (example: https://www.usenix.org/conference/atc20/presentation/chen)
phkahler 2021-08-16 16:04:10 +0000 UTC [ - ]
throwaway81523 2021-08-16 08:03:20 +0000 UTC [ - ]
dataflow 2021-08-16 08:22:21 +0000 UTC [ - ]
Yeah so how many total insertions (as part of the rehashing) do you expect to have to do to achieve 99% occupancy? Wouldn't it be even worse than O(n^2)?
And if you're going to spend a lot of time rebuilding the hash table all the time, then why not just use a perfect hash generator?
throwaway81523 2021-08-16 08:30:14 +0000 UTC [ - ]
I don't know offhand how many rehashes you need to get 98% occupancy with cuckoo hashing. There may be ways to optimize it by sharding the table into smaller ones. I'll re-read the wikipedia article when I get a chance. It's a fun algorithm and I've sometimes looked for places to use it.
nullc 2021-08-16 23:54:19 +0000 UTC [ - ]
It depends on your bucket sizes and how many hashes you use.
If you attempt 99% occupancy with 2 hashes and 4 entries per bucket, then you are going to be doing a LOT of kicking. 92% with that geometry, OTOH, is fine and will end up with just a couple kicks per insert on average.
signa11 2021-08-16 08:03:29 +0000 UTC [ - ]
2021-08-16 09:23:10 +0000 UTC [ - ]
truenindb 2021-08-16 07:43:10 +0000 UTC [ - ]
lrem 2021-08-16 08:00:49 +0000 UTC [ - ]
orf 2021-08-16 09:16:05 +0000 UTC [ - ]
chana_masala 2021-08-16 08:05:35 +0000 UTC [ - ]
xarope 2021-08-16 07:47:40 +0000 UTC [ - ]
DiabloD3 2021-08-16 09:05:16 +0000 UTC [ - ]