A fast alternative to the modulo reduction (2016)
pwuille 2021-08-18 15:56:23 +0000 UTC [ - ]
A write-up is here, in case anyone is interested: https://github.com/sipa/writeups/tree/main/uniform-range-ext...
kwillets 2021-08-18 17:14:20 +0000 UTC [ - ]
timdaub 2021-08-18 08:46:20 +0000 UTC [ - ]
Most folks I asked about it, ended up concluding that I'll end up with this pidgeon hole effect where it'll become likely that an index is suggested multiple times [2].
It may not be super cryptographically-sophisticated: But what I want is a hash function that's (at least) somewhat aware of collisions without using too much storage. Bloomfilters were recommended - but it seems they need to much storage too. I need to be able to map Ethereum accounts to a smaller sized domain e.g. 0 to 2^32 to safe storage costs in smart contracts.
Since I asked on Stackexchange, the problem has become kind of stale and I stopped working on it. But it's fascinating and I should continue to dive deeper...
Interesting article! Thanks
- 1: https://crypto.stackexchange.com/questions/91742/can-there-b...
FreakLegion 2021-08-18 09:21:04 +0000 UTC [ - ]
- https://github.com/0xcb/Golomb-coded-map
timdaub 2021-08-18 13:36:01 +0000 UTC [ - ]
- 1 https://github.com/rugpullindex/awesome-ethereum-rollups#mem...
wizzwizz4 2021-08-18 08:34:42 +0000 UTC [ - ]
Reverse the bits before applying the trick.
thxg 2021-08-18 10:05:55 +0000 UTC [ - ]
nullc 2021-08-18 21:45:06 +0000 UTC [ - ]
Look a generation or two back and you'll see that the division is a LOT slower than 6 cycles, too.
But it should be noted that if the modulus is a compile time constant the compiler will turn the % into a multiply and add.
The riscv bitmanip instruction set has permutes that can reverse the bits in a word (along with other useful permutations). ... still doesn't exist in an actual asic yet. Maybe some time in my life we'll get it on common platforms.
deepsun 2021-08-18 08:50:49 +0000 UTC [ - ]
flohofwoe 2021-08-18 10:19:49 +0000 UTC [ - ]
https://c9x.me/x86/html/file_module_x86_id_210.html
...so the compiler could even drop the right-shift and instead just use the register where the upper half of the result is stored.
deepsun 2021-08-18 14:00:08 +0000 UTC [ - ]
But the type conversion is on the results, not input. Is it guaranteed that we're always multiplying two 32-bit numbers and not two 64-bit numbers in "(uint64_t) x * (uint64_t) N"? Because looking at the code, it's more natural to think we're multiplying 64-bit integers, so we may need some CPU cycles to make "x" occupy 64 bits first, same for "N", and then do multiplication on them.
pwuille 2021-08-18 18:18:53 +0000 UTC [ - ]
wizzwizz4 2021-08-18 09:11:36 +0000 UTC [ - ]
bbojan 2021-08-18 08:53:04 +0000 UTC [ - ]
It shows how to use unsigned 64-bit multiplication to restrict a range of a 32-bit random number generator to 0..N. No shifts are needed, as the result is in the upper 32-bit register.
kazinator 2021-08-18 18:23:20 +0000 UTC [ - ]
What we can do take enough bits from the number to cover the range, in this case 4:
Then, we test whether the value is in the range 0 to 13. If not, we throw the random number away, and try it with a another four bits.pwuille 2021-08-18 18:48:19 +0000 UTC [ - ]
However in a setting of say finding buckets in a hashtable which isn't a power-of-two size, this would amount to potentially computing multiple hashes. In certain scenarios, that may be unacceptable for computational reasons. If strict uniformity isn't required, the approach presented here may be good enough. It won't be exactly uniform if the range isn't a power of two, but it'll be equally close to uniform as modulo reduction is.
kazinator 2021-08-18 20:12:02 +0000 UTC [ - ]
I think, I'd never want to do such a thing, but I will remember this multiplication trick in a legitimate situation where a non-power-of-two residue is being calculated, but the mathematical residue per se is not required.
nullc 2021-08-18 21:41:55 +0000 UTC [ - ]
Particularly for large open hashtables having to double the size of your table to go up to the next power of 2 is a really significant memory overhead... and if the size is variable, the multiply is a lot faster. It's a good technique.
kazinator 2021-08-18 22:35:50 +0000 UTC [ - ]
There is simply no modern use case for non-power-of-two hash tables. Even in open addressing.
Table sizes which are prime numbers are theoretically important because they allow techniques like quadratic probing to guarantee to visit every possible table offset, thanks to Euler's theorem. This one:
https://en.wikipedia.org/wiki/Euler%27s_theorem
(Only if n is prime is the phi(n) totient function equal to n - 1, meaning that all values of the congruence are traversed by the successive powers of a.)
This will go out the window if you start doubling the table, so that n is no longer prime; you instead need a predetermined sequence of primes to serve as table sizes.
Quadratic probing is silly because it's been found that linear probing, in spite of its issues, wins in terms of caching performance. The only reason to go to open addressing over chained hashing would be for better caching, due to not chasing chain pointers, and then you might as well reap the full improvement from better caching by using linear probing.
nullc 2021-08-18 23:44:20 +0000 UTC [ - ]
I think I'd much rather a hash table that needs 32.1 GB actually take 32.1 GB and not 64 GB especially when I need several of them, thank you very much!
With the extra memory saved and the TLB/cache footprint reduced I can get a lot of performance to offset the minuscule latency difference between a widening multiply and a logical and!
In a recent data structure I've worked on, using an algebraic code requiring a couple bisection searches in a couple tiny tables to decode-- to just reduce table entries by 4 bits measurably improved performance, simply from reducing the cache/tlb footprint a few percent. As computers have become faster multiplies and adds have become increasingly free relative to memory latency/bandwidth.
NWDD 2021-08-18 21:29:46 +0000 UTC [ - ]
The topic of unbiased bounded generation is pretty interesting itself and both lemire and M.E. O'Neill (author of pcg) wrote blog posts and an academic paper comparing different approaches (some using as building block the fixed point trick)
https://www.pcg-random.org/posts/bounded-rands.html https://lemire.me/blog/2019/06/06/nearly-divisionless-random... https://arxiv.org/pdf/1805.10941.pdf
thomasmg 2021-08-18 10:06:37 +0000 UTC [ - ]
But it does look like the technique was forgotten, or at least not used very often in the past.
adrian_b 2021-08-18 11:10:16 +0000 UTC [ - ]
If you interpret a random n-bit integer as random fraction >= 0 and < 1, it is obvious that if you multiply it by K you will have a random number >= 0 and < K.
If you take the integer part and discard the fractional part, i.e. you keep just the upper word from multiplication, you get a random integer >= 0 and <= (K-1).
This was well known since the oldest times when pseudo-random numbers began to be used, i.e. immediately after WW2. At that time, before the introduction of the floating-point numbers, interpreting the binary numbers as fractions instead of integers was much more frequent.
Using modulo became popular only in higher level languages, where it was more complex to write a double word multiplication, but it is clearly an inferior solution compared to using a multiplication for the same job.
Moreover, the multiplicative method ensures that all the integer values possible for the result have the same probabilities, while with the modulo K method it is possible that the last value (K-1) has a lower probability (corresponding to fewer possible values of the input number), which is not what is normally desired.
For ancient congruential PRNGs, the modulo method was even worse, because those PRNGs had the lower bits less random than the upper bits.
formerly_proven 2021-08-18 14:11:13 +0000 UTC [ - ]
Well technically floating point was invented around 1900 and then invented again independently by Zuse in the 30s (all of his computers were floating point machines).
lifthrasiir 2021-08-18 11:19:42 +0000 UTC [ - ]
Uh, you can't do that for any non-power-of-two K. The multiplicative method doesn't remove but only moves those lower probabilities around when used without rejection.
adrian_b 2021-08-18 11:28:29 +0000 UTC [ - ]
If you would want a very wide range for the output random numbers, which is seldom the case, the comparison of the methods could be reversed.
atq2119 2021-08-18 12:42:14 +0000 UTC [ - ]
pwuille 2021-08-18 16:06:18 +0000 UTC [ - ]
eternalban 2021-08-18 10:43:10 +0000 UTC [ - ]
"We can interpret a 32-bit hash value v as a binary fraction, i.e., the representation of a number between zero and one whose digits after the (binary) decimal point are given by v. Given a hash table size s, we can obtain a slot number by multiplying s and v and shifting right 32-bits. This way, arbitrary table sizes are possible without using expensive division or modulus operations. Computing s∗v can be performed with one 32-bit multiplication (generating a 64- bit result). If s is a 16-bit number, as might be expected for a small hash table, then two 16-bit multiplications, a shift, and a 32-bit addition suffice."
Kenneth A. Ross, Efficient Hash Probes on Modern Processors, IBM Research Report RC24100 (W0611-039) November 8, 2006
https://dominoweb.draco.res.ibm.com/reports/rc24100.pdf
kwillets 2021-08-18 15:33:07 +0000 UTC [ - ]
kazinator 2021-08-18 18:25:09 +0000 UTC [ - ]
The method redistributes the bias compared to straight modulo.
You won't get a uniform distribution with this, just a differently shaped one compared to modulo.
pwuille 2021-08-18 16:08:03 +0000 UTC [ - ]
kwillets 2021-08-18 16:52:03 +0000 UTC [ - ]