Hugo Hacker News

A fast alternative to the modulo reduction (2016)

bbojan 2021-08-18 08:53:04 +0000 UTC [ - ]

This technique is not new. E.g. I remember it described in ARM system developer's guide published back in 2004.

It shows how to use unsigned 64-bit multiplication to restrict a range of a 32-bit random number generator to 0..N. No shifts are needed, as the result is in the upper 32-bit register.

kazinator 2021-08-18 18:23:20 +0000 UTC [ - ]

For random numbers, that wont' be uniform. Suppose we have a good quality, uniformly distributed 32 bit unsigned random number, that we want to reduce, to say, 0..13 which also has a uniform distribution.

What we can do take enough bits from the number to cover the range, in this case 4:

  out = r % 16;
Then, we test whether the value is in the range 0 to 13. If not, we throw the random number away, and try it with a another four bits.

pwuille 2021-08-18 18:48:19 +0000 UTC [ - ]

Yes, that's called rejection sampling, and it's the only way to produce truly uniform output if the range doesn't evenly divide the input range.

However in a setting of say finding buckets in a hashtable which isn't a power-of-two size, this would amount to potentially computing multiple hashes. In certain scenarios, that may be unacceptable for computational reasons. If strict uniformity isn't required, the approach presented here may be good enough. It won't be exactly uniform if the range isn't a power of two, but it'll be equally close to uniform as modulo reduction is.

kazinator 2021-08-18 20:12:02 +0000 UTC [ - ]

> hashtable which isn't a power-of-two size

I think, I'd never want to do such a thing, but I will remember this multiplication trick in a legitimate situation where a non-power-of-two residue is being calculated, but the mathematical residue per se is not required.

nullc 2021-08-18 21:41:55 +0000 UTC [ - ]

If your non-power-of-two sized hashtable is compile-time-constant, the modulus will get compiled into similarly fast code. It's only needed when the size isn't a constant.

Particularly for large open hashtables having to double the size of your table to go up to the next power of 2 is a really significant memory overhead... and if the size is variable, the multiply is a lot faster. It's a good technique.

kazinator 2021-08-18 22:35:50 +0000 UTC [ - ]

A power of two hash table doubling in size just has to adjust a clipping mask. E.g. when the size is 16 we have mask == 0xF. When the size goes to 32, we have mask == 0x1F. Reducing the hash value to the table size is just hash & mask.

There is simply no modern use case for non-power-of-two hash tables. Even in open addressing.

Table sizes which are prime numbers are theoretically important because they allow techniques like quadratic probing to guarantee to visit every possible table offset, thanks to Euler's theorem. This one:

https://en.wikipedia.org/wiki/Euler%27s_theorem

(Only if n is prime is the phi(n) totient function equal to n - 1, meaning that all values of the congruence are traversed by the successive powers of a.)

This will go out the window if you start doubling the table, so that n is no longer prime; you instead need a predetermined sequence of primes to serve as table sizes.

Quadratic probing is silly because it's been found that linear probing, in spite of its issues, wins in terms of caching performance. The only reason to go to open addressing over chained hashing would be for better caching, due to not chasing chain pointers, and then you might as well reap the full improvement from better caching by using linear probing.

nullc 2021-08-18 23:44:20 +0000 UTC [ - ]

> There is simply no modern use case for non-power-of-two hash tables. Even in open addressing.There is simply no modern use case for non-power-of-two hash tables.

I think I'd much rather a hash table that needs 32.1 GB actually take 32.1 GB and not 64 GB especially when I need several of them, thank you very much!

With the extra memory saved and the TLB/cache footprint reduced I can get a lot of performance to offset the minuscule latency difference between a widening multiply and a logical and!

In a recent data structure I've worked on, using an algebraic code requiring a couple bisection searches in a couple tiny tables to decode-- to just reduce table entries by 4 bits measurably improved performance, simply from reducing the cache/tlb footprint a few percent. As computers have become faster multiplies and adds have become increasingly free relative to memory latency/bandwidth.

NWDD 2021-08-18 21:29:46 +0000 UTC [ - ]

For many use cases biased is good enough.

The topic of unbiased bounded generation is pretty interesting itself and both lemire and M.E. O'Neill (author of pcg) wrote blog posts and an academic paper comparing different approaches (some using as building block the fixed point trick)

https://www.pcg-random.org/posts/bounded-rands.html https://lemire.me/blog/2019/06/06/nearly-divisionless-random... https://arxiv.org/pdf/1805.10941.pdf

thomasmg 2021-08-18 10:06:37 +0000 UTC [ - ]

Yes it is not new; I read the multiplication trick was described even in 1986, in "The Metafont book" (Donald E. Knuth). (I don't have the book however).

But it does look like the technique was forgotten, or at least not used very often in the past.

adrian_b 2021-08-18 11:10:16 +0000 UTC [ - ]

It is much older than that.

If you interpret a random n-bit integer as random fraction >= 0 and < 1, it is obvious that if you multiply it by K you will have a random number >= 0 and < K.

If you take the integer part and discard the fractional part, i.e. you keep just the upper word from multiplication, you get a random integer >= 0 and <= (K-1).

This was well known since the oldest times when pseudo-random numbers began to be used, i.e. immediately after WW2. At that time, before the introduction of the floating-point numbers, interpreting the binary numbers as fractions instead of integers was much more frequent.

Using modulo became popular only in higher level languages, where it was more complex to write a double word multiplication, but it is clearly an inferior solution compared to using a multiplication for the same job.

Moreover, the multiplicative method ensures that all the integer values possible for the result have the same probabilities, while with the modulo K method it is possible that the last value (K-1) has a lower probability (corresponding to fewer possible values of the input number), which is not what is normally desired.

For ancient congruential PRNGs, the modulo method was even worse, because those PRNGs had the lower bits less random than the upper bits.

formerly_proven 2021-08-18 14:11:13 +0000 UTC [ - ]

> immediately after WW2. At that time, before the introduction of the floating-point numbers

Well technically floating point was invented around 1900 and then invented again independently by Zuse in the 30s (all of his computers were floating point machines).

lifthrasiir 2021-08-18 11:19:42 +0000 UTC [ - ]

> the multiplicative method ensures that all the integer values possible for the result have the same probabilities

Uh, you can't do that for any non-power-of-two K. The multiplicative method doesn't remove but only moves those lower probabilities around when used without rejection.

adrian_b 2021-08-18 11:28:29 +0000 UTC [ - ]

You are right in general, but for the common case, when you want random numbers with the maximum value much less than the maximum integer that can be represented in a word, e.g. you want dice numbers or lottery numbers, the probabilities of the integers produced by multiplication are much more uniform than of those produced by modulo.

If you would want a very wide range for the output random numbers, which is seldom the case, the comparison of the methods could be reversed.

atq2119 2021-08-18 12:42:14 +0000 UTC [ - ]

No. The two methods are qualitatively equally good. In both cases, some outputs have probability floor(2^n/K)/2^n while others have probability ceil(2^n/K)/2^n. The numbers of outputs with each probability are such that they average out to 1/K. The only difference is the distribution of which output numbers have which of these two probabilities.

pwuille 2021-08-18 16:06:18 +0000 UTC [ - ]

Exactly, both the modulo and the multiplicative method are equally close to uniform. In fact, under the constraint of starting from a uniform input with 2^n possibilities, both produce an output that is as close to uniform as possible. A necessary and sufficient requirement for that is the maximum and minimum number of inputs that maps to any given output differs by at most one, which is the case for both methods.

eternalban 2021-08-18 10:43:10 +0000 UTC [ - ]

Kenneth A. Ross mentions it in passing in his 2006 IBM research report (a great paper btw):

"We can interpret a 32-bit hash value v as a binary fraction, i.e., the representation of a number between zero and one whose digits after the (binary) decimal point are given by v. Given a hash table size s, we can obtain a slot number by multiplying s and v and shifting right 32-bits. This way, arbitrary table sizes are possible without using expensive division or modulus operations. Computing s∗v can be performed with one 32-bit multiplication (generating a 64- bit result). If s is a 16-bit number, as might be expected for a small hash table, then two 16-bit multiplications, a shift, and a 32-bit addition suffice."

Kenneth A. Ross, Efficient Hash Probes on Modern Processors, IBM Research Report RC24100 (W0611-039) November 8, 2006

https://dominoweb.draco.res.ibm.com/reports/rc24100.pdf

kwillets 2021-08-18 15:33:07 +0000 UTC [ - ]

What is new here is the method to remove bias.

kazinator 2021-08-18 18:25:09 +0000 UTC [ - ]

I don't believe it removes bias. By pigeon-hole-principle-related reasoning, you cannot map, say, 65536 values to 17 values without bias.

The method redistributes the bias compared to straight modulo.

You won't get a uniform distribution with this, just a differently shaped one compared to modulo.

pwuille 2021-08-18 16:08:03 +0000 UTC [ - ]

No, it's not. The method described by ARM is functionally identical, so it is equally biased as the method described in the blog post here.

kwillets 2021-08-18 16:52:03 +0000 UTC [ - ]

I see, this is one of several posts on this topic. There is a link near the end to the unbiased method, which is more original, but as you note fixed-point multiplication is not new.

pwuille 2021-08-18 15:56:23 +0000 UTC [ - ]

I recently discovered a generalization of this approach, which allows mapping a single hash to multiple independent numbers, each in their own range, while maintaining various uniformity properties.

A write-up is here, in case anyone is interested: https://github.com/sipa/writeups/tree/main/uniform-range-ext...

kwillets 2021-08-18 17:14:20 +0000 UTC [ - ]

This may save random bits, which are generally more costly than the range reduction.

timdaub 2021-08-18 08:46:20 +0000 UTC [ - ]

I recently discovered a problem in this domain of modulo reduction. I'm looking for a way to create a smaller index for Ethereum accounts that are in the range of 0 to 2^160 [1]. It's mainly to reduce gas storage cost [3].

Most folks I asked about it, ended up concluding that I'll end up with this pidgeon hole effect where it'll become likely that an index is suggested multiple times [2].

It may not be super cryptographically-sophisticated: But what I want is a hash function that's (at least) somewhat aware of collisions without using too much storage. Bloomfilters were recommended - but it seems they need to much storage too. I need to be able to map Ethereum accounts to a smaller sized domain e.g. 0 to 2^32 to safe storage costs in smart contracts.

Since I asked on Stackexchange, the problem has become kind of stale and I stopped working on it. But it's fascinating and I should continue to dive deeper...

Interesting article! Thanks

- 1: https://crypto.stackexchange.com/questions/91742/can-there-b...

- 2: https://en.wikipedia.org/wiki/Pigeonhole_principle

- 3: https://github.com/rugpullindex/honeybatcher

FreakLegion 2021-08-18 09:21:04 +0000 UTC [ - ]

There are smaller alternatives to Bloom filters, if size is the only issue. A few examples:

- https://github.com/0xcb/Golomb-coded-map

- https://github.com/efficient/cuckoofilter

- https://github.com/FastFilter/xor_singleheader/

timdaub 2021-08-18 13:36:01 +0000 UTC [ - ]

Thanks, helpful links. I've looked into them and found them a great fit for my awesome-ethereum-rollups list [1].

- 1 https://github.com/rugpullindex/awesome-ethereum-rollups#mem...

wizzwizz4 2021-08-18 08:34:42 +0000 UTC [ - ]

> I also suspect that in many cases, given a list of IDs you can get reasonable results by doing just ID % . In this case, you can get away without applying a hashing transformation. With your trick, you do need this.

Reverse the bits before applying the trick.

thxg 2021-08-18 10:05:55 +0000 UTC [ - ]

Yes, although the trick is used in the first place in order to replace a 6-cycle division by a 1-cycle multiplication (on x86_64, assuming no CPU stalls). Roughly, in order to still make it a net positive, we have a 4-cycles budget for additional operations. That's very tight for reversing the bits of a uint32_t.

nullc 2021-08-18 21:45:06 +0000 UTC [ - ]

> used in the first place in order to replace a 6-cycle division by a 1-cycle multiplication

Look a generation or two back and you'll see that the division is a LOT slower than 6 cycles, too.

But it should be noted that if the modulus is a compile time constant the compiler will turn the % into a multiply and add.

The riscv bitmanip instruction set has permutes that can reverse the bits in a word (along with other useful permutations). ... still doesn't exist in an actual asic yet. Maybe some time in my life we'll get it on common platforms.

deepsun 2021-08-18 08:50:49 +0000 UTC [ - ]

What about type conversion? The values were `uint32_t`, but converted to `uint64_t`, doesn't that consume cycles as well, or for one operation, optimizer compiles that away?

flohofwoe 2021-08-18 10:19:49 +0000 UTC [ - ]

Multiplying two 32-bit numbers yields a 64-bit result down on the CPU instruction level, so the type conversion is basically built in, details may differ by CPU, but here's the vanilla x86 multiply which spreads the results over two registers:

https://c9x.me/x86/html/file_module_x86_id_210.html

...so the compiler could even drop the right-shift and instead just use the register where the upper half of the result is stored.

deepsun 2021-08-18 14:00:08 +0000 UTC [ - ]

Interesting, thank you.

But the type conversion is on the results, not input. Is it guaranteed that we're always multiplying two 32-bit numbers and not two 64-bit numbers in "(uint64_t) x * (uint64_t) N"? Because looking at the code, it's more natural to think we're multiplying 64-bit integers, so we may need some CPU cycles to make "x" occupy 64 bits first, same for "N", and then do multiplication on them.

pwuille 2021-08-18 18:18:53 +0000 UTC [ - ]

The conversion is free, at least on x86-like platforms, because there are no separate 32-bit and 64-bit registers. Instead, there is a fixed shared set of registers, and the instructions signal whether they operate on the 32-bit or 64-bit value on it. When assigning to a register in 32-bit mode the top 32 bits are cleared, so if x and n were loaded into 32-bit registers, a 64-bit multiplication can be applied to it directly, without any conversion instructions.

wizzwizz4 2021-08-18 09:11:36 +0000 UTC [ - ]

Some CPUs have a single instruction for “multiply but get the high word”.