In-Place Parallel Super Scalar Samplesort (IPS⁴o)

scott_s 2021-08-18 13:36:48 +0000 UTC [ - ]

Academic paper: "Engineering In-place (Shared-memory) Sorting Algorithms", Michael Axtmann, Sascha Witt, Daniel Ferizovic, Peter Sanders, https://arxiv.org/abs/2009.13569

To start discussion, I think it's interesting to note how I landed on this repo and paper. I got curious about Tim Sort, and some searching on HN brought up a 2018 story, "On the Worst-Case Complexity of TimSort", https://news.ycombinator.com/item?id=17883461. In the comments, poster lorenzhs has an excellent comment (https://news.ycombinator.com/item?id=17884612) which points out IPS4o as a sorting algorithm that takes modern hardware into account.

h2odragon 2021-08-18 15:23:19 +0000 UTC [ - ]

> The parallel version of IPS⁴o requires 16-byte atomic compare-and-exchange instructions to run the fastest. Most CPUs and compilers support 16-byte compare-and-exchange instructions nowadays.

I was so thrilled when i got 4 byte CAS on Intel, Sparc, and Alpha (all i had were uniproc Alphas but it made the code feel cleaner). Then I wandered into other things and haven't been keeping up with this stuff. damn its got fancy.

Does memory latency still suck rocks with lots of handwave about "but we have so much width now"?

dragontamer 2021-08-18 16:57:53 +0000 UTC [ - ]

> Does memory latency still suck rocks with lots of handwave about "but we have so much width now"?

Basically.

Intel processors can process something like 2 or 3 load/stores per clock cycle to L1 cache at full rate, no questions asked.

When you use AVX512 and do 512-bit loads/stores (64-byte), that's still only 2 or 3 per clock cycle. But if you use typical 64-bit registers, you're still only able to do 2 or 3 load/stores per cycle. Its not even memory latency here, its literally the execution port that's the holdup.

-------

Now I've never looked at AVX512 / AVX2 (256-bit) from the perspective of multithreaded / atomic access / memory order. But from a performance perspective, that load/store is the full 512-bit or 256-bit width and probably atomic.

------

I'm assuming the 16-byte atomic compare-and-exchange instructions are NEON on ARM (128-bit width), and probably some AVX instruction I'm not aware of on Intel/AMD (256-bit).

EDIT: It seems like the 128-bit instruction on x86 is cmpxchg16b. 512-bit / 64-byte cache lines are the unit that the cache operates however. Intel has a TSX-NI (transactional memory) extension that operates over 64-bytes (making it possible to implement a cmpxchg64b, but over multiple instructions). Its not actually AVX512 or AVX2 related at all.

ithkuil 2021-08-18 23:09:00 +0000 UTC [ - ]

> its literally the execution port that's the holdup.

Is if also a limitation of the register file ports/granularity? If not, what would stop Intel go add an instruction that spill and fill a set of gp registers back and forth from memory (e.g. stack)?

h2odragon 2021-08-19 00:09:44 +0000 UTC [ - ]

Knuth stopped including the "sequential access" stuff just as it came back to usefulness :)

chmod775 2021-08-18 16:58:47 +0000 UTC [ - ]

> "but we have so much width now"?

We also have 16MB of L3 and smarter prefetching!