System CPU Usage and Glibc - Hugo Hacker News

AshamedCaptain 2021-08-19 12:56:31 +0000 UTC [ - ]

Nothing in this article makes any sense. You simply cannot use MAP_SHARED by default because then it will wreak havoc the moment anyone uses fork() (for anything other than to immediately exec). And I fail to see absolutely any reason why MAP_SHARED vs MAP_PRIVATE would alter the performance characteristics of multi-threading within the same process, since altought the multiple threads are technically forks/clones, they will all share the same VM space irregardles of whether you used MAP_PRIVATE or MAP_SHARED. And neither MAP_SHARED nor MAP_PRIVATE will alter the way entries in the TLB are prepopulated or not, which is going to be at the choice of the processor itself (after at all, it is a _cache_).

So can someone please explain? Maybe what he actually wants is madvise or hugetlbfs ?

st_goliath 2021-08-19 13:39:00 +0000 UTC [ - ]

> Nothing in this article makes any sense.

Well, that's basically it.

The author appears to have heard some explanations on Unix & OS internals, but not quite understood them and appears to confuse a lot of things. From the looks of things, I figure the author did some voodoo problem solving and then, convinced of having understood the problem, decided to write an article about it.

There's a lot of second guessing, e.g. what the mmap flags do, based solely on their name, resulting in the bizarre connection to security guarantees; casually dismissing over-commit (because "nobody would do such a thing"), etc...

formerly_proven 2021-08-19 16:04:37 +0000 UTC [ - ]

This reminds me a little bit of a lengthy blog article by some FAANG engineer about picking optimal chunk sizes for I/O where he did a lot of benchmarking with /dev/null and /dev/zero.

scottlamb 2021-08-19 14:04:56 +0000 UTC [ - ]

> So can someone please explain? Maybe what he actually wants is madvise or hugetlbfs ?

I think you're getting warm. It's hard to know anything from this confused article but I'll assume they're correct that there was only one process, the pages were filled prior to the load test, and there were many minor page faults later. My best guess is that the page faults are due to the kernel's moving stuff to transparent huge pages in the background (and struggling to find/make unfragmented physical address space). The behavior change was due to the mmap size change they observed: before with the extra bytes the kernel didn't initially use huge pages. After, it did, so no background operation was necessary. (Or maybe MAP_SHARED helped because they had them backed by a non tmpfs filesystem, where huge issues aren't supported. Or maybe some behavior changes were just because they tested multiple times without rebooting and the physical page space got fragmented. Again, hard to know much from this article.)

If I were them, I'd confirm this by "perf record" or eBPF and/or by disabling transparent huge page compaction. Then I'd switch to using dedicated huge pages, because huge pages really do speed things up by like 15% for a memory heavy workload; they're worth it if you can avoid THP's sometimes pathological behavior.

Edit to add: Here is a much clearer article explaining the same problem. https://pingcap.com/blog/why-we-disable-linux-thp-feature-fo... I don't agree with the pingcap folks that you should permanently disable THP altogether but they did good debugging.

gopalv 2021-08-19 16:10:16 +0000 UTC [ - ]

> If I were them, I'd confirm this by "perf record" or eBPF and/or by disabling transparent huge page compaction

The report does look a lot like transparent hugepage defrag moving around memory randomly & system CPU spikes for no good reason.

And the workaround is valid, because THP can currently only map anonymous memory regions such as heap and stack space.

Literally the first thing I have to do to a Hadoop cluster is go around and turn off the defrag for the multi-threaded bits we have or system CPU goes over 20% as the system goes into workloads.

The other way you end up with a thundering herd causing permanent issues is with NUMA balancing (+ false sharing). So if you scale up a system slowly, you end up allocating in the same NUMA zone for a whole section (like a 1Gb array), which makes traversing it faster, but if this happens at the same time as another thread touching the same data, it will get scheduled over in a different zone & the kernel can do a lot of busy work with zone rebalancing as you go above 50% memory usage. This used to be a big deal when the first ccNUMA Intel boxes were coming out & mysql at Yahoo used to have so much trouble with NUMA messing up memory access speed assumptions[1].

The 96 core + 235GiB of data suggests there is some NUMA messiness going on for sure, particularly because there's a pinned worker to a CPU in the design.

The NUMA issues were front-and-center when the Power8 porting of Hadoop was going on , particularly because Java just gives you two things you can tweak directly in the GC (UseNUMA and UseTLAB). Because it was a lot of Cores on the bus (128 cores x 1TB - NUMA is like 1x-4x slower, which is hard to optimize for in a general sense).

[1] - https://blog.jcole.us/2012/04/16/a-brief-update-on-numa-and-...

scottlamb 2021-08-19 16:26:49 +0000 UTC [ - ]

Great point about NUMA. It sounds like one request is basically "scan the entire gallery for matches to this image". This is similar to websearch. The ideal thing would be for each of the pinned, per-core threads to initialize its own memory region and be responsible for scanning just that memory region, so the 96 threads all work on the same request then move on to the next. This would be more efficient as well as have lower latency at <100% utilization. I don't think they're doing that or they wouldn't have talked about increasing concurrency "from 10, 20, 40, 60, 80 and 96".

> And the workaround is valid, because THP can currently only map anonymous memory regions such as heap and stack space.

Also tmpfs, fwiw. https://www.kernel.org/doc/html/latest/admin-guide/mm/transh... says "Currently THP only works for anonymous memory mappings and tmpfs/shmem. But in the future it can expand to other filesystems." I'm eagerly waiting for that future...

I don't think they're using tmpfs because they said "The default tmpfs on the host was untouched and was left at 50% (128 GiB)", and 235 GiB doesn't fit in 128 GiB.

tyingq 2021-08-19 13:36:08 +0000 UTC [ - ]

He's saying that MAP_PRIVATE|MAP_ANONYMOUS causes copy-on-write not just for forked processes, but for threads using std::vector. So he redesigned with a RAII wrapper to explicitly use mmap() with MAP_SHARED to avoid the copy-on-write.

It's confusing because his re-write isn't multi-process, but multi-thread. But then he keeps calling threads processes.

AshamedCaptain 2021-08-19 13:43:18 +0000 UTC [ - ]

> He's saying that MAP_PRIVATE|MAP_ANONYMOUS causes copy-on-write not just for forked processes

But that is just plain wrong; it's practically in the definition of threads that they share the VM between themselves. If MAP_PRIVATE meant "copy-on-write even for threads using std::vector" most multi-threaded programs would stop working, save for perhaps a couple of purely functional examples.

On the positive side, there wouldn't be any data-races. :)

nyrikki 2021-08-19 17:03:07 +0000 UTC [ - ]

If you are using glibc you really need to see what clone() flags are being passed as fork() is just mapping to clone() with defaults now.

Now some of the other libc variants used by space optimized containers like alpine will still use the legacy fork()

But I constantly see developers confused because they were taught that the v7 style fork() is still used in modern POSIX.

It is didactic and not a rule, focusing on clone() helps get past the change in behavior for both threads and processes.

This author seems to have missed this change in behavior and went on a few crazy paths.

tyingq 2021-08-19 13:59:47 +0000 UTC [ - ]

Maybe he means "copy on write" for copies of a std::vector, instead of real copy.

yakubin 2021-08-19 12:58:54 +0000 UTC [ - ]

Several paragraphs don't make sense to me:

>Our biometric datasets are sized at 1 GiB for easier file management as we could turn on transparent huge pages, if required. So it is guaranteed that mmap is being used under the hood by std::vector, so why was there TLB misses in the first place? That’s because the MAP_PRIVATE flag instructs the kernel to turn off VM_SHARED flag when creating the mapping in the kernel. This causes TLB to be cold and populate it only on demand, via page faults. As long as this translation is small, there’s no overhead. However as the memory pressure increases with concurrency, the kernel has to fight hard to populate TLB.

There is MAP_POPULATE flag to populate the mapping eagerly without waiting for page faults. I'm not sure what MAP_SHARED has to do with that.

> That aside, the reasoning behind glibc allocator’s use of mmap with MAP_PRIVATE|MAP_ANONYMOUS as defaults, is privacy and anonymity. Which makes sense, because we do not want other processes to peak into the memory region of our process. That would be a security nightmare. But I’m not sure I agree on this for inter-thread design. Threads inherently share the same process space and thus the heap. MAP_PRIVATE gets a copy-on-write mapping for performance reasons (may be). MAP_ANONYMOUS will ensure the mapping is initialized to zero. So the performance optimization is wasted away. However this doesn’t make sense for user-space data structures std::vector or malloc for that matter (FWIW, malloc, although a function, has it’s own internal data structure under the hood). No one is going to throw away a memory region after allocation without writing something into it, after all, why else will they allocate in the first place? As the size of memory being requested is large, it makes sense to use MAP_SHARED|MAP_ANONYMOUS as the default in glibc.

CoW is for separate processes, not just for any tasks. A different thread in the same process writing to a page won't trigger a copy. You don't need to pass MAP_SHARED to mmap to share memory between threads, the arguments to the clone(2) syscall which created the thread already made sure that you share the memory.

What I really suspect happened is that they didn't call std::vector::reserve at the start, so there was a lot of overhead when resizing (allocating new memory, copying the content, freeing the old memory) of std::vector, and when they moved to calling mmap directly, they effectively reserved memory at the start, as would happen if they called std::vector::reserve. Unfortunately the post doesn't contain enough details to really be sure about that.

ReactiveJelly 2021-08-19 13:51:25 +0000 UTC [ - ]

> the reasoning behind glibc allocator’s use of mmap with MAP_PRIVATE|MAP_ANONYMOUS as defaults, is privacy and anonymity. Which makes sense, because we do not want other processes to peak into the memory region of our process. That would be a security nightmare

This is also a big 'hmm'

amluto 2021-08-19 13:30:59 +0000 UTC [ - ]

> This causes TLB to be cold and populate it only on demand, via page faults.

This article is confused. Page faults populate the page tables, not the TLB.

hipokampa 2021-08-19 13:16:59 +0000 UTC [ - ]

> No one is going to throw away a memory region after allocation without writing something into it, after all, why else will they allocate in the first place?

Hah! Just try running a server with vm.overcommit_memory=2.

2021-08-19 13:18:54 +0000 UTC [ - ]