Exploring Clang/LLVM optimization on programming horror
eatonphil 2021-08-17 21:27:59 +0000 UTC [ - ]
I still cannot fathom how you can encode induction detection into an algorithm, any pointers welcome (keeping it simple, please, and ideally with working code).
The only case that makes sense to me is if you did numeric computation against some fixed number of cases and if that worked out then you assume it's right.
I guess this is what proof assistants do (among other things). Maybe I should look into how to write a basic proof assistant.
jcranmer 2021-08-17 22:34:38 +0000 UTC [ - ]
loop.preheader:
; This is the block that executes before the loop does.
br label %loop.body;
loop.body:
%i = phi i32 [0, %loop.preheader], [%i.next, %loop.body]
%val = phi i1 [0, %loop.preheader], [%val.next, %loop.body]
%val.next = xor i1 %val, -1
%i.next = add i32 %i, 1
%cmp = icmp slt i32 %i, %N
br i1 %cmp, %loop.exit, %loop.body
loop.exit:
; After the loop is done
Because of how phis work, in constructing this form, we immediately realize how to describe a value either in terms of its value in the first iteration (this is the value it takes coming from the preheader), or in terms solely of its value in the previous iteration. In other words, %val.next = f(%val) for some function f, which we can easily identify in the source code.Now, if we can compute the repeated composition of f easily, we can replace the recurrence in the loop with repeated composition. Repeated addition of the same value is multiplication, and repeated multiplication of the same value is exponentiation--that's two very easy examples to do so. We recognize that there is no use of the value except of its final occurrence, and so instead of doing the loop, we can generate the composition itself. Thus, after N iterations of the loop, we can conclude that %val.next = f^N(0), and if we can identify the loop iteration count (that's %N in this case), we can simply write out f^N instead of having to compute every individual value.
rocqua 2021-08-18 07:55:03 +0000 UTC [ - ]
I've seen LLVM do the same optimization for triangular and pyramidal numbers. Is there a list of hardcoded functions it can handle? Does it try a specific algorithm to find the direct formula?
salawat 2021-08-18 02:45:27 +0000 UTC [ - ]
jcranmer 2021-08-18 02:59:37 +0000 UTC [ - ]
I haven't looked at the exact criteria in this transformation, but it's very likely that there are also a couple of constraints on how precisely the loop is structured.
dw-im-here 2021-08-18 08:50:57 +0000 UTC [ - ]
contravariant 2021-08-17 21:39:27 +0000 UTC [ - ]
even = !even
and the line numberCompare++
resulting in even = (indvar & 1)
numberCompare = indvar
You can then solve from the exit condition that number == numberCompare, which gives you the indvar which can then be substituted into 'even'.I'm not saying it isn't magical, and certainly requires some symbolic solving, but it's doable.
Of course the real goal is to eventually get it to optimize
while (number != 1) {
if (number & 1)
number = number * 3 + 1
number >>= 1;
}
missblit 2021-08-17 22:26:14 +0000 UTC [ - ]
Valid C++ programs are guaranteed to be able to make forward progress [1].
So if `number` is a normal stack allocated integer (and not volatile, etc), then the infinite looping case is undefined behavior here.
So it would be a valid optimization to transform this to `number = 1;`.
(And indeed: https://godbolt.org/z/eodhfWe6h )
secondcoming 2021-08-17 22:43:26 +0000 UTC [ - ]
icc does not make this astounding leap in logic.
jcranmer 2021-08-18 00:00:38 +0000 UTC [ - ]
In general, compilers face an iron triangle of optimization: do you want a fast compiler, a small binary, or fast output? Optimizations that help all three are rare (especially ones that aren't already implemented in -O1), and even two of the three is often difficult to obtain.
account42 2021-08-18 12:30:03 +0000 UTC [ - ]
Why? Reasoning that the value must be 1 after the loop is not even a loop optimization, just a simple consequence of the branching instruction. This leaves the loop but nothing it computes is used anymore so since C++ guarantees that it exits you can simply remove it.
The transformation as a whole might not be useful often, but having a constant for the induction variable after the loop and being able to delete useless loops will both have runtime wins.
> In general, compilers face an iron triangle of optimization: do you want a fast compiler, a small binary, or fast output?
I often wish compilers provided more options to just brute force a bit more performance out of release builds. You can manually tune many parameters but then you need to know a lot more about how the passes work than just saying --compile-for=5h (or something like that, doesn't need to have a hard time limit).
matttb 2021-08-18 07:09:03 +0000 UTC [ - ]
onedognight 2021-08-17 22:55:08 +0000 UTC [ - ]
number = 1
for at least up to 64bit types?vnorilo 2021-08-17 22:06:33 +0000 UTC [ - ]
Consider indvar[n] = indvar[n-1] + k, indvar[0] = a. It follows that indvar[n] = a + n * k.
This is the type of indvar canonicalization that can zap loops such as the one in the article. Suddenly the final state is just a function of the loop trip count.
The return value is replaced with canonicalized indvar. That makes the loop itself dead (nobody observes any of its effects) and removed by subsequent dead code elimination pass.
My example transforms addition to multiplication. Another common one is multiplication into power. That's close to what's going on in the article example: xor is just a multiplication of signed one-bit integers.
geofft 2021-08-17 22:27:52 +0000 UTC [ - ]
I think "induction" here isn't exactly the sense of proofs by mathematical induction as you learn in school (as in proving something for all positive integers given a base case and a recursive case) - I think all they mean by it is "a loop, over a single integer." Quoting https://en.wikipedia.org/wiki/Induction_variable :
> In computer science, an induction variable is a variable that gets increased or decreased by a fixed amount on every iteration of a loop or is a linear function of another induction variable.
The description at https://www.cs.cornell.edu/courses/cs6120/2019fa/blog/ive/ makes it clearer:
> An induction variable is any variable whose value can be represented as a function of: loop invariants; the number of loop iterations that have executed; and other induction variables.
That is to say, the link to proofs-by-induction is that if you have a loop like "for (i = 0; i < 10; i++)", you can prove the statement "after N loops, i = N". Why? Because after 0 loops, i = 0 (the first part), and on each loop, i is one greater (the i++ part), and nothing else that happens in the loop changes i (which is what that post means by "loop invariants").
In other words, it's not that the compiler is detecting that the programmer is doing induction (the programmer is not, in fact the programmer is simply writing a for loop and not thinking about it), it's that the compiler is performing induction on the loop.
So a simple case is if you have a loop like
int i;
for (i = 0; i < 10; i++)
printf("Hi!\n");
printf("i = %d\n", i");
You can prove that at the end of the loop, i = 10, and you don't need to do actually increment i during each iteration to get there. And if you didn't have that printf (i.e., if the loop body was empty), you could optimize the loop out entirely and just print out 10.So it's not terribly magical that LLVM is able to turn
while (number != numberCompare)
{
even = !even;
numberCompare++;
}
into, in pseudocode, numberCompare = number;
even = !!!...!!!even (numberCompare times);
i.e., that it's able to get numberCompare out of the loop, to conclude that even in turn is only a function of numberCompare, and then that there's no loop left.What's more magical to me, I think, is that it's able to make sense of "run the ! operation numberCompare times" and turn that into "if (numberCompare % 2 == 1) even = !even". I don't think that's too surprising either - a compiler should know that, for booleans, !!foo can be simplified to foo. But repeatedly applying that rule across an abstract number of calls to !!, plus possibly one more, seems pretty clever.
If I'm skimming the code right, that part of the simplification is a different optimization pass entirely. I don't know off hand where that's implemented. I'd be curious if someone can find it.
rocqua 2021-08-18 17:23:06 +0000 UTC [ - ]
(The sum of the first N integers and first N squares respectively)
I really wonder what they use for this and how much it generalizes.
benmmurphy 2021-08-17 21:37:24 +0000 UTC [ - ]
CalChris 2021-08-17 21:20:36 +0000 UTC [ - ]
woodruffw 2021-08-17 21:32:27 +0000 UTC [ - ]
Most engineers are intimidated by compilers, but they really shouldn't be! LLVM's source code is very approachable and well-isolated by concern, and most of the fundamental concepts in an optimizing compiler are understandable with a beginner's understanding of data structures, assembly, and resource allocation.
One slightly funny thing, though: LLVM's optimizations are on IR, so every frontend can theoretically benefit from them. However, many assume patterns originating from the C/C++ frontend or heavy canonicalization, so you need to be a little careful as a language frontend developer to take full advantage of all optimizations!
CalChris 2021-08-17 22:48:23 +0000 UTC [ - ]
LLVM is pass oriented but broadly there are front ends (Clang, Rust, Fortran), a middle end (LLVM IR SSA optimizations) and multiple in-tree and out-of-tree backends. I was a little resentful of having to write all of 5 lines or so inside of Clang. Otherwise, I happily live in the backend. Doing a merge with a new release is an hour or so of conflict resolution.
The pass structure importantly means that you can test your backend code, even at the GlobalISel subpass level, without needing to understand Clang, LLVM, .... Pretty sweet if you ask me.
jagger27 2021-08-17 22:42:26 +0000 UTC [ - ]
So does this mean that it’s possible to express exotic patterns in LLVM-IR that aren’t typically produced when compiling C or C++?
Are Rust’s optimizer passes significantly different than clang's? I would guess borrow checking has a special pass that wouldn’t be applicable to C.
This topic is really fascinating to me!
Rusky 2021-08-18 01:51:59 +0000 UTC [ - ]
Borrow checking is more like a type checking pass, not an optimization pass- it's pass/fail, and doesn't change the IR.
Probably the most interesting thing to look at here is Rust's use of LLVM noalias metadata. This captures the assumption that, while a reference (of type &T or &mut T) is live, the value it points to is not allowed to change via any other name. (The borrow checker ensures this assumption is valid for safe code, but the rule applies to unsafe code too- the important thing is the reference types.)
This metadata enables LLVM to perform more aggressive optimizations of pointer reads and writes. C can actually generate similar metadata via `restrict`, but this is less common because there's no borrow checker to verify that it's used correctly. And interestingly, LLVM has been very buggy in this area! Rust has had to disable its use of noalias metadata several times as those bugs have been found and fixed.
jamincan 2021-08-17 23:14:58 +0000 UTC [ - ]
sillycross 2021-08-18 01:11:42 +0000 UTC [ - ]
I think the major reason is that TBAA optimization is only possible if you manually emit those metadata (clang does, but I didn't, due to time limitation and fear of aliasing-related bugs).
But that LLVM is designed to work best on the pattern emitted by Clang is definitely another reason. In fact, even the official documentation recommends users to emit IR in a way that looks similar to what is emitted by Clang, so you have a better chance to keep the LLVM optimizer happy.
aspaceman 2021-08-18 11:15:43 +0000 UTC [ - ]
LLVM project is used to compile many different shading languages into a similar IL, then compiled into a bytecode for GPU. The project has been really useful for managing all the compiler infrastructure needed for graphics nowadays imo.
mhh__ 2021-08-17 22:32:04 +0000 UTC [ - ]
I find GCC machine def files actually slightly easier to read than LLVM's
dragontamer 2021-08-17 21:28:22 +0000 UTC [ - ]
So the key is to first understand SSA form. After that, a lot of optimization passes become obvious to think about.
---------
SSA has two bits to understand:
1. Single static assignment -- variables can be assigned, but once assigned they can never change. (This part is easy).
2. Graph-based, and therefore requires the 'Phi' function to understand -- variables may take on two different values. EX: "r0" may come from Node#1, while "r0" comes from Node#2. r0 may have been renamed %1 in #1, and %2 in #2. "Phi" is a "fake function" that resolves the ambiguity and kinda makes SSA code "real" again.
Hmm, maybe that's a bad explanation. But whatever, my point is that understanding "Phi" is the concept you want to focus on in SSA. Phi is the harder part, but its still not that hard to understand.
Once you get it, the whole form "clicks" in your brain and you suddenly understand how optimizations can become possible.
jcranmer 2021-08-17 22:00:27 +0000 UTC [ - ]
SSA is static single assignment. Every variable has a single point of declaration. If you want to change the value of a variable (say, i = i + 1), you instead have to create a new variable. The advantage of this form is that tracking the definition of a variable is trivial; in LLVM IR, to use a variable, you literally pass in a pointer to the instruction that defined it. The inverse, finding all uses of a value, is done by building a use-list: whenever you create an instruction, it adds itself to the use-list of all the values it's using, which allows easy iteration of the users of a value.
In short, SSA gives you a trivially correct-by-construction to understand the dataflow of a program (at least, that data which lives in registers; memory is more complicated). There's no separate analysis that has to be maintained throughout optimizations, and hence a possible source of programs (as there is for failure to maintain the dominator tree through CFG transformations, a historically bug-prone step in LLVM).
Now, as you may notice, there are cases where it would seem hard to pin down a single definition of a variable: conditional assignment. SSA solves this by creating Phi values. Phi values can be better thought of as basic block arguments: when you enter a basic block, you must provide a value for each of the phi values in that basic block. The actual value of the phi is dependent on which edge you used to enter a basic block, or in other words, the control dependencies of the code.
In short, phis are the intersection of control flow in the code with the dataflow--and they are the only things in that intersection.
mhh__ 2021-08-17 22:30:25 +0000 UTC [ - ]
srcreigh 2021-08-17 21:36:05 +0000 UTC [ - ]
In case anybody's interested in learning the basics, here's some course notes to a compilers class which I love!
CalChris 2021-08-17 22:14:53 +0000 UTC [ - ]
srcreigh 2021-08-17 23:11:47 +0000 UTC [ - ]
dleslie 2021-08-17 22:16:02 +0000 UTC [ - ]
Looks like it's still 379; not sure about the optional component.
macintux 2021-08-17 22:19:27 +0000 UTC [ - ]
thrasumachos 2021-08-17 21:48:23 +0000 UTC [ - ]
titzer 2021-08-17 23:44:47 +0000 UTC [ - ]
ufo 2021-08-18 01:18:18 +0000 UTC [ - ]
Does anyone know? Perhaps it becomes tail recursive after one round of inlining?
pertymcpert 2021-08-18 07:00:30 +0000 UTC [ - ]
ufo 2021-08-18 19:45:58 +0000 UTC [ - ]
An accumulator is when we add an extra parameter to hold the "loop variable", like this:
bool doit(int n, bool acc)
{
if (n == 0) return acc;
else return doit(n-1, !acc);
}
bool isEven(int n)
{
return doit(n, true);
}
The function is now tail-recursive and behaves similarly to this loop: bool acc = true;
while (n != 0) {
acc = !acc;
n = n - 1;
}
return acc;
According to the comment on the top of the file, apparently llvm can create this accumulator variable if the operation is associative and commutative. That's because after introducing the accumulator, the evaluation order is outside->inside instead of inside->outside.
martincmartin 2021-08-17 22:25:02 +0000 UTC [ - ]
In complexity theory, the size of a problem is the number of bits to represent the input, so if the input integer is i, then the size is n = log_2(i) so the algorithm is actually exponential in the number of bits it takes to represent the input.
dragontamer 2021-08-17 22:30:40 +0000 UTC [ - ]
But in practice, you're not. Case in point: matrix multiplication is commonly quoted as O(n^3) (naive), when in fact, the amount of data used is O(n^2), and therefore should be quoted as O(size^2) (quadratic) with respect to data-size. (EDIT: was bad at math for a second).
But no, people mean "n-by-n matrix" has "O(n^3) naive" implementation. In practice, the "n" is just whatever is most convenient for a problem.
In many cases, n is proportional to the input size. But in other cases (such as the famous matrix multiplication examples), n is the sqrt(input-size).
jcranmer 2021-08-18 00:05:23 +0000 UTC [ - ]
It's actually O(size^1.5), not O(size^2). Well, assuming you're talking about θ(n) when you mean O(n) as most people do.
[If you're not familiar with the difference, Θ(f(n)) means a function that grows as fast as f(n), up to a constant, whereas O(f(n)) is a function that grows no faster. So n^2 is O(n^3) but not Θ(n^3).]
wittycardio 2021-08-18 00:08:31 +0000 UTC [ - ]
trias 2021-08-17 22:28:57 +0000 UTC [ - ]
Ar-Curunir 2021-08-17 23:10:53 +0000 UTC [ - ]
creata 2021-08-18 01:35:38 +0000 UTC [ - ]
Ar-Curunir 2021-08-18 15:05:39 +0000 UTC [ - ]
jagrsw 2021-08-18 07:53:21 +0000 UTC [ - ]
O(n) isEven function compared to the obvious constant time O(1) modulo algorithm
In the holy name of nitpickiness: the modulo operation is not always O(1), as div/mod CPU instructions have complex internal implementations and require varying number of cycles depending on what the input args are.However, for x%2 it'll be probably O(1), both on the cpu level, and b/c it'll get optimized to something like x&1 by most compilers (at least for unsigned values)
8192kjshad09- 2021-08-18 08:05:35 +0000 UTC [ - ]
jagrsw 2021-08-18 08:42:37 +0000 UTC [ - ]
laszlokorte 2021-08-18 09:38:13 +0000 UTC [ - ]
There are other notations like small o(), Omega(), omega() or theta() to capture the asymtotic runtime behavior more or less precisely.
athrowaway3z 2021-08-18 09:12:30 +0000 UTC [ - ]
woodruffw 2021-08-17 21:25:18 +0000 UTC [ - ]
One nitpicky observation: mem2reg doesn't "move values from memory [...] to registers (directly inside the CPU)." It's an SSA expansion pass, taking the minimal SSA form produced by the C/C++ frontend to turning it into a more aggressive SSA form by eliminating as many `alloca`s as possible. SSA is characterized by the presence of infinitely many "fresh" abstract registers, none of which correspond to actual CPU registers -- LLVM is responsible at a later stage for performing efficient lowering and allocation of registers/stack slots for the SSA registers.
CyberShadow 2021-08-18 08:30:41 +0000 UTC [ - ]
maattdd 2021-08-18 09:36:44 +0000 UTC [ - ]
slavik81 2021-08-17 22:38:04 +0000 UTC [ - ]
NobodyNada 2021-08-17 23:02:51 +0000 UTC [ - ]
Memory is mutable whereas SSA registers are immutable, so the compiler frontend just put all the C variables in memory to express the semantics of the C program in the simplest way possible. The mem2reg pass then moved all the variables into SSA registers, by simply creating a new register for each assignment and adding phi nodes to track the flow of variables between basic blocks (like the individual iterations of the loop body).
slavik81 2021-08-17 23:40:23 +0000 UTC [ - ]
dragontamer 2021-08-17 22:53:58 +0000 UTC [ - ]
Instead, I'll point you out to the keyword "Dominator Frontier", and "SSA Construction Algorithm".
https://en.wikipedia.org/wiki/Static_single_assignment_form#...
You should read that only after you understand SSA and Phi functions in particular. Once you know how Phi works, the next question is the algorithm for how they're constructed.
https://en.wikipedia.org/wiki/Dominator_(graph_theory)
This page may also help.
dragontamer 2021-08-17 21:34:51 +0000 UTC [ - ]
> -mem2reg: Promote Memory to Register
> This file promotes memory references to be register references. It promotes alloca instructions which only have loads and stores as uses. An alloca is transformed by using dominator frontiers to place phi nodes, then traversing the function in depth-first order to rewrite loads and stores as appropriate. This is just the standard SSA construction algorithm to construct “pruned” SSA form.
-----------
https://llvm.org/docs/LangRef.html#alloca-instruction
> The ‘alloca’ instruction allocates memory on the stack frame of the currently executing function, to be automatically released when this function returns to its caller. If the address space is not explicitly specified, the object is allocated in the alloca address space from the datalayout string.
--------
Based on my reading and understanding of alloca and mem2reg above, I have to disagree with your assessment. It seems like alloca roughly corresponds to a "push" in your typical x86 assembly language, or maybe a "add esp, #ofBytes".
By removing alloca instructions, the mem2reg step is turning stack-memory into registers.
woodruffw 2021-08-17 21:52:55 +0000 UTC [ - ]
This is true, but it's also misleading: the transformation of stack slots into machine registers is a beneficial side effect of mem2reg. Reducing stack use is an important optimization, but producing an optimal SSA form is even more important, since key passes like folding, DCE and SRoA rely heavily on the SSA form. The "reg" in "mem2reg" refers explicitly to the latter (SSA registers), as the text you've excerpted directly says.
You can also prove this to yourself by contriving an IR function that contains more SSA registers than machine registers: you'll see that, in addition to any ABI constraints, LLVM will be forced to spill any excess SSA registers back onto the stack.
Edit: but also yes, to confirm: `alloca` corresponds more or less directly to `add ESP, <adjust>` within the x86 stack model. But it doesn't have to! LLVM's semantics are target independent even when the IR isn't.
dragontamer 2021-08-17 21:56:23 +0000 UTC [ - ]
I have crude familiarity with SSA (no personal experience but I've read a book once) but not with LLVM's terminology. I'm looking at the example in the blogpost, and I can see that its clearly adding in the important Phi functions needed
Strange that llvm names their aggressive SSA step in this manner. But yes, adding the phi functions in this manner is a very important step before we can think about optimizations.
The "inductive variable" is far easier to detect once its in a "Phi-form" so to speak.
woodruffw 2021-08-17 22:03:17 +0000 UTC [ - ]
The end result being that we have the same word "register" referring to two opposed resources (one infinite, one extremely finite) :-).
ufo 2021-08-18 01:15:26 +0000 UTC [ - ]
gsnedders 2021-08-17 22:44:50 +0000 UTC [ - ]
account42 2021-08-18 12:13:14 +0000 UTC [ - ]
Having more SSA register than machine registers does not mean there will be any spilling as unlike SSA registers, machine registers can be reused once the old value is no longer needed.
woodruffw 2021-08-18 15:48:07 +0000 UTC [ - ]