After diving into SIMD (Single Instruction Multiple Data) in our previous video, Amir Kirsh will walk us through the different optimization options of the compiler optimizer with GCC and Clang.
Transcript: Comparing Clang and GCC Optimization
Hi, everybody. My name is Amir Kirsh, and I’m a Dev Advocate at Incredibuild.
Today we are going to continue with our sessions on compiler optimizations and we are going to play a bit with some code and see optimizations that we get through the use of the compiler optimizer with GCC and Clang.
So let’s just share my Compiler Explorer. I hope you can see now my screen. We have an empty program, and we will start with a very simple function, something that we already saw in a previous session. Let’s have a function called foo, which gets a simple int, and inside the function, we want to loop over “i” till “i” reaches 10 and we want to return “I”. We already asked in a previous session whether we actually have a loop here. Well, we do have a loop, but do we actually need this loop? And the answer is no, there isn’t any need for an actual loop here and the optimizer is able to see that and to perform loop unrolling, which means that eventually, we have here a simple “if”. So, the optimizer here with GCC and -O2 decides to put 10 at the return value register and then compare the return value register to the actual parameter that we got and then conditional upon this comparison, either to move the parameter into the return value or, if not, to keep the 10.
So in the end, either we return 10 in case “i” was smaller than 10 or, if “i” was bigger, greater, or equal then we return “i”, which means if “i” was 15, for example, we just don’t get into the loop and we return 15 and this is a simple “if”, no need for a loop. But if we go with -O0, if you recall, then we do see a loop here. And this is something that we already saw. The question is if we complicate the code a bit if we add things into this function, would the optimizer be able to still do loop unrolling?
So let’s complicate this function a bit. Let’s call it foo. We don’t have to change the name of the function, but let’s add ‘sum’ initialized to zero, and let’s decide that what we want to do inside the loop is to add to sum the values of the iteration variable “i”. And we see here that now GCC doesn’t do loop unrolling with -O2. Let’s add a simple main, let’s add main and check what happens in main let’s return foo of 5, let’s say. And we can see that in main, GCC Optimizer is smart enough not to call foo, so we have some kind of inlining here and to calculate the actual number at compile time. The algebraic series from 5 to 9 is actually 35. This is the actual sum. So 35 is the return value of main without any need to go through the loop.
Does it mean that not doing loop unrolling here is OK? Well, not necessarily. Because suppose that it was not 5; suppose that it was a variable. If it is a variable, then it is not something that can be done at compile time. Let’s check it. Suppose that we get int argc and char** argv, and we send argc to foo. Now we can see that the optimizer did inlining. So we do have foo as inlined into main but we actually have a loop. Where probably we could just calculate the number based on a variable. Let’s see what Clang does on this.
So let’s change the compiler to Clang, and we see that Clang does use loop unrolling. We see that the loop is gone. We see that we don’t need a loop. There is some calculation here that calculates algebraic series without the need for a loop. Maybe GCC can do that with -O3. So let’s go back to GCC and we’re back to GCC, we are back to the loop, which probably we don’t need here.
And let’s check whether -O3 would do the trick. And what we see here, that the code changes, the assembly changes. But this is not loop unrolling. GCC has somehow decided to do something else here. Do you see something that you recognize? Well, I will remind you that we already saw the assembly commands like movdqa and things like paddd. And what we have here is SIMD instructions. So what GCC did here is in this case, instead of being efficient and using loop unrolling, which is better in this case, Just went with -O3 to SIMD. What Clang did in this case, both with -O3 and -O2 is, I would say, a bit better using loop unrolling This is a very simple example. But I think what we can learn from it is that optimizations behave differently with different optimization flags and with different compilers and once you have in mind a certain optimization that can take place, in some cases it might be better to perform the optimization yourself. Of course, if this is a bottleneck, if this is something that doesn’t affect too much your performance, so keep the code as is and count on your optimizer. But in some cases, the optimizer might not achieve what can be achieved. And it might be better for those cases that you would go and say, OK, I think that they can calculate it without a loop.
Thank you for now. This is it for today. And we’ll see you later on our next videos. Bye-bye.