Joseph Sibony
reading time:
Compiler optimization is the process of improving the compiled code to allow it to be more efficient, whether by improving its runtime performance or by minimizing the code size.
In this video (the transcript is presented here below), Amir Kirsh, Incredibuild Dev Advocate, is focusing on some specific and interesting optimizations performed by the C++ compiler.
Transcript: A Peek into the C++ Optimizer
Hi everybody my name is Amir Kirsh and I’m a dev advocate at Incredibuild.
Today I want to talk about compiler optimization.
A peek into the C++ Compiler
How the c++ optimizer can optimize our code?
Let’s just dive into a piece of code. I will share with you my Coliru online compiler.
Multi-threading without Locking
In this example I have here a multi-threaded program and there is a reason why I show you a multi-threaded program, I will explain shortly. We have here a function called “adder”, which adds number into the global variable sum. The global variable sum is being approached here in this program from several threads in parallel, without locking, which is not good. We know that. We have here a race condition, and the result can be anything. But the idea is to check what is the actual result. So, sum is a global variable, as said, and “adder” runs a loop and adds some multiplication of the index “i” and some x into sum. And in the main, we launch 10 threads. Each thread runs “adder”, and all the threads, all the 10 threads, add numbers into the same global variable sum.
So, if we want to check what would be the results, probably if we run it only once, it might be that we’ll get the expected result, which we can calculate and see what happens if there is a lock. But if we run the program more than once, we will see that the results are indeed not always the same. So what I did here is to run a small bash script. Here it is, I run the program 500 times, and then I sort the results and count the unique results, which eventually what I have is number of times which I got each certain output. So we can see that 490 times out of the 500, I get some result, which is the expected one, which is the right calculation, if I do have a lock. But we do have 10 other results, which are something else. That’s nice.
And it did show that we have a reason we need here either to use some atomic integer or to use a lock. But my question here is how come we got all the results as a multiplication of 50. How come that we don’t have other results like 267355, 267384 or many other results that can come here. And the answer is compiler optimization; something being done here by the optimizer turns the result to be around the multiplication of 10, or multiplication of 50 in the result. And the thing that is being done by the optimizer is loop unrolling.
Loop unrolling
The optimizer sees this loop, and then it says, well, I’m not sure that you actually need a loop here. Maybe I can calculate the entire thing without a loop. So the function adder becomes just a single calculation, and the loop is gone, which is called loop unrolling.
Once the loop is gone, then the competition between the threads, the race condition, is on the entire calculation, which is being calculated by the optimizer. What would be the sum inside the loop, which is in multiplication of 50 or something like that. Now, can we check this theory? Well, we can; let’s try to change the optimization flag from -O2 to -O0 and compile again. So I’m changing here, the optimization flag to -O0 compiling again. And we can see that the results are now much more spread across all sorts of numbers because there is no loop unrolling now. And the loop actually is in the code. So we actually go into the loop, and the race condition now is on each addition inside the loop and not on a single calculation. And we do have most of the times the same number that we got to before, but the others are quite more spread.
Assembly
If we go to see the assembly code, we can use Compiler Explorer for that, another online tool. And we can see that -O0 and -O2 on the same code results with different assembly code. With a -O0, the “adder” function that we see here, do have a loop. We see that we go to the code that has some calculation and then raise a comparison here this line. And if the comparison is greater than a certain number, then you continue, otherwise you jump back to the loop.
So this is the loop that we see and in the code that uses the -O2 optimization the assembly code, the assembly commands inside the “adder” functional are just the calculation, one line of calculation or two lines, two commands of calculations. And that’s it. Without any loop inside, this is the loop unrolling that we see here.
Final note
So we saw today the optimization done by the C++ compiler called loop unrolling. And we saw how it affects, in our example, a multi-threaded application. It didn’t save us from a race condition because the race condition is in the “adder” function anyhow, but the race condition behaved a bit differently then if we do not have a loop unrolling. This is for today; we will continue on in the next talks to see what the optimizer can do for us. So thank you for listening. Thank you for being with us today. See you.
Table of Contents
Shorten your builds
Incredibuild empowers your teams to be productive and focus on innovating.