After diving into the different loop unrolling scenarios in our previous video, Amir Kirsh walks us through another kind of loop optimization – Single Instruction, Multiple Data (SIMD) – and shows how in some cases a longer assembly is actually better, as well as shows some examples in Clang and GCC.
Transcript: A Deeper Dive into the C++ Compiler Optimization
My name is Amir Kirsh and I’m a dev advocate at Incredibuild.
Today we are going to continue our video series on C++ compiler optimization. So let’s take a look at a piece of code. I will share my screen and I hope you can see the loop on this slide.
The example on this slide comes from a talk at CppCon 2016 by Timur Doumler, and the question here is, how can we make this loop more efficient? Now, the current loop that you see on this slide cannot be optimized. The optimizer takes a look at this loop. The optimizer hesitates. Maybe it can make it more efficient, but it cannot, at least for now. And the question is, can we change the code so it will react the same, it will behave the same, which will result in the same output with the same outcome, but allow the optimizer to do something with this loop, which could make it much more efficient.
Don’t take a look at the slide that you see inside my slide from Timur’s talk, because Timur’s talk covers a lot of things. And here this problem does not exactly relate to instructions caching or to memory caching. It relates to something else and it doesn’t relate also to loop unrolling, which we dealt with in previous talks. It is another kind of loop optimization. So, can you see something that can be done with this loop that will allow the optimizer to make it more efficient?
SIMD – Single Instruction Multiple Data
And the answer relates to a term called SIMD. SIMD Is the abbreviation for Single Instruction Multiple Data, which means if it is possible to take a single instruction and run it on multiple data, we can have more efficient code at runtime and the optimizer is able to recognize code that might become more efficient with SIMD and compile it to machine code that will use SIMD.
Now, in our original loop, there is a problem. SIMD, or the ability to use SIMD, requires a loop to be able to take several iterations and perform several iterations at the same time. So, for example, if I can perform iteration for index i equals zero and i equals one and two and three together, on the calculations that I’m going to perform inside the loop, so doing all the calculations at the same time for indices zero to three – this is SIMD.
This requires that there are no dependencies between the calculations or the operations that we do in the different indices. So, if the calculation for i equals zero does not affect the calculation for i equals one, we can, in a way, put the data that relates to i equals zero and put the data that relates to i equals one in a certain position in memory, and then do the calculation on both, with a single instruction. This is SIMD. SIMD, by the way, is a single tool that is part of a broader optimization called loop vectorization. Loop vectorization is the ability to take a look at a loop and say, okay, I can break the loop to parallel computation and SIMD is one of the tools to do this parallel computation for a loop.
So, the problem with the current loop that we see here is that the indices or the loop iterations are dependent on each other. We calculate something on an index i which is based on b on Index i, but we also calculate something or change b of index i+1. This means that if we will try to do together at the same time in parallel indices i zero to three or zero to eight, we will touch at the same time, b at index i and b at index i+1 which would not allow us to do this in parallel. So the problem here is that a certain iteration is dependent on a previous one, or the next one is dependent on the current one. The right way, if we can, is to break this loop or change the way the loop is written, so we would break the dependency.
Can we do that? Well, let’s try. So in order to try that, let’s move to compiler explorer so we can play with the code. So, I’m sharing with you now my compiler. You would see it in a moment. Here it is. So, you see here the original code, this is the code. We will not look at the assembly right now, I’m putting this aside, we will take a look at it in a moment. What we want to do here is to ask, okay, we see the loop here and we do have dependency between a certain iteration and the previous iteration, so we want to break the dependency. So, we would not use index “i” and index “i+1” in the same iteration, can we do that? Well, I think that we can.
Let’s do that by copying the same loop, so we can make sure later that both loops do the same thing. So, in order to make sure that both loops, the one that we see here and the next one that I’m going to write, actually do the same thing. I think that the best thing to do is to copy the loop and then check that we actually get the same result.
OK, so this is the first loop. And this is the second one. And we would act on a1 which would be equal to a. And then we would have on the second loop, we would preserve the arrays in order to work with the proper arrays without any change so we have here 3 arrays, and in the first loop, we work on this copy, which is “a1”, “b1”, “c1”. This is the original code. But now with “a1”, “b1”, “c1”, and now we can make the other loop work on “a2”, “b2”, “c2”. So, in this case, we can just make sure that we didn’t break anything, that the actual code behaves the same and we will do that by assert. So let’s add an assert that a1 equals “a2” and the same for we actually do not change “c”, so there is no need to check for c, we just have to check that “a1” equals “a2” and “b1” equals “b2” and for that, we need to add include for assert. So let’s do that. OK. So, I think the current assert should pass. Let’s take a look and make sure that currently… well it’s the same loop, we didn’t change anything yet. So, let’s take a look at the output, and the output is fine, we passed the assert. Just to make sure let’s mutate the assert, make sure that the assert checks something, so, let’s assert that a1 doesn’t equal “a2” and we see that… We break. We abort. Yeah. The assert actually checks something. So, we get back to asserting “a1” equals “a2” and now we can go back to… Okay, if we want to change the loop to make it more efficient. So, let’s do that. More efficient means that, yeah, again, in order to allow the optimizer to take the loop and say, you know what, I can do several iterations at the same time using SIMD. Single instruction, multiple data. But in order to achieve that, I need to break the dependency.
So I think that in order to break the dependency, what we can do is to start with “a” at index zero, would take, add, “b” at index zero, and now, since we already started with doing the call on a. By the way, that probably why the assert now breaks. I think the assert now breaks. Let’s take a look. Hmm. It doesn’t break because… Oh, it’s not on “a”. I have to do that on “a2”. It’s good that we have an assert. We can check things. Okay, now the assert breaks because we are adding to “a2” twice index zero of “b2”. No, we need to change the actual loop inside.
So inside the loop, I will run from index 1 and I will end at index 999, so I can take off this last calculation, and inside, I will just change the order. So first, take “b2” at index I and add to it “c2” at index “i” minus one, then take “a2” at index “i” and add to it “b2” at index “i”. Now, I think, it seems to me that this is the same logic as the loop above. But again, we have an assert we can check. So let’s check, let’s take a look at the assert and the assert passes. Well fine, we have the same result. Now, the question is, okay, we have the same result, is it more efficient? Well, we started in order to… We did the change in order to make things more efficient. Let’s take a look at the assembly.
Compiling in Clang
So I’m in compiler explorer. I am compiling it in Clang with -O3 and we can see that the first loop, the first loop is we see it. Let’s take a look. Okay. The first loop is here. Okay, well, here, this is the loop. The loop has a jump based on checking whether we reached 998. And this is the loop. This is an ordinary loop. If we take a look at the other loop, the one that is underneath, we would see some kind of a different assembly, by the way, a longer one.
You know, that longer assembly doesn’t mean less efficient code. In some cases, like, for example, this case, longer assembly is better because it is longer, but it achieves better performance because it’s doing things more efficiently. What we see here is moving a bulk of data in order to perform bulk calculation by a single instruction.
So we would not go through the entire assembly here, but I can show you that when we compare the index, we stop at 992. Why? Because at 992, we already have the next… eight calculations, eight iterations as a bulk. and this is what happens here. We actually have SIMD. We actually have probably a more efficient implementation. So the question here is, is it actually more efficient? Let’s check. How can we check that? Well, the best way to check that is to run a benchmark. So in order to run benchmark, what we would do is we would use another online tool, which is called Quick C++ bench. Let’s do that.
Quick C++ bench
And I prepared already the benchmark with the two loops. This is the first loop with which has a dependency. And this is the second loop where the dependency is gone. It is written here a bit differently from the way I wrote it in the other… in Compiler Explorer, here I run to 998 and work on “i+1”. But it’s the same. And we can see in the benchmark that the code that allows SIMD allows it in the way that does not prevent in the way that there isn’t any dependency between iterations inside the loop is much faster. It’s five times faster.
And this is something that the optimizer couldn’t achieve with something which is very similar, but not the same, which means that we have to think about the code and in this loop, we can achieve five times faster run time by rewriting the loop. To allow SIMD. This is optimization because, without optimization, let’s take a look without optimization, same code, same code, exact same code. But now I run it without any optimization. Flag -O0 and of course, both would react the same because it would be the same assembly. By the way, we can take a look at the assembly here in Quick Bench, and the assembly for the two would be the same. Both would be without SIMD, both would be without any optimization. So it is two things. One, having optimization and the other one allowing the optimizer to do the optimization by preventing any dependency between loop iterations and since we do not have any dependency in the loop iterations, the optimizer can go for SIMD. Which is in this case, five times faster.
By the way, if we are already in Quick bench, let’s perform some other quick benchmarks.
Like, for example, which is faster? Oh, this this one we saw… Which is faster: emplace_back or push_back?
So, we have here an example. It’s not always that emplace_back is better than push_back, but in this example where we had a string built out of three characters or built inside the vector, emplace is better, and we see it in the benchmark. Let’s do another benchmark. Just out of curiosity, if we are talking about benchmarks like, for example, is it better to run on rows or columns? Suppose that we want to sum a matrix. Is it better to sum the values running over the rows or running over the columns? So probably you heard about cache locality. We already have the raw in memory. So it is better to run over rows. And, yeah, we can see it. It is two times faster.
Back to SIMD
So we talked a bit about benchmarks. Let’s get back to SIMD. We saw that SIMD allows us to have much faster run time. In order to achieve that we have to break the dependency inside the loop. In many cases we can do that. And once there isn’t any dependency between iterations the optimizer may try to create, to use, SIMD during compilation. If we ask ourselves, did the compiler use SIMD or not, we can go back. Well, let’s do that.
We can go back to Compiler Explorer and take a look at what we saw there. So as we saw, the first loop doesn’t use SIMD. We can see, it stops at 998 and it just does each iteration at a time. And the other one does use SIMD, and we can even ask Clang to show us whether the optimizer was able to use some kind of optimization or not. And to do that, we would go to a tool called optimization output.
So, I’m in Compiler Explorer. If you’re in a shell prompt, you could do that directly using Clang, and I have here the window with the code and output from the optimizer. And I can see here many interesting outputs from the optimizer and here in the loop, the optimizer says that the loop was not vectorized. Why it was not vectorized, you can get some more information about why. In some cases, it will give you some more information. In this loop, it is vectorized, and you can see that you get the green color here. There is another problem here there is red here because it says that it is… it loads some info and couldn’t use something from the cache, but the loop itself is being optimized by vectorizing the loop. So, we could see here from the optimizer output, what was done and what the optimizer tried, or thought about doing, but couldn’t do.
The last thing is maybe to ask, is it something that works only on Clang, or maybe GCC has the same behavior, So let’s… Or does it relate to -O3 or to -O2? So, if I go to Clang and in Clang change it to -O2… the same result. It is said that -O2 and -O3 in Clang are very similar. They are quite similar, but not the same. There are optimizations that are in -O3 and not in -O2, but in this case, SIMD in Clang is used both in -O2 and -O3. You can see that, I’m here in -O2 with Clang and we got the same optimization on the second loop, which allowed SIMD. On the other hand, if I go to GCC, let’s go to GCC, I’m here with GCC say with -O2 GCC doesn’t have the optimized output. So I don’t get any information on this window. But here I can take a look at the assembly and I can see that the first loop doesn’t have SIMD but also the second loop.
So no SIMD for you, sir, but I’ll go and change this in GCC to -O3 then… GCC is giving me the SIMD for the other loop. So, both GCC and Clang may allow the use of SIMD by vectorizing the loop using SIMD in order to optimize the code in GCC, this required here -O3, in Clang -O2 was enough. But both have this option.
This is it for today. Thank you for being with us and we will talk a bit more about optimization in our next video. Bye-bye.