Tutorial Video - A Deeper Dive into the C++ Compiler Optimization

After diving into the different loop unrolling scenarios in our previous video, Amir Kirsh walks us through another kind of loop optimization – Single Instruction, Multiple Data (SIMD) – and shows how in some cases a longer assembly is actually better, as well as shows some examples in Clang and GCC.

Transcript: A Deeper Dive into the C++ Compiler Optimization

Hi, everybody.

My name is Amir Kirsh and I’m a dev advocate at Incredibuild.

Today we are going to continue our video series on C++ compiler optimization. So let’s take a look at a piece of code. I will share my screen and I hope you can see the loop on this slide.

The example on this slide comes from a talk at CppCon 2016 by Timur Doumler, and the question here is, how can we make this loop more efficient? Now, the current loop that you see on this slide cannot be optimized. The optimizer takes a look at this loop. The optimizer hesitates. Maybe it can make it more efficient, but it cannot, at least for now. And the question is, can we change the code so it will react the same, it will behave the same, which will result in the same output with the same outcome, but allow the optimizer to do something with this loop, which could make it much more efficient.

Don’t take a look at the slide that you see inside my slide from Timur’s talk, because Timur’s talk covers a lot of things. And here this problem does not exactly relate to instructions caching or to memory caching. It relates to something else and it doesn’t relate also to loop unrolling, which we dealt with in previous talks. It is another kind of loop optimization. So, can you see something that can be done with this loop that will allow the optimizer to make it more efficient?

SIMD – Single Instruction Multiple Data

And the answer relates to a term called SIMD. SIMD Is the abbreviation for Single Instruction Multiple Data, which means if it is possible to take a single instruction and run it on multiple data, we can have more efficient code at runtime and the optimizer is able to recognize code that might become more efficient with SIMD and compile it to machine code that will use SIMD.

Now, in our original loop, there is a problem. SIMD, or the ability to use SIMD, requires a loop to be able to take several iterations and perform several iterations at the same time. So, for example, if I can perform iteration for index i equals zero and i equals one and two and three together, on the calculations that I’m going to perform inside the loop, so doing all the calculations at the same time for indices zero to three – this is SIMD.

This requires that there are no dependencies between the calculations or the operations that we do in the different indices. So, if the calculation for i equals zero does not affect the calculation for i equals one, we can, in a way, put the data that relates to i equals zero and put the data that relates to i equals one in a certain position in memory, and then do the calculation on both, with a single instruction. This is SIMD. SIMD, by the way, is a single tool that is part of a broader optimization called loop vectorization. Loop vectorization is the ability to take a look at a loop and say, okay, I can break the loop to parallel computation and SIMD is one of the tools to do this parallel computation for a loop.

So, the problem with the current loop that we see here is that the indices or the loop iterations are dependent on each other. We calculate something on an index i which is based on b on Index i, but we also calculate something or change b of index i+1. This means that if we will try to do together at the same time in parallel indices i zero to three or zero to eight, we will touch at the same time, b at index i and b at index i+1 which would not allow us to do this in parallel. So the problem here is that a certain iteration is dependent on a previous one, or the next one is dependent on the current one. The right way, if we can, is to break this loop or change the way the loop is written, so we would break the dependency.

Compiler Explorer

Can we do that? Well, let’s try. So in order to try that, let’s move to compiler explorer so we can play with the code. So, I’m sharing with you now my compiler. You would see it in a moment. Here it is. So, you see here the original code, this is the code. We will not look at the assembly right now, I’m putting this aside, we will take a look at it in a moment. What we want to do here is to ask, okay, we see the loop here and we do have dependency between a certain iteration and the previous iteration, so we want to break the dependency. So, we would not use index “i” and index “i+1” in the same iteration, can we do that? Well, I think that we can.

Let’s do that by copying the same loop, so we can make sure later that both loops do the same thing. So, in order to make sure that both loops, the one that we see here and the next one that I’m going to write, actually do the same thing. I think that the best thing to do is to copy the loop and then check that we actually get the same result.

OK, so this is the first loop. And this is the second one. And we would act on a1 which would be equal to a. And then we would have on the second loop, we would preserve the arrays in order to work with the proper arrays without any change so we have here 3 arrays, and in the first loop, we work on this copy, which is “a1”, “b1”, “c1”. This is the original code. But now with “a1”, “b1”, “c1”, and now we can make the other loop work on “a2”, “b2”, “c2”. So, in this case, we can just make sure that we didn’t break anything, that the actual code behaves the same and we will do that by assert. So let’s add an assert that a1 equals “a2” and the same for we actually do not change “c”, so there is no need to check for c, we just have to check that “a1” equals “a2” and “b1” equals “b2” and for that, we need to add include for assert. So let’s do that. OK. So, I think the current assert should pass. Let’s take a look and make sure that currently… well it’s the same loop, we didn’t change anything yet. So, let’s take a look at the output, and the output is fine, we passed the assert. Just to make sure let’s mutate the assert, make sure that the assert checks something, so, let’s assert that a1 doesn’t equal “a2” and we see that… We break. We abort. Yeah. The assert actually checks something. So, we get back to asserting “a1” equals “a2” and now we can go back to… Okay, if we want to change the loop to make it more efficient. So, let’s do that. More efficient means that, yeah, again, in order to allow the optimizer to take the loop and say, you know what, I can do several iterations at the same time using SIMD. Single instruction, multiple data. But in order to achieve that, I need to break the dependency.

So I think that in order to break the dependency, what we can do is to start with “a” at index zero, would take, add, “b” at index zero, and now, since we already started with doing the call on a. By the way, that probably why the assert now breaks. I think the assert now breaks. Let’s take a look. Hmm. It doesn’t break because… Oh, it’s not on “a”. I have to do that on “a2”. It’s good that we have an assert. We can check things. Okay, now the assert breaks because we are adding to “a2” twice index zero of “b2”. No, we need to change the actual loop inside.

So inside the loop, I will run from index 1 and I will end at index 999, so I can take off this last calculation, and inside, I will just change the order. So first, take “b2” at index I and add to it “c2” at index “i” minus one, then take “a2” at index “i” and add to it “b2” at index “i”. Now, I think, it seems to me that this is the same logic as the loop above. But again, we have an assert we can check. So let’s check, let’s take a look at the assert and the assert passes. Well fine, we have the same result. Now, the question is, okay, we have the same result, is it more efficient? Well, we started in order to… We did the change in order to make things more efficient. Let’s take a look at the assembly.

Compiling in Clang

So I’m in compiler explorer. I am compiling it in Clang with -O3 and we can see that the first loop, the first loop is we see it. Let’s take a look. Okay. The first loop is here. Okay, well, here, this is the loop. The loop has a jump based on checking whether we reached 998. And this is the loop. This is an ordinary loop. If we take a look at the other loop, the one that is underneath, we would see some kind of a different assembly, by the way, a longer one.

You know, that longer assembly doesn’t mean less efficient code. In some cases, like, for example, this case, longer assembly is better because it is longer, but it achieves better performance because it’s doing things more efficiently. What we see here is moving a bulk of data in order to perform bulk calculation by a single instruction.

So we would not go through the entire assembly here, but I can show you that when we compare the index, we stop at 992. Why? Because at 992, we already have the next… eight calculations, eight iterations as a bulk. and this is what happens here. We actually have SIMD. We actually have probably a more efficient implementation. So the question here is, is it actually more efficient? Let’s check. How can we check that? Well, the best way to check that is to run a benchmark. So in order to run benchmark, what we would do is we would use another online tool, which is called Quick C++ bench. Let’s do that.

Quick C++ bench

And I prepared already the benchmark with the two loops. This is the first loop with which has a dependency. And this is the second loop where the dependency is gone. It is written here a bit differently from the way I wrote it in the other… in Compiler Explorer, here I run to 998 and work on “i+1”. But it’s the same. And we can see in the benchmark that the code that allows SIMD allows it in the way that does not prevent in the way that there isn’t any dependency between iterations inside the loop is much faster. It’s five times faster.

And this is something that the optimizer couldn’t achieve with something which is very similar, but not the same, which means that we have to think about the code and in this loop, we can achieve five times faster run time by rewriting the loop. To allow SIMD. This is optimization because, without optimization, let’s take a look without optimization, same code, same code, exact same code. But now I run it without any optimization. Flag -O0 and of course, both would react the same because it would be the same assembly. By the way, we can take a look at the assembly here in Quick Bench, and the assembly for the two would be the same. Both would be without SIMD, both would be without any optimization. So it is two things. One, having optimization and the other one allowing the optimizer to do the optimization by preventing any dependency between loop iterations and since we do not have any dependency in the loop iterations, the optimizer can go for SIMD. Which is in this case, five times faster.

Quick benchmarks

By the way, if we are already in Quick bench, let’s perform some other quick benchmarks.

Like, for example, which is faster? Oh, this this one we saw… Which is faster: emplace_back or push_back?

So, we have here an example. It’s not always that emplace_back is better than push_back, but in this example where we had a string built out of three characters or built inside the vector, emplace is better, and we see it in the benchmark. Let’s do another benchmark. Just out of curiosity, if we are talking about benchmarks like, for example, is it better to run on rows or columns? Suppose that we want to sum a matrix. Is it better to sum the values running over the rows or running over the columns? So probably you heard about cache locality. We already have the raw in memory. So it is better to run over rows. And, yeah, we can see it. It is two times faster.

Back to SIMD

So we talked a bit about benchmarks. Let’s get back to SIMD. We saw that SIMD allows us to have much faster run time. In order to achieve that we have to break the dependency inside the loop. In many cases we can do that. And once there isn’t any dependency between iterations the optimizer may try to create, to use, SIMD during compilation. If we ask ourselves, did the compiler use SIMD or not, we can go back. Well, let’s do that.

We can go back to Compiler Explorer and take a look at what we saw there. So as we saw, the first loop doesn’t use SIMD. We can see, it stops at 998 and it just does each iteration at a time. And the other one does use SIMD, and we can even ask Clang to show us whether the optimizer was able to use some kind of optimization or not. And to do that, we would go to a tool called optimization output.

Optimization output

So, I’m in Compiler Explorer. If you’re in a shell prompt, you could do that directly using Clang, and I have here the window with the code and output from the optimizer. And I can see here many interesting outputs from the optimizer and here in the loop, the optimizer says that the loop was not vectorized. Why it was not vectorized, you can get some more information about why. In some cases, it will give you some more information. In this loop, it is vectorized, and you can see that you get the green color here. There is another problem here there is red here because it says that it is… it loads some info and couldn’t use something from the cache, but the loop itself is being optimized by vectorizing the loop. So, we could see here from the optimizer output, what was done and what the optimizer tried, or thought about doing, but couldn’t do.

The last thing is maybe to ask, is it something that works only on Clang, or maybe GCC has the same behavior, So let’s… Or does it relate to -O3 or to -O2? So, if I go to Clang and in Clang change it to -O2… the same result. It is said that -O2 and -O3 in Clang are very similar. They are quite similar, but not the same. There are optimizations that are in -O3 and not in -O2, but in this case, SIMD in Clang is used both in -O2 and -O3. You can see that, I’m here in -O2 with Clang and we got the same optimization on the second loop, which allowed SIMD. On the other hand, if I go to GCC, let’s go to GCC, I’m here with GCC say with -O2 GCC doesn’t have the optimized output. So I don’t get any information on this window. But here I can take a look at the assembly and I can see that the first loop doesn’t have SIMD but also the second loop.

So no SIMD for you, sir, but I’ll go and change this in GCC to -O3 then… GCC is giving me the SIMD for the other loop. So, both GCC and Clang may allow the use of SIMD by vectorizing the loop using SIMD in order to optimize the code in GCC, this required here -O3, in Clang -O2 was enough. But both have this option.

This is it for today. Thank you for being with us and we will talk a bit more about optimization in our next video. Bye-bye.

Incredibuild Team

Incredibuild is a software platform that speeds up the entire development cycle. We do that by accelerating your compilations, giving you real time visibility of your builds, and how to get the most out of your assets, all right out of the box, without needing to change code or the tool stack.

Cookie	Duration	Description
ARRAffinity	session	ARRAffinity cookie is set by Azure app service, and allows the service to choose the right instance established by a user to deliver subsequent requests made by that user.
ARRAffinitySameSite	session	This cookie is set by Windows Azure cloud, and is used for load balancing to make sure the visitor page requests are routed to the same server in any browsing session.
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
_gat	1 minute	This cookie is installed by Google Universal Analytics to restrain request rate and thus limit the collection of data on high traffic sites.
_uetsid	1 day	Bing Ads sets this cookie to engage with a user that has previously visited the website.
_uetvid	1 year 24 days	Bing Ads sets this cookie to engage with a user that has previously visited the website.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-8508435-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	30 minutes	Hotjar sets this cookie to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	Hotjar sets this cookie to identify a new user’s first session. It stores a true/false value, indicating whether it was the first time Hotjar saw this user.
_hjIncludedInPageviewSample	2 minutes	Hotjar sets this cookie to know whether a user is included in the data sampling defined by the site's pageview limit.
_hjIncludedInSessionSample	2 minutes	Hotjar sets this cookie to know whether a user is included in the data sampling defined by the site's daily session limit.
_hjTLDTest	session	To determine the most generic cookie path that has to be used instead of the page hostname, Hotjar sets the _hjTLDTest cookie to store different URL substring alternatives until it fails.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
MR	7 days	This cookie, set by Bing, is used to collect user information for analytics purposes.
utm_campaign	2 months	Google Ad Services sets this cookie to store session campaign value if present.
utm_content	2 months	This cookie is used for storing the session content value if present.
utm_source	2 months	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
utm_term	2 months	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
_mkto_trk	2 years	This cookie, provided by Marketo, has information (such as a unique user ID) that is used to track the user's site usage. The cookies set by Marketo are readable only by Marketo.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
MUID	1 year 24 days	Bing sets this cookie to recognize unique web browsers visiting Microsoft sites. This cookie is used for advertising, site analytics, and other operations.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
utm_medium	2 months	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Tutorial Video – A Deeper Dive into the C++ Compiler Optimization

Transcript: A Deeper Dive into the C++ Compiler Optimization

SIMD – Single Instruction Multiple Data

Compiler Explorer

Compiling in Clang

Quick C++ bench

Quick benchmarks

Back to SIMD

Optimization output

Incredibuild Team

Table of Contents

Shorten your builds

Related Posts

14 minutes 8 Reasons Why You Need Build Observability

14 minutes These 4 advantages of caching are a game-changer for development projects

14 minutes What Level of Build Observability Is Right for You?

Cookie	Duration	Description
_hjSession_2537450	30 minutes	No description
_hjSessionUser_2537450	1 year	No description
AnalyticsSyncHistory	1 month	No description
BIGipServersn-mch-v2-80	session	No description
BIGipServersn02web-nginx-app_https	session	No description
ib_last_referrer	2 months	No description
incap_ses_1319_2167377	session	No description
li_gc	2 years	No description
muc_ads	2 years	No description
nlbi_2167377	session	No description
original_req_url	past	No description
referrer66_00f	1 month	No description
visid_incap_2167377	1 year	No description
visitorId	1 year	No description