16/05/2012 Compiler SSE code generation

During the first years of GLM development, I raised my interest for the efficiency of the generated code by the compiler to answer this simple question: Is writing SSE code using intrinsics worth the effort? Understand, while that give a performance gain? For a long time the answer was easy, yes because the compiler does a poor job. Today's compilers do a much better job however there are still incapable of vectorizing the code...

One of the interest of the SSE or AVX instruction sets is a provide SIMD instructions allowing to process in a Single Instruction Multiple Data. However, this is only a subset of the SEE or AVS instructions as all the Multiple Data instructions also come with Single Data equivalent. This is these Single Data instructions that the compilers are actually generating. Following, here is a code sample that we will use to generate the ASM code for GCC 4.4, GCC 4.8, ICC13, VC12, Clang 3.0 and Clang 3.2.

Source used to analyse compiler generated ASM
  • #include <emmintrin>
  • #include <cstddef>
  • struct float4
  • {
  • float4(){}
  • float4(float const & s) :
  • x(s), y(s), z(s), w(s)
  • {}
  • float4(float const & x, float const & y, float const & z, float const & w) :
  • x(x), y(y), z(z), w(w)
  • {}
  • float operator[](std::size_t i) const
  • {
  • return (&x)[i];
  • }
  • union
  • {
  • struct {float r, g, b, a;};
  • struct {float s, t, p, q;};
  • struct {float x, y, z, w;};
  • __m128 data;
  • };
  • }
  • inline float4 operator*(float4 const & a, float4 const & b)
  • {
  • return float4(a.x * b.x, a.y * b.y, a.z * b.z, a.w * b.w);
  • }
  • float4 mul_intrinsic(float4 const m[4], float4 const & v)
  • {
  • __m128 v0 = _mm_shuffle_ps(v.data, v.data, _MM_SHUFFLE(0, 0, 0, 0));
  • __m128 v1 = _mm_shuffle_ps(v.data, v.data, _MM_SHUFFLE(1, 1, 1, 1));
  • __m128 v2 = _mm_shuffle_ps(v.data, v.data, _MM_SHUFFLE(2, 2, 2, 2));
  • __m128 v3 = _mm_shuffle_ps(v.data, v.data, _MM_SHUFFLE(3, 3, 3, 3));
  • __m128 m0 = _mm_mul_ps(m[0].data, v0);
  • __m128 m1 = _mm_mul_ps(m[1].data, v1);
  • __m128 a0 = _mm_add_ps(m0, m1);
  • __m128 m2 = _mm_mul_ps(m[2].data, v2);
  • __m128 m3 = _mm_mul_ps(m[3].data, v3);
  • __m128 a1 = _mm_add_ps(m2, m3);
  • __m128 a2 = _mm_add_ps(a0, a1);
  • float4 f;
  • f.data = a2;
  • return f;
  • }
  • float4 mul_cpp(float4 const m[4], float4 const & v)
  • {
  • return float4(
  • m[0][0] * v[0] + m[1][0] * v[1] + m[2][0] * v[2] + m[3][0] * v[3],
  • m[0][1] * v[0] + m[1][1] * v[1] + m[2][1] * v[2] + m[3][1] * v[3],
  • m[0][2] * v[0] + m[1][2] * v[1] + m[2][2] * v[2] + m[3][2] * v[3],
  • m[0][3] * v[0] + m[1][3] * v[1] + m[2][3] * v[2] + m[3][3] * v[3]);
  • }
  • float4 mul_inst_like(float4 const m[4], float4 const & v)
  • {
  • float4 const Mov0(v[0]);
  • float4 const Mov1(v[1]);
  • float4 const Mul0 = m[0] * Mov0;
  • float4 const Mul1 = m[1] * Mov1;
  • float4 const Add0 = Mul0 * Mul1;
  • float4 const Mov2(v[2]);
  • float4 const Mov3(v[3]);
  • float4 const Mul2 = m[2] * Mov2;
  • float4 const Mul3 = m[3] * Mov3;
  • float4 const Add1 = Mul2 * Mul3;
  • float4 const Add2 = Add0 * Add1;
  • return Add2;
  • }
  • int main()
  • {
  • return 0;
  • }

Reading these codes, we immediately see that no compiler is capable of generating vectorized intructions.

We can notice some useless mov instructions generated by some compilers. Also, Clang tries to interleave different instructions while ICC is regrouping identical instructions. Others compilers more or less interleave or regroup the identical instructions but in any case each compiler is capable to massively reorder the instructions to the point that GCC 4.8 is capable to generate exactly the same assembly code for both mul_cpp and mul_inst_like but it is still incapable of vectorizing a code.

It seems to me that being capable of such reording shows how compiler optimizations have been focus on the result and the dependances to this result. With such strategy based on ASTs, the compilers can remove dead code and useless operations like sequence of mov instructions. However, today CPU performances are more bound to the usage of memory, how we maximize the cache usage and how we reduce the data movement and transfer. Two conscequences: There is still a lot of room for compiler optimizations and hand writing code with intrinsic remains relevants.

There are some researches to resolve the issue of generating vectorized code. ISPC seems inspired by GPU architechtures and it generates C++ source code using on demand SIMD instruction sets. Then Polly is a compiler optimizer that directly tackles the issues of memory access pattern. Finally, LLVM is going to integrate in LLVM 3.3 a new optimizer called SLP Vectorizer

For GLM, what I would enjoy is to figure out an approach where I could avoid writing intrinsic code but still write my C++ code in a way that the compiler would generate the SSE code I expect it to be generated. Even if, I have to look at the assembler code, such approach would allow me to have a single code for each operation making it easier to maintain.

So far GLM provides dedicated simdVec4 and simdMat4 classes for SIMD optimiations. David Reid even contributed a simd version of GLM quaternions for GLM 0.9.5. It is obivous that using GLM to write very fast code is not a good idea but this is not a reason why GLM shouldn't be as fast as possible and ideally it should be fast transparently but for that the compilers will need to do a better job.

Viewing x86 assembly with XCode

Since XCode 4.1, we can display the assembly of a file using the menu "Product/Generate Output/Assembly File". However, with Clang the IDE will show the LLVM IR which might be great for the compiler to use but I find it harder to read than old-fashion CPU or GPU instructions. Fortunately we can enable x86 assembly generation using the argument "-no-integrated-as". This argument can be set using the menu "Product/Scheme/Manage Schemes". Also, "-integrated-as" can explicitly request LLVM IR.

Online C++ compilers

I discovered few months ago that many compilers can be used online. This is very convenient idea that allows to quickly test a code on different compilers. Isn't it nice to be able to use VC12 on MacOS X? My favourite website is gcc.godbolt.org which support many versions of GCC but also Clang 3.0 and ICC 13. A great feature is that this website display the ASM code generated. For Visual C++ 12 there is the great rise4fun.com/vcpp however it doesn't generate ASM code but only the compiler errors.

GLM packages for Linux >
< OpenGL Samples Pack 4.3.2.2 released
Copyright Christophe Riccio 2002-2014 all rights reserved
Designed for Chrome 9, Firefox 4, Opera 11 and Safari 5