Performance gain using SSE instructions

Tuesday 20 of January, 2009

SSE

SSE (e.g. http://en.wikipedia.org/wiki/SSE2) is the multimedia extension of modern x86 Intel and AMD CPUs. Basically, using 128 bit wide registers, 4 floats (or 2 doubles) can be processed within one cycle.

The Problem

One idea how to do efficient light competition calculation is:

SSE solution

Using SSE instructions the code might look like:

// MS Visual Studio decl-spec to force (necessary) alignment in memory
#define CACHE_LINE  64
#define CACHE_ALIGN __declspec(align(CACHE_LINE))
#define CNT 1000000
float CACHE_ALIGN fData[CNT*4];


int test4()
{
  __m128 empty = _mm_setzero_ps();
  __m128 stamp[8];
  for (int i=0;i<8;++i) 
    stamp[i]=empty;
  stamp[3] = _mm_set1_ps( 3.343f); // any value
  __m128 *p1, *p2, *p3, *p4, *s;
  TTickTack t;
  float *f1=fData;
  __m128 val;
  const int lineoffset=100;
  for (int i=0;i<10000;i++)
    for (int j=0;j<10000;j++) {
      //apply(stamp, j);
      s = stamp;
      f1 = fData + 16*j; // force different starting positions
      val = _mm_load_ps(f1); // load 128bit (4x float)
      val = _mm_add_ps(val, *s++); // add 4 float values (of the stamp)
      _mm_store_ps(f1, val); // store the result back to memory
      f1+=4; // advance grid-pointer
      val = _mm_load_ps(f1);
      val = _mm_add_ps(val, *s++);
      _mm_store_ps(f1, val);
      f1+=lineoffset; // go to next line
      val = _mm_load_ps(f1);
      val = _mm_add_ps(val, *s++);
      _mm_store_ps(f1, val);
      f1+=4;
      val = _mm_load_ps(f1);
      val = _mm_add_ps(val, *s++);
      _mm_store_ps(f1, val);
      f1+=lineoffset;
      val = _mm_load_ps(f1);
      val = _mm_add_ps(val, *s++);
      _mm_store_ps(f1, val);
      f1+=4;
      val = _mm_load_ps(f1);
      val = _mm_add_ps(val, *s++);
      _mm_store_ps(f1, val);
      f1+=lineoffset;
      val = _mm_load_ps(f1);
      val = _mm_add_ps(val, *s++);
      _mm_store_ps(f1, val);
      f1+=4;
      val = _mm_load_ps(f1);
      val = _mm_add_ps(val, *s++);
      _mm_store_ps(f1, val);
      f1+=lineoffset;

    }

    std::cout << "applied stamp (SSE): "  << t.elapsedStr() << "\n";	 t.reset();
    float sum=0.f;
    for (int j=0;j<10000;j++)
      sum+=fData[j]; 

    std::cout << "sum: " << sum << "\n";
    return 0;
}

Important to note: The used data structures must be aligned in memory (at least for 16 bytes, but it seems to be preferable to use 64-byte alignment because of faster cache-access). The "_mm_xxx()"-functions are replaced by single assembler instructions, as the disassembly shows:

      __m128 val = _mm_load_ps(f1);
00401735  movaps      xmm1,xmmword ptr [eax-4E0h] 
      val = _mm_add_ps(val, *s++);
0040173C  addps       xmm1,xmm0 
      _mm_store_ps(f1, val);
0040173F  movaps      xmmword ptr [eax-4E0h],xmm1 

Without using SSE

A algorithm that use no special instructions could be as follows:

int test5()
{
  float stamp[64];
  for (int i=0;i<64;++i) 
     stamp[i]=0.f;
  stamp[20]=3.343; // some value
  TTickTack t;
  float *f1, *s;
  int ii;
  const int lineoffset=100;
  for (int i=0;i<10000;i++)
     for (int j=0;j<10000;j++) {
	//apply(stamp, j);
	s = stamp;
	f1 = fData + 16*j; // force different starting positions
	for (ii=0;ii<8;++ii) {
   	  *f1++ += *s++;
	  *f1++ += *s++;
	  *f1++ += *s++;
	  *f1++ += *s++;
	  *f1++ += *s++;
	  *f1++ += *s++;
	  *f1++ += *s++;
	  *f1++ += *s++; 
	  f1+=lineoffset;
	}
    }
    std::cout << "applied stamp (no SSE): "  << t.elapsedStr() << "\n";
    t.reset();
}

As it is, this code applies a 8x8 stamp (64 float values) 10000x10000 times (i.e. 100,000,000). It is somewhat optimized (faster pointer arithmetic, manual rolling out of the innermost loop and such).

Results

The code is executed on a T2300 Intel Core Duo processor. The results are:

Without SSEWith SSE
8.6s1.6s (=ca.18%)

This is quite impressive and considering the 100,000,000 repetitions both versions are extremely fast. In reality, however, this example would be too artificial: one point is that the "stamp" here keeps the same for all calculations (and would therefore keep sitting in the cache), another point is that there is no bound checking (when stamping along the borders of the big grid). I also tried a kind of standard-implementation with a bunch of nested loops, having the innermost loop looking like this:

  f1 = fData + 16*j; // start somewhere 
  for (x=0;x<8;x++) {
    for (y=0;y<8;y++) {
       f1[y*gridwidth+x] += stamp[y*stampwidth+x];
    }
  }

This version surprisingly performs very similar to the above version (without SSE). At least the Microsoft-compiler seems to do a very good job in optimizing the code. To confirm this, I also did a run in debug-mode of the compiler. The results are then:

no SSE, nested loopsSSE pointer optimizedSSE
77s53s16s
100%68%21%

Generally, using SSE instructions and careful memory alignment/cache-aware programming has a huge potential of performance improvements.


Permalink: http://iland-model.org/blogpost2-Performance-gain-using-SSE-instructions