Saturday, March 8, 2014

Playing With C …

Most of the time, when you're interested in C programming, you're also interested in performances. Doing homemade micro-benchmarks is, in itself, an interesting activity even if it's not productive at all …

Recently, after reading an article about pipeline and optimization (Playing with the CPU pipeline), I decided to test that by myself. This was also the occasion to (re)discover classical math functions.

Computing sinus

The easiest way to compute a trigonometric function is to use the traditional Taylor decomposition, there's better way, but we're not here to provide the best sinus function, we just want to play around …

I'll use the same approximation as my starting articles (since I want to test the code in it … ) The first naive version uses a homemade power and factorial functions.

uint64_t fact(uint64_t n) {
  uint64_t      r = 1;
  for (; n; --n) r *= n;
  return r;
}

double qpower(double x, int p) {
  return p == 1 ? x : qpower(x * x, p >> 1) * (p % 2 ? x : 1);
}

// Taylor version
double sin_taylor(double x) {
  double        r = x;
  int           s = -1;
  for (int i = 3; i < 16; i += 2, s *= -1)
    r += s * qpower(x, i) / (double)fact(i);
  return r;
}

I hope you know how a quick exponentiation works (it has a O(log p) complexity).

The main issue here is that we're computing several time constant values using a function. So, the next step is to pre-compute the 8 values (I'll use the approximation provides in the aforementioned article that are closer to minimax approximation, which yield more accurate results) and put them in an array (for easier access.)

static const double Coef[] = {
  +1.0,
  -1.666666666666580809419428987894207e-1,
  +8.333333333262716094425037738346873e-3,
  -1.984126982005911439283646346964929e-4,
  +2.755731607338689220657382272783309e-6,
  -2.505185130214293595900283001271652e-8,
  +1.604729591825977403374012010065495e-10,
  -7.364589573262279913270651228486670e-13,
};

Now we can remove the call to fact (and the division by the way):

double sin_taylor2(double x) {
  double        r = x;
  for (int i = 3; i < 16; i += 2)
    r += qpower(x, i) * Coef[i >> 1];
  return r;
}

Next optimization deals with the exponentiation: the goal is to rearrange and factorize in order to have less multiplication and try take advantage of the pipeline and out-of-order parallelism. I chose this version (spoil: it will give us the best execution time) upon all variation presented in the original paper:

double sin_v6(double x) {
  double        x2 = x * x;
  double        x4 = x2 * x2;
  double        x8 = x4 * x4;
  double        A = Coef[0] + x2 * (Coef[1] + x2 * (Coef[2] + x2 * Coef[3]));
  double        B = Coef[4] + x2 * (Coef[5] + x2 * (Coef[6] + x2 * Coef[7]));
  return x * (A + x8 * B);
}

Is it faster ? It's time to test !

Measuring Performances ?

How can we measure correctly execution time ? We don't have good solutions, the only way is to find an accurate clock (probably provided by the system) take its value before and after calling our function and computing the difference.

I choose the POSIX clock_gettime(2) function with CLOCK_MONOTONIC which seems suited for our job. This will look like that:

  struct timespec       c0, c1;
  clock_gettime(CLOCK_MONOTONIC, &c0);
  // Do your dirty work here
  clock_gettime(CLOCK_MONOTONIC, &c1);

OK, that's how we do usually. To compute the difference between clocks, we have to take a look at the timespec structure since operations like timersub(3) operate on timeval structure not timespec (timeval provides microseconds while timespec provides nanoseconds … )

According to the man-page, timespec contains (at least) two fields, one with the seconds, the other with the nanoseconds. The simplest way to compute difference is to convert timespec into double float:

double spec_to_double(struct timespec *ts) {
  return ts->tv_sec + 1e-9 * ts->tv_nsec;
}

I prefer writing a function, and then tag it as static inline rather than directly doing computation in the code, it's much more readable and the compiler is able to optimize this correctly. Never be afraid of functions, they are better documentation than any comments.

OK, I'm now able to measure execution time, normally with a precision of a nanosecond. I don't really believe it: what is the cost of calling clock_gettime(2) ? According to its manual section, its a system call. And, is my function slow enough to be measured ?

I got several answers to that:

  • Take some measures ! We'll just measure time between two consecutive call to clock_gettime(2)
  • Don't run your function only once ! We'll call our function about thousands or more time, to obtain some average …
  • Run your test several time, if the result is stable, it may have some meaning.
So, I wrote a small piece of code that compute the time between two consecutive calls of clock_gettime(2), then the same with a call to the libc sin function. I run that 10 times and compute an average. I also print intermediary results. Here is the code:

#define _XOPEN_SOURCE 500

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

static inline
double spec_to_double(struct timespec *ts) {
  return ts->tv_sec + 1e-9 * ts->tv_nsec;
}

int main() {
  struct timespec       c0, c1;
  double                r, s0, s1, a=0;

  for (int i = 0; i < 10; ++i) {
    clock_gettime(CLOCK_MONOTONIC, &c0);
    clock_gettime(CLOCK_MONOTONIC, &c1);
    s0 = spec_to_double(&c1) - spec_to_double(&c0);

    clock_gettime(CLOCK_MONOTONIC, &c0);
    r = sin(1.5);
    clock_gettime(CLOCK_MONOTONIC, &c1);
    s1 = spec_to_double(&c1) - spec_to_double(&c0);

    a += s1 - s0;
    printf("> %g\n", s1 - s0);
  }
  printf("%g\n", a / 10);
  return 0;
}

I ran that piece of code on a i7 3770 compiled with clang and -O0 optimization flag and get the following result:

> 1.034e-05
> 3.35043e-07
> 7.19447e-08
> 6.07688e-08
> -1.81608e-08
> 7.0082e-08
> 9.19681e-08
> 1.28057e-07
> 9.17353e-08
> 8.68458e-08
1.12583e-06

Wait, what was that ? I've got a negative result … This is exactly what I was looking for: our clock is not sufficiently precise and sometimes (Ok, that was not the first run, but it happens say 1 time out of 3) the time without the sin is longer.

There's also this first value, which is often longer. There's probably a rational for that related to processor cache (the first time the code of sin need to be copied from library to the cache, and it's small enough to fit in and stay there for the other run.)

Anyway, those values are not interesting, they only show the fact that we're not able to perform precise measures, but we can use loops. Replacing the call to sin by a loop performing the same line thousand of time yield the following results:

> 3.72389e-05
> 4.90241e-05
> 4.39051e-05
> 4.30089e-05
> 4.2367e-05
> 4.86639e-05
> 4.08019e-05
> 3.8858e-05
> 3.86741e-05
> 3.90329e-05
4.21575e-05

Ok, that's a little bit more stable, but we still got a variation of about 20 or 25% of the measured time ! Adding a zero, give us a more accurate result:

> 0.00025304
> 0.000266933
> 0.000270017
> 0.000247684
> 0.000247303
> 0.000247422
> 0.000249467
> 0.000247646
> 0.000248894
> 0.000247764
0.000252617

Let's Go !

So, I'm now able to do all my tests. I've implemented almost all versions presented in the original article with some minor variations. And since I was planning to see the interaction with compiler optimizations, my first runs was compiled using gcc and -O3, then clang and -O3 also. Let's have a look at the result:
  • Empty Loop: just a loop doing sum in a variable
  • libm sin: sin from C math library
  • sin_taylor: the first Taylor series version
  • sin_taylor2: the same but with constant rather than call to fact
  • sin_v6: high/low factor split

CompilerEmpty Looplibm sinsin_taylorsin_taylor2sin_v6
gcc -O30.23550350801.46981147893.88085019010.60061714310.3628679269
clang -O30.23594911191.48031646397.70664868694.29427291900.3918996351
gcc version:    4.8.2 20140206 (prerelease) (Arch Linux package)
clang version:  3.4 (tags/RELEASE_34/final) (Arch Linux package)

Each run corresponds to 100,000,000 calls to the corresponding sinus function. The loop is also accumulating the errors (variation) against standard sin function (libm.) We use a pre-filled vector of random float numbers to avoid accounting the cost of the random generator.

Nice results, isn't it ? But, I cheated a little bit, my first version yielded strange timing, so let's see why and how I obtain these results.

Strange Timing


In a previous version, I got the following values:

CompilerEmpty Looplibm sinsin_taylorsin_taylor2sin_v6
gcc -O30.23170267600.23144390110.23148966420.23143743000.2321709669
clang -O30.23139798321.42392848320.23179650800.23211171990.2316789902

Almost all timing was identical, and worst, it was similar to the empty loop.

I made two errors here:
  • rather than having a vector of random float, I used the same float for all calls
  • I forgot the builtin functions of gcc
What's the deal with benching with the same value ? All my functions are in my main file and with -O3 optimization, the compiler is able to understand that the expression is constant in the loop and compute it just one time. Fail.

The second error is about testing the libm function: it seems that gcc is able to optimize it the same way as local functions, while clang runs it at each loop step. Why ? It takes me sometimes to realize what happens.

The traditional way of solving this kind of issues is to take a look at the generated code (assembly) and try to understand how your functions are called. My testing code is rather ugly and not factorized at all, combine with -O3, it becomes very difficult to find the difference. In that case, there's a better test: the option -fno-builtin !

Yeah … gcc has a builtin function for sinus and thus it can safely do constant propagation to avoid computing the sinus at each step !

So, when comparing your code against standard functions, you should deactivate builtin function most of the time in order to have coherent results.

Compiler Optimization Impact


Running your code in -O3 is nice, things seems to go faster, OK. But what is the real impact is your code really efficient ? You should consider running your code with various level of optimization to see the differences. Here is the same example with optimization ranging from -O0 to -O3, using gcc and clang. I also add -ffast-math (I know, it may yield altered floating points results.)

Compilerlibm sinsin_taylorsin_taylor2sin_v6
gcc -O01.640969305123.357415701010.57552155781.2111645590
gcc -O11.523625704011.47179266305.86691553100.4126090182
gcc -O21.47107294508.22070184285.28496960290.4066506480
gcc -O31.46981147893.88085019010.60061714310.3628679269
gcc -O3 -ffast-math1.47539282793.76906145810.50010442610.3611039480
clang -O01.561618749126.048279444911.43707089411.0181730469
clang -O11.48056166088.90578887804.73330464190.4122427311
clang -O21.47776968387.77558478694.44361459300.3908856600
clang -O31.48031646397.70664868694.29427291900.3918996351
clang -O3 -ffast-math1.47998775197.68622586114.29045542800.3916952400

Small gains on the first column, come probably from loop optimizations (clang does not optimize it, and I deactivate builtin for gcc.) Tests for the empty loop (not shown here) yield a globally constant time for all compiler's options. Which tends to enforce our testing process: the loop in itself can safely be removed from execution time.

Another interesting point is the average error of our functions. I choose to use the standard sin function from libm as a referent. Globally, sin_v6 has an average error around of -9.60791e-23 (on the range -π/2 to π/2.) Even with -ffast-math, it seems that this version output a result with the same average precision.

It could be interesting to investigate a little bit more on precision, my actual version is not a real minimax polynomial approximation, but the global code is quite the same and we should have similar performances. If you're interest on this precision topic, there's another article on the same website as the aforementioned article: Better function approximations: Taylor vs. Remez.

Conclusion

Ok, what do we got:
  • Before evaluating performances, you should be sure of your tools for measuring.
  • Beware of your compiler's impact: builtin functions and optimization may invalidate some of your results.
  • Even if the compiler is better than you for most optimization, a clever code can be faster.
  • Yes, you can have an homemade math function performing better than the corresponding libc function (but only for a reduced range … )

Bonus

You may ask why haven't you wrote an assembly version ? Good question, here it is an implementation using Intel FPU instruction FSIN:

double sin_v7(double x) {
  double        r;
  __asm (
         "fldl %1\n"
         "fsin\n"
         "fstpl %0\n"
         : "=m"(r): "m"(x));
  return r;
}

I've test it using the same framework, and guess what ? It's slower !

For the same condition (hundreds of million calls) I get a time of 2.87384s which is nearly ten time slower than the sin_v6 function !

You want to know why ? That's probably due to the way FPU works: it induce a context switch just for one operation, so even if the operation itself is faster, the function is inefficient …