Saturday, March 8, 2014

Playing With C …

Most of the time, when you're interested in C programming, you're also interested in performances. Doing homemade micro-benchmarks is, in itself, an interesting activity even if it's not productive at all …

Recently, after reading an article about pipeline and optimization (Playing with the CPU pipeline), I decided to test that by myself. This was also the occasion to (re)discover classical math functions.

Computing sinus

The easiest way to compute a trigonometric function is to use the traditional Taylor decomposition, there's better way, but we're not here to provide the best sinus function, we just want to play around …

I'll use the same approximation as my starting articles (since I want to test the code in it … ) The first naive version uses a homemade power and factorial functions.

uint64_t fact(uint64_t n) {
  uint64_t      r = 1;
  for (; n; --n) r *= n;
  return r;
}

double qpower(double x, int p) {
  return p == 1 ? x : qpower(x * x, p >> 1) * (p % 2 ? x : 1);
}

// Taylor version
double sin_taylor(double x) {
  double        r = x;
  int           s = -1;
  for (int i = 3; i < 16; i += 2, s *= -1)
    r += s * qpower(x, i) / (double)fact(i);
  return r;
}

I hope you know how a quick exponentiation works (it has a O(log p) complexity).

The main issue here is that we're computing several time constant values using a function. So, the next step is to pre-compute the 8 values (I'll use the approximation provides in the aforementioned article that are closer to minimax approximation, which yield more accurate results) and put them in an array (for easier access.)

static const double Coef[] = {
  +1.0,
  -1.666666666666580809419428987894207e-1,
  +8.333333333262716094425037738346873e-3,
  -1.984126982005911439283646346964929e-4,
  +2.755731607338689220657382272783309e-6,
  -2.505185130214293595900283001271652e-8,
  +1.604729591825977403374012010065495e-10,
  -7.364589573262279913270651228486670e-13,
};

Now we can remove the call to fact (and the division by the way):

double sin_taylor2(double x) {
  double        r = x;
  for (int i = 3; i < 16; i += 2)
    r += qpower(x, i) * Coef[i >> 1];
  return r;
}

Next optimization deals with the exponentiation: the goal is to rearrange and factorize in order to have less multiplication and try take advantage of the pipeline and out-of-order parallelism. I chose this version (spoil: it will give us the best execution time) upon all variation presented in the original paper:

double sin_v6(double x) {
  double        x2 = x * x;
  double        x4 = x2 * x2;
  double        x8 = x4 * x4;
  double        A = Coef[0] + x2 * (Coef[1] + x2 * (Coef[2] + x2 * Coef[3]));
  double        B = Coef[4] + x2 * (Coef[5] + x2 * (Coef[6] + x2 * Coef[7]));
  return x * (A + x8 * B);
}

Is it faster ? It's time to test !

Measuring Performances ?

How can we measure correctly execution time ? We don't have good solutions, the only way is to find an accurate clock (probably provided by the system) take its value before and after calling our function and computing the difference.

I choose the POSIX clock_gettime(2) function with CLOCK_MONOTONIC which seems suited for our job. This will look like that:

  struct timespec       c0, c1;
  clock_gettime(CLOCK_MONOTONIC, &c0);
  // Do your dirty work here
  clock_gettime(CLOCK_MONOTONIC, &c1);

OK, that's how we do usually. To compute the difference between clocks, we have to take a look at the timespec structure since operations like timersub(3) operate on timeval structure not timespec (timeval provides microseconds while timespec provides nanoseconds … )

According to the man-page, timespec contains (at least) two fields, one with the seconds, the other with the nanoseconds. The simplest way to compute difference is to convert timespec into double float:

double spec_to_double(struct timespec *ts) {
  return ts->tv_sec + 1e-9 * ts->tv_nsec;
}

I prefer writing a function, and then tag it as static inline rather than directly doing computation in the code, it's much more readable and the compiler is able to optimize this correctly. Never be afraid of functions, they are better documentation than any comments.

OK, I'm now able to measure execution time, normally with a precision of a nanosecond. I don't really believe it: what is the cost of calling clock_gettime(2) ? According to its manual section, its a system call. And, is my function slow enough to be measured ?

I got several answers to that:

  • Take some measures ! We'll just measure time between two consecutive call to clock_gettime(2)
  • Don't run your function only once ! We'll call our function about thousands or more time, to obtain some average …
  • Run your test several time, if the result is stable, it may have some meaning.
So, I wrote a small piece of code that compute the time between two consecutive calls of clock_gettime(2), then the same with a call to the libc sin function. I run that 10 times and compute an average. I also print intermediary results. Here is the code:

#define _XOPEN_SOURCE 500

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

static inline
double spec_to_double(struct timespec *ts) {
  return ts->tv_sec + 1e-9 * ts->tv_nsec;
}

int main() {
  struct timespec       c0, c1;
  double                r, s0, s1, a=0;

  for (int i = 0; i < 10; ++i) {
    clock_gettime(CLOCK_MONOTONIC, &c0);
    clock_gettime(CLOCK_MONOTONIC, &c1);
    s0 = spec_to_double(&c1) - spec_to_double(&c0);

    clock_gettime(CLOCK_MONOTONIC, &c0);
    r = sin(1.5);
    clock_gettime(CLOCK_MONOTONIC, &c1);
    s1 = spec_to_double(&c1) - spec_to_double(&c0);

    a += s1 - s0;
    printf("> %g\n", s1 - s0);
  }
  printf("%g\n", a / 10);
  return 0;
}

I ran that piece of code on a i7 3770 compiled with clang and -O0 optimization flag and get the following result:

> 1.034e-05
> 3.35043e-07
> 7.19447e-08
> 6.07688e-08
> -1.81608e-08
> 7.0082e-08
> 9.19681e-08
> 1.28057e-07
> 9.17353e-08
> 8.68458e-08
1.12583e-06

Wait, what was that ? I've got a negative result … This is exactly what I was looking for: our clock is not sufficiently precise and sometimes (Ok, that was not the first run, but it happens say 1 time out of 3) the time without the sin is longer.

There's also this first value, which is often longer. There's probably a rational for that related to processor cache (the first time the code of sin need to be copied from library to the cache, and it's small enough to fit in and stay there for the other run.)

Anyway, those values are not interesting, they only show the fact that we're not able to perform precise measures, but we can use loops. Replacing the call to sin by a loop performing the same line thousand of time yield the following results:

> 3.72389e-05
> 4.90241e-05
> 4.39051e-05
> 4.30089e-05
> 4.2367e-05
> 4.86639e-05
> 4.08019e-05
> 3.8858e-05
> 3.86741e-05
> 3.90329e-05
4.21575e-05

Ok, that's a little bit more stable, but we still got a variation of about 20 or 25% of the measured time ! Adding a zero, give us a more accurate result:

> 0.00025304
> 0.000266933
> 0.000270017
> 0.000247684
> 0.000247303
> 0.000247422
> 0.000249467
> 0.000247646
> 0.000248894
> 0.000247764
0.000252617

Let's Go !

So, I'm now able to do all my tests. I've implemented almost all versions presented in the original article with some minor variations. And since I was planning to see the interaction with compiler optimizations, my first runs was compiled using gcc and -O3, then clang and -O3 also. Let's have a look at the result:
  • Empty Loop: just a loop doing sum in a variable
  • libm sin: sin from C math library
  • sin_taylor: the first Taylor series version
  • sin_taylor2: the same but with constant rather than call to fact
  • sin_v6: high/low factor split

CompilerEmpty Looplibm sinsin_taylorsin_taylor2sin_v6
gcc -O30.23550350801.46981147893.88085019010.60061714310.3628679269
clang -O30.23594911191.48031646397.70664868694.29427291900.3918996351
gcc version:    4.8.2 20140206 (prerelease) (Arch Linux package)
clang version:  3.4 (tags/RELEASE_34/final) (Arch Linux package)

Each run corresponds to 100,000,000 calls to the corresponding sinus function. The loop is also accumulating the errors (variation) against standard sin function (libm.) We use a pre-filled vector of random float numbers to avoid accounting the cost of the random generator.

Nice results, isn't it ? But, I cheated a little bit, my first version yielded strange timing, so let's see why and how I obtain these results.

Strange Timing


In a previous version, I got the following values:

CompilerEmpty Looplibm sinsin_taylorsin_taylor2sin_v6
gcc -O30.23170267600.23144390110.23148966420.23143743000.2321709669
clang -O30.23139798321.42392848320.23179650800.23211171990.2316789902

Almost all timing was identical, and worst, it was similar to the empty loop.

I made two errors here:
  • rather than having a vector of random float, I used the same float for all calls
  • I forgot the builtin functions of gcc
What's the deal with benching with the same value ? All my functions are in my main file and with -O3 optimization, the compiler is able to understand that the expression is constant in the loop and compute it just one time. Fail.

The second error is about testing the libm function: it seems that gcc is able to optimize it the same way as local functions, while clang runs it at each loop step. Why ? It takes me sometimes to realize what happens.

The traditional way of solving this kind of issues is to take a look at the generated code (assembly) and try to understand how your functions are called. My testing code is rather ugly and not factorized at all, combine with -O3, it becomes very difficult to find the difference. In that case, there's a better test: the option -fno-builtin !

Yeah … gcc has a builtin function for sinus and thus it can safely do constant propagation to avoid computing the sinus at each step !

So, when comparing your code against standard functions, you should deactivate builtin function most of the time in order to have coherent results.

Compiler Optimization Impact


Running your code in -O3 is nice, things seems to go faster, OK. But what is the real impact is your code really efficient ? You should consider running your code with various level of optimization to see the differences. Here is the same example with optimization ranging from -O0 to -O3, using gcc and clang. I also add -ffast-math (I know, it may yield altered floating points results.)

Compilerlibm sinsin_taylorsin_taylor2sin_v6
gcc -O01.640969305123.357415701010.57552155781.2111645590
gcc -O11.523625704011.47179266305.86691553100.4126090182
gcc -O21.47107294508.22070184285.28496960290.4066506480
gcc -O31.46981147893.88085019010.60061714310.3628679269
gcc -O3 -ffast-math1.47539282793.76906145810.50010442610.3611039480
clang -O01.561618749126.048279444911.43707089411.0181730469
clang -O11.48056166088.90578887804.73330464190.4122427311
clang -O21.47776968387.77558478694.44361459300.3908856600
clang -O31.48031646397.70664868694.29427291900.3918996351
clang -O3 -ffast-math1.47998775197.68622586114.29045542800.3916952400

Small gains on the first column, come probably from loop optimizations (clang does not optimize it, and I deactivate builtin for gcc.) Tests for the empty loop (not shown here) yield a globally constant time for all compiler's options. Which tends to enforce our testing process: the loop in itself can safely be removed from execution time.

Another interesting point is the average error of our functions. I choose to use the standard sin function from libm as a referent. Globally, sin_v6 has an average error around of -9.60791e-23 (on the range -π/2 to π/2.) Even with -ffast-math, it seems that this version output a result with the same average precision.

It could be interesting to investigate a little bit more on precision, my actual version is not a real minimax polynomial approximation, but the global code is quite the same and we should have similar performances. If you're interest on this precision topic, there's another article on the same website as the aforementioned article: Better function approximations: Taylor vs. Remez.

Conclusion

Ok, what do we got:
  • Before evaluating performances, you should be sure of your tools for measuring.
  • Beware of your compiler's impact: builtin functions and optimization may invalidate some of your results.
  • Even if the compiler is better than you for most optimization, a clever code can be faster.
  • Yes, you can have an homemade math function performing better than the corresponding libc function (but only for a reduced range … )

Bonus

You may ask why haven't you wrote an assembly version ? Good question, here it is an implementation using Intel FPU instruction FSIN:

double sin_v7(double x) {
  double        r;
  __asm (
         "fldl %1\n"
         "fsin\n"
         "fstpl %0\n"
         : "=m"(r): "m"(x));
  return r;
}

I've test it using the same framework, and guess what ? It's slower !

For the same condition (hundreds of million calls) I get a time of 2.87384s which is nearly ten time slower than the sin_v6 function !

You want to know why ? That's probably due to the way FPU works: it induce a context switch just for one operation, so even if the operation itself is faster, the function is inefficient …

Friday, November 8, 2013

More C++11 threading: I want my lock-free !

Last time, I've tried to describe the change in C++11 that impact multi-threading. Now, I'll talk about implementing a lock-free data structures.

Sharing Memory


Before talking about lock-free, let me refresh your memory about the most important issue when doing parallel code: concurrent memory access.

If two (or more) threads share (i.e. use) a common memory location, like a counter or the entry point of a data-structure, each time they access that shared location, conflicts can arise.

The most striking example is the shared counter problem: some threads shared a common integer (our shared counter) and try to simple increment it. That's it, they just want to do a simple ++i !

Even with this simple example, you'll get a bad behavior. To convince you I've got some interesting results: I've got a bunch of threads (from 2 to 256) and each thread is running a simple loop (with exactly 2²² rounds) that do a simple ++i, with i a shared counter. So normally the final value should be number_of_threads * 2²² and thus we can compute a kind of error measure: (expected_value - real_value)/expected_value (in percent.) Here are the results:


Number of threadsError
249.24%
468.99%
884.21%
25689.87%
This test was run on a Intel Core i7-3770 (so 4 cores and 2 hyperthreads per core.)

Impressive, isn't it ?

How can a simple increment be so wrong ? It's quite simple, to increment you shared variable you've got at least to operation (from the memory point of view): in the first one you load the value at the given location; in the second one, you store the newly computed value (based on the loaded value) in the memory location. The issue is that between your load and your store, some others threads may have modified this value and you won't see it.

To solve that kind of issue, the usual solution is to use lock (for the given example, you can now use atomic types.)

When dealing with a data structure (lists, queues, trees … ) protecting access is even more critical since, not only you may loose information, but you can also completely break the data-structure.

Lock free ?


« OK, I got your point about protecting memory access, so what's that lock-free stuff ? »

Locks (any kind of locks) are probably the kind of operation that you want to avoid.

When a thread holds a lock, all other threads that are trying to obtain the same lock are blocked until the actual owner release the lock. That's the expected purpose. But, processes are blocked, even if the owner is not active.

The problem really arise when the number of threads or processes on the system is growing. Once again, it's quite simple: if the thread holding the lock is scheduled, all other threads waiting for the lock are blocked while the owner of the lock isn't really using the data !

So, intuitively, a lock-free algorithm provides a global progression property: there can always be an active thread making progress.

« I've heard about wait-free algorithm, is that the same thing ? »

No … here are some definition:

  • lock-free: an algorithm (or the operation on a data-structure) is lock-free if « the suspension of one or more threads will not stop the potential progress of the remaining threads. »
  • wait-free: an algorithm is wait-free if « every operation has a bound on the number of steps the algorithm will take before the operation completes. »
In lock-free algorithm, the system (that is our set of threads) maintains progression, while in wait-free algorithm, every threads have progression.

What do we need ?


Let me guess … brain grease ?

There're a lot of results about primitive needed to implement lock-free or wait-free operations. The main is result is that only one basic mechanism is needed, and two are available for that job:
  • Atomic Compare&Swap (CAS)
  • Load-Link/Store-Conditional (ll/sc)
A CAS belongs to the family of read-modify-write operations, but while most of operations of this family doesn't offer a better consensus number than 2 (can only be used to provide a lock-free algorithm for 2 processes), the CAS provides has an infinite consensus number. This operation is quite simple:

bool CAS(int *mem, int value, int cmp) {
  if (*mem == cmp) {
    *mem = value;
    return true;
  }
  return false;
}

On the other hand, ll/sc, is a pair of operations: the first one (Load-Link) load the content of a shared memory and keep track of that load, then the store-conditional update the memory location, only if it hasn't change.

The ll/sc mechanism is, theoretically, more powerful than a simple CAS. But that's theoretically. For various reason, most (all) concrete ll/sc implementation miss the theoretically semantics, breaking a lot of algorithm.

Since implementing CAS with ll/sc is simple (even with broken ll/sc), most algorithm prefer CAS.

How do we use CAS ?


So, we found that the CAS operation will be our primitive. It is available in the C++11 atomic operation, so we can use it (if you take a look at that article, you'll a find a template providing double-word compare swap for x86 and x86-64 using ASM inline.)

Here are some example using CAS to provide fetch&add and to implement a simple spin-lock.

First the fetch&add (like operator += )
template
T fetch_and_add(std::atomic& x, T value) {
  T             tmp;
  tmp = x.load();
  while (!x.compare_exchange_weak(tmp, tmp+value)) {}
  return tmp+value;
}

And the simple spin-lock class:

class mylock {
public:

  struct guard {
    guard(mylock& _lock) : lock(_lock) {
      lock.lock();
    }
    ~guard() {
      lock.unlock();
    }
    mylock&           lock;
  };

  mylock(): _on(0) {}
  mylock(const mylock& x) = delete;
  void lock() {
    // C++11 CAS need a ref
    int                 tmp = 0;
    while (!_on.compare_exchange_weak(tmp, 1)) {
      tmp = 0; // needed since C++11 CAS back-up changed value in tmp
      std::this_thread::yield(); // don't waste cycles
    }
  }

  void unlock() {
    _on = 0; // ok atomic store
  }

  bool try_lock() {
    int                 tmp = 0;
    return _on.compare_exchange_weak(tmp,1);
  }

private:
  // we use here a CAS for the demo, but C++11 provide atomic_flag which should
  // be more accurate here.
  std::atomic<bool>     _on;
};

So as you can see, most of the time we use a kind of retry-loop schema to achieve our goal.

You may have notice that I used compare_exchange_weak() in my code. What's weak is all about ? In the documentation, it seems that the weak version may fail for spurious reason but is less expensive for a retry-loop than the strong version. That's probably a matter of lower-level implementation, and since I'm using a loop, I'll use the weak version.

Lock Free Queue ?


So, how do I implement a lock free queue ?

There's few main ideas behind that kind of data structures: behind able to fail and retry (like a transaction), find linearization point and try to concentrate global structure modification into one or eventually two atomic operation and finally, find a way to finish other threads works.

A linearization point is where the effects of an operation appears to happen for observers (other threads.)

For example, if you have a pointer to the first element in a linked list and want to delete that first element. So we want something like this (suppose Head is a shared global entry point of some list):

tmp = Head;     // retrieve the head pointer
Head = tmp->next; // cut-off the head
... // do what you want with it

Main issue in a concurrent context is that Head and Head->next may change while we're using it. Also note that the linearization point is when I switch the head with it's successor. This operation must be atomic and thus it needs a CAS.

Let's Build That Queue


So, the global idea is to fetch the pointer we need and try to switch the head, if it fails (if the head has changed since our previous read) we need to retry, here is a possible implementation (pseudo code):

do {
  tmp = Head;       // fetch the head
  next = tmp->next; // fetch the next element
  // be sure our pointer is still accurate
  if (tmp != Head) continue;
  // get the content
  // switch the head and the next element
  if (CAS(&Head, next, tmp)) break;
  // if CAS fails, try again !
} while (true);

This example doesn't manage empty list case and some other details. To avoid issue when dealing with empty list or list with one element (there's more issues with one element than an empty list), we use a sentinel: the head of the list is always a dumb node, only the second one matters. Here is a complete version of a pop operation (C++11 version):

struct List {
  struct node {
    int                 value;
    std::atomic  next;
    node() { next = 0; }
    node(int x) : value(x) {}
  };

  std::atomic    Head, Tail;

  List() {
    Head = new node();
    Tail = Head;
  }

  bool pop(int *lvalue) {
    node       *head, *tail, *next;
    do {
      // acquire pointers
      head = Head.load();
      tail = Tail.load();
      next = head->next.load();
      // check if we're coherent
      if (Head != head) continue;
      if (next = 0) return false; // empty queue
      if (head == tail) {
        // missing completion on tail (see later)
        // do the job of an unfinished push
        Tail.compare_exchange_weak(tail, next);
        continue;
      }
      lvalue = next->value; // copy value
      // done ? try to cut the head off
      if (Head.compare_exchange_weak(head, next)) break;
      // fails ? got to retry
    } while (true);
  }

  void push(int x); // see later
};

Next step is the push operation. While for the pop operation we only have one modification in the list (replace head with its next) for the push operation we need two modification: changing the next of the tail element and then replace the tail with the new element.

A thread can arrive on the queue with the next field of the tail element not null (that is: we have added the element but haven't update the tail pointer.) In that case, the current thread can try to finish the operation. We've already done that in the pop operation: when the head and the tail are equal but their next pointer is not null, the poping thread can do the CAS itself to update the tail.

An interesting point with this CAS is that if the original pushing thread do the update itself between the test and the CAS of the other thread (the one that want to finish the job) since the tail pointer has already been updated the CAS will fail silently without any consequences !

To summarize: when pushing, you'll build your new list cell, and try to attach it to the last cell (the tail) of the list, once done you can replace the tail pointer with the pointer to the new cell. During an update (push or pop), if a thread find the queue in an intermediary state, it can do the update itself.

Here is the code:

void List::push(int x) {
  node         *node = new node(x);
  node         *tail;
  do {
    tail = Tail.load();
    if (Tail != tail) continue;
    if (tail->next != 0) {
      // missing completion on tail
      Tail.compare_exchange_weak(tail,tail->next);
      continue;
    }
    // update the next of the tail
    if (tail->next.compare_exchange_weak(next,node)) break;
  } while (true);
  // finally update the tail (may fail, but we don't care)
  Tail.compare_exchange_weak(tail, node);
}

Note that the final CAS can fail, but, if it does the only reason is that someone has already done the job for us, so we don't need to care !

Memory Issues


In the previous example we avoid the question of releasing/recycling memory, for a good reason.

Why ?

You can delete a list-cell in the pop operation once you swap the head with its next pointer (at the end of the function after the loop.) But are you sure that you safely delete it ?

What can go wrong ? Some other thread holds the same pointer and may try to use it.

Once again, that's a question of timing: a thread trying to take an element of the queue will keep and use the pointer to the head in order to access its next field, so if the corresponding node has been freed this operation may fail.

A lot of lock-free implementation are done in Java where there's a garbage collector: a node won't be freed until all its persistence roots are active. But, in C++ there's no such thing …

A possible solution to avoid segfault is to use a recycling allocator: cells are not deleted but reuse as new nodes. Thus, accessing a node will always make sense, and won't fail.

Are we safe now ? No, another issue arise with recycling: the ABA problem.

ABA ?

No, we don't talk about Swedish pop music. The ABA issue is the weak point of the CAS operation: observing only if a pointer has changed may not be enough.

In brief: a first thread get the head of the queue, found a pointer A and get scheduled, during its inactivity phase, another thread take the first element of the queue and the head pointer becomes B, but some other changes may occurs and finally, the head pointer becomes A again (but it's content has change.) The first thread won't see that change, but if it has stored the head and its next, the next value will be false, but the CAS won't fail and we loose data coherency !

ABA issue in a stack
There are various solutions to the ABA problem:

  1. Use a garbage collector (like in Java)
  2. Use the so-called IBM solution with tagged pointers
  3. Use Hazard Pointers
The GC solution is still not possible and in fact, it's not really interesting. Let's see the two others.

Tagged Pointers


The idea of tagged pointers is to associate a counter with our pointer and manipulate the pair (stored contiguously) as a whole value. The counter is not a reference counter: it's not really possible to implement an efficient ref-count in lock-free. Our counter here is just a kind of tag that we increase each time we use the pointer. So when comparing the pair pointer/counter even if the pointer has the same value, if it has been updated, the counter will have different value and the CAS will fail.

This technique is quite simple and efficient, but it requires a double-world CAS, that is a CAS that work on one location but for two contiguous words (not to be confused with a double CAS, that works like two CAS on two different locations.) Such an operation exists on most 32bits processors but is only available on modern x86 processors with 64bit extension. C++11 provides 16, 32 and 64 bits CAS for atomic types but no 128bits version.

This method is used in the original article describing the lock-free queue: Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms

I've also write an article explaining how to use cmpxchg16b instruction using C++ template and inline ASM: Implementing generic double-word compare and swap for x86/x86-64

Hazard Pointers


Hazard Pointers is a technique describe in an other article from Magged Micheal (one of the author of the work on lock-free queue): Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects (pdf)

It's like a simplified GC: each threads has some hazard pointer variables where it stores pointers that need to be preserved and when some thread tries to delete some pointers, it first verifies that this pointer is not in used (stored in a hazard pointer var.)

Hazard pointer variables (HP) are in fact shared variable that is writing by one thread and readable by all others. When a thread want to delete a pointer, it first adds it to a wait-list, when this list reach a threshold, it retrieve all the pointers that are stored in a HP and only delete those that are not in used.

Despite some tricky details about managing the attribution of HP, the implementation is quite simple, it is wait-free and when choosing the right data-structures it can be pretty fast. In practice, I've seen no difference in performance with the IBM techniques (tagged pointers.)

I've got a full implementation available here: hazard_pointer.hh

More ?


So that's just a beginning, I start playing around programming all those data structures (available in a beta version) and you can take a look at it in my bitbucket repos: https://bitbucket.org/marwan_burelle/lockfreeexperiment

Of course there's a lot of information on the net, but start from the two paper already cited (lock-free queue and hazard pointers.)

My next step will be to implement wait-free data structures using ideas presented in recent works that announce very good performances, that's another story …

Tuesday, May 14, 2013

Playing With C++11 Multi-Threading Stuff

So, you may be interested in what I'm doing in my spare time ? I'm playing with concurrent-friendly data structures. The idea is to implement in C++ data structures that are available for Java programmers in the Java framework (and some newer ones if possible.)

It started when I was searching algorithmic example for parallel programming lecture. I found the classical article from Micheal&Scott published in PODC96 called: Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms.

This paper presents two queue algorithm a blocking one with two locks and a non-blocking one (also called lock-free.) So I wanted to implement both algorithm in order to explain them to the student.

The blocking algorithm is very simple, in fact it belongs to that kind of algorithm that seems obvious after reading it. It echoed some ideas I've tried by myself and solved all the issues I was thinking of.

The non-blocking algorithm requires more time. In fact, I was blocked by the fact that my programming language (C) doesn't provide a CompareAndSwap operation (I've finally found what I was looking for in gcc intrinsinc, but only to found another issue in the 64bit version.)

I've finally managed to solve that question by implementing mixing ASM and C++ template in order to have a working double-word CompareAndSwap operation. You can find all the story in this article published on the LSE's blog.

And then I've started reading about C11 and C++11 stuff. And found that it become the right time to explore both the new elements in the standards and alll that funny data structures.

But, let's start with C++11, will talk about lock-free later.

What's the deal with all that stuff on the memory model ?

In fact what is a memory model anyway ?

So, we need to take a look at some notions in compiler and programming languages.

Semantics ?

If you're a programmer, you probably have some knowledge on that compiling stuff: you give higher-level code to some program (the compiler) and it gives you back a binary file, that suppose to be some machine-level code. If you try to wrote ASM by yourself, you've probably found that there's quite a huge gap between your favorite programming language constructions and assembly code. Of course, for the experienced eyes, even if you don't know the target assembly, you'll see patterns in the code and probably understand the link between the two world.

But, anyway, the translation between higher-level code and assembly code is not simple. That's not just basic line to line translation. The formal way to understand this translation is what we call semantics. Language semantics explains what value an expression (for example) should return, and eventually what memory cells are affected by the computation.

When dealing with well-founded semantics, like with most functional programming languages, the description is theoretic, even with small-steps operational semantics that can be directly implemented in an interpreter, we don't really have a description on how to obtain results, but what results are expected. This way, the language can be implemented in several ways, and doesn't rely on specific architectures, or algorithms.

C and C++ doesn't belong to that kind of languages (may say hopefully … ) But, lot of people think (and I was also thinking that way when I was student) that the compiler describe the semantics. That is: the semantics of C program is defined by how standard compilers translate the code.

This is false and stupid. C and C++ have a semantics, but when a well-founded programming language validation system try to forbid code for which no semantics is clearly defined, C and C++ let the compiler decide. So, this is not simple, but the thing you need to understand is that the semantics gives a lot of liberty to the compiler on how to translate things. And the semantics only gives a loose definition of the expected results for a subset of the language.

One of the key idea behind this approach is to give the maximum liberty to the compiler so that it can optimized the code using all the available tricks.

Take this little code sample:

int res = p1->x * p2->x + p1->y * p2->y + p1->z * p2->z;

(given that p1 and p2 are pointers to some kind of 3D point structure.)

Edit: I've moved this code from float to integer. Because, as you know re-ordering and floating point …

If you played with SIMD extensions of some CPUs you know that we can obtain the same result with faster code using these extensions. But it will change the order (ok, not much) in which the expression is evaluated. You know that it doesn't change the final result, but a too restrictive semantics will forbid that kind of rewriting.

What's the deal with re-ordering ? It doesn't change the result ? Yes, but only because we don't have any side-effect here, nor do we need to respect some order in our memory accesses (think about accessing a hardware for example.)

So, rather than making too restrictive rules (with the hope that clever compilers' writers find a way to auto-magically detect when the semantics can be brake), people writing standards for C and C++ decided to give maximum freedom and less constraints, with the drawback that some piece of code will have undefined behavior.

There's a lot of stuff to explain about this approach, but a good example is better: the standard doesn't give any rules on the order that kind of stuff be executed …

t1[i++] = t2[i++];

This is an undefined behavior and for that matter, the compiler may decide to produce code that wipe out your drives, because the standard doesn't say it can't do that.

Going Parallel

And now add some fun to that: multiple processors, multiple threads and a single memory. You begin to see what's happening next ?

So, let's dive into the hardware first. When dealing with memory accesses in the presence of multiple processors we got a lot of issues solve.

Basically the question is: do we need to synchronize memory operations ?

Let's take a real life example: at the crossing of two roads, we may add some traffic lights to avoid accidents. But, most of the time we don't need it, why ? Because, there isn't enough cars and if we use traffic lights we may block cars when it's not needed.

For shared memory it's the same idea: most memory operation doesn't require any synchronization because they won't conflict. Beware, I'm not saying that when accessing the same memory cell with two processors there isn't any risk, no I'm saying that most of the time processors won't access the same memory cells. So, you'll need to specify what's happening when you know conflicts can arise.

Then rather than synchronizing all memory operations, or even rather than trying to synchronize all dangerous memory accesses, most hardware are based on what we call the relaxed memory model: you must explicitly describe when you need synchronization, when you don't, you will have no guarantee of what will happen.

So what kind of synchronization do we have:

  • atomic operations: an atomic operation is a complex operation (doing more than one memory access) that executes as if it were a simple access. The usual operations are atomic exchange, compare-and-swap …
  • acquire barrier: this barrier enforces that operations after the barrier will be re-ordered before the barrier (that is, the processor won't try to antcipate load operations after the barrier.)
  • release barrier: this barrier enforces that all operations begin before the barrier will be finished before crossing it (that is, the processor won't defer concrete memory stores after the barrier.)
  • full memory fence: a full memory fence is a kind of combined acquire and release barrier: all pending operation must finish before the barrier and no new operation will start before the barrier.
The acquire/release things often disturb new comer in the world of concurrent memory accesses. The first things you need to know is that your processor doesn't act in a linear fashion for a lot of good reasons. To be short: it is able to start loading data from memory as soon as it able to decode the target address, even if some prior instruction are not yet finished; it is also able to defer store operation to a later point, that is start operations that are after the storing operation before concretely storing the data (note that you compiler may also try to do similar changes in your code.)

The classic example is the I-have-finished boolean mistake. Let's take a look at the following pseudo code:

shared int x = 0;
shared bool ok = false;

worker0() {
  x = 42;
  ok = true;
}

worker1 {
  if (ok)
    print(x);
  else
    print("not yet");
}

So, a quick reading let you think that this program will either print "42" or "not yet" but not "0", right ? Wrong !

The processor will probably defer some of the storage operations, if it's longer to write to x, it can change the ok to true before storing 42 in x. On the other part, it can also start reading x before ok, and thus read the older value and then the true flag in ok.

Barrier can be used that way here:

shared int x = 0;
shared bool ok = false;

worker0() {
  x = 42;
  release-barrier;
  ok = true;
}

worker1 {
  if (ok) {
    acquire-barrier;
    print(x);
  } else
    print("not yet");
}

This will solve the issue.

And thus, what about C and C++ ?

For years, the standard avoided the question. C99 and C++98 don't even mention that kind of issues as undefined behavior, no, they simply don't deal with concurrency. Until the release of 2011 new standards C11 and C++11 (in fact this was for quite some times in C++0x.)

What's the difference now ? Hey, guess what ? (most) concurrent accesses are undefined behavior !

What a (r)evolution, you may say. But in fact it is, because the standard also provides a way to introduce barriers. And thus, we are now able to code using atomic operations and memory barriers, without relying on the supposed behavior of a compiler and a processor.

If you have looked for implementation or discussion about implementing lock-free data structures, you may have noticed that most of them are done in Java, the only reason was that Java make the move to the relaxed memory model (basically the one I described to you) since the release of Java 1.5. So, you can rely on that model to justify you implementation, and be sure that your benchmarks make sense.

I hope that the arrival of the new C++ memory model will gives us a chance to see more C++ implementations.

And the other stuff ?

You want to know more about C++11 threads ? Read the specs !

I've got some slides (in English) on the subject here: Threads

You can also find interesting information in this (old but still accurate) article by Maurice Herlihy: Wait-Free Synchronization (pdf format.)

The next article will talk about my implementation of Hazard Pointers, their use in lock-free queues and maybe some words about testing all that code.