Tuesday, May 14, 2013

Playing With C++11 Multi-Threading Stuff

So, you may be interested in what I'm doing in my spare time ? I'm playing with concurrent-friendly data structures. The idea is to implement in C++ data structures that are available for Java programmers in the Java framework (and some newer ones if possible.)

It started when I was searching algorithmic example for parallel programming lecture. I found the classical article from Micheal&Scott published in PODC96 called: Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms.

This paper presents two queue algorithm a blocking one with two locks and a non-blocking one (also called lock-free.) So I wanted to implement both algorithm in order to explain them to the student.

The blocking algorithm is very simple, in fact it belongs to that kind of algorithm that seems obvious after reading it. It echoed some ideas I've tried by myself and solved all the issues I was thinking of.

The non-blocking algorithm requires more time. In fact, I was blocked by the fact that my programming language (C) doesn't provide a CompareAndSwap operation (I've finally found what I was looking for in gcc intrinsinc, but only to found another issue in the 64bit version.)

I've finally managed to solve that question by implementing mixing ASM and C++ template in order to have a working double-word CompareAndSwap operation. You can find all the story in this article published on the LSE's blog.

And then I've started reading about C11 and C++11 stuff. And found that it become the right time to explore both the new elements in the standards and alll that funny data structures.

But, let's start with C++11, will talk about lock-free later.

What's the deal with all that stuff on the memory model ?

In fact what is a memory model anyway ?

So, we need to take a look at some notions in compiler and programming languages.

Semantics ?

If you're a programmer, you probably have some knowledge on that compiling stuff: you give higher-level code to some program (the compiler) and it gives you back a binary file, that suppose to be some machine-level code. If you try to wrote ASM by yourself, you've probably found that there's quite a huge gap between your favorite programming language constructions and assembly code. Of course, for the experienced eyes, even if you don't know the target assembly, you'll see patterns in the code and probably understand the link between the two world.

But, anyway, the translation between higher-level code and assembly code is not simple. That's not just basic line to line translation. The formal way to understand this translation is what we call semantics. Language semantics explains what value an expression (for example) should return, and eventually what memory cells are affected by the computation.

When dealing with well-founded semantics, like with most functional programming languages, the description is theoretic, even with small-steps operational semantics that can be directly implemented in an interpreter, we don't really have a description on how to obtain results, but what results are expected. This way, the language can be implemented in several ways, and doesn't rely on specific architectures, or algorithms.

C and C++ doesn't belong to that kind of languages (may say hopefully … ) But, lot of people think (and I was also thinking that way when I was student) that the compiler describe the semantics. That is: the semantics of C program is defined by how standard compilers translate the code.

This is false and stupid. C and C++ have a semantics, but when a well-founded programming language validation system try to forbid code for which no semantics is clearly defined, C and C++ let the compiler decide. So, this is not simple, but the thing you need to understand is that the semantics gives a lot of liberty to the compiler on how to translate things. And the semantics only gives a loose definition of the expected results for a subset of the language.

One of the key idea behind this approach is to give the maximum liberty to the compiler so that it can optimized the code using all the available tricks.

Take this little code sample:

int res = p1->x * p2->x + p1->y * p2->y + p1->z * p2->z;

(given that p1 and p2 are pointers to some kind of 3D point structure.)

Edit: I've moved this code from float to integer. Because, as you know re-ordering and floating point …

If you played with SIMD extensions of some CPUs you know that we can obtain the same result with faster code using these extensions. But it will change the order (ok, not much) in which the expression is evaluated. You know that it doesn't change the final result, but a too restrictive semantics will forbid that kind of rewriting.

What's the deal with re-ordering ? It doesn't change the result ? Yes, but only because we don't have any side-effect here, nor do we need to respect some order in our memory accesses (think about accessing a hardware for example.)

So, rather than making too restrictive rules (with the hope that clever compilers' writers find a way to auto-magically detect when the semantics can be brake), people writing standards for C and C++ decided to give maximum freedom and less constraints, with the drawback that some piece of code will have undefined behavior.

There's a lot of stuff to explain about this approach, but a good example is better: the standard doesn't give any rules on the order that kind of stuff be executed …

t1[i++] = t2[i++];

This is an undefined behavior and for that matter, the compiler may decide to produce code that wipe out your drives, because the standard doesn't say it can't do that.

Going Parallel

And now add some fun to that: multiple processors, multiple threads and a single memory. You begin to see what's happening next ?

So, let's dive into the hardware first. When dealing with memory accesses in the presence of multiple processors we got a lot of issues solve.

Basically the question is: do we need to synchronize memory operations ?

Let's take a real life example: at the crossing of two roads, we may add some traffic lights to avoid accidents. But, most of the time we don't need it, why ? Because, there isn't enough cars and if we use traffic lights we may block cars when it's not needed.

For shared memory it's the same idea: most memory operation doesn't require any synchronization because they won't conflict. Beware, I'm not saying that when accessing the same memory cell with two processors there isn't any risk, no I'm saying that most of the time processors won't access the same memory cells. So, you'll need to specify what's happening when you know conflicts can arise.

Then rather than synchronizing all memory operations, or even rather than trying to synchronize all dangerous memory accesses, most hardware are based on what we call the relaxed memory model: you must explicitly describe when you need synchronization, when you don't, you will have no guarantee of what will happen.

So what kind of synchronization do we have:

  • atomic operations: an atomic operation is a complex operation (doing more than one memory access) that executes as if it were a simple access. The usual operations are atomic exchange, compare-and-swap …
  • acquire barrier: this barrier enforces that operations after the barrier will be re-ordered before the barrier (that is, the processor won't try to antcipate load operations after the barrier.)
  • release barrier: this barrier enforces that all operations begin before the barrier will be finished before crossing it (that is, the processor won't defer concrete memory stores after the barrier.)
  • full memory fence: a full memory fence is a kind of combined acquire and release barrier: all pending operation must finish before the barrier and no new operation will start before the barrier.
The acquire/release things often disturb new comer in the world of concurrent memory accesses. The first things you need to know is that your processor doesn't act in a linear fashion for a lot of good reasons. To be short: it is able to start loading data from memory as soon as it able to decode the target address, even if some prior instruction are not yet finished; it is also able to defer store operation to a later point, that is start operations that are after the storing operation before concretely storing the data (note that you compiler may also try to do similar changes in your code.)

The classic example is the I-have-finished boolean mistake. Let's take a look at the following pseudo code:

shared int x = 0;
shared bool ok = false;

worker0() {
  x = 42;
  ok = true;
}

worker1 {
  if (ok)
    print(x);
  else
    print("not yet");
}

So, a quick reading let you think that this program will either print "42" or "not yet" but not "0", right ? Wrong !

The processor will probably defer some of the storage operations, if it's longer to write to x, it can change the ok to true before storing 42 in x. On the other part, it can also start reading x before ok, and thus read the older value and then the true flag in ok.

Barrier can be used that way here:

shared int x = 0;
shared bool ok = false;

worker0() {
  x = 42;
  release-barrier;
  ok = true;
}

worker1 {
  if (ok) {
    acquire-barrier;
    print(x);
  } else
    print("not yet");
}

This will solve the issue.

And thus, what about C and C++ ?

For years, the standard avoided the question. C99 and C++98 don't even mention that kind of issues as undefined behavior, no, they simply don't deal with concurrency. Until the release of 2011 new standards C11 and C++11 (in fact this was for quite some times in C++0x.)

What's the difference now ? Hey, guess what ? (most) concurrent accesses are undefined behavior !

What a (r)evolution, you may say. But in fact it is, because the standard also provides a way to introduce barriers. And thus, we are now able to code using atomic operations and memory barriers, without relying on the supposed behavior of a compiler and a processor.

If you have looked for implementation or discussion about implementing lock-free data structures, you may have noticed that most of them are done in Java, the only reason was that Java make the move to the relaxed memory model (basically the one I described to you) since the release of Java 1.5. So, you can rely on that model to justify you implementation, and be sure that your benchmarks make sense.

I hope that the arrival of the new C++ memory model will gives us a chance to see more C++ implementations.

And the other stuff ?

You want to know more about C++11 threads ? Read the specs !

I've got some slides (in English) on the subject here: Threads

You can also find interesting information in this (old but still accurate) article by Maurice Herlihy: Wait-Free Synchronization (pdf format.)

The next article will talk about my implementation of Hazard Pointers, their use in lock-free queues and maybe some words about testing all that code.