Friday, November 8, 2013

More C++11 threading: I want my lock-free !

Last time, I've tried to describe the change in C++11 that impact multi-threading. Now, I'll talk about implementing a lock-free data structures.

Sharing Memory


Before talking about lock-free, let me refresh your memory about the most important issue when doing parallel code: concurrent memory access.

If two (or more) threads share (i.e. use) a common memory location, like a counter or the entry point of a data-structure, each time they access that shared location, conflicts can arise.

The most striking example is the shared counter problem: some threads shared a common integer (our shared counter) and try to simple increment it. That's it, they just want to do a simple ++i !

Even with this simple example, you'll get a bad behavior. To convince you I've got some interesting results: I've got a bunch of threads (from 2 to 256) and each thread is running a simple loop (with exactly 2²² rounds) that do a simple ++i, with i a shared counter. So normally the final value should be number_of_threads * 2²² and thus we can compute a kind of error measure: (expected_value - real_value)/expected_value (in percent.) Here are the results:


Number of threadsError
249.24%
468.99%
884.21%
25689.87%
This test was run on a Intel Core i7-3770 (so 4 cores and 2 hyperthreads per core.)

Impressive, isn't it ?

How can a simple increment be so wrong ? It's quite simple, to increment you shared variable you've got at least to operation (from the memory point of view): in the first one you load the value at the given location; in the second one, you store the newly computed value (based on the loaded value) in the memory location. The issue is that between your load and your store, some others threads may have modified this value and you won't see it.

To solve that kind of issue, the usual solution is to use lock (for the given example, you can now use atomic types.)

When dealing with a data structure (lists, queues, trees … ) protecting access is even more critical since, not only you may loose information, but you can also completely break the data-structure.

Lock free ?


« OK, I got your point about protecting memory access, so what's that lock-free stuff ? »

Locks (any kind of locks) are probably the kind of operation that you want to avoid.

When a thread holds a lock, all other threads that are trying to obtain the same lock are blocked until the actual owner release the lock. That's the expected purpose. But, processes are blocked, even if the owner is not active.

The problem really arise when the number of threads or processes on the system is growing. Once again, it's quite simple: if the thread holding the lock is scheduled, all other threads waiting for the lock are blocked while the owner of the lock isn't really using the data !

So, intuitively, a lock-free algorithm provides a global progression property: there can always be an active thread making progress.

« I've heard about wait-free algorithm, is that the same thing ? »

No … here are some definition:

  • lock-free: an algorithm (or the operation on a data-structure) is lock-free if « the suspension of one or more threads will not stop the potential progress of the remaining threads. »
  • wait-free: an algorithm is wait-free if « every operation has a bound on the number of steps the algorithm will take before the operation completes. »
In lock-free algorithm, the system (that is our set of threads) maintains progression, while in wait-free algorithm, every threads have progression.

What do we need ?


Let me guess … brain grease ?

There're a lot of results about primitive needed to implement lock-free or wait-free operations. The main is result is that only one basic mechanism is needed, and two are available for that job:
  • Atomic Compare&Swap (CAS)
  • Load-Link/Store-Conditional (ll/sc)
A CAS belongs to the family of read-modify-write operations, but while most of operations of this family doesn't offer a better consensus number than 2 (can only be used to provide a lock-free algorithm for 2 processes), the CAS provides has an infinite consensus number. This operation is quite simple:

bool CAS(int *mem, int value, int cmp) {
  if (*mem == cmp) {
    *mem = value;
    return true;
  }
  return false;
}

On the other hand, ll/sc, is a pair of operations: the first one (Load-Link) load the content of a shared memory and keep track of that load, then the store-conditional update the memory location, only if it hasn't change.

The ll/sc mechanism is, theoretically, more powerful than a simple CAS. But that's theoretically. For various reason, most (all) concrete ll/sc implementation miss the theoretically semantics, breaking a lot of algorithm.

Since implementing CAS with ll/sc is simple (even with broken ll/sc), most algorithm prefer CAS.

How do we use CAS ?


So, we found that the CAS operation will be our primitive. It is available in the C++11 atomic operation, so we can use it (if you take a look at that article, you'll a find a template providing double-word compare swap for x86 and x86-64 using ASM inline.)

Here are some example using CAS to provide fetch&add and to implement a simple spin-lock.

First the fetch&add (like operator += )
template
T fetch_and_add(std::atomic& x, T value) {
  T             tmp;
  tmp = x.load();
  while (!x.compare_exchange_weak(tmp, tmp+value)) {}
  return tmp+value;
}

And the simple spin-lock class:

class mylock {
public:

  struct guard {
    guard(mylock& _lock) : lock(_lock) {
      lock.lock();
    }
    ~guard() {
      lock.unlock();
    }
    mylock&           lock;
  };

  mylock(): _on(0) {}
  mylock(const mylock& x) = delete;
  void lock() {
    // C++11 CAS need a ref
    int                 tmp = 0;
    while (!_on.compare_exchange_weak(tmp, 1)) {
      tmp = 0; // needed since C++11 CAS back-up changed value in tmp
      std::this_thread::yield(); // don't waste cycles
    }
  }

  void unlock() {
    _on = 0; // ok atomic store
  }

  bool try_lock() {
    int                 tmp = 0;
    return _on.compare_exchange_weak(tmp,1);
  }

private:
  // we use here a CAS for the demo, but C++11 provide atomic_flag which should
  // be more accurate here.
  std::atomic<bool>     _on;
};

So as you can see, most of the time we use a kind of retry-loop schema to achieve our goal.

You may have notice that I used compare_exchange_weak() in my code. What's weak is all about ? In the documentation, it seems that the weak version may fail for spurious reason but is less expensive for a retry-loop than the strong version. That's probably a matter of lower-level implementation, and since I'm using a loop, I'll use the weak version.

Lock Free Queue ?


So, how do I implement a lock free queue ?

There's few main ideas behind that kind of data structures: behind able to fail and retry (like a transaction), find linearization point and try to concentrate global structure modification into one or eventually two atomic operation and finally, find a way to finish other threads works.

A linearization point is where the effects of an operation appears to happen for observers (other threads.)

For example, if you have a pointer to the first element in a linked list and want to delete that first element. So we want something like this (suppose Head is a shared global entry point of some list):

tmp = Head;     // retrieve the head pointer
Head = tmp->next; // cut-off the head
... // do what you want with it

Main issue in a concurrent context is that Head and Head->next may change while we're using it. Also note that the linearization point is when I switch the head with it's successor. This operation must be atomic and thus it needs a CAS.

Let's Build That Queue


So, the global idea is to fetch the pointer we need and try to switch the head, if it fails (if the head has changed since our previous read) we need to retry, here is a possible implementation (pseudo code):

do {
  tmp = Head;       // fetch the head
  next = tmp->next; // fetch the next element
  // be sure our pointer is still accurate
  if (tmp != Head) continue;
  // get the content
  // switch the head and the next element
  if (CAS(&Head, next, tmp)) break;
  // if CAS fails, try again !
} while (true);

This example doesn't manage empty list case and some other details. To avoid issue when dealing with empty list or list with one element (there's more issues with one element than an empty list), we use a sentinel: the head of the list is always a dumb node, only the second one matters. Here is a complete version of a pop operation (C++11 version):

struct List {
  struct node {
    int                 value;
    std::atomic  next;
    node() { next = 0; }
    node(int x) : value(x) {}
  };

  std::atomic    Head, Tail;

  List() {
    Head = new node();
    Tail = Head;
  }

  bool pop(int *lvalue) {
    node       *head, *tail, *next;
    do {
      // acquire pointers
      head = Head.load();
      tail = Tail.load();
      next = head->next.load();
      // check if we're coherent
      if (Head != head) continue;
      if (next = 0) return false; // empty queue
      if (head == tail) {
        // missing completion on tail (see later)
        // do the job of an unfinished push
        Tail.compare_exchange_weak(tail, next);
        continue;
      }
      lvalue = next->value; // copy value
      // done ? try to cut the head off
      if (Head.compare_exchange_weak(head, next)) break;
      // fails ? got to retry
    } while (true);
  }

  void push(int x); // see later
};

Next step is the push operation. While for the pop operation we only have one modification in the list (replace head with its next) for the push operation we need two modification: changing the next of the tail element and then replace the tail with the new element.

A thread can arrive on the queue with the next field of the tail element not null (that is: we have added the element but haven't update the tail pointer.) In that case, the current thread can try to finish the operation. We've already done that in the pop operation: when the head and the tail are equal but their next pointer is not null, the poping thread can do the CAS itself to update the tail.

An interesting point with this CAS is that if the original pushing thread do the update itself between the test and the CAS of the other thread (the one that want to finish the job) since the tail pointer has already been updated the CAS will fail silently without any consequences !

To summarize: when pushing, you'll build your new list cell, and try to attach it to the last cell (the tail) of the list, once done you can replace the tail pointer with the pointer to the new cell. During an update (push or pop), if a thread find the queue in an intermediary state, it can do the update itself.

Here is the code:

void List::push(int x) {
  node         *node = new node(x);
  node         *tail;
  do {
    tail = Tail.load();
    if (Tail != tail) continue;
    if (tail->next != 0) {
      // missing completion on tail
      Tail.compare_exchange_weak(tail,tail->next);
      continue;
    }
    // update the next of the tail
    if (tail->next.compare_exchange_weak(next,node)) break;
  } while (true);
  // finally update the tail (may fail, but we don't care)
  Tail.compare_exchange_weak(tail, node);
}

Note that the final CAS can fail, but, if it does the only reason is that someone has already done the job for us, so we don't need to care !

Memory Issues


In the previous example we avoid the question of releasing/recycling memory, for a good reason.

Why ?

You can delete a list-cell in the pop operation once you swap the head with its next pointer (at the end of the function after the loop.) But are you sure that you safely delete it ?

What can go wrong ? Some other thread holds the same pointer and may try to use it.

Once again, that's a question of timing: a thread trying to take an element of the queue will keep and use the pointer to the head in order to access its next field, so if the corresponding node has been freed this operation may fail.

A lot of lock-free implementation are done in Java where there's a garbage collector: a node won't be freed until all its persistence roots are active. But, in C++ there's no such thing …

A possible solution to avoid segfault is to use a recycling allocator: cells are not deleted but reuse as new nodes. Thus, accessing a node will always make sense, and won't fail.

Are we safe now ? No, another issue arise with recycling: the ABA problem.

ABA ?

No, we don't talk about Swedish pop music. The ABA issue is the weak point of the CAS operation: observing only if a pointer has changed may not be enough.

In brief: a first thread get the head of the queue, found a pointer A and get scheduled, during its inactivity phase, another thread take the first element of the queue and the head pointer becomes B, but some other changes may occurs and finally, the head pointer becomes A again (but it's content has change.) The first thread won't see that change, but if it has stored the head and its next, the next value will be false, but the CAS won't fail and we loose data coherency !

ABA issue in a stack
There are various solutions to the ABA problem:

  1. Use a garbage collector (like in Java)
  2. Use the so-called IBM solution with tagged pointers
  3. Use Hazard Pointers
The GC solution is still not possible and in fact, it's not really interesting. Let's see the two others.

Tagged Pointers


The idea of tagged pointers is to associate a counter with our pointer and manipulate the pair (stored contiguously) as a whole value. The counter is not a reference counter: it's not really possible to implement an efficient ref-count in lock-free. Our counter here is just a kind of tag that we increase each time we use the pointer. So when comparing the pair pointer/counter even if the pointer has the same value, if it has been updated, the counter will have different value and the CAS will fail.

This technique is quite simple and efficient, but it requires a double-world CAS, that is a CAS that work on one location but for two contiguous words (not to be confused with a double CAS, that works like two CAS on two different locations.) Such an operation exists on most 32bits processors but is only available on modern x86 processors with 64bit extension. C++11 provides 16, 32 and 64 bits CAS for atomic types but no 128bits version.

This method is used in the original article describing the lock-free queue: Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms

I've also write an article explaining how to use cmpxchg16b instruction using C++ template and inline ASM: Implementing generic double-word compare and swap for x86/x86-64

Hazard Pointers


Hazard Pointers is a technique describe in an other article from Magged Micheal (one of the author of the work on lock-free queue): Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects (pdf)

It's like a simplified GC: each threads has some hazard pointer variables where it stores pointers that need to be preserved and when some thread tries to delete some pointers, it first verifies that this pointer is not in used (stored in a hazard pointer var.)

Hazard pointer variables (HP) are in fact shared variable that is writing by one thread and readable by all others. When a thread want to delete a pointer, it first adds it to a wait-list, when this list reach a threshold, it retrieve all the pointers that are stored in a HP and only delete those that are not in used.

Despite some tricky details about managing the attribution of HP, the implementation is quite simple, it is wait-free and when choosing the right data-structures it can be pretty fast. In practice, I've seen no difference in performance with the IBM techniques (tagged pointers.)

I've got a full implementation available here: hazard_pointer.hh

More ?


So that's just a beginning, I start playing around programming all those data structures (available in a beta version) and you can take a look at it in my bitbucket repos: https://bitbucket.org/marwan_burelle/lockfreeexperiment

Of course there's a lot of information on the net, but start from the two paper already cited (lock-free queue and hazard pointers.)

My next step will be to implement wait-free data structures using ideas presented in recent works that announce very good performances, that's another story …