Monday, August 3, 2015

Reference counting: practice and trouble

Reference counting is a well known strategy used to provide simple garbage collection. As automatic garbage collection, it has a lot of issues among which not correctly handling circular data-structures is the most obvious.
On the other hand, used for semi-automatic GC, it is pretty accurate, easy to implement and solves probably most of the use cases. People using C++11 shared_ptr probably see what I mean.
But I'm here to talk about concurrent programming and especially non-blocking data-structures. There’s a lot of academic explaining why ref-count won't help you, but still a lot of people think that we can use ref-count without impact.

Safe world: sequential code and lock protection

The classical way of using ref-counting, in a safe context, is to store the counter in the data (the memory cell.) While the accounted pointer is reachable, the counter count at least one reference corresponding to the placeholder of the pointer and thus will never get down to 0 and, thus, memory won't be freed.
Being alone on the data (due to lock protection or simply because there’s only one thread currently running) a thread can safely increment the pointer without any risk of hazardous memory access. Even when using optimistic or lazy locking, locks may prevent conflicting increment/decrement situations.
Anyway, we can present a simple and usable ref-count in C using intrusive data structure.

Simple and generic ref-count in C

We now introduce a simple ref-count using intrusive data-structure. Intrusive data structures have been discussed in this blog in a previous post.
In our case, what we need is: a small structure containing the counter and a pointer to the delete function, an operation to increase the counter, an operation to decrease it and eventually call the delete function and finally some macro to retrieve the parent structure pointer from the pointer of our intrusive structure. Here is the basic code (without any locking support):
/* Simple ref-count */

# ifndef REF_COUNT_H_
# define REF_COUNT_H_

// needed for offsetof
# include <stddef.h>

// Intrusive tools to retrieve access to the including structure
# define containerof(TYPE_, FIELD_, PTR_)               \
  ((TYPE_*)((char*)(PTR_) - offsetof(TYPE_, FIELD_)))

// Typedef for the delete function pointer
// increase readability
typedef void (*delete_ptr)(void *);

// The ref-count intrusive structure
struct ref_count {
  unsigned             counter;
  delete_ptr           deleter;
};

// Init function, set the counter to 1
static inline
void ref_count_init(struct ref_count *ref, delete_ptr ptr) {
  ref->counter = 1;
  ref->deleter = ptr;
}

// Increasing count
static inline
void ref_count_get(struct ref_count *ref) {
  ref->counter += 1;
}

// Decreasing count
// call deleter if count get downto 0
static inline
void ref_count_release(struct ref_count *ref) {
  ref->counter -= 1;
  if (ref->counter == 0)
    ref->deleter(ref);
}

# endif /* REF_COUNT_H_ */
And, because that’s better with usage example:
// Testing our ref-count

# include <stdio.h>
# include <stdlib.h>

# include "ref_count.h"

// Some data
// last field ref is our ref-count intrusive structure
struct data {
  int                  an_int;
  float                a_float;
  struct ref_count     ref;
};

// Now the delete function
void data_delete(void *p) {
  struct data         *main_struct;
  main_struct = containerof(struct data, ref, p);
  free(main_struct);
}

// A builder for our data
struct data* new_data(int a, float b) {
  struct data         *d;
  d = malloc(sizeof (struct data));
  d->an_int = a;
  d->a_float = b;
  ref_count_init(&(d->ref), data_delete);
  return d;
};

// A toy function using the data
void do_some_stuff(struct data *d) {
  // Holding the pointer, get it !
  ref_count_get(&(d->ref));
  printf("Data (%p) : %d %g\n", d, d->an_int, d->a_float);
  // We're done, release the pointer
  ref_count_release(&(d->ref));
}

// run using valgrind or clang's address sanitizer to check for memory leaks.
int main() {
  struct data         *d;
  d = new_data(42, 3.14);       // count already set to 1
  do_some_stuff(d);             // get and release
  do_some_stuff(d);             // get and release
  ref_count_release(&(d->ref)); // we're done
  return 0;
}
==20934== Memcheck, a memory error detector
==20934== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==20934== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==20934== Command: ./testing
==20934== 
Data (0x51d8040) : 42 3.14
Data (0x51d8040) : 42 3.14
==20934== 
==20934== HEAP SUMMARY:
==20934==     in use at exit: 0 bytes in 0 blocks
==20934==   total heap usage: 1 allocs, 1 frees, 24 bytes allocated
==20934== 
==20934== All heap blocks were freed -- no leaks are possible
==20934== 
==20934== For counts of detected and suppressed errors, rerun with: -v
==20934== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 1 from 1)
Of course, this implementation is not thread safe. Having a safe locking mechanism protecting the ref-count is not that easy and is outside of the scope of this article.

More complete example: data in 2 containers

Here I provide an example using intrusive lists as containers and where data chunks belong to two lists at the same time. Follow code comments:
First the intrusive list implementation:
// list.h : Basic intrusive lists

# ifndef LIST_H_
# define LIST_H_

# include <stddef.h>

// No explicit ref-count, but we need containerof
# include "ref_count.h"

struct list {
  struct list          *next, *prev;
};

// Init circular double linked list
// Used to init sentinel
static inline
void list_init(struct list *l) {
  l->next = l->prev = l;
}

// list is empty
static inline
int list_is_empty(struct list *l) {
  return l->next == l;
}

// Insert before current node
static inline
void list_insert_before(struct list *entry, struct list *cell) {
  cell->next = entry;
  cell->prev = entry->prev;
  entry->prev = cell;
  cell->prev->next = cell;
}

// Insert after current node
static inline
void list_insert_after(struct list *entry, struct list *cell) {
  cell->prev = entry;
  cell->next = entry->next;
  entry->next = cell;
  cell->next->prev = cell;
}

// Detach current node
static inline
void list_detach(struct list *entry) {
  entry->prev->next = entry->next;
  entry->next->prev = entry->prev;
  entry->next = NULL;
  entry->prev = NULL;
}

// Foreach tools

// Simple foreach
// entry = list entry point (sentinel)
// cur   = variable used as iterator, only active in loop
# define list_foreach(entry, cur)                                            \
  for (struct list *cur = entry->next; cur != entry; cur = cur->next)


// deque like interface
// should be used only with sentinel as input
// mainly wrappers

// deque is empty
int deque_is_empty(struct list *entry) {
  return entry->next == entry;
}

// Push front
static inline
void deque_push_front(struct list *entry, struct list *elm) {
  list_insert_after(entry, elm);
}

// Push back
static inline
void deque_push_back(struct list *entry, struct list *elm) {
  list_insert_before(entry, elm);
}

// Getting (not detached) front cursor
// return NULL if empty
static inline
struct list* deque_front(struct list *entry) {
  if (deque_is_empty(entry)) return NULL;
  return entry->next;
}

// Getting (not detached) back cursor
// return NULL if empty
static inline
struct list* deque_back(struct list *entry) {
  if (deque_is_empty(entry)) return NULL;
  return entry->prev;
}

# endif /* LIST_H_ */
And now the code:
// Demonstration of intrusive lists and ref-count

# define _XOPEN_SOURCE 500

# include <stdlib.h>
# include <stdio.h>

# include "list.h"
# include "ref_count.h"

// Kind of a selection sort
// we look for the smallest element in a list and push it in front of the
// deque but only if it is not already there
// After building the deque, we remove clear the original list, then print the
// content of the deque and finally clear it also

// Data: simple int
// Data structure: each element belongs to 2 lists and has a ref-count

struct data {
  int                   val;
  struct list           main_list;
  struct list           deque;
  struct ref_count      ref;
};

// Now the delete function
void data_delete(void *p) {
  struct data          *main_struct;
  main_struct = containerof(struct data, ref, p);
  free(main_struct);
}

// For simplicity we put almost everything in a single function
// sentinel for each list are local entity, and thus automatically freed

void example(size_t len) {
  // Sentinels
  struct list           list_entry;
  struct list           deque_entry;

  // Building the list
  list_init(&list_entry);
  for (size_t i = 0; i < len; i++) {
    struct data        *cell = malloc(sizeof (struct data));
    cell->val = random() % 100;
    // init ref counting, no need to account current copy
    ref_count_init(&(cell->ref), data_delete);
    list_insert_after(&list_entry, &(cell->main_list));
  }
  // Display the list
  list_foreach((&list_entry), iter) {
    struct data        *cell;
    cell = containerof(struct data, main_list, iter);
    printf("%02d->", cell->val);
  }
  printf("end\n");

  // Starting the sort
  // iterate over the list to find smallest element
  // add the cell in the deque if it is not already thre

  // lastval in order to avoid the same value again and again
  // No negative entry, use -1 as initial value
  int                   lastval = -1;
  list_init(&deque_entry);
  for (size_t i = 0; i < len; i++) {
    struct data        *min = NULL;
    int                 minval = 100; // maximum possible value
    list_foreach((&list_entry), iter) {
      struct data      *cell;
      cell = containerof(struct data, main_list, iter);
      if (cell->val > lastval && cell->val < minval) {
        min = cell;
        minval = min->val;
      }
    }
    if (min != NULL && min->val > lastval) {
      // Add an element
      ref_count_get(&(min->ref));
      deque_push_back(&deque_entry, &(min->deque));
      lastval = min->val;
    }
  }

  // Display the deque
  list_foreach((&deque_entry), iter) {
    struct data        *cell;
    cell = containerof(struct data, deque, iter);
    printf("%02d->", cell->val);
  }
  printf("end\n");

  // Clear the main list
  while (!list_is_empty(&list_entry)) {
    // Extract next element
    struct list        *iter = list_entry.next;
    list_detach(iter);
    // No more ref in the list
    ref_count_release(&(containerof(struct data, main_list, iter)->ref));
  }

  // Display the deque
  list_foreach((&deque_entry), iter) {
    struct data        *cell;
    cell = containerof(struct data, deque, iter);
    printf("%02d->", cell->val);
  }
  printf("end\n");

  // Now clear the deque
  while (!deque_is_empty(&deque_entry)) {
    struct data        *cell;
    cell = containerof(struct data, deque, deque_front(&deque_entry));
    list_detach(&(cell->deque));
    // No more ref in the list
    ref_count_release(&(cell->ref));
  }

}

int main() {
  example(10);
  return 0;
}
Once again, the code runs without any memory leak (try with valgrind … )

Trying to be non-blocking

Obviously, using ref-counting in a multithreaded context requires locks.
The legitimate question is: if we use atomic integer for the counter and we limit the usage of our ref-count to some safe situation, can we have non-blocking ref-count ?

Safe Concurrent Ref-counting ?

Let’s consider the following case: we have a pointer between shared between a group of reader threads and a single updater thread, updates are only performed by replacing the pointer, not modifying the content. This is, basically, the traditional example used to illustrate RCU.
During the update, the writer thread first get the pointer, allocate a new one, eventually copy the old data in the new allocated area and when ready atomically replace the old pointer with new one. Once the pointer has been replaced, no readers can access it, and the writer can then wait for readers still active on the old data before releasing the memory.
Now that we want to implement the wait for readers stage using ref-count on pointers. The protected pointer goes through two states: before update, since it is attached to some placeholders, its ref-count is at least 1 even if no threads hold it; after the update, since the pointer is not anymore attached to the placeholders, its ref-count value reflect the number of still active readers. The nice aspect of this is that starting from the update, the ref-count is now only decreasing, and thus, when the last reader release the pointer, automatic garbage collection frees the memory.
So, if the pointer is visible, you have the guarantee that the count will never get down to 0 and thus it will stay valid, on the other hand, when the pointer has been replaced, no new threads can grab it and increase the counter.
Stated this way, it seems that we have a less difficult synchronization problem: we just need to be sure that counts are correctly handled, and a simple atomic storage can provide that. Are we really safe ?
In parallel computing, issues rise in the details, and in the complexity of transition states. Here, the potential problem may rise when the pointer is replaced. Consider the following schema:
  • no reader hold the pointer
  • a writer is active and ready to update the pointer
  • a new reader arrive and get its hand on the shared pointer before the update
  • the writer performs the change, decreasing the ref-counter, at this point, the only remaining reference is in the writer’s hand.
  • the writer releases its own copy before the reader had time to increase the counter (the reader may have been scheduled for example)
  • Being the last thread holding a reference, the writer thread see the ref-count getting downto 0 and enter in the memory release code
  • now, even if the reader succeed at increasing the ref-count, the pointer will be invalidated, causing potential improper dereference.
We're facing a classical race-condition.
In order to prevent that kind of situation, you need lock, or more clever constructions (hey RCU here you are … ), but the ref-count doesn’t help alone. The problem is that a reader need to take two actions: grab the data and increase the counter at the same time, there’s no atomic operations able to do that. Eventually, you can think of a kind of tagged pointer, where the ref-count is contiguously attached to the pointer, that sounds good, but it means that getting the pointer no longer requires one atomic load, but a fail/retry loop of compare-and-swap which will be expensive when in the presence of a lot of readers.

Ref-counting and Lock-free Data Structures

Another use-case we may consider is life-time of memory cells in lock-free data structures. As an example, I’ll consider the classical lock-free queue of 
Micheal&Scott Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms.
In this case we have two problem related to pointers: use after free and ABA. 
Let’s have a quick overview of the part of the algorithm that introduce these 
issues:
  • The queue is single linked list with 2 entry points (tail and head/)
  • The head points to a recycled sentinel which points to the real top of the queue.
  • Update operations are based on compare-and-swap.
  • A consumer (after checking for emptyness) will get the value from the next of the head, detach the old head and replace it with its next (recycling the element as a sentinel.)
The problem: an acting consumer grab the head pointer and a pointer to its next element and keep it until it succeeds in updating the head pointer or fail (and the retry its operation.) Consider what happen to the head: when the a consumer thread arrives, the first things it does is to get the head pointer, but in the same time, another active consumer may finish its job and update this pointer. Atomicity, enforce the fact that the new consumer thread will either see the old head or the new head, if it sees the old one, it’ll try to access the content in order to grab the next pointer. Again, in a race condition scenario, if the thread that update the head had time to release the memory cell, the new thread will dereference a freed pointer.
Since the lock-free algorithm relies heavily on compare-and-swap and thus often check pointers, one can avoid the use-after-free by simply recycling the memory cells rather than freeing them. Now, we introduce, or at least we increase the risk of, another problem: ABA.
ABA is a classical problem associated with compare-and-swap updates. Most CAS based algorithms check that a pointer hasn't changed before updating it, in the presence of recycling, it may happen that the same pointer has been detached and then reused while the thread was scheduled. Thus, seeing the same pointer doesn’t mean that it hasn’t change.
Again, can we use ref-count to solve these issues ?
Like before, the idea is that the pointer has two different life: when its attached (and its ref-count will be at least 1) and when it has been detached (and then its ref-count will only decrease.) The problem is still in the transition phase. Just consider the use-after-free scenario: when the new consumer thread get the head pointer, it may fetch a pointer that will be detached immediately after. Once again, if add a ref-count, where is it ? And how can we fetch the pointer and increment the ref-count at the same time so that it won’t be freed ? Race condition, again !
As expected (this isn't a new result), ref-count is not the right tools for non-blocking algorithm.

Solutions

In the previous examples, I describes problems that can’t be solved using ref-counting, but is there any solution anyway ?

RCU, Hazard Pointers and Procrastination Based Programming

Managing life-time of pointers in a (almost) non-blocking manner, can be done using techniques that basically belongs to procrastination or relativistic programming. The base idea is that you don't need to do it as soon as you can, but just wait for a better moment.
RCU is an approach heavily used in the Linux kernel and is becoming available in userland. How does it work ? I’ll write something about it later, but in short, the writer (the one who wants to delete the pointer) waits while readers are still active. How does he now that readers are active ? All threads go through a quiescent state from time to time where they no longer need coherency, all the job of the RCU framework is to provide the simplest possible code (on the reader side) that provides the correct wait period. Things are not exactly simple in userland, but it works. A lot of ressources are available in Introduction to RCU
Hazard Pointers are a little bit more classical. They provide two aspects: deferred deletes and protection for in use pointers. Each thread has a set of hazard pointers (variable shared for reading only) where they put their current pointers; on the delete side, pointers scheduled for delete are accumulated in a wait list, when this list reach a given size, the current thread take the difference between the wait list and the pointers in the hazard pointers of other threads and delete the result. The original article presents most of what you need to know on the subject Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects.

Is it worth the effort ?

I've made some experiment with my own hazard pointers implementation and C++11 shared_ptr, in order to see the impact of synchronization in smart pointer using ref-counting.
The experiment is pretty simple: a bunch of threads reading a shared buffer, eventually another thread comes and updates the buffer (change the pointer). The buffer is not protected by synchronization but we fetch the pointer using atomic load (with full sequential consistency.) As a reference, we have a version with only readers and no extensions related to memory reclamation (not needed in this case.) The two others versions support only reading and sporadic updates, the first one use smart pointers and the second uses hazard pointers.
Hazard pointers add some code to the readers: once the pointer has been fetched and stored in the hazard pointer, we need to check that it hasn’t change. The pointer is only protected when stored in the hazard pointer, meaning that you can have an update between the first fetch and the storage in the HP, with the risk of seeing the pointer invalidated between the two operations.
The read itself only do a kind of strlen on a buffer of 256 bytes, exactly 2²⁴ times, while the reader updates the buffer with a sleeping time of 50 micro-seconds between each update.
This is a micro-bench, and thus you should take it with the usual care, but results are pretty self explanatory: smart pointers performances are way down below the others.
Experiment were run on a two dual-core CPU (Dual-Core AMD Opteron Processor 2220) thus four physical cores, with 16GB of RAM in two NUMA banks. The code were produced with gcc 5.1.0 using -O3 flag. I use out of the box Archlinux distribution (and thus pre-emptive kernel.) Timings are done inside the code directly using C++11 monotonic wall-clock.
Raw results for reading only cases: the table presents execution time for each version and for different number of threads.
Version24816
No sync1.678291.6723153.275036.50652
shared_ptr13.1170526.932262.7429129.811
Hazard ptr1.814621.7698053.493456.977785
The second table presents results in the presence of a single writer:
Version24816
shared_ptr8.8009528.933965.05137.608
Hazard ptr1.784791.810973.568377.05612
Read operations per second per thread
Read operations per second per thread


Conclusion

So, while ref-counting can be useful tool in a lot of cases (like in my simple C piece of code) using it in multithreaded environment may become expensive.
I started this article because I was, once again, trap by the false idea that we can use ref-count (in some specific cases) safely, but as you can see, removing constraints doesn’t solve the synchronization issue. While I'm speaking, there isn't official implementation of real non-blocking memory reclamation techniques like hazard pointers. A userland RCU implementation is available (and I’m in the process of writing my own in C++ style) but it’s not really portable (at least, it works on most modern Unixes, but not all architectures.)
There exists other approaches, like tagged pointers, that solves ABA problems in lock-free data structures, but as it seems, deferred memory reclamations seems to be pretty efficient. I strongly encourage you to read RCU papers (both web articles about its uses in Linux and academics ones.) With working RCU, a lot of concurrent accesses problems can be solved easily with almost no specific code and almost no overhead on the reader side.