Tuesday, July 1, 2014

Playing With Parallel Computing and Graph

Parallel Algorithms ?

Parallel programming is an important field in modern programming, it introduces new issues that traditional (sequential) programming doesn’t have:
  • Determinism
  • Concurrent memory access
  • Communication
  • Synchronization
And of course, we need to design new algorithms for a lot of problem that already have sequential ones. For some cases, the migration is obvious or doesn’t really need specific attention and for others it may be impossible.

Divide and Conquer

The divide and conquer principle is a classical strategy used in algorithms: the idea is to split the task to be accomplished into smaller pieces (divide) and merge sub-results (conquer.) There are lots of algorithms based on that principle, among which the most known is probably QuickSort.
Divide and conquer strategy are also suited for parallelism: divided tasks can be distributed among threads easily. In fact, lots of recursive algorithms (and even more iterative ones) can be re-implemented that way.
As a small example consider finding the smallest element in an unsorted array. The classical strategy is to scan the whole array sequentially. We can rewrite this in the following way:
int find_min(int *begin, int *end) {
  if (end - begin == 1)
    return *begin;
  int lmin, hmin;
  int *mid = begin + (end - begin)/2;
  lmin = find_min(begin, mid);
  hmin = find_min(mid, end);
  return lmin < hmin ? lmin : hmin;
}
This implementation is less efficient in sequential programming than the traditional scan, but you can execute recursive calls in separate threads to obtain a parallel version of your code. In fact, in order to have a correct algorithm, you will need a split-bound (a minimal size bellow which you’ll use sequential search) and probably a task systems to avoid starting new a thread for each call.

Parallel or Distributed ?

There’s a lot of definition in the parallel computing world, but we can differentiate to major groups of parallel programs depending on whether data sharing is based on shared memory or on communication. So, the sake of clarity, I’ll call parallel algorithm those using shared memory and distributed algorithm those who rely on communication.
Both kind have different advantages and, of course, issues. In fact, advantages and issues are tight coupled: sharing memory permits efficient data exchange while inducing a lot of concurrent access issues (and thus determinism issues), on the other hand, communication suppress concurrent access issues and introduce synchronization and communication overhead. You may use communication in multi-threaded context, but if you plan to really distribute you application upon multiple processors over a network, you can’t use shared memory (yeah, I know, that’s obvious.)
Now, what is interesting in divide and conquer strategy is that you never really share the data (most of the time). In fact, the divide part can be simplify to the point where you just split index references to the original data and restrain data exchange to the recollecting part (conquer.)
OK, that’s theory, in practice things are far more complicated, otherwise computer scientists in the parallel and distributed fields should have loose their jobs for a long time.

An Example: Computing Connected Components in Undirected Graph

The purpose of the following example is not to describe a new, nor an efficient algorithm but to analyse and study how can we make a parallel implementation of a sequential algorithm.
Let’s consider a classical algorithm for computing components in an undirected graph: union-find.
The principle is very straight forward, we associate to each vertex a leader and when we encounter a new edge, we search the leaders of the two ends of the edge and (if they differ) you make one the leader of the other. Here are the two basic functions, considering that vertices have an unique integer identifier (ranging from 0 to the order of the graph):
int find(int leaders[], int v) {
while (v != leaders[v]) v = leaders[v]; return v; } void union(int leaders[], int v0, int v1) { int l0 = find(leaders, v0); int l1 = find(leaders, v1); if (l0 != l1) leaders[l1] = l0; }
At beginning, each vertex is its own leader. The main algorithm will then traverse the set of edges and call union for each pair of vertex. There’s two major optimizations: path compression and rank.
This algorithm is somehow incremental: at each step, the current leaders correspond to the real leaders of a graph limited to edges seen so far. We can also merge the results of separate runs on separate sets of edges. In fact, after computing leaders for a given subset of edges, we can build asmaller set of edges where each vertex in a connected component is linked to the leader of the component.
Are we sure the subsets we built are smaller ? If we consider a connected component made of N vertices and K edges, the smallest value for K is N-1. On the other hand, the set of edges built by linking each vertex in the component to the component leader we have exactly N-1 edges. So, we can not enforce that the set is strictly smaller, but it can be bigger. Like other divide and conquer strategies, this approach is not more efficient, but since we can introduce some parallelism, if we bound the induce overhead, we will have it run faster.

Strategies for Parallelism

The way we can split the input is now obvious: we simply split the set of edges into smaller subsets. The question is: how do we merge sub-results.
So consider we have two subsets of edges, we can, as a first approximation, run union-find on both subsets, then build two new subsets like described earlier, merge them and run again union-find on it. It’s working but is it efficient ?
It’s obvious that on most cases, this will almost double the linear cost of the algorithm (if the number of edges is already minimal the parallel pass will be useless … )
On the other hand, if we only consider running the algorithm on edges that matteri.e. edges that can connect components built in different subsets, we could decrease the cost of the merge pass.
In Introduction to Parallel Computing (Addison Wesley, ISBN: 0-201-64865-2, 2003, by Grama, Gupta, Karyptis and Kumar), the authors present a similar approach to the same problem. They choose to compute connected components of splited subsets of edges using a depth traversal and perform the merge step using union-find. They also present a formal study of the algorithm and its cost for shared memory and communication models.

Implementation ?

I choose to make an experimental implementation of the parallel version that I described using the go language (http://golang.org/).
Why go ? A good question, first I was looking for an interesting (but short) project to increase my knowledge of the language and test the go-routine mechanism a little bit further. The idea is also to extend this algorithm into a distributed one (work in progress, stay tune for the next article) using nsq messaging system (http://nsq.io/).
I’ll publish the code into a public git repository soon (actually it is still in a dirty experimental form and need more testing.) But, don’t be afraid, here are some hints and explanation on how it works ;)

Union-Find, ranks and path compression

Union-find is quite straightfoward to implement, but in order to provide the best performance, we need to add simple optimizations: ranks and path compression.
Ranks: the idea is to add a simple heuristic in order to choose which among the two founded vertices will become the leader of merged component. For that we store a count of the vertices represented by each leader. At beginning, each vertex is its own leader and thus represent one vertex, when union is called, we choose the one with the biggest count as a leader, updating its count by adding these of the other leader. When count are equal, we choose the first element of the pair.
Path compression: most of the time of the algorithm is spent in the find function, and this time is due to the iteration required to find the leader. We can simply notice that when searching for the leader all encountered vertices have the same leader and thus we can re-affect them with the found leader. While this add a second loop to the function, it greatly reduces the number of steps to found the leader at each call.
Combining both optimization enforces an amortized complexity of 
O(α(n))
α(n)

The implementation supposed that vertex has a unique integer id ranging from 1 to the order of the graph and thus we may use a simple array to store the leader, but since in the parallel implementation we won’t see all vertices, we replace the array with a map from int to int. In the same idea, we use another map for the rank.
Here are the algorithm main union-find code:
package union_find

type Edge struct {
    Src, Dst int
}

type Components struct {
    leader map[int]int
    rank   map[int]int
}

func (Comp *Components) Find(key int) (lead int) {
    if _, ok := Comp.leader[key]; !ok {
        Comp.leader[key] = key
        Comp.rank[key] = 0
    }
    if lead = Comp.leader[key]; lead != key {
        Count += 1
        Comp.leader[key] = Comp.Find(lead)
        lead = Comp.leader[key]
    }
    return
}

func (Comp *Components) Unify(v0, v1 int) int {
    l0, l1 := Comp.Find(v0), Comp.Find(v1)
    if Comp.rank[l0] < Comp.rank[l1] {
        tmp := l0
        l0 = l1
        l1 = tmp
    }
    Comp.leader[l1] = l0
    if Comp.rank[l0] == Comp.rank[l1] {
        Comp.rank[l0] += 1
    }
    return l0
}

Parallelism

Parallel split and merge are done using a divide and conquer strategy. I use another data-structure where I store the maps for the union-find and a set of the vertices present in the current subset. The main idea is to split the set of edges until we reach our threshold, then I build a subgraphcontaining the set of vertices and the result of the union-find algorithm.
In order to merge two subgraphs g1 and g2, I rebuild a set of edges mapping each vertex to its leader from g2 and for each new edges (s,d), if s or d appears in g1 I call union on the edge, otherwise I simply add (s,d) in the leader map.
package subgraph

type SubGraph struct {
    VertexList map[int]bool
    Components *union_find.Components
}

func NewSubGraph(edges []union_find.Edge) SubGraph {
    g := SubGraph{make(map[int]bool), union_find.NewComponent()}
    for _, e := range edges {
        g.VertexList[e.Src] = true
        g.VertexList[e.Dst] = true
        g.Components.Unify(e.Src, e.Dst)
    }
    return g
}

func (g SubGraph) Merge(edges []union_find.Edge) {
    q := make(map[int]bool)
    for _, e := range edges {
        newsrc := false
        newdst := false
        if newsrc = !g.VertexList[e.Src]; newsrc {
            q[e.Src] = true
        }
        if newdst = !g.VertexList[e.Dst]; newdst {
            q[e.Dst] = true
        }
        if !(newsrc && newdst) {
            g.Components.Unify(e.Src, e.Dst)
        } else {
            g.Components.AddPair(e)
        }
    }
    for v := range q {
        g.VertexList[v] = true
    }
}

func rec_compute(edges []union_find.Edge, grain int) (g SubGraph) {
    mid := len(edges) / 2
    if mid > grain {
        hout := make(chan int)
        l := edges[0:mid]
        h := edges[mid:len(edges)]
        var gh SubGraph
        go func() {
            gh = rec_compute(h, grain)
            hout <- 1
        }()
        g = rec_compute(l, grain)
        <- hout
        g.Merge(gh.GetComputedEdges())
    } else {
        g = NewSubGraph(edges)
    }
    return
}

func ComputeEdges(edges []union_find.Edge, grain int) (comp_edges []union_find.Edge) {
    g := rec_compute(edges, grain)
    comp_edges = g.GetComputedEdges()
    return
}
Parallelism is obtained through go-routines and synchronization use a simple int channel.
This code is a first draft, I’ve try another implementation without the merge optimization, testing is on the run to see if it’s really accurate.

Performances ?

I’m still exploring various aspects of the implementation to track down useless computations and bad practices (I’ve still got a lot to learn about go.) I’m also re-considering various choices made for optimization …
Testing is actually done stupidly: I build a random huge set of edges, with the simplest generator. For 16 millions of edges, I start to see an interesting speed-up over a straight sequential union-find. But, I’ve also noticed that replacing go-routine call with a simple recursion call is slowing down the process (w.r.t. the sequential version) by 70%, there’s probably something to do in the whole process to get better results.
For that matter, here are the output from go test -bench -cpu 8:
PASS
BenchmarkNewSubGraphLinear-8           1    10374479688 ns/op
BenchmarkComputeEdgeLinear-8           1    17708146949 ns/op
BenchmarkComputeEdgeParallel-8         1    5025817943 ns/op
ok      git.lse.epita.fr/users/burell_m/distributed_union_find/subgraph 45.617s
The first bench test the basic sequential version, the second the divide and conquer without go-routines and the last one corresponds to the code shown in this article. The bench runs with a random graph of 224 edges and a threshold of 220.

Next Step ?

So, I’m still improving the basic algorithm and testing strategies to obtain an efficient parallel version. I’m also looking for real input data, or at least a random graph where I can control some properties. There’s also questions about the order edges are presented and splited …
Once all that is fixed, the next step will be to distribute this code using nsq.