## Optimization and Details

I’ve already talk about optimization for simple computation like trigonometric functions (see one of the previous post) trying to find the best implementation. I’m back on the subject, looking at the small details, and this time playing with integer, most-significant bit and integer square root.

## Most Significant Bit

I’ve recently read a post (Fast Power of Two Equal or Larger than a 32-bit Integer (1)) about computing the smallest power of two bigger than a given integer. The idea is to find the most significant bit (msb.) If k is the msb of x then we know that 2kx<2k+1. The easiest way to compute the msb is to divide x by two until it reaches 0 and counting the number of steps. We can compute the power of two in the same loop, yielding the following code (with unused computation removed):
static inline
unsigned next_pow2_a(unsigned x) {
unsigned              r = 1;
for (; x > r; r <<= 1) {}
return r;
}
The post proposes another version, where it computes the msb with following idea: we can fill all the significant bit of an integer using 5 shift. In fact, this operation directly compute the power of 2 we’re looking for, the code (extracted from [1]) look like that:

static inline
unsigned npow(unsigned x) {
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
return x + 1;
}

static inline
unsigned next_pow2_b(unsigned x) {
if (!x || ((x & (x - 1)) == 0))
return x;
return npow(x);
}
To be completed, I decided to test two other methods based on a binary search. The first one is the classical binary search applied to bits:
static inline
unsigned next_pow2_c(unsigned x) {
unsigned      l = 0, r = 32;
while (r - l > 1) {
unsigned m = ((l+r)>>1);
if (x < (1 << m))
r = m;
else
l = m;
}
return 1 << r;
}
The second one take advantages of the fact that we can unroll completely the loop being on 32bits integer and after some simplification we arrived at the following code:
static inline
unsigned next_pow2_d(unsigned x) {
unsigned              r = 1;
if (x >= 1 << 16) {
r <<= 16;
x >>= 16;
}
if (x >= 1 << 8) {
r <<= 8;
x >>= 8;
}
if (x >= 1 << 4) {
r <<= 4;
x >>= 4;
}
if (x >= 1 << 2) {
r <<= 2;
x >>= 2;
}
if (x >= 2)
r <<= 1;
return r;
}
Finally, I wanted to compare all these methods against x86 bsr instruction:
static inline
unsigned next_pow2_e(unsigned x) {
unsigned      msb;
__asm__("bsrl %1,%0" : "=r"(msb) : "r"(x));
return 1 << (msb + 1);
}
In order to compare all the function, I needed a simple framework to automate the call against a huge array of random integer without loosing the inlining. Playing a little bit with macro, I’ve finally wrote the following code:

// Time diff for bench
double nanodiff(struct timespec t0, struct timespec t1) {
double s0 = t0.tv_sec + ((double)t0.tv_nsec) * 1e-9;
double s1 = t1.tv_sec + ((double)t1.tv_nsec) * 1e-9;
return s0 - s1;
}

// Type of the core loop bench functions
typedef void (*func) (unsigned*, unsigned);

// Macro generating core loops for bench
// Here we try to keep inlined function (avoiding use pointer)
#define CORE_BENCH_NAME(FNAME_, VER_) core_bench_##FNAME_##_##VER_
#define CORE_BENCH(FNAME_, VER_)                                      \
void core_bench_##FNAME_##_##VER_(unsigned data[], unsigned len) {  \
for (unsigned *cur = data; cur != data+len; ++cur) {              \
unsigned r = FNAME_##_##VER_(*cur);                             \
assert(!r || ((r & (r - 1)) == 0));                           \
}                                                                 \
}

// Generate core bench loops for various version
CORE_BENCH(next_pow2, a)
CORE_BENCH(next_pow2, b)
CORE_BENCH(next_pow2, c)
CORE_BENCH(next_pow2, d)
CORE_BENCH(next_pow2, e)

void bench(unsigned data[], unsigned len, func core, char *name) {
struct timespec       t0, t1;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t0);
core(data,len);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t1);
printf("Bench %s: \t%gs\n", name, nanodiff(t1, t0));
}
Here are the timing results compiled with clang and run on i7 (3770) for 226 integers and compiler optimizations turned off (-O0) and sorted by time:
Bench a (naive version):    3.92489s
Bench c (loop version):     1.63613s
Bench d (if-bin version):   1.01329s
Bench b (fast version):     0.781902s
Bench e (bsrl version):     0.291132s

The same with -O3 :
Bench a (naive version):    1.34297s
Bench c (loop version):     0.703433s
Bench d (if-bin version):   0.398467s
Bench b (fast version):     0.116604s
Bench e (bsrl version):     0.0957188s

So obviously the asm version is faster in all cases, and the version from the [1] has good perf. It’s also worth noticing that loop version of the binary search is nearly 2 times slower than the unrolled version.

## Square Root

Computing msb is useful to set up a first approximation of a square root and thus, I found it interesting to see the impact of the msb computation time on a integer square root functions.
I used the Heron method (or specialized Newton’s method.) The basic version used the argument as a first approximation, giving the following code:
static inline
unsigned int_sqrt(unsigned x) {
unsigned              r = x;
while (r > x / r) {
r = (r + x / r) >> 1;
}
return r;
}
The idea is then to replace the first approximation, using the following idea: if 2kx<2k+1, then x2(k+1)/2. So, once we got the msb, we can have a first approximation with a division by two, some addition and a shift.
In order to avoid rewriting the same code several time, I, once again, play with macro:
// Automatic square root function
#define INT_SQRT_NAME(VER_) int_sqrt_##VER_
#define INT_SQRT(VER_)                           \
static inline                                  \
unsigned int_sqrt_##VER_(unsigned x) {         \
unsigned            r = msb_##VER_(x - 1);   \
while (r > x / r) {                          \
r = (r + x / r) >> 1;                      \
}                                            \
return r;                                    \
}
These macro use a function returning the approximation, so the first version is just returning its argument (the name is badly choosen, since it doesn’t return the msb … ) :
static inline
unsigned msb_a(unsigned x) { return x; }
Then comes the naive loop again:
static inline
unsigned msb_b(unsigned x) {
unsigned      r = 0;
for (; x; x >>= 1)
r += 1;
return 1 << ((1 + r) >> 1);
}
We can’t used msb computation from [1] since it doesn’t compute the rank but directly the corresponding power of 2, but we can recycle our binary search versions:
static inline
unsigned msb_c(unsigned x) {
unsigned              l = 0, h = 32;
while ( h - l > 1 ) {
unsigned m = (h + l) >> 1;
if (x < (1u << m))
h = m;
else
l = m;
}
return 1 << ((1 + h) >> 1);
}

static inline
unsigned msb_d(unsigned x) {
unsigned              r = 1;
if (x >= 1 << 16) {
r += 16;
x >>= 16;
}
if (x >= 1 << 8) {
r += 8;
x >>= 8;
}
if (x >= 1 << 4) {
r += 4;
x >>= 4;
}
if (x >= 1 << 2) {
r += 2;
x >>= 2;
}
if (x >= 2) r += 1;
return 1 << ((1 + r) >> 1);
}
And finally, the asm version:
static inline
unsigned msb_e(unsigned x) {
unsigned      msb;
__asm__("bsrl %1,%0" : "=r"(msb) : "r"(x));
return 1 << ((2 + msb) >> 1);
}
I’ve tested this code with the same methods presented earlier, and get the following results for -O0 (and the same 226 random integers):
Bench a (simple version):       12.1927s
Bench b (basic msb version):    6.90088s
Bench c (bin loop version):     4.4s
Bench d (if bin version):       3.61818s
Bench e (bsrl version):         3.30704s

And with -O3:
Bench a (simple version):       7.05185s
Bench b (basic msb version):    2.22483s
Bench c (bin loop version):     2.10562s
Bench d (if bin version):       1.58189s
Bench e (bsrl version):         1.55542s

The most import result is that a good approximation makes a huge difference ! Note also that the unrolled binary search is nearly as good as the asm version, giving an opportunity to write a fast portable code (almost portable … )

## More Hardware Backend

Most modern CPU provides square root functions that are (supposed to be) really fast. I’ve decided to try SSE code using intrinsic instructions. After reading a stackoverflow question (2) I’ve decided to use the reverse-square root SSE instruction.
The first issue was that this operator computes a floating point approximation that is less accurate than my other functions. Rather than decreasing my accuracy, I decided to complete the SSE code with some more steps of the previous algorithm. Here is the function (you can see that I borrowed the code from [2]):
static inline
unsigned int_sqrt_sse(unsigned x) {
float pIn = (float)x, pOut = 0.0;
__m128 in = _mm_load_ss( &pIn );
_mm_store_ss( &pOut, _mm_mul_ss( in, _mm_rsqrt_ss( in ) ) );
unsigned out = (unsigned)pOut;
do {
out = (out + x/out) >> 1;
} while (out > x/out);
return out;
}
And then test it using the same framework described earlier. The result are without any doubts the fastest version ! With a timing of 0.45167s (with -O3, using clang and the same tests as before), it’s three time faster than all other versions !

## Accuracy ?

Square root are approximation, only few numbers have a computable square root that can be represented using a computer. So how can we define the accuracy of our computation ?

For integer square root, we first observe that given an integer
x, we can find two square numbers (call them y0 and y1) such that y0xy1. Since square root is a monotonic function, we have y0xy1. Finding the best accuracy means finding the smallest ϵ=y1y0, and, if we suppose that x is not a square number, we know that ϵ is minimized when there isn’t other square number between y0 and y1. Thanks again to the monotonicity of the square root, we can easily deduce that the minimum value is obtained for ϵ when y1y0=1 and, if z=y0, the solution is y0=z2 and y1=(z+1)2. In that case, the best integer square root for x is z.
So in order to verify our result, we just have to check that z2x<(z+1)2. In my test this done using assert:
  assert(square(r) <= *cur && *cur < square(r + 1));
What about floating point ? We can’t rely on the next square number idea, but we’re still trying to minimize a distance, here its |xsqrt(x)2|. Most of the time, we fix a threshold for that distance defining it as our accuracy level. Unfortunately, this is not the best way of computing error using float.
There’s valuable pages (mainly Comparing floating point numbers (3) and Comparing Floating Point Numbers, 2012 Edition (4)) about comparing float. From our point of view, the interesting aspect is to establish our error not as a difference (or even relative difference) between two floating point number, but rather the kind of digit distance described in [3] and [4]. I won’t gave you the whole discussion, if you’re really interested in the subject read these articles (especially [4]).
So, as far as we’re concerned, what we need is to see how close our square root approximation is closed to the real square root. Here is a possible check for accuracy:

union float_int {
  float         f;
int32_t       i;
};

// Works only for non-null positive float number
// Without regards to special cases (NaN, infinity ... )
int ulp_positive_float_compare(float a, float b, int32_t ulp_threshold) {
// Some acceptable constraints
assert(sizeof (float) == sizeof (int32_t));             // to be sure
assert(a > 0 && b > 0);                                 // non-null positive
assert(ulp_threshold > 0 && ulp_threshold < (1 << 22)); // proper threshold

if (a == b) return 1;

union float_int       A, B;
A.f = a;
B.f = b;
return abs(A.i - B.i) <= ulp_threshold;
}
When exploring various implementation of integer square root, I’ve rejected the use of libC sqrtf, because it where too divergent for me: once translated back to integer, the root doesn’t satisfy my test. In the same idea, the SSE square root needed some more iteration of the main algorithm in order to return a correct value. Why ? Testing sqrtf result with the previous function shows differences that are less than one ULP, meaning that we are as close as we can, from a float point of view, but that doesn’t mean that the integer translation of this result is the closest integer we can get to the square root.

## Approximation and Optimization

The computation of the float square root relies on an iterative process that get closer to the perfect solution at each step. That’s a recurrent pattern, many numerical approximation works by narrowing a range around the solution. The accuracy of your result is directly linked to the number of iterations and thus the speed of the computation is inversely proportional to the accuracy.
If, for a given problem, you can obtain a fast approximation, you can cut down the number of steps required to obtain an accurate result. For the integer square root, the approximation based on the MSB is a pretty good example: even with a slow MSB computation the square root is 3 time faster.
That’s the root of all fast square root function for floating pointer numbers. The form of floating pointer numbers offers various tricks to get a very fast approximation of the square root that can be used as a starting point for a Heron/Newton iterative method.
A first version can use the fact that we can obtain a good approximation of log2(x) directly from the binary representation of x (noted xint in the following expression): log2(x)xint×e23127 (where 127 is the bias of the 32bits floating point representation.) From log2(x) we can obtain our square root approximation. Here is a basic implementation (based on Wikipedia articles) where a is bias adjustment to reduce the approximation error:
union float_int {
float f;
int   i;
};

const int a = 0x4c000;

// sqrt based on log2
float sqrt_log2(float x) {
union float_int       X;
X.f = x;
union float_int       Y;
Y.i = (1 << 29) + (X.i >> 1) - (1 << 22) + a;
return Y.f;
}
There’s an even faster version based on the reciprocal square root (1x). This version gain famed due to its used in Quake III and the fact that it looks somehow magic (to be fair, this version appears first in some SGI code.) While the original code was used directly for reciprocal square root (needed in computer graphics) you can obtain the square root by taking the reciprocal again (or with a step of the Newton method.) The following version (again inspired by the Wikipedia version) is extended with 3 steps of the Newton method with a result as accurate as the libC sqrtf.
float sqrt_fastinv(float x) x{
union float_int X, Y;
X.f = x;
Y.i = 0x5f3759df - (X.i >> 1);
Y.f = 0.5f * (1/Y.f + x * Y.f); // Newton step with 1/sqrt(x)
Y.f = 0.5f * (Y.f + x / Y.f);   // Classical Newton steps
Y.f = 0.5f * (Y.f + x / Y.f);
return Y.f;
}
To obtain the same accuracy with classic Newton method it requires 5 to 20 steps (for a nine digits number.)

## Conclusion

OK, we finally finish this survey of basic computations. Its interesting to see how obviously simple computations can be optimized and rewritten to obtain measurable performance gain.
I’m not a specialist of numerical algorithms but I do know programming and algorithms in general, details matter, and most of the time you’re looking for a trade-off between accuracy and computation time. You also need to consider that fast but no really accurate approximation can serve as a good starting point speed-up a more accurate algorithm.