Floating point numbers are useful (should I say mandatory) for scientific computations. But they’re also one of the most complex subject for programmers. From a purely theoretical point of view, they are the perfect illustration of the edge between discrete math (arithmetic, symbolic computation … ) and continuous math (math dealing with continuous values like real number and so on … ) Unfortunately for us, almost every scientific field use continuous math and only logic and computer science really care about discrete math.

Concretely, we have to represent values that have potentially infinite representation into a finite representation.

## Floating Point Numbers Representation

There’s a standard describing how to encode real values in our finite world: floating point numbers (IEEE 754.)

You don’t need the details of the standard to do basic stuff (that should interesting for more advanced computations and optimization.) But you do need to understand the idea behind.

You’ve probably encounters values express in the form 1.234×10³. The idea is to encode in the exponent how large is number (that’s a generalization of unit used in the metric system.) Very large numbers will have a very important exponent and very small numbers a negative one. The exponent is some how moving the point in the number. Floating point numbers are based on the same idea, except that we’re on a computer and used power of two !

Representing real values that way let us have very high or very small value encoded using the same representation.

The literature is full of details and explanation about floating point, but the whole things you need is:

- A float is a pair (m,e) (hey, I forgot the sign … ;)
- m (the mantissa) is a number with a fixed representation
- e (the exponent) is a signed integer (or biased-integer)

## Issues ?

The main issue are often called absorption: small values may be absorbed by wider ones.

What does it mean ? Let’s go back to human readable stuff: we decide to have a fixed mantissa with four digits of the form X.XXX and use exponent of 10. Now you wan to compute 1×10⁰ + 1×10¯⁴, the result should be 1.0001, but it requires five digits and thus we get as answer 1×10⁰.

The same apply to floating point numbers, let’s try in C:

```
#include <stdio.h>
int main() {
float x = 1.0, y = 1e-10;
if (x+y == x)
printf("y was absorbed !\n");
else
printf("y wasn' absorbed!\n");
return 0;
}
```

This code will print the first message, of course !

That’s a classical one and you have probably seen it by yourself previously. But, this simple effect can have huge impact on your code and your results.

## Vector Sum

A classical case where absorption effect may have strange result is array sum. Let’s take a simple example: we a huge array of float number and we want to compute the sum or the average value. We choose to fill our array is 1, so that the effect is more obvious.

```
#include <stdio.h>
#include <stdlib.h>
int main(int ac, char *av[]) {
unsigned long size = 100000000;
float sum = 0.0, *tab;
if (ac > 1) size = strtoul(av[1], NULL, 10);
tab = malloc(size * sizeof (float));
for (float *cur = tab; cur < tab + size; ++cur)
*cur = 1.0;
for (float *cur = tab; cur < tab + size; ++cur)
sum += *cur;
printf("size:\t%lu\nsum:\t%.8g\n", size, sum);
return 0;
}
```

The output of this code is disturbing:

```
size: 100000000
sum: 16777216
```

The sum stop at some point: that’s a typical absorbing effect, when 1 becomes to small with regards to the current sum, the value stop to change. We can easily infer this value: the mantissa is represented using 24bits (in fact, it’s 23, there’s an implicit bit always set to 1 at the beginning), and 2²⁴ = 16777216 !

To be sure, let’s play with the array size:

```
size: 16777215
sum: 16777215
size: 16777216
sum: 16777216
size: 16777217
sum: 16777216
```

Can we solve this ? Yes, of course, the good solution is to perform a sum by chunks of the array. Here is two examples of implementation:

```
#define CHUNK_SIZE 65536
float splitted_sum(float *begin, float *end) {
if (end - begin <= CHUNK_SIZE) {
float sum = 0.0;
for (float *cur = begin; cur != end; ++cur)
sum += *cur;
return sum;
}
float *mid = begin + (end - begin)/2;
return splitted_sum(begin, mid+1) + splitted_sum(mid+1, end);
}
float splitted_sum2(float *begin, float *end) {
float gsum = 0.0;
for (float *chunk = begin; chunk < end; chunk += CHUNK_SIZE) {
float sum = 0.0;
for (float *cur = chunk; cur != chunk + CHUNK_SIZE && cur != end; ++cur)
sum += *cur;
gsum += sum;
}
return gsum;
}
```

First, we should verify that we get a better result: the answer is yes, tested against the same context we get the awaited values:

```
size: 100000000
sum: 16777216
splitted_sum: 1e+08
splitted_sum2: 1e+08
```

Some words about that two functions: the global idea is to perform the sum piece by piece, i.e. we work on smaller sub-array so that individual values won’t get absorbed by the growing result.

The first version is a classic using recursive range division: you divide the current range until you reach the threshold, and only then do you do the array sum.

The second one is even simpler: two loops, one ranging over sub-arrays and the inner one doing the sum on each sub-array.

The positive point is that both solutions work for our case, but which one is better ? Good question ;)

Globally the recursive version is more robust: since we’re staging sums by pairs, most summed values will be in the same range, while the second version may encounter absorption (for my example, 100,000,000 of 1s, chunks smaller than 16 show absorption.)

What about performances ? I’ve made some quick tests … there isn’t any really differences. For small chunks (about 16 cells) the second version is really faster, but we’re also flirting with the error-zone. When around 32/64 the recursive version is faster (but not faster than the loop version with a chunk size of 16.) For bigger chunks, there’s no real differences. Here are some results (time added to previous output) for chunk size of 32:

size: 100000000`sum: 16777216 (time 0.0937701s) splitted_sum: 1e+08 (time 0.0697868s) splitted_sum2: 1e+08 (time 0.0825487s)`

There’s another interesting point in building split sum:

**parallelism**! Moving to parallel here just requires computing in separate threads the sums of each chunk. It can be done by hand using threads library or stuff like parallel reduce of TBB …
Here is an example using C++11

`std::async`

, (*note that this is not the best way to do that*) :```
double simple_sum(double *begin, double *end) {
double r = 0;
for (auto cur = begin; cur != end; ++cur)
r += *cur;
return r;
}
double async_sum(double *begin, double *end) {
auto len = end - begin;
if (len <= CHUNK_SIZE)
return simple_sum(begin, end);
auto mid = begin + len/2;
auto part1 = std::async(async_sum, begin, mid);
return async_sum(mid, end) + part1.get();
}
```

This implementation spawn too many threads and consume too much memory. A good parallel implementation should separate the parallel splitting and the splitting of the array. And note that the

`std::async`

and the `std::future`

is too expensive to (at least in current implementation.)## Approximation And Testing

Another consequence of floating point approximation is the

*instability*of computation upon expression re-ordering (*i.e.*using associativity and commutativity.)
Let’s take a simple example. Here are implementations of the factorial function:

```
double fact(int n) {
if ( n < 2 ) return 1;
return n * fact(n - 1);
}
double fact_term(int n, double a) {
if ( n < 2 ) return a;
return fact_term(n - 1, n * a);
}
```

They are classical examples of recursive implementations of the factorial. Here is a simple test:

```
int main() {
int n = 51;
double r0, r1, r2;
r0 = fact(n);
r1 = fact_term(n, 1);
printf("fact(%d) = r0 = %g\n", n, r0);
printf("fact_term(%d) = r1 = %g\n", n, r1);
printf("r1 - r0 = %g\n", r1 - r0);
return 0;
}
```

The output explains the problem itself:

```
fact(51) = r0 = 1.55112e+66
fact_term(51) = r1 = 1.55112e+66
r1 - r0 = -5.61217e+50
```

**That’s make a huge difference !**

Just to be sure, I tracked the moment when the errors shows up, let’s look at the same code with 50 rather 51 …

```
fact(50) = r0 = 3.04141e+64
fact_term(50) = r1 = 3.04141e+64
r1 - r0 = 0
```

**No more errors !**

No you understand the difficulty ? Errors in floating point can comes quickly for a simple reordering and can disappear with certain range of values. Our main issue is to be able to test various implementations of the same computation.

As an example, I’ll take a Heron based square root and compare it with the result of the standard library function

`sqrt(3)`

.```
// Simple Heron method for square root
// threshold indicate when to stop refinement
double mysqrt(double x, double threshold) {
double r0 = x, r1 = 0;
while (fabs(r0 - r1) > threshold) {
r1 = r0;
r0 = (r0 + x/r0)/2;
}
return r0;
}
```

The method works by iterative refinement, at each step we try to narrow our range around the square root. We stop as soon as the difference between the current value and the next one falls under the given threshold. Our purpose is to find a good threshold, for that I’ll compute

`sqrt(2, th)`

with increasing threshold and stop when the difference between my function and the standard function is stable … Just look at the code:```
int main() {
double x = 2, my_root, c_root;
double th = 1, diff=HUGE_VAL, prev_diff, mem = 0;
unsigned c = 0;
c_root = sqrt(x);
do {
prev_diff = diff;
my_root = mysqrt(x,th);
diff = fabs(my_root - c_root);
if (prev_diff - diff <= 0) {
if (c == 0) mem = th;
c += 1;
} else c = 0;
th = th/8;
} while (prev_diff - diff > 0 || c < 4);
printf("Threshold: %g - %a\nDiff: %g\n", mem, mem, diff);
return 0;
}
```

Ok, the code is more a script in C than a real testing program, but you can notice the tricks: we won’t get the same result, unless we use exactly the same algorithm (implemented the same way … ) So, we compute the difference between the expected result and our own value, it appears that we can’t do better than something around 2.22×10¯²². Here is the output:

```
Threshold: 5.96046e-08 - 0x1p-24
Diff: 2.22045e-16
```

## Real float format ?

A float (32bit floating point numbers) :

These elements are stored in that way:

**f = (-1^S) × 1.SIGNIFICAND × 2^(EXPONENT - 127)**

These elements are stored in that way:

As you may notice, the one before the point is not stored, the idea is that the mantissa is scaled such that it start with a one (remember, we’re in binary, that’s not a problem.)

The main consequences it’s some times difficult to compare values and non-strict comparing are sometime fuzzy. That’s a complex field of study and still a problem for many applications.

A funny consequence of the format is the existence of two zero ! The sign in the representation is an independent bit, not like in signed integer and thus we can all the value set to zero but the sign bit set to one ! A classical example is this absolute value function:

```
float fabsf(float x) {
if (x > 0.0) return x;
if (x < 0.0) return x;
return 0.0;
}
```

## Conclusion

Dealing with floating point numbers is not easy task, there’s a lot of traps, and you haven’t seen all here, I’m not a scientific computation specialist, I just report my simple experience.