Errors in numerical computation: Floating point numbers

3. Errors in numerical computation: Floating point numbers#

Numbers are represented in bits https://en.wikipedia.org/wiki/Byte?useskin=vector
There is not way to assign infinite bits for infinite precission. A finite number of bits is assigned, some of them spent on the exponent, others on the mantissa, and one on the sign. https://en.wikipedia.org/wiki/Floating-point_arithmetic?useskin=vector
This means a limit in precission, range, and even density (https://docs.python.org/3/tutorial/floatingpoint.html). There are 8388607 single precision numbers between 1.0 and 2.0, but only 8191 between 1023.0 and 1024.0

From: "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/FloatingPointPrecisionAugmented.png/1000px-FloatingPointPrecisionAugmented.png"

To avoid differences across platforms, the IEE754 standard stablished rules for approximation, truncation, and so on. https://en.wikipedia.org/wiki/IEEE_754?useskin=vector . Languages like c/c++/fortran/python support the standard. In particular, for scientific applications try to use ‘double’ precission
There are several possible representation of floating point numbers, like Fixed point (https://en.wikipedia.org/wiki/Fixed-point_arithmetic), decimal representation, etc, but the common one in the IEEE754 standard is an “exponential” scientific notation. Play with https://bartaz.github.io/ieee754-visualization/

Format

bits

exponent bits

Significand bits

Machine eps

float

32

8

23

$10^{-7}$

double

64

11

52

$10^{-16}$

For more fp representation details, check
- https://bartaz.github.io/ieee754-visualization/
- https://float.exposed/0x44c00000
- https://www.h-schmidt.net/FloatConverter/IEEE754.html
- https://en.wikipedia.org/wiki/Floating-point_arithmetic
- https://trekhleb.dev/blog/2021/binary-floating-point/

Some warnings:

14 decimal places can be lost in a single operation.
Do not cast floats to integers and viceversa
Addition is no longer associative: $x + (y+z) != (x+y) + z $. Use $x = -1.5e38, y = -1.5e38, z = 1$.
Not every number can be exactly represented (1/3, 0.1, etc)
Compilers has many flags, explore them (-ffast-math, -ffloat-store).

For some dramatic examples of FP errors, check:

https://www.iro.umontreal.ca/~mignotte/IFT2425/Disasters.html
https://web.ma.utexas.edu/users/arbogast/misc/disasters.html
https://slate.com/technology/2019/10/round-floor-software-errors-stock-market-battlefield.html
https://stackoverflow.com/questions/2732845/real-life-example-fo-floating-point-error

3.1. Overflow for integers#

We will add 1 until some error occurs

#include <cstdio>

int main(void)
{
  int a = 1;
  while(a > 0) {
    a *= 2 ;
    std::printf("%10d\n", a);
  }
}

3.2. Underflow and overflow for floats#

#include <iostream>
#include <cstdlib>

typedef float REAL;

int main(int argc, char **argv)
{
  std::cout.precision(16);
  std::cout.setf(std::ios::scientific);
  int N = std::atoi(argv[1]);
  REAL under = 1.0, over = 1.0;
  for (int ii = 0; ii < N; ++ii){
    under /= 2.0;
    over *= 2.0;
    std::cout << ii << "\t"
              << under << "\t"
              << over << "\n";
  }

}

3.3. Test for associativity#

#include <cstdio>

int main(void)
{
  float x = -1.5e38, y = 1.5e38, z = 1;
  printf("%16.6f\t%16.6f\n", x + (y+z), (x+y) + z);
  printf("%16.16f\n", 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1);

  return 0;
}

3.4. Machine eps#

You can also compute/verify the machine precision $\epsilon_m$ of your machine according to different types. The machine precision is a number that basically rounds to zero and in opertation. It can be represented as $1_c + \epsilon_m = 1_c$, where $1_c$ is the computational representation of the number 1. Actually, that means than any real number $x$ is actually represented as $$x_c = x(1+\epsilon), |\epsilon| \le \epsilon_m .$$

Implement the following algorithm to compute it and report your results:

eps = 1.0
begin do N times
    eps = eps/2.0
    one = 1.0 + eps
    print

#include <iostream>
#include <cstdlib>

typedef long double REAL;

int main(int argc, char **argv)
{   
  std::cout.precision(20);
  std::cout.setf(std::ios::scientific);

  int N = std::atoi(argv[1]);
  REAL eps = 1.0, one = 0.0;
  for (int ii = 0; ii < N; ++ii){
    eps /= 2.0;
    one = 1.0 + eps;
    std::cout << ii << "\t"
              << one << "\t"
              << eps << "\n";
  }

}

3.5. Practical example: Series expansion of $e^{-x}$#

The function $e^{-x}$ can be expanded as $$e^{-x} = \sum_{i=0}^{\infty} (-1)^i \frac{x^i}{i!} = 1 - x + \frac{x^2}{2} - \frac{x^3}{6} + \ldots \ (|x| < \infty)$$

This is a great expansion, valid for all finite values of $x$. But, what numerical problems do you see?

Implement a function that receives $x$ and $N$ (max number of terms), and saves the iteration value of the series as a function of $i$ in a file called sumdata.txt. Then load the data and plot it. Use $N = 10000$ and $x=1.8766$. You will need to implement also a factorial function.

#include <iostream>
#include <cmath>

int factorial(int n);
double fnaive(double x, int N);
double fsmart(double x, int N);

int main(void)
{
  std::cout.precision(16); std::cout.setf(std::ios::scientific);
  double x = 1.234534534;
  for (int NMAX = 0; NMAX <= 100; ++NMAX) {
    std::cout << NMAX
              << "\t" << fnaive(x, NMAX)
              << "\t" << std::fabs(fnaive(x, NMAX) - std::exp(-x))/std::exp(-x)
              << "\t" << fsmart(x, NMAX)
              << "\t" << std::fabs(fsmart(x, NMAX) - std::exp(-x))/std::exp(-x)
              << std::endl;
  }
  return 0;
}

double fnaive(double x, int N)
{
  double term = 0, suma = 0;
  for(int k = 0; k <= N; ++k){
    term = std::pow(-x, k)/factorial(k);
    suma += term;
  }
  return suma;
}

int factorial(int n)
{
  if (n <= 0) return 1;
  return n*factorial(n-1);
}

double fsmart(double x, int N)
{
  double term = 1, suma = 1;
  for(int k = 0; k < N; ++k){
    term *= (-x)/(k+1);
    suma += term;
  }
  return suma;
}

3.5.1. Exercise: Substractive cancellation with the Quadratic equation#

3.5.2. Exercise: Sum ups and down#

Is there any computation difference between the following sums?

(3.1)#\[\begin{align} S_{up}(N) &= \sum_{n=1}^{N} \frac{1}{n},\\ S_{down}(N) &= \sum_{n=N}^{1} \frac{1}{n}. \end{align}\]

Implement and plot the relative difference among them as a function of $N$.

3.6. Errors#

FP is well defined through the IEEE754 standard.
A simple substraction could destroy 15 decimal places of precision
You should not cast floats to integers
You should normalize your models to natural units
Addition is not always associative: $x + (y + z) \ne (x+y) + z$, when $x = -1.5\times 10^{38}, y = +1.5\times 10^{38}, z = 1.0$ (single precision)
All numbers can be represented in binary: false. Check 0.1, or 0.3

For some dramatic examples of FP errors, check:

https://www.iro.umontreal.ca/~mignotte/IFT2425/Disasters.html
https://web.ma.utexas.edu/users/arbogast/misc/disasters.html
https://slate.com/technology/2019/10/round-floor-software-errors-stock-market-battlefield.html
https://stackoverflow.com/questions/2732845/real-life-example-fo-floating-point-error

Kind of errors

Probability of an error: start $\to U_1 \to U_2 \to \ldots \to U_n \to$ end
Blunders: Typographical, wrong program, etc
Random errors: Electronics, alien invasion, etc
Approximation: (mathematical series truncation)
Roundoff and truncation of a number in the computer representation

3.6.1. Roundoff/truncation example#

Let’s compute the following sum as a function of $k$, $f(k) = \left|\frac{k}{10} - \sum_{i=1}^k 0.1\right|$ Mathematically, this function should give 0 always. Is that true?

3.6.2. Substractive cancellation#

Given $a = b-c$, then $a_c = b_c-c_c$, therefore

(3.2)#\[\begin{align} a_c &= a(1+\epsilon_a ) = b_c - c_c = b(1+\epsilon_b ) - c(1+\epsilon_c )\\ \frac{a_c}{a} &= 1 + \epsilon_b \frac{b}{a} - \frac{c}{a}\epsilon_c \end{align}\]

so the error in $a$ is about $$\epsilon_a = \frac{b}{a}(\epsilon_b - \epsilon_c), $$ which is much larger, when $b$ and $c$ are close (so $a$ is small).

3.6.2.1. Example of substrative cancellation#

The following series, $$S_N^{(1)} = \sum_{n=1}^{2N} (-1)^n \frac{n}{n+1},$$ can be written in two mathematically equivalent ways: $$S_N^{(2)} = -\sum_{n=1}^{N} \frac{2n-1}{2n} + \sum_{n=1}^{N} \frac{2n}{2n+1},$$ and $$S_N^{(3)} = \sum_{n=1}^{N} \frac{1}{2n(2n+1)}.$$

Could there be any computational difference? why?

Write a program that compute all sums, and , assuming that $S_n^{(3)}$ is the best option (why?), plots the relative difference with the other sums as a function of $N$.

3.7. Total errors in algorithms and their computational implementation#

As you can see, there are several source for errors in computation. Some come from the mathematical approximation, , called $\epsilon_a$, and some others are intrinsic from the numerical representation, and we can call them $\epsilon_r$. Sometimes, the rounding/truncation error are modeled as a random walk, $\epsilon_r \simeq \sqrt{N} \epsilon_m$, where $\epsilon_m$ is the machine precision, and $N$ is the representative number of “steps”. From here, the total error can be estimated as

(3.3)#\[\begin{align} \epsilon_{\rm tot} &= \epsilon_a + \epsilon_r \\ &= \frac{\alpha}{N^\beta} + \sqrt{N} \epsilon_m. \end{align}\]

You can derive this equation an compute the optimal value for $N$, which will depend on the order of the mathematical algorithm and the machine precision. The next table show some examples that illustrate this point:

3.8. How to minimize numerical errors?#

Use numerical libraries: lapack, eigen, standard c++ lib (https://en.cppreference.com/w/, https://hackingcpp.com/cpp/cheat_sheets.html)
Analize extreme cases, try to rewrite the expressions: http://herbie.uwplse.org/
Minimize the use of substractions of similar numbers, operations between very large and/or very small numbers, normalize your models
Some interesting refs:
- https://phys.org/news/2021-07-rounding-errors-stopwatches-wrong-winners.html
- https://docs.oracle.com/cd/E19957-01/800-7895/800-7895.pdf