3. Errors in numerical computation: Floating point numbers#

  • Numbers are represented in bits https://en.wikipedia.org/wiki/Byte?useskin=vector

  • There is not way to assign infinite bits for infinite precission. A finite number of bits is assigned, some of them spent on the exponent, others on the mantissa, and one on the sign. https://en.wikipedia.org/wiki/Floating-point_arithmetic?useskin=vector

  • This means a limit in precission, range, and even density (https://docs.python.org/3/tutorial/floatingpoint.html). There are 8388607 single precision numbers between 1.0 and 2.0, but only 8191 between 1023.0 and 1024.0

Image Description
From: "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/FloatingPointPrecisionAugmented.png/1000px-FloatingPointPrecisionAugmented.png"
  • To avoid differences across platforms, the IEE754 standard stablished rules for approximation, truncation, and so on. https://en.wikipedia.org/wiki/IEEE_754?useskin=vector . Languages like c/c++/fortran/python support the standard. In particular, for scientific applications try to use ‘double’ precission

  • There are several possible representation of floating point numbers, like Fixed point (https://en.wikipedia.org/wiki/Fixed-point_arithmetic), decimal representation, etc, but the common one in the IEEE754 standard is an “exponential” scientific notation. Play with https://bartaz.github.io/ieee754-visualization/

    Format

    bits

    exponent bits

    Significand bits

    Machine eps

    float

    32

    8

    23

    \(10^{-7}\)

    double

    64

    11

    52

    \(10^{-16}\)

under-over-truncation
  • For more fp representation details, check

    • https://bartaz.github.io/ieee754-visualization/

    • https://float.exposed/0x44c00000

    • https://www.h-schmidt.net/FloatConverter/IEEE754.html

    • https://en.wikipedia.org/wiki/Floating-point_arithmetic

    • https://trekhleb.dev/blog/2021/binary-floating-point/

Some warnings:

  • 14 decimal places can be lost in a single operation.

  • Do not cast floats to integers and viceversa

  • Addition is no longer associative: \(x + (y+z) != (x+y) + z \). Use \(x = -1.5e38, y = -1.5e38, z = 1\).

  • Not every number can be exactly represented (1/3, 0.1, etc)

  • Compilers has many flags, explore them (-ffast-math, -ffloat-store).

For some dramatic examples of FP errors, check:

  • https://www.iro.umontreal.ca/~mignotte/IFT2425/Disasters.html

  • https://web.ma.utexas.edu/users/arbogast/misc/disasters.html

  • https://slate.com/technology/2019/10/round-floor-software-errors-stock-market-battlefield.html

  • https://stackoverflow.com/questions/2732845/real-life-example-fo-floating-point-error

3.1. Overflow for integers#

We will add 1 until some error occurs

#include <cstdio>

int main(void)
{
  int a = 1;
  while(a > 0) {
    a *= 2 ;
    std::printf("%10d\n", a);
  }
}

3.2. Underflow and overflow for floats#

#include <iostream>
#include <cstdlib>

typedef float REAL;

int main(int argc, char **argv)
{
  std::cout.precision(16);
  std::cout.setf(std::ios::scientific);
  int N = std::atoi(argv[1]);
  REAL under = 1.0, over = 1.0;
  for (int ii = 0; ii < N; ++ii){
    under /= 2.0;
    over *= 2.0;
    std::cout << ii << "\t"
              << under << "\t"
              << over << "\n";
  }

}

3.3. Test for associativity#

#include <cstdio>

int main(void)
{
  float x = -1.5e38, y = 1.5e38, z = 1;
  printf("%16.6f\t%16.6f\n", x + (y+z), (x+y) + z);
  printf("%16.16f\n", 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1);

  return 0;
}

3.4. Machine eps#

You can also compute/verify the machine precision \(\epsilon_m\) of your machine according to different types. The machine precision is a number that basically rounds to zero and in opertation. It can be represented as \(1_c + \epsilon_m = 1_c\), where \(1_c\) is the computational representation of the number 1. Actually, that means than any real number \(x\) is actually represented as $\(x_c = x(1+\epsilon), |\epsilon| \le \epsilon_m .\)$

Implement the following algorithm to compute it and report your results:

eps = 1.0
begin do N times
    eps = eps/2.0
    one = 1.0 + eps
    print
#include <iostream>
#include <cstdlib>

typedef long double REAL;

int main(int argc, char **argv)
{   
  std::cout.precision(20);
  std::cout.setf(std::ios::scientific);

  int N = std::atoi(argv[1]);
  REAL eps = 1.0, one = 0.0;
  for (int ii = 0; ii < N; ++ii){
    eps /= 2.0;
    one = 1.0 + eps;
    std::cout << ii << "\t"
              << one << "\t"
              << eps << "\n";
  }

}

3.5. Practical example: Series expansion of \(e^{-x}\)#

The function \(e^{-x}\) can be expanded as $\(e^{-x} = \sum_{i=0}^{\infty} (-1)^i \frac{x^i}{i!} = 1 - x + \frac{x^2}{2} - \frac{x^3}{6} + \ldots \ (|x| < \infty)\)$

This is a great expansion, valid for all finite values of \(x\). But, what numerical problems do you see?

Implement a function that receives \(x\) and \(N\) (max number of terms), and saves the iteration value of the series as a function of \(i\) in a file called sumdata.txt. Then load the data and plot it. Use \(N = 10000\) and \(x=1.8766\). You will need to implement also a factorial function.

#include <iostream>
#include <cmath>

int factorial(int n);
double fnaive(double x, int N);
double fsmart(double x, int N);

int main(void)
{
  std::cout.precision(16); std::cout.setf(std::ios::scientific);
  double x = 1.234534534;
  for (int NMAX = 0; NMAX <= 100; ++NMAX) {
    std::cout << NMAX
              << "\t" << fnaive(x, NMAX)
              << "\t" << std::fabs(fnaive(x, NMAX) - std::exp(-x))/std::exp(-x)
              << "\t" << fsmart(x, NMAX)
              << "\t" << std::fabs(fsmart(x, NMAX) - std::exp(-x))/std::exp(-x)
              << std::endl;
  }
  return 0;
}

double fnaive(double x, int N)
{
  double term = 0, suma = 0;
  for(int k = 0; k <= N; ++k){
    term = std::pow(-x, k)/factorial(k);
    suma += term;
  }
  return suma;
}

int factorial(int n)
{
  if (n <= 0) return 1;
  return n*factorial(n-1);
}

double fsmart(double x, int N)
{
  double term = 1, suma = 1;
  for(int k = 0; k < N; ++k){
    term *= (-x)/(k+1);
    suma += term;
  }
  return suma;
}

3.5.1. Exercise: Substractive cancellation with the Quadratic equation#

3.5.2. Exercise: Sum ups and down#

Is there any computation difference between the following sums?

(3.1)#\[\begin{align} S_{up}(N) &= \sum_{n=1}^{N} \frac{1}{n},\\ S_{down}(N) &= \sum_{n=N}^{1} \frac{1}{n}. \end{align}\]

Implement and plot the relative difference among them as a function of \(N\).

3.6. Errors#

  • FP is well defined through the IEEE754 standard.

  • A simple substraction could destroy 15 decimal places of precision

  • You should not cast floats to integers

  • You should normalize your models to natural units

  • Addition is not always associative: \(x + (y + z) \ne (x+y) + z\), when \(x = -1.5\times 10^{38}, y = +1.5\times 10^{38}, z = 1.0\) (single precision)

  • All numbers can be represented in binary: false. Check 0.1, or 0.3

For some dramatic examples of FP errors, check:

  • https://www.iro.umontreal.ca/~mignotte/IFT2425/Disasters.html

  • https://web.ma.utexas.edu/users/arbogast/misc/disasters.html

  • https://slate.com/technology/2019/10/round-floor-software-errors-stock-market-battlefield.html

  • https://stackoverflow.com/questions/2732845/real-life-example-fo-floating-point-error

Kind of errors

  • Probability of an error: start \(\to U_1 \to U_2 \to \ldots \to U_n \to\) end

  • Blunders: Typographical, wrong program, etc

  • Random errors: Electronics, alien invasion, etc

  • Approximation: (mathematical series truncation)

  • Roundoff and truncation of a number in the computer representation

3.6.1. Roundoff/truncation example#

Let’s compute the following sum as a function of \(k\), \(f(k) = \left|\frac{k}{10} - \sum_{i=1}^k 0.1\right|\) Mathematically, this function should give 0 always. Is that true?

3.6.2. Substractive cancellation#

Given \(a = b-c\), then \(a_c = b_c-c_c\), therefore

(3.2)#\[\begin{align} a_c &= a(1+\epsilon_a ) = b_c - c_c = b(1+\epsilon_b ) - c(1+\epsilon_c )\\ \frac{a_c}{a} &= 1 + \epsilon_b \frac{b}{a} - \frac{c}{a}\epsilon_c \end{align}\]

so the error in \(a\) is about $\(\epsilon_a = \frac{b}{a}(\epsilon_b - \epsilon_c), \)\( which is much larger, when \)b\( and \)c\( are close (so \)a$ is small).

3.6.2.1. Example of substrative cancellation#

The following series, $\(S_N^{(1)} = \sum_{n=1}^{2N} (-1)^n \frac{n}{n+1},\)\( can be written in two mathematically equivalent ways: \)\(S_N^{(2)} = -\sum_{n=1}^{N} \frac{2n-1}{2n} + \sum_{n=1}^{N} \frac{2n}{2n+1},\)\( and \)\(S_N^{(3)} = \sum_{n=1}^{N} \frac{1}{2n(2n+1)}.\)$

Could there be any computational difference? why?

Write a program that compute all sums, and , assuming that \(S_n^{(3)}\) is the best option (why?), plots the relative difference with the other sums as a function of \(N\).

3.7. Total errors in algorithms and their computational implementation#

As you can see, there are several source for errors in computation. Some come from the mathematical approximation, , called \(\epsilon_a\), and some others are intrinsic from the numerical representation, and we can call them \(\epsilon_r\). Sometimes, the rounding/truncation error are modeled as a random walk, \(\epsilon_r \simeq \sqrt{N} \epsilon_m\), where \(\epsilon_m\) is the machine precision, and \(N\) is the representative number of “steps”. From here, the total error can be estimated as

(3.3)#\[\begin{align} \epsilon_{\rm tot} &= \epsilon_a + \epsilon_r \\ &= \frac{\alpha}{N^\beta} + \sqrt{N} \epsilon_m. \end{align}\]

You can derive this equation an compute the optimal value for \(N\), which will depend on the order of the mathematical algorithm and the machine precision. The next table show some examples that illustrate this point:

under-over-truncation

3.8. How to minimize numerical errors?#