3. Errors in numerical computation: Floating point numbers#
Numbers are represented in bits https://en.wikipedia.org/wiki/Byte?useskin=vector
There is not way to assign infinite bits for infinite precission. A finite number of bits is assigned, some of them spent on the exponent, others on the mantissa, and one on the sign. https://en.wikipedia.org/wiki/Floating-point_arithmetic?useskin=vector
This means a limit in precission, range, and even density (https://docs.python.org/3/tutorial/floatingpoint.html). There are 8388607 single precision numbers between 1.0 and 2.0, but only 8191 between 1023.0 and 1024.0
To avoid differences across platforms, the IEE754 standard stablished rules for approximation, truncation, and so on. https://en.wikipedia.org/wiki/IEEE_754?useskin=vector . Languages like c/c++/fortran/python support the standard. In particular, for scientific applications try to use ‘double’ precission
There are several possible representation of floating point numbers, like Fixed point (https://en.wikipedia.org/wiki/Fixed-point_arithmetic), decimal representation, etc, but the common one in the IEEE754 standard is an “exponential” scientific notation. Play with https://bartaz.github.io/ieee754-visualization/
Format
bits
exponent bits
Significand bits
Machine eps
float
32
8
23
\(10^{-7}\)
double
64
11
52
\(10^{-16}\)
For more fp representation details, check
https://bartaz.github.io/ieee754-visualization/
https://float.exposed/0x44c00000
https://www.h-schmidt.net/FloatConverter/IEEE754.html
https://en.wikipedia.org/wiki/Floating-point_arithmetic
https://trekhleb.dev/blog/2021/binary-floating-point/
Some warnings:
14 decimal places can be lost in a single operation.
Do not cast floats to integers and viceversa
Addition is no longer associative: \(x + (y+z) != (x+y) + z \). Use \(x = -1.5e38, y = -1.5e38, z = 1\).
Not every number can be exactly represented (1/3, 0.1, etc)
Compilers has many flags, explore them (-ffast-math, -ffloat-store).
For some dramatic examples of FP errors, check:
https://www.iro.umontreal.ca/~mignotte/IFT2425/Disasters.html
https://web.ma.utexas.edu/users/arbogast/misc/disasters.html
https://slate.com/technology/2019/10/round-floor-software-errors-stock-market-battlefield.html
https://stackoverflow.com/questions/2732845/real-life-example-fo-floating-point-error
3.1. Overflow for integers#
We will add 1 until some error occurs
#include <cstdio>
int main(void)
{
int a = 1;
while(a > 0) {
a *= 2 ;
std::printf("%10d\n", a);
}
}
3.2. Underflow and overflow for floats#
#include <iostream>
#include <cstdlib>
typedef float REAL;
int main(int argc, char **argv)
{
std::cout.precision(16);
std::cout.setf(std::ios::scientific);
int N = std::atoi(argv[1]);
REAL under = 1.0, over = 1.0;
for (int ii = 0; ii < N; ++ii){
under /= 2.0;
over *= 2.0;
std::cout << ii << "\t"
<< under << "\t"
<< over << "\n";
}
}
3.3. Test for associativity#
#include <cstdio>
int main(void)
{
float x = -1.5e38, y = 1.5e38, z = 1;
printf("%16.6f\t%16.6f\n", x + (y+z), (x+y) + z);
printf("%16.16f\n", 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1);
return 0;
}
3.4. Machine eps#
You can also compute/verify the machine precision \(\epsilon_m\) of your machine according to different types. The machine precision is a number that basically rounds to zero and in opertation. It can be represented as \(1_c + \epsilon_m = 1_c\), where \(1_c\) is the computational representation of the number 1. Actually, that means than any real number \(x\) is actually represented as $\(x_c = x(1+\epsilon), |\epsilon| \le \epsilon_m .\)$
Implement the following algorithm to compute it and report your results:
eps = 1.0
begin do N times
eps = eps/2.0
one = 1.0 + eps
print
#include <iostream>
#include <cstdlib>
typedef long double REAL;
int main(int argc, char **argv)
{
std::cout.precision(20);
std::cout.setf(std::ios::scientific);
int N = std::atoi(argv[1]);
REAL eps = 1.0, one = 0.0;
for (int ii = 0; ii < N; ++ii){
eps /= 2.0;
one = 1.0 + eps;
std::cout << ii << "\t"
<< one << "\t"
<< eps << "\n";
}
}
3.5. Practical example: Series expansion of \(e^{-x}\)#
The function \(e^{-x}\) can be expanded as $\(e^{-x} = \sum_{i=0}^{\infty} (-1)^i \frac{x^i}{i!} = 1 - x + \frac{x^2}{2} - \frac{x^3}{6} + \ldots \ (|x| < \infty)\)$
This is a great expansion, valid for all finite values of \(x\). But, what numerical problems do you see?
Implement a function that receives \(x\) and \(N\) (max number of terms), and saves the iteration value of the series as a function of \(i\) in a file called sumdata.txt
. Then load the data and plot it. Use \(N = 10000\) and \(x=1.8766\). You will need to implement also a factorial function.
#include <iostream>
#include <cmath>
int factorial(int n);
double fnaive(double x, int N);
double fsmart(double x, int N);
int main(void)
{
std::cout.precision(16); std::cout.setf(std::ios::scientific);
double x = 1.234534534;
for (int NMAX = 0; NMAX <= 100; ++NMAX) {
std::cout << NMAX
<< "\t" << fnaive(x, NMAX)
<< "\t" << std::fabs(fnaive(x, NMAX) - std::exp(-x))/std::exp(-x)
<< "\t" << fsmart(x, NMAX)
<< "\t" << std::fabs(fsmart(x, NMAX) - std::exp(-x))/std::exp(-x)
<< std::endl;
}
return 0;
}
double fnaive(double x, int N)
{
double term = 0, suma = 0;
for(int k = 0; k <= N; ++k){
term = std::pow(-x, k)/factorial(k);
suma += term;
}
return suma;
}
int factorial(int n)
{
if (n <= 0) return 1;
return n*factorial(n-1);
}
double fsmart(double x, int N)
{
double term = 1, suma = 1;
for(int k = 0; k < N; ++k){
term *= (-x)/(k+1);
suma += term;
}
return suma;
}
3.5.1. Exercise: Substractive cancellation with the Quadratic equation#
3.5.2. Exercise: Sum ups and down#
Is there any computation difference between the following sums?
Implement and plot the relative difference among them as a function of \(N\).
3.6. Errors#
FP is well defined through the IEEE754 standard.
A simple substraction could destroy 15 decimal places of precision
You should not cast floats to integers
You should normalize your models to natural units
Addition is not always associative: \(x + (y + z) \ne (x+y) + z\), when \(x = -1.5\times 10^{38}, y = +1.5\times 10^{38}, z = 1.0\) (single precision)
All numbers can be represented in binary: false. Check 0.1, or 0.3
For some dramatic examples of FP errors, check:
https://www.iro.umontreal.ca/~mignotte/IFT2425/Disasters.html
https://web.ma.utexas.edu/users/arbogast/misc/disasters.html
https://slate.com/technology/2019/10/round-floor-software-errors-stock-market-battlefield.html
https://stackoverflow.com/questions/2732845/real-life-example-fo-floating-point-error
Kind of errors
Probability of an error:
start
\(\to U_1 \to U_2 \to \ldots \to U_n \to\)end
Blunders: Typographical, wrong program, etc
Random errors: Electronics, alien invasion, etc
Approximation: (mathematical series truncation)
Roundoff and truncation of a number in the computer representation
3.6.1. Roundoff/truncation example#
Let’s compute the following sum as a function of \(k\), \(f(k) = \left|\frac{k}{10} - \sum_{i=1}^k 0.1\right|\) Mathematically, this function should give 0 always. Is that true?
3.6.2. Substractive cancellation#
Given \(a = b-c\), then \(a_c = b_c-c_c\), therefore
so the error in \(a\) is about $\(\epsilon_a = \frac{b}{a}(\epsilon_b - \epsilon_c), \)\( which is much larger, when \)b\( and \)c\( are close (so \)a$ is small).
3.6.2.1. Example of substrative cancellation#
The following series, $\(S_N^{(1)} = \sum_{n=1}^{2N} (-1)^n \frac{n}{n+1},\)\( can be written in two mathematically equivalent ways: \)\(S_N^{(2)} = -\sum_{n=1}^{N} \frac{2n-1}{2n} + \sum_{n=1}^{N} \frac{2n}{2n+1},\)\( and \)\(S_N^{(3)} = \sum_{n=1}^{N} \frac{1}{2n(2n+1)}.\)$
Could there be any computational difference? why?
Write a program that compute all sums, and , assuming that \(S_n^{(3)}\) is the best option (why?), plots the relative difference with the other sums as a function of \(N\).
3.7. Total errors in algorithms and their computational implementation#
As you can see, there are several source for errors in computation. Some come from the mathematical approximation, , called \(\epsilon_a\), and some others are intrinsic from the numerical representation, and we can call them \(\epsilon_r\). Sometimes, the rounding/truncation error are modeled as a random walk, \(\epsilon_r \simeq \sqrt{N} \epsilon_m\), where \(\epsilon_m\) is the machine precision, and \(N\) is the representative number of “steps”. From here, the total error can be estimated as
You can derive this equation an compute the optimal value for \(N\), which will depend on the order of the mathematical algorithm and the machine precision. The next table show some examples that illustrate this point:
3.8. How to minimize numerical errors?#
Use numerical libraries: lapack, eigen, standard c++ lib (https://en.cppreference.com/w/, https://hackingcpp.com/cpp/cheat_sheets.html)
Analize extreme cases, try to rewrite the expressions: http://herbie.uwplse.org/
Minimize the use of substractions of similar numbers, operations between very large and/or very small numbers, normalize your models
Some interesting refs: