Computational linear algebra: Performance exploration

12. Computational linear algebra: Performance exploration#

In this section we will learn how to perform some benchmarks on some matrix operations to get an idea of the operations performance. This will be done manually, but in the future you can use specialized tools like google benchmark, google/benchmark, or nanobench, https://nanobench.ankerl.com/ , etc.

You can explore some general remarks and specific techniques at

In our example, we will use the eigen c++ library. Therefore, we will start with some eigen c++ examples.

12.1. `Eigen c++` library#

“Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.” Taken from http://eigen.tuxfamily.org/index.php?title=Main_Page

For a quick start go to: http://eigen.tuxfamily.org/dox/GettingStarted.html

12.1.1. Creating basic matrices#

The following shows how to create a dynamic 2x2 matrix of doubles (X means dynamic, d means double), and also show how to access a given element.

#include <iostream>
#include <eigen3/Eigen/Dense>

int main(void)
{
    Eigen::MatrixXd m(2, 2);
    m(0, 1) = -9.877;
    std::cout << m << std::endl;

    return 0;
}


0	-9.877
0	0

The following shows how to create a random matrix

#include <iostream>
#include <cstdlib>
#include <eigen3/Eigen/Dense>

int main(int argc, char **argv)
{
    //const int SEED = std::atoi(argv[1]);
    //srand(SEED); // control seed
    Eigen::MatrixXd m = Eigen::MatrixXd::Random(5,5);
    std::cout << m << std::endl;

    return 0;
}


0.680375	-0.604897	-0.0452059	0.83239	-0.967399
-0.211234	-0.329554	0.257742	0.271423	-0.514226
0.566198	0.536459	-0.270431	0.434594	-0.725537
0.59688	-0.444451	0.0268018	-0.716795	0.608354
0.823295	0.10794	0.904459	0.213938	-0.686642

Exercise: Create a 8x8 random matrix and print it to the screen. Control the seed, print it twice to check that the elements are the same.

Matrix-vector multiplication is very simple with overloaded * operator.

#include <iostream>
#include <eigen3/Eigen/Dense>

int main()
{
    const int N = 3;
    srand(10);
    Eigen::MatrixXd m = Eigen::MatrixXd::Random(N, N);
    m = (m + Eigen::MatrixXd::Constant(N, N, 1.2)) * 50;
    std::cout << "m =" << std::endl << m << std::endl;
    Eigen::VectorXd v(N);
    v << 1, 2, 3;
    std::cout << "m * v =" << std::endl << m * v << std::endl;
}

12.1.2. Linear Algebra with eigen c++#

In this example we are going to use eigen to solve the system \(\boldsymbol{A}\vec x = \vec{x}\) using HouseHolder QR decomposition as done in http://eigen.tuxfamily.org/dox/group__TutorialLinearAlgebra.html .

#include <iostream>
#include <eigen3/Eigen/Dense>

int main()
{
    Eigen::Matrix3f A;
    Eigen::Vector3f b;
    A << 1,2,3,  4,5,6,  7,8,10;
    b << 3, 3, 4;
    std::cout << "Here is the matrix A:\n" << A << std::endl;
    std::cout << "Here is the vector b:\n" << b << std::endl;
    Eigen::Vector3f x = A.colPivHouseholderQr().solve(b);
    std::cout << "The solution is:\n" << x << std::endl;
    std::cout << "Verification:\n" << (A*x - b).norm() << std::endl;
}


Here	is	the	matrix	A:
1	2	3
4	5	6
7	8	10
Here	is	the	vector	b:
3
3
4
The	solution	is:
-2
0.999997
1
Verification:
1.43051e-06

Exercise: Create a random 40x40 matrix \(A\), a random vector \(b\) of size 40, and solve the system \(Ax = b\) using FullPivHouseholderQR. Compare the execution time with a 100x100 problem size.

Exercise: Compile the previous example with profiling, run under valgrind cachegrind, and check the output. Can you change or optimize any of that?

Now use eigen to compute the eigenvalues of a given matrix:

#include <iostream>
#include <eigen3/Eigen/Dense>

int main()
{
    Eigen::Matrix2d A;
    A << 1, 2, 2, 3;
    std::cout << "Here is the matrix A:\n" << A << std::endl;
    Eigen::SelfAdjointEigenSolver<Eigen::Matrix2d> eigensolver(A);
    std::cout << "The eigenvalues of A are:\n" << eigensolver.eigenvalues() << std::endl;
    std::cout << "Here's a matrix whose columns are eigenvectors of A \n"
              << "corresponding to these eigenvalues:\n"
              << eigensolver.eigenvectors() << std::endl;
}

Exercise : Compute the eigen-values of a random matrix of size 20x20.

12.2. Matrix operations performance: LU operations to solve `Ax = b`#

In the following we want to explore the performance of the following program as a function of the system size. We will explore the role of the compiler optimization flags, and possibly parallelization using threads. By reading https://eigen.tuxfamily.org/dox/TopicMultiThreading.html, we see that eigen supports parallelization for some operations, and we will take solving with partial pivote LU as example.

This is a very simple example on how to use it

#include <iostream>
#include <eigen3/Eigen/Dense>

void solvesystem(int size);

int main(int argc, char ** argv)
{
  const int M = atoi(argv[1]); // Matrix size
  solvesystem(M);
}

void solvesystem(int size)
{
  Eigen::MatrixXd A = Eigen::MatrixXd::Random(size, size);
  Eigen::MatrixXd b = Eigen::MatrixXd::Random(size, 1);
  // crono star
  //Eigen::MatrixXd x = A.fullPivLu().solve(b);
  //Eigen::MatrixXd x = A.partialPivLu().solve(b);
  Eigen::MatrixXd x = A.lu().solve(b);
  // crono end
  double relative_error = (A*x - b).norm() / b.norm(); // norm() is L2 norm
  std::cout << A << std::endl;
  std::cout << b << std::endl;
  std::cout << "The relative error is:\n" << relative_error << std::endl;
}

// usar chrono para medir el tiempo
// perf
// remove printing
// modify printing


-0.999984	-0.562082	-0.232996	0.0594004	-0.165028
-0.736924	-0.905911	0.0388327	0.342299	0.373545
0.511211	0.357729	0.661931	-0.984604	0.177953
-0.0826997	0.358593	-0.930856	-0.233169	0.860873
0.0655345	0.869386	-0.893077	-0.866316	0.692334
0.0538576
-0.81607
0.307838
-0.168001
0.402381
The	relative	error	is:
7.87638e-16

Now, to be able to call this in a parametric way, we can put the solve into a function

#include <iostream>
#include <eigen3/Eigen/Dense>

void solvesystem(int size);

int main(int argc, char ** argv)
{
  const int M = atoi(argv[1]); // Matrix size
  solvesystem(M);
}

void solvesystem(int size)
{
  Eigen::MatrixXd A = Eigen::MatrixXd::Random(size, size);
  Eigen::MatrixXd b = Eigen::MatrixXd::Random(size, 1);
  // crono star
  Eigen::MatrixXd x = A.fullPivLu().solve(b);
  // crono end
  // double relative_error = (A*x - b).norm() / b.norm(); // norm() is L2 norm
  //std::cout << A << std::endl;
  //std::cout << b << std::endl;
  //std::cout << "The relative error is:\n" << relative_error << std::endl;
}

// usar chrono para medir el tiempo
// gprof / perf
// remove printing
// modify testing

If you use just

time ./a.out

you will be measuring the whole execution time, including the memory allocation, the actual operation, and the memory freeing. It is much better to actually measure only what we are interesting in, which is the actual memory operation.

To create a reasonable benchmark, we need to

Fill the matrix randomly, controlling the seed
Measure only processing time (wall and cpu times)
Statistics
Print the time as function of N and plot (run this in the shell)
Use threads (openmp)

Filling the matrix randomly

For repeatability, we can control the seed with the srand instructions (check the manual, and also https://stackoverflow.com/questions/67387539/set-random-seed-in-eigen?noredirect=1&lq=1 and https://stackoverflow.com/questions/21292881/matrixxfrandom-always-returning-same-matrices ).

Measuring processing time only (both wall and cpu time)

To measure only the times associated with the processing , we will use the chrono library (https://en.cppreference.com/w/cpp/chrono). Before proceeding, please read first the documentation of system_clock, steady_clock, and high_resolution_clock, and decide which one to use. Also, consider if they measure the cpu time. If not, how to measure it? Check:

Statistics

We will need a function to repeat the measured function several times, and then compute both the mean,

(12.1)#\[\begin{equation} \bar x = \dfrac{1}{N} \sum x_i, \end{equation}\]

and the standard deviation,

(12.2)#\[\begin{equation} \sigma = \sqrt{\dfrac{1}{N-1}\sum (x_i - \bar x)^2}. \end{equation}\]

Your main function must be something like

    #include <iostream>
    #include <chrono>
    #include <ctime>
    #include <cmath>
    #include <eigen3/Eigen/Dense>

   void stats(const int size, const int reps, // input parameters
               double & mean_wtime, double & sigma_wtime, // output
               double & mean_ctime, double & sigma_ctime); // output

    int main(int argc, char ** argv)
    {
      const int M = atoi(argv[1]); // Matrix size
      const int R = atoi(argv[2]); // Repetitions

      double mean_wtime, sigma_wtime;
      double mean_ctime, sigma_ctime;
      stats(M, R, mean_wtime, sigma_wtime, mean_ctime, sigma_ctime);
      std::cout << M
                << "\t" << mean_wtime << "\t" << sigma_wtime
                << "\t" << mean_ctime << "\t" << sigma_ctime
                << std::endl;
        
      return 0;
    }

Exercise: Create a function that receives the size and the number of repetitions, and performs the corresponding call to the kernel function and computes and prints the stats.

   void stats(const int size, const int reps,
               double & mean_wtime, double & sigma_wtime,
               double & mean_ctime, double & sigma_ctime)
    {
      double wsum = 0, wsum2 = 0, csum = 0, csum2 = 0;
      double wtime = 0, ctime = 0;
// TODO
      std::cout << M
          << "\t" << mean_wtime << "\t" << sigma_wtime
          << "\t" << mean_ctime << "\t" << sigma_ctime
          << std::endl;
    }

12.2.1. Threads#

Eigen c++ includes parallelization with openmp threads. To do so, you need to compile with the -fopenmp flag and then execute as

OMP_NUM_THREADS=n ./a.out ...

where n is the number of threads. Check the following to see the kind of ops that can be parallelized: https://eigen.tuxfamily.org/dox/TopicMultiThreading.html

Run the code for a large enough size and check if the flag has any effect.

12.2.2. Measuring time as a function of size#

Now you are ready to measure both the wall and cpu time as a function of size, for at least 5 samples per size, and also for several number of threads. Start using -O3 -fopenmp and plot the curves for 1, 2, 4, 8 and the max threads in your pc. Then try the same with -O0 -fopenmp, but maybe you will not be able to use large sizes. How will you run this in parallel? take into account the memory consumption and also that parallel -j jobs is the number of processes which is different to the number of threads (a process can sapwn a lot of threads, and you should not launch more threads than available in your machine)

12.3. Exercise: Measuring time for matrix-matrix multiplication, with optimization and parallelization#

Now do the same study but with matrix multiplication. Implement the matrix matmul with eigen (see the code snippet at the end). Also implement the typical three-loop matrix multiplication (but does not use for large matrix sizes). Always compile with the flags -fopenmp -O3. And now execute in the following way

OMP_NUM_THREADS=VAR ./a.out ...

where VAR takes values from 1, 2, 3, 4, … up the output from the command nproc. This means that you are starting to use parallelization. You must run this in a multicore system (or give your virtual machine enough resources) with a least 8 threads. Plot the time (with error bars) as a function of the system size, and you will have four curves on that graph, each one for each VAR value.

Basic snippet for matmul in eigen:

double multiply(int size)
{
  // create matrices
  Eigen::MatrixXd A = Eigen::MatrixXd::Random(size, size);
  Eigen::MatrixXd B = Eigen::MatrixXd::Random(size, size);

  auto start = std::chrono::steady_clock::now();
  auto C{A*B}; // MULTIPLY
  double tmp = C(0, 0); // use the matrix to make eigen compute it
  auto end = std::chrono::steady_clock::now();

  std::clog << tmp << std::endl; // use the matrix to make eigen compute it

  std::chrono::duration<double> elapsed_seconds = end-start;
  return elapsed_secons.count();
}

12.4. Using gnu parallel to run several matmul and compute first parallel metrics#

Now let’s use the previous matmul with eigen and blas exercise to show how to use parallel to run several matrix sizes at the same time and also to compute some parallel metrics. The code is

We have two goals:

To compute the wall time as a function of the matrix size (strong scaling, changing problem size), using blas.
To compute the speedup and parallel efficiency for a fixed matrix size.

In the first case we will just use parallel to run as many simulations as possible. In the second case we will compute some metrics to check when is our code the most efficient.

It is assumed that you have blas with spack, so you can load it as

spack load openblas

or might have to install it using

spack install openblas threads=openmp cxxflags="-O3" cflags="-O3" target=x86_64

(the last flag is just to have the same target independent of the actual machine, like the conditions we have in the computer room)

Strong scaling: Time as a function of matriz size, one thread, eigen + blas

Naively, we could just run the program serially for each matriz size, but if we are in computer with multiple cores/threads it would be better if we ran as many parameters as possible (adapt the following instructions to your case):

#source $HOME/repos/spack/share/spack/setup-env.sh
spack load openblas
#g++ -fopenmp -O3 -I $CMAKE_PREFIX_PATH/include -L $CMAKE_PREFIX_PATH/lib eigen-matmul.cpp -DEIGEN_USE_BLAS -lopenblas -o eigen_blas.x
g++ -fopenmp -O3  eigen-matmul.cpp -DEIGEN_USE_BLAS -lopenblas -o eigen_blas.x
parallel 'OMP_NUM_THREADS=1 ./eigen_blas.x {} 10 2>/dev//null' ::: 10 50 100 200 500 700 1000 2000 5000w

Check that your programs are running in parallel, as expected.

Exercise: Strong scaling for eigen eigenvectors

With the following code, compute the strong scaling of eigen when computing eigenvectors with the more general method:

#include <iostream>
#include <cstdlib>
#include <chrono>
#include <eigen3/Eigen/Dense>

void solve_eigensystem(int size, double &time);

int main(int arg, char **argv)
{
  const int M = atoi(argv[1]); // Matrix size
  const int R = atoi(argv[2]); // Repetitions
  const int S = atoi(argv[3]); // seed
  srand(S);

  double totaltime = 0, auxtime = 0;
  for(int irep = 0; irep < R; ++irep) {
    solve_eigensystem(M, auxtime);
    totaltime += auxtime;
  }
  std::cout << M << "\t" << totaltime/R << "\n";

 }

void solve_eigensystem(int size, double &time)
{
  double aux;
  Eigen::MatrixXd A = Eigen::MatrixXd::Random(size, size);
  auto start = std::chrono::steady_clock::now();
  Eigen::SelfAdjointEigenSolver<Eigen::MatrixXd> eigensolver(A);
  aux = eigensolver.eigenvalues()(0);
  auto end = std::chrono::steady_clock::now();
  std::clog << "The first eigenvalue of A is:\n" << aux << std::endl;
  std::chrono::duration<double> diff = end - start;
  time = diff.count();
}

Compile como

g++ -O3 eigen-eigenvectors.cpp  -o eigen.x

Ejecute como

./eigen.x M 5 3 2>/dev/null

debe cambiar M, matriz size.

Datos de ejemplo (dependen de la maquina, pero las relaciones entre ellos deben ser similares):

    9.7568e-06
    3.55734e-05
    0.000312481
    0.000890043
...
  72.8078
  81.0619

Plot the data ad analyze.

Weak scaling: number of threads and parallel metrics

Here we will compute some key parallel metrics that inform about the efficiency of our code when running in parallel. Now you do not want to use gnu parallel since you have a variable number of thredas per process

source $HOME/repos/spack/share/spack/setup-env.sh
spack load openblas
#g++ -fopenmp -O3 -I $CMAKE_PREFIX_PATH/include -L $CMAKE_PREFIX_PATH/lib eigen-matmul.cpp -DEIGEN_USE_BLAS -lopenblas -o eigen_blas.x
g++ -fopenmp -O3 eigen-matmul.cpp -DEIGEN_USE_BLAS -lopenblas -o eigen_blas.x
parallel -j 1 'echo -n "{}  "; OMP_NUM_THREADS={} ./eigen_blas.x 4000 10 2>/dev//null' ::: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The following is data obtained from a run in a 8core/16threads computer, running only with eigen

4000 6.75223 0.00290168      6.75225 0.0029001
4000 3.52052 0.00405504      7.01616 0.00819132
4000 2.40281 0.0117795       7.15847 0.0355119
4000 1.85186 0.0049013       7.33257 0.0187681
4000 1.72682 0.176218        8.53451 0.880953
4000 1.65921 0.00946933      9.83127 0.0574367
4000 1.52068 0.00538196      10.4943 0.0370317
4000 1.39755 0.0326183       11.006  0.260568
4000 2.26355 0.00254841      19.9546 0.0452903
4000        2.04808 0.00732663      20.0991 0.0807175
4000        2.00821 0.00876695      21.7043 0.104527
4000        1.76768 0.0276189       20.834  0.324801
4000        1.77771 0.00686642      22.7412 0.0887671
4000        1.59293 0.0116353       21.9208 0.213236
4000        1.56692 0.00829185      23.1334 0.121202
4000        1.49262 0.0321579       23.3921 0.417985

And these are with running with eigen+blas

4000 1.76345 0.00750705      1.76345 0.0075023
4000 0.930922        0.00450842      1.83688 0.00900039
4000 0.666528        0.0122499       1.94996 0.0365303
4000 0.523076        0.00175201      2.01795 0.00654245
4000 0.442096        0.00226719      2.1107  0.0108692
4000 0.394103        0.00867531      2.23982 0.0513271
4000 0.371224        0.000725666     2.44876 0.00691333
4000 0.35202 0.00564542      2.64095 0.0441576
4000 0.53266 0.00218486      4.59061 0.02803
4000        0.496156        0.00207424      4.73416 0.0281744
4000        0.461317        0.00102704      4.82462 0.0111843
4000        0.550406        0.0431109       6.32975 0.518631
4000        0.514583        0.0119228       6.38601 0.144949
4000        0.494073        0.0166192       6.58912 0.231588
4000        0.484151        0.0121776       6.90909 0.179303
4000        0.493364        0.0316198       7.45327 0.429279

The two more important parallel metric efficiency.

The speed up is defined as

where \(T_1\) is the reference time with one thread, and \(T_n\) with n threads. For a perfect scaling, \(T_n = T_1/n\), so \(S_{\rm theo}(n) = n\). This does not occur always and depend on the actual machine, the communication patterns and so on. In the following we will use \(T_1 = 6.75223\) for eigen and \(T_1 = 1.76345\) for eigen+blas.

The parallel efficiency measures roughly how efficiently we are using all the threads. It is defined as

and, therefore, theoretically it is equal to 1.

To easily create the data we need, we could use a spreadsheet or better, the command line

awk '{print $1, 6.75223/$3, 6.75223/$3/$1 }' codes/eigen.txt > codes/eigen_metrics.txt
awk '{print $1, 1.76345/$3, 1.76345/$3/$1 }' codes/eigen_blas.txt > codes/eigen_blas_metrics.txt

Then we can plot and analyze

set term png enhaced giant#; set out 'tmpspeedup.png'
set key l t; set xlabel 'nthreads'; set ylabel 'Parallel speedup'; set title 'Computer with 8cores/16threads'
plot [:17][:] x lt -1 t 'theo', 'codes/eigen_metrics.txt' u 1:2 w lp t 'eigen', 'codes/eigen_blas_metrics.txt'  u 1:2 w lp t 'eigen+blas'

#set term png; set out 'efficiency.png'
set key l b; set xlabel 'nthreads'; set ylabel 'Parallel efficiency'; set title 'Computer with 8cores/16threads'
plot [:17][-0.1:1.1] 1 lt -1, 'codes/eigen_metrics.txt' u 1:3 w lp t 'eigen', 'codes/eigen_blas_metrics.txt'  u 1:3 w lp t 'eigen+blas', 0.6 lt 4

Normally, if you want to buy some cloud services to run your code, you should use threads that give equal or above 60-70% efficiency.