12. Performance measurement for some matrix ops#

In this section we will learn how to perform some benchmarks on some matrix operations to get an idea of the operations performance. This will be done manually, but in the future you can use specialized tools like google benchmark, google/benchmark, or nanobench, https://nanobench.ankerl.com/ , etc. https://news.ycombinator.com/item?id=39714053

12.1. LU operations to solve Ax = b#

In the following we want to explore the performance of the following program as a function of the system size

#include <iostream>
#include <eigen3/Eigen/Dense>

void solvesystem(int size);

int main(int argc, char ** argv)
{
  const int M = atoi(argv[1]); // Matrix size
  solvesystem(M);
}

void solvesystem(int size)
{
  Eigen::MatrixXd A = Eigen::MatrixXd::Random(size, size);
  Eigen::MatrixXd b = Eigen::MatrixXd::Random(size, 1);
  // crono star
  //Eigen::MatrixXd x = A.fullPivLu().solve(b);
  Eigen::MatrixXd x = A.lu().solve(b);
  // crono end
  double relative_error = (A*x - b).norm() / b.norm(); // norm() is L2 norm
  std::cout << A << std::endl;
  std::cout << b << std::endl;
  std::cout << "The relative error is:\n" << relative_error << std::endl;
}

// usar chrono para medir el tiempo
// gprof
// remove printing
// modify printing

-0.999984

-0.562082

-0.232996

0.0594004

-0.165028

-0.736924

-0.905911

0.0388327

0.342299

0.373545

0.511211

0.357729

0.661931

-0.984604

0.177953

-0.0826997

0.358593

-0.930856

-0.233169

0.860873

0.0655345

0.869386

-0.893077

-0.866316

0.692334

0.0538576

-0.81607

0.307838

-0.168001

0.402381

The

relative

error

is:

7.87638e-16

#include <iostream>
#include <eigen3/Eigen/Dense>

void solvesystem(int size);

int main(int argc, char ** argv)
{
  const int M = atoi(argv[1]); // Matrix size
  solvesystem(M);
}

void solvesystem(int size)
{
  Eigen::MatrixXd A = Eigen::MatrixXd::Random(size, size);
  Eigen::MatrixXd b = Eigen::MatrixXd::Random(size, 1);
  // crono star
  Eigen::MatrixXd x = A.fullPivLu().solve(b);
  // crono end
  // double relative_error = (A*x - b).norm() / b.norm(); // norm() is L2 norm
  //std::cout << A << std::endl;
  //std::cout << b << std::endl;
  //std::cout << "The relative error is:\n" << relative_error << std::endl;
}

// usar chrono para medir el tiempo
// gprof
// remove printing
// modify testing

If you use just

time ./a.out

you will measuring the whole execution time, including the memory allocation, the actual operation, and the memory freeing. It is much better to actually measure only what we are interesting in, which is the actual memory operation.

To create a reasonable benchmark, we need to

  • Fill the matrix randomly, controlling the seed

  • Measure only processing time (wall and cpu times)

  • Statistics

  • Print the time as function of N and plot (run this in the shell)

  • Use threads (openmp)

12.1.1. Filling the matrix randomly#

For repeatability, we can control the seed with the srand instructions (check the manual, and also https://stackoverflow.com/questions/67387539/set-random-seed-in-eigen?noredirect=1&lq=1 and https://stackoverflow.com/questions/21292881/matrixxfrandom-always-returning-same-matrices ).

12.1.2. Measuring processing time only (both wall and cpu time)#

To measure only the times associated with the processing , we will use the chrono library (https://en.cppreference.com/w/cpp/chrono). Before proceeding, please read first the documentation of system_clock, steady_clock, and high_resolution_clock, and decide which one to use. Also, consider if they measure the cpu time. If not, how to measure it? Check:

To check:

  • [ ] What happens i you comment out the printing and compile with -O3?

  • [ ] Compare the times you are getting with /usr/bin/time.

12.1.3. Stats#

Now introduce the ability to repeat the timing many times and compute statistics. The number of reps will be a parameter.

12.1.4. Threads#

Eigen c++ includes parallelization with openmp threads. To do so, you need to compile with the -fopenmp flag and then execute as

OMP_NUM_THREADS=n ./a.out ...

where n is the number of threads. Check the following to see the kind of ops that can be parallelized: https://eigen.tuxfamily.org/dox/TopicMultiThreading.html

12.2. Exercise: Measuring time for matrix-matrix multiplication, with optimization and parallelization#

Now do the same study but with matrix multiplication. Implement the matrix matmul with eigen (see the code snippet at the end). Always compile with the flags “-fopenmp -O3”. And now execute in the following way

OMP_NUM_THREADS=VAR ./a.out ...

where VAR takes values from 1, 2, 3, 4, … up the output from the command nproc. This means that you are starting to use parallelization. You must run this in a multicore system (or give your virtual machine enough resources) with a least 8 threads. Plot the time (with error bars) as a function of the system size, and you will have four curves on that graph, each one for each VAR value.

Basic snippet for matmul in eigen:

double multiply(int size)
{
  // create matrices
  Eigen::MatrixXd A = Eigen::MatrixXd::Random(size, size);
  Eigen::MatrixXd B = Eigen::MatrixXd::Random(size, size);

  auto start = std::chrono::steady_clock::now();
  auto C{A*B}; // MULTIPLY
  double tmp = C(0, 0); // use the matrix to make eigen compute it
  auto end = std::chrono::steady_clock::now();

  std::clog << tmp << std::endl; // use the matrix to make eigen compute it

  std::chrono::duration<double> elapsed_seconds = end-start;
  return elapsed_secons.count();
}
#include <iostream>
#include <chrono>
#include <ctime>
#include <cmath>
#include <eigen3/Eigen/Dense>


void multiply(const int size, double & wtime, double & ctime); // store time spent multiplying
void stats(const int size, const int reps,
           double & mean_wtime, double & sigma_wtime,
           double & mean_ctime, double & sigma_ctime);

int main(int argc, char ** argv)
{
  const int M = atoi(argv[1]); // Matrix size
  const int R = atoi(argv[2]); // Repetitions

  double mean_wtime, sigma_wtime;
  double mean_ctime, sigma_ctime;
  stats(M, R, mean_wtime, sigma_wtime, mean_ctime, sigma_ctime);
  std::cout << M
            << "\t" << mean_wtime << "\t" << sigma_wtime
            << "\t" << mean_ctime << "\t" << sigma_ctime
            << std::endl;

  return 0;
}
void stats(const int size, const int reps,
           double & mean_wtime, double & sigma_wtime,
           double & mean_ctime, double & sigma_ctime)
{
  double wsum = 0, wsum2 = 0, csum = 0, csum2 = 0;
  double wtime = 0, ctime = 0;
  for (int rep = 0; rep < reps; ++rep) {
    srand(rep+1); // change seed for each repetition
    multiply(size, wtime, ctime);
    wsum  += wtime;
    wsum2 += wtime*wtime;
    csum  += ctime;
    csum2 += ctime*ctime;
  }
  mean_wtime = wsum/reps;
  sigma_wtime = std::sqrt(reps*(wsum2/reps - mean_wtime*mean_wtime)/(reps-1));
  mean_ctime = csum/reps;
  sigma_ctime = std::sqrt(reps*(csum2/reps - mean_ctime*mean_ctime)/(reps-1));
}


void multiply(const int size, double & wtime, double & ctime)
{
  // create matrices
  Eigen::MatrixXd A = Eigen::MatrixXd::Random(size, size);
  Eigen::MatrixXd B = Eigen::MatrixXd::Random(size, size);

  auto start = std::chrono::system_clock::now(); // measures wall time
  std::clock_t c1 = std::clock(); // measures cpu time
  auto C{A*B}; // MULTIPLY
  double tmp = C(0, 0); // use the matrix to make eigen compute it
  auto end = std::chrono::system_clock::now(); // wall time
  std::clock_t c2 = std::clock(); // cpu time

  std::clog << tmp << std::endl; // use the matrix to make eigen compute it

  ctime = 1.0*(c2-c1)/CLOCKS_PER_SEC;
  std::chrono::duration<double> elapsed_seconds = end-start;
  wtime = elapsed_seconds.count();
}