12. Performance measurement for some matrix ops#
In this section we will learn how to perform some benchmarks on some matrix operations to get an idea of the operations performance. This will be done manually, but in the future you can use specialized tools like google benchmark, google/benchmark, or nanobench, https://nanobench.ankerl.com/ , etc. https://news.ycombinator.com/item?id=39714053
12.1. LU operations to solve Ax = b#
In the following we want to explore the performance of the following program as a function of the system size
#include <iostream>
#include <eigen3/Eigen/Dense>
void solvesystem(int size);
int main(int argc, char ** argv)
{
const int M = atoi(argv[1]); // Matrix size
solvesystem(M);
}
void solvesystem(int size)
{
Eigen::MatrixXd A = Eigen::MatrixXd::Random(size, size);
Eigen::MatrixXd b = Eigen::MatrixXd::Random(size, 1);
// crono star
//Eigen::MatrixXd x = A.fullPivLu().solve(b);
Eigen::MatrixXd x = A.lu().solve(b);
// crono end
double relative_error = (A*x - b).norm() / b.norm(); // norm() is L2 norm
std::cout << A << std::endl;
std::cout << b << std::endl;
std::cout << "The relative error is:\n" << relative_error << std::endl;
}
// usar chrono para medir el tiempo
// gprof
// remove printing
// modify printing
-0.999984 |
-0.562082 |
-0.232996 |
0.0594004 |
-0.165028 |
-0.736924 |
-0.905911 |
0.0388327 |
0.342299 |
0.373545 |
0.511211 |
0.357729 |
0.661931 |
-0.984604 |
0.177953 |
-0.0826997 |
0.358593 |
-0.930856 |
-0.233169 |
0.860873 |
0.0655345 |
0.869386 |
-0.893077 |
-0.866316 |
0.692334 |
0.0538576 |
||||
-0.81607 |
||||
0.307838 |
||||
-0.168001 |
||||
0.402381 |
||||
The |
relative |
error |
is: |
|
7.87638e-16 |
#include <iostream>
#include <eigen3/Eigen/Dense>
void solvesystem(int size);
int main(int argc, char ** argv)
{
const int M = atoi(argv[1]); // Matrix size
solvesystem(M);
}
void solvesystem(int size)
{
Eigen::MatrixXd A = Eigen::MatrixXd::Random(size, size);
Eigen::MatrixXd b = Eigen::MatrixXd::Random(size, 1);
// crono star
Eigen::MatrixXd x = A.fullPivLu().solve(b);
// crono end
// double relative_error = (A*x - b).norm() / b.norm(); // norm() is L2 norm
//std::cout << A << std::endl;
//std::cout << b << std::endl;
//std::cout << "The relative error is:\n" << relative_error << std::endl;
}
// usar chrono para medir el tiempo
// gprof
// remove printing
// modify testing
If you use just
time ./a.out
you will measuring the whole execution time, including the memory allocation, the actual operation, and the memory freeing. It is much better to actually measure only what we are interesting in, which is the actual memory operation.
To create a reasonable benchmark, we need to
Fill the matrix randomly, controlling the seed
Measure only processing time (wall and cpu times)
Statistics
Print the time as function of N and plot (run this in the shell)
Use threads (openmp)
12.1.1. Filling the matrix randomly#
For repeatability, we can control the seed with the srand
instructions
(check the manual, and also
https://stackoverflow.com/questions/67387539/set-random-seed-in-eigen?noredirect=1&lq=1
and
https://stackoverflow.com/questions/21292881/matrixxfrandom-always-returning-same-matrices
).
12.1.2. Measuring processing time only (both wall and cpu time)#
To measure only the times associated with the processing , we will use
the chrono
library (https://en.cppreference.com/w/cpp/chrono).
Before proceeding, please read first the documentation of
system_clock
, steady_clock
, and high_resolution_clock
, and decide
which one to use. Also, consider if they measure the cpu time. If not,
how to measure it? Check:
To check:
[ ] What happens i you comment out the printing and compile with
-O3
?[ ] Compare the times you are getting with
/usr/bin/time
.
12.1.3. Stats#
Now introduce the ability to repeat the timing many times and compute statistics. The number of reps will be a parameter.
12.1.4. Threads#
Eigen c++ includes parallelization with openmp threads. To do so, you
need to compile with the -fopenmp
flag and then execute as
OMP_NUM_THREADS=n ./a.out ...
where n
is the number of threads. Check the following to see the kind
of ops that can be parallelized:
https://eigen.tuxfamily.org/dox/TopicMultiThreading.html
12.2. Exercise: Measuring time for matrix-matrix multiplication, with optimization and parallelization#
Now do the same study but with matrix multiplication. Implement the matrix matmul with eigen (see the code snippet at the end). Always compile with the flags “-fopenmp -O3”. And now execute in the following way
OMP_NUM_THREADS=VAR ./a.out ...
where VAR takes values from 1, 2, 3, 4, … up the output from the command
nproc
. This means that you are starting to use parallelization. You
must run this in a multicore system (or give your virtual machine enough
resources) with a least 8 threads. Plot the time (with error bars) as a
function of the system size, and you will have four curves on that
graph, each one for each VAR value.
Basic snippet for matmul in eigen:
double multiply(int size)
{
// create matrices
Eigen::MatrixXd A = Eigen::MatrixXd::Random(size, size);
Eigen::MatrixXd B = Eigen::MatrixXd::Random(size, size);
auto start = std::chrono::steady_clock::now();
auto C{A*B}; // MULTIPLY
double tmp = C(0, 0); // use the matrix to make eigen compute it
auto end = std::chrono::steady_clock::now();
std::clog << tmp << std::endl; // use the matrix to make eigen compute it
std::chrono::duration<double> elapsed_seconds = end-start;
return elapsed_secons.count();
}
#include <iostream>
#include <chrono>
#include <ctime>
#include <cmath>
#include <eigen3/Eigen/Dense>
void multiply(const int size, double & wtime, double & ctime); // store time spent multiplying
void stats(const int size, const int reps,
double & mean_wtime, double & sigma_wtime,
double & mean_ctime, double & sigma_ctime);
int main(int argc, char ** argv)
{
const int M = atoi(argv[1]); // Matrix size
const int R = atoi(argv[2]); // Repetitions
double mean_wtime, sigma_wtime;
double mean_ctime, sigma_ctime;
stats(M, R, mean_wtime, sigma_wtime, mean_ctime, sigma_ctime);
std::cout << M
<< "\t" << mean_wtime << "\t" << sigma_wtime
<< "\t" << mean_ctime << "\t" << sigma_ctime
<< std::endl;
return 0;
}
void stats(const int size, const int reps,
double & mean_wtime, double & sigma_wtime,
double & mean_ctime, double & sigma_ctime)
{
double wsum = 0, wsum2 = 0, csum = 0, csum2 = 0;
double wtime = 0, ctime = 0;
for (int rep = 0; rep < reps; ++rep) {
srand(rep+1); // change seed for each repetition
multiply(size, wtime, ctime);
wsum += wtime;
wsum2 += wtime*wtime;
csum += ctime;
csum2 += ctime*ctime;
}
mean_wtime = wsum/reps;
sigma_wtime = std::sqrt(reps*(wsum2/reps - mean_wtime*mean_wtime)/(reps-1));
mean_ctime = csum/reps;
sigma_ctime = std::sqrt(reps*(csum2/reps - mean_ctime*mean_ctime)/(reps-1));
}
void multiply(const int size, double & wtime, double & ctime)
{
// create matrices
Eigen::MatrixXd A = Eigen::MatrixXd::Random(size, size);
Eigen::MatrixXd B = Eigen::MatrixXd::Random(size, size);
auto start = std::chrono::system_clock::now(); // measures wall time
std::clock_t c1 = std::clock(); // measures cpu time
auto C{A*B}; // MULTIPLY
double tmp = C(0, 0); // use the matrix to make eigen compute it
auto end = std::chrono::system_clock::now(); // wall time
std::clock_t c2 = std::clock(); // cpu time
std::clog << tmp << std::endl; // use the matrix to make eigen compute it
ctime = 1.0*(c2-c1)/CLOCKS_PER_SEC;
std::chrono::duration<double> elapsed_seconds = end-start;
wtime = elapsed_seconds.count();
}