Introduction to High Performance Computing

13. Introduction to High Performance Computing#

Courses

High-Performance Computing Technologies Course: https://www.hpc.temple.edu/mhpc/hpc-technology/
What to do aout parallel prograaming: https://news.ycombinator.com/item?id=36318280
https://epcced.github.io/hpc-intro/
https://www.ipht.fr/Pisp/gregoire.misguich/pp.php
https://hpc.llnl.gov/documentation/tutorials

Resources

trevor-vincent/awesome-high-performance-computing
https://viralinstruction.com/posts/hardware/
https://www.archer.ac.uk/training/online/index.php#IntroHPC
https://www.archer.ac.uk/training/courses/index.php#hands_on_intro
https://www.archer.ac.uk/training/online/driving_test.php
https://www.sdsc.edu/services/service_rates_summary.html

Containers in HPC

https://www.youtube.com/watch?v=PwI0tJHJOlo
https://www.youtube.com/watch?v=WQTrA4-9ZXk
https://apptainer.org/docs/user/main/index.html
https://apptainer.org/docs/user/main/docker_and_oci.html

Debuggers

EPCC PRACE Training

Reference Materials See presentations hpc-intro.pdf, 07-HPC-Distributed.pdf and 07-Parallel-general-metrics.pdf . Special care to metrics.

13.1. Introduction#

There is a point where a serial version of our code is not the most optimal way to exploit our computational resources (but it might be in the case of embarrassingly parallel problems where you can just run several programs at once). For instance, you might want to use all the cores on your multicore system, ideally reducing the execution time, or you need to explore larger system sizes that could consume a lot of memory or need too much time.

Typically, Moore’s law allowed to wait for a bit in order to get a better machine so your algorithms will run faster.

https://ourworldindata.org/uploads/2020/11/Transistor-Count-over-time.png

But due to physics limitation and power considerations, it is now typical to have multicore systems

https://www.youtube.com/watch?v=Qlv5pB6u534 (The physics ending moore laws)

https://i.stack.imgur.com/fRJgk.png

Recently, power considerations are being more and more relevant:

At the same time, the computational problems size and/or complexity has been steadily increasing in time, requiring distributed computing techniques.

https://upload.wikimedia.org/wikipedia/commons/thumb/c/c6/Distributed-parallel.svg/600px-Distributed-parallel.svg.png

(see also https://en.wikipedia.org/wiki/Flynn’s_taxonomy?useskin=vector). Recently, besides CPU parallelization, the GPU parallelization has become very relevant (see https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units?useskin=vector ), where CUDA , OpenACC, and others, are the relevant technologies.

In our case, we will be more focused on the cluster computing aspect, while there are more HPC approaches, like grid computing, cloud computing, and so on. One of the goals of HPC is to get better results faster and/or to exploit better current or future resources.

https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/IBM_Blue_Gene_P_supercomputer.jpg/600px-IBM_Blue_Gene_P_supercomputer.jpg

13.2. Basics of parallel metrics#

But, as usual, you should always measure. All programs have a serial part that cannot be parallelized and a parallel part than can. Using more processors/threads can reduce only the parallel, so a 100% serial program cannot really take advantage of a parallel system. This is known as Amdahls law, https://en.wikipedia.org/wiki/Amdahl’s_law?useskin=vector

https://upload.wikimedia.org/wikipedia/commons/e/ea/AmdahlsLaw.svg

At the end, the user must also gauge its application performance. Blindly reserve of HPC resources represent a non efficient cluster use, and higher costs. In this regard, parallel metrics are really crucial. The next to figures show the speedup and the parallel efficiency. As you can see, they are limited by the hardware (and algorithms)

Speedup:
Parallel efficiency:

13.3. Practical overview of a cluster resources and use#

There are many aspects to take into account in the HPC field. If you are a user, you should know abount the type of parallelization (shared memory, disitributed memory, gpu programming), the type of hardware you are using, the resource manager, the data storage and so on. The goal of a system administrator is to make that easier, but that is not always possible. Check the 12-12-hkhlr_quick_reference-goethe-hlr (from https://csc.uni-frankfurt.de/wiki/doku.php?id=public:start) for an example of a typical cluster config and offerings.

These are examples from the Archer cluster at https://www.archer2.ac.uk/

../_images/Archer1.png

In the following, we will see some basic examples for HPC, such us

Shared memory: with openmp
Distributed memory: using mpi
Multiple processes: using gnu parallel
C++ threads
TODO C++ parallel algorithms

13.4. Openmp, shared memory#

The following code shows a very simple parallelization using openmp, which allows tu share memory and run on several threads.

#include <iostream>
#include <omp.h>

int main(int argc, char *argv[]) {
  std::cout << "BEFORE\n";
#pragma omp parallel
  {
    std::cout << "Hola mundo\n";
  }
  std::cout << "AFTER\n";

  return 0;
}

To activate multi-threading, compile it as

g++ -fopenmp codes/openmp.cpp

To run it, you can control the number of threads using the environment variable OMP_NUM_THREADS:

g++ -fopenmp codes/openmp.cpp
echo "Running with 2 threads"
OMP_NUM_THREADS=2 ./a.out
echo "Running with 4 threads"
OMP_NUM_THREADS=4 ./a.out

13.4.1. Exercises#

Modify the previous exercise to identify the thread which is printing. Find a function to get the “thread id”.
Besides the thread id, print the number of threads and the hostname. Print the number of threads outside the parallel region. Does that make sense?

13.5. MPI, distributed memory#

MPI, the Message Passing Interface, is a library API that allows process to interchange data in a distributed memory context. It is more comple that openmp, but also opens the door to a greater scale since we can use many computers, increasing both our computational power and memory capacity (if done correctly and efficiently).

The following shows the basic structure of a MPI program. It creates several processes that can communicate with each other, and can be run in multiple machines (for an introduction, see: https://mpitutorial.com/tutorials/mpi-introduction/)

#include <mpi.h>
#include <iostream>

int main(int argc, char** argv)
{
    // Initialize the MPI environment
    MPI_Init(&argc, &argv);

    // Get the number of processes
    int np;
    MPI_Comm_size(MPI_COMM_WORLD, &np);

    // Get the rank of the process
    int pid;
    MPI_Comm_rank(MPI_COMM_WORLD, &pid);

    // Get the name of the processor
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    // Print off a hello world message
    printf("Hello world from processor %s, rank %d out of %d processes\n",
           processor_name, pid, np);

    // Finalize the MPI environment.
    MPI_Finalize();
}

You can compile it as

mpic++  mpi.cpp

(If you want to see all the flags, use mpic++ --showme)

And now run it as

mpirun -np 4 ./a.out

You can also specifiy different machines to run on, but you will need to have configured passwordless access to those machines.

13.6. Simple parallelization: farm task and gnu parallel or xargs#

Sometimes you do not need to actually parallelize your code, but to run it with many parameters combination. Let’s assume that we have a task that depend on one parameter and can be executed independent of other parameters. It can be a very complex program, but for now it will be just a very simple bash instructions that prints a value. Save the following code in a bash script (like script.sh) that will use the stress command to stress a single core

# file: script.sh
echo "First arg: ${1}"
stress -t 10 -c 1 # stress one core
echo "Stress test done"

When it is executed, it just prints the first argument

bash codes/script.sh 23

What if we want to do execute this task for 4 different arguments? we will just do it sequentially:

date +"%H-%M-%S"
bash codes/script.sh 23
bash codes/script.sh 42
bash codes/script.sh 10
bash codes/script.sh 57
date +"%H-%M-%S"

40 seconds in total. Remember that this example is very simple, but assume that the script is a very large task. Then, the previous task will take four times the time of a simple task. What if we have a machine with four possible threads? it will be useful to run all the commands in parallel. To do so you might just put them in the background with the & character at the end. But what will happen if you need to run 7 different arguments and you have only 4 threads? then it would be not optimal to have all of them running at tha same time with less than 100% of cpu usage. It would be better to run 4 of them and when one of the finishes then launch the next one and so on. To do this programatically, you can use gnu parallel, https://www.gnu.org/software/parallel/ (check the tutorial in the documentation section, or the cheatsheet, https://www.gnu.org/software/parallel/parallel_cheat.pdf). You can install as spack info parallel, or load it with spack load parallel if it not installed already. For our case, it would be very useful

date +"%H-%M-%S"
parallel 'bash codes/script.sh {} ' ::: 23 42 10 57
date +"%H-%M-%S"


08-12-25
First	arg:	23
stress:	info:	[83775]	dispatching	hogs:	1	cpu,	0	io,	0	vm,	0	hdd
Stress	test	done
First	arg:	42
stress:	info:	[83779]	dispatching	hogs:	1	cpu,	0	io,	0	vm,	0	hdd
Stress	test	done
First	arg:	10
stress:	info:	[83781]	dispatching	hogs:	1	cpu,	0	io,	0	vm,	0	hdd
Stress	test	done
First	arg:	57
stress:	info:	[83785]	dispatching	hogs:	1	cpu,	0	io,	0	vm,	0	hdd
Stress	test	done
08-12-36

Around 10 seconds now! Gnu parallel will detect the number of cores and launch the process accodingly taking care of jobs distribution. Read the manual for the many options of this powerful tool that is used even on large clusters. For instance, try to run 7 processes:

date +"%H-%M-%S"
parallel 'bash codes/script.sh {} ' ::: 23 42 10 57 21 8 83
date +"%H-%M-%S"


08-13-20
First	arg:	23
stress:	info:	[84082]	dispatching	hogs:	1	cpu,	0	io,	0	vm,	0	hdd
Stress	test	done
First	arg:	42
stress:	info:	[84086]	dispatching	hogs:	1	cpu,	0	io,	0	vm,	0	hdd
Stress	test	done
First	arg:	10
stress:	info:	[84088]	dispatching	hogs:	1	cpu,	0	io,	0	vm,	0	hdd
Stress	test	done
First	arg:	57
stress:	info:	[84091]	dispatching	hogs:	1	cpu,	0	io,	0	vm,	0	hdd
Stress	test	done
First	arg:	21
stress:	info:	[84161]	dispatching	hogs:	1	cpu,	0	io,	0	vm,	0	hdd
Stress	test	done
First	arg:	8
stress:	info:	[84165]	dispatching	hogs:	1	cpu,	0	io,	0	vm,	0	hdd
Stress	test	done
First	arg:	83
stress:	info:	[84168]	dispatching	hogs:	1	cpu,	0	io,	0	vm,	0	hdd
Stress	test	done
08-13-41

You can play with the -j n flag to control how many jobs to run with parallel. By default it uses all possible threads

13.6.1. Using gnu parallel to run several matmul and compute metrics#

Now let’s use the previous matmul with eigen and blas exercise to show how to use parallel to run several matrix sizes at the same time and also to compute some parallel metrics. The code is

We have two goals:

To compute the wall time as a function of the matrix size (strong scaling, changing problem size), using blas.
To compute the speedup and parallel efficiency for a fixed matrix size.

In the first case we will just use parallel to run as many simulations as possible. In the second case we will compute some metrics to check when is our code the most efficient.

It is assumed that you have blas with spack, so you can load it as

spack load openblas

or might have to install it using

spack install openblas threads=openmp cxxflags="-O3" cflags="-O3" target=x86_64

(the last flag is just to have the same target independent of the actual machine, like the conditions we have in the computer room)

Strong scaling: Time as a function of matriz size, one thread, eigen + blas

Naively, we could just run the program serially for each matriz size, but if we are in computer with multiple cores/threads it would be better if we ran as many parameters as possible (adapt the following instructions to your case):

#source $HOME/repos/spack/share/spack/setup-env.sh
spack load openblas
#g++ -fopenmp -O3 -I $CMAKE_PREFIX_PATH/include -L $CMAKE_PREFIX_PATH/lib eigen-matmul.cpp -DEIGEN_USE_BLAS -lopenblas -o eigen_blas.x
g++ -fopenmp -O3  eigen-matmul.cpp -DEIGEN_USE_BLAS -lopenblas -o eigen_blas.x
parallel 'OMP_NUM_THREADS=1 ./eigen_blas.x {} 10 2>/dev//null' ::: 10 50 100 200 500 700 1000 2000 5000w

Check that your programs are running in parallel, as expected.

Exercise: Strong scaling for eigen eigenvectors

With the following code, compute the strong scaling of eigen when computing eigenvectors with the more general method:

#include <iostream>
#include <cstdlib>
#include <chrono>
#include <eigen3/Eigen/Dense>

void solve_eigensystem(int size, double &time);

int main(int arg, char **argv)
{
  const int M = atoi(argv[1]); // Matrix size
  const int R = atoi(argv[2]); // Repetitions
  const int S = atoi(argv[3]); // seed
  srand(S);

  double totaltime = 0, auxtime = 0;
  for(int irep = 0; irep < R; ++irep) {
    solve_eigensystem(M, auxtime);
    totaltime += auxtime;
  }
  std::cout << M << "\t" << totaltime/R << "\n";

 }

void solve_eigensystem(int size, double &time)
{
  double aux;
  Eigen::MatrixXd A = Eigen::MatrixXd::Random(size, size);
  auto start = std::chrono::steady_clock::now();
  Eigen::SelfAdjointEigenSolver<Eigen::MatrixXd> eigensolver(A);
  aux = eigensolver.eigenvalues()(0);
  auto end = std::chrono::steady_clock::now();
  std::clog << "The first eigenvalue of A is:\n" << aux << std::endl;
  std::chrono::duration<double> diff = end - start;
  time = diff.count();
}

Compile como

g++ -O3 eigen-eigenvectors.cpp  -o eigen.x

Ejecute como

./eigen.x M 5 3 2>/dev/null

debe cambiar M, matriz size.

Datos de ejemplo (dependen de la maquina, pero las relaciones entre ellos deben ser similares):

    9.7568e-06
    3.55734e-05
    0.000312481
    0.000890043
...
  72.8078
  81.0619

Plot the data ad analyze.

Weak scaling: number of threads and parallel metrics

Here we will compute some key parallel metrics that inform about the efficiency of our code when running in parallel. Now you do not want to use gnu parallel since you have a variable number of thredas per process

source $HOME/repos/spack/share/spack/setup-env.sh
spack load openblas
#g++ -fopenmp -O3 -I $CMAKE_PREFIX_PATH/include -L $CMAKE_PREFIX_PATH/lib eigen-matmul.cpp -DEIGEN_USE_BLAS -lopenblas -o eigen_blas.x
g++ -fopenmp -O3 eigen-matmul.cpp -DEIGEN_USE_BLAS -lopenblas -o eigen_blas.x
parallel -j 1 'echo -n "{}  "; OMP_NUM_THREADS={} ./eigen_blas.x 4000 10 2>/dev//null' ::: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The following is data obtained from a run in a 8core/16threads computer, running only with eigen

4000 6.75223 0.00290168      6.75225 0.0029001
4000 3.52052 0.00405504      7.01616 0.00819132
4000 2.40281 0.0117795       7.15847 0.0355119
4000 1.85186 0.0049013       7.33257 0.0187681
4000 1.72682 0.176218        8.53451 0.880953
4000 1.65921 0.00946933      9.83127 0.0574367
4000 1.52068 0.00538196      10.4943 0.0370317
4000 1.39755 0.0326183       11.006  0.260568
4000 2.26355 0.00254841      19.9546 0.0452903
4000        2.04808 0.00732663      20.0991 0.0807175
4000        2.00821 0.00876695      21.7043 0.104527
4000        1.76768 0.0276189       20.834  0.324801
4000        1.77771 0.00686642      22.7412 0.0887671
4000        1.59293 0.0116353       21.9208 0.213236
4000        1.56692 0.00829185      23.1334 0.121202
4000        1.49262 0.0321579       23.3921 0.417985

And these are with running with eigen+blas

4000 1.76345 0.00750705      1.76345 0.0075023
4000 0.930922        0.00450842      1.83688 0.00900039
4000 0.666528        0.0122499       1.94996 0.0365303
4000 0.523076        0.00175201      2.01795 0.00654245
4000 0.442096        0.00226719      2.1107  0.0108692
4000 0.394103        0.00867531      2.23982 0.0513271
4000 0.371224        0.000725666     2.44876 0.00691333
4000 0.35202 0.00564542      2.64095 0.0441576
4000 0.53266 0.00218486      4.59061 0.02803
4000        0.496156        0.00207424      4.73416 0.0281744
4000        0.461317        0.00102704      4.82462 0.0111843
4000        0.550406        0.0431109       6.32975 0.518631
4000        0.514583        0.0119228       6.38601 0.144949
4000        0.494073        0.0166192       6.58912 0.231588
4000        0.484151        0.0121776       6.90909 0.179303
4000        0.493364        0.0316198       7.45327 0.429279

The two more important parallel metric efficiency.

The speed up is defined as

where \(T_1\) is the reference time with one thread, and \(T_n\) with n threads. For a perfect scaling, \(T_n = T_1/n\), so \(S_{\rm theo}(n) = n\). This does not occur always and depend on the actual machine, the communication patterns and so on. In the following we will use \(T_1 = 6.75223\) for eigen and \(T_1 = 1.76345\) for eigen+blas.

The parallel efficiency measures roughly how efficiently we are using all the threads. It is defined as

and, therefore, theoretically it is equal to 1.

To easily create the data we need, we could use a spreadsheet or better, the command line

awk '{print $1, 6.75223/$3, 6.75223/$3/$1 }' codes/eigen.txt > codes/eigen_metrics.txt
awk '{print $1, 1.76345/$3, 1.76345/$3/$1 }' codes/eigen_blas.txt > codes/eigen_blas_metrics.txt

Then we can plot and analize

set term png enhaced giant#; set out 'tmpspeedup.png'
set key l t; set xlabel 'nthreads'; set ylabel 'Parallel speedup'; set title 'Computer with 8cores/16threads'
plot [:17][:] x lt -1 t 'theo', 'codes/eigen_metrics.txt' u 1:2 w lp t 'eigen', 'codes/eigen_blas_metrics.txt'  u 1:2 w lp t 'eigen+blas'

Speedup

#set term png; set out 'efficiency.png'
set key l b; set xlabel 'nthreads'; set ylabel 'Parallel efficiency'; set title 'Computer with 8cores/16threads'
plot [:17][-0.1:1.1] 1 lt -1, 'codes/eigen_metrics.txt' u 1:3 w lp t 'eigen', 'codes/eigen_blas_metrics.txt'  u 1:3 w lp t 'eigen+blas', 0.6 lt 4

Efficiency

Normally, if you want to buy some cloud services to run your code, you should use threads that give equal or above 60-70% efficiency.

13.7. Threads from c++11#

The c++11 standard included, among many other usefull things, the use a thread. A thread is a lightweight process that can be launched in parallel with other threads from a parent process. In the following we will see some very simple examples since at the end we will focus mainly on OpenMP (where threads are the key and the memory is shared) and MPI (where processes are the basic unit and memory is distributed).

The following example are based on

The following example shows how to create a thread from a given process, and its output:

#include <iostream>
#include <thread>

void func(int x);

int main(int argc, char **argv) {
    std::thread th1(&func, 100);
    std::thread th2(&func, 200);
    th1.join();
    std::cout << "Outside thread" << std::endl;
    th2.join();
    return 0;
}

void func(int x) {
    std::cout << "Inside thread " << x << std::endl;
    std::thread::id this_id = std::this_thread::get_id();
    std::cout << "This is thread_id: " << this_id << std::endl;
}

Compile it as

g++ -std=c++11 thread-v1.cpp

The folowwing is an example of the output:


Inside	thread	Inside	thread	100200

This	is	thread_id:	This	is	thread_id:	0x700003c34000
0x700003cb7000
Outside	thread

Run it several times, you will obtain different outputs, many times they will be mangled. Why? because the threads are running in parallel and their output is not independent of each other, not synced.

To check that we are really running two threads, let’s increase the computational effort inside function func and then, while the program is running, use top or htop to check what is running on your computer. Notice that the cpu use percentage is around 200%:

#include <iostream>
#include <thread>
#include <chrono>
#include <cmath>

void func(double x, int nsecs);

int main(int argc, char **argv) {
    const int secs = std::atoi(argv[1]);
    std::thread th1(&func, 100, secs);
    std::thread th2(&func, 200, secs);
    std::thread th3(&func, 300, secs);
    th1.join();
    std::cout << "Outside thread" << std::endl;
    th2.join();
    th3.join();
    return 0;
}

void func(double x, int nsecs) {
    std::cout << "Inside thread " << x << std::endl;
    std::this_thread::sleep_for (std::chrono::seconds(nsecs)); // make this sleep, does not consume a lot of resources
    for (int ii = 0; ii < 100000000; ++ii) {
        x += std::fabs(x*std::sin(x) + std::sqrt(x)/3.4455)/(ii+1);
    }
    std::cout << "Getting out of thread " << x << std::endl;
}

The folowwing is an example of the output:


Inside	thread	Inside	thread	100
200
Getting	out	of	thread	9387820000.0
Outside	thread
Getting	out	of	thread	18849600000.0

To synchronize threads, you can use a mutex. This is useful in case you need to sort out the printing, or, more importantly, to syncrhonize a writing operation on some common variable. The following is a simple example:

#include <iostream>
#include <thread>
#include <chrono>
#include <mutex>

std::mutex g_display_mutex;

void foo()
{
    std::thread::id this_id = std::this_thread::get_id();

    g_display_mutex.lock();
    std::cout << "thread " << this_id << " sleeping...\n";
    g_display_mutex.unlock();

    std::this_thread::sleep_for(std::chrono::seconds(1));
}

int main()
{
    std::thread t1(foo);
    std::thread t2(foo);

    t1.join();
    t2.join();

    return 0;
}


thread	0x70000709a000	sleeping…
thread	0x70000711d000	sleeping…

Repeat several times. Although the thread id will change, the output will not be mangled.

There is much more about threads, but since our focus will turn to OpenMP, we will stop here. For more info check https://en.cppreference.com/w/cpp/thread/thread

13.7.1. Exercise#

Fix the following code, which has a race condition, using a mutex (ref: https://www.classes.cs.uchicago.edu/archive/2013/spring/12300-1/labs/lab6/) :

#include <iostream>
#include <vector>
#include <thread>

void square(const int x, int & result);

int main() {
    int accum = 0;
    std::vector<std::thread> ths;
    for (int i = 1; i <= 20; i++) {
        ths.push_back(std::thread(&square, i, std::ref(accum)));
    }

    for (auto & th : ths) {
        th.join();
    }
    std::cout << "accum = " << accum << std::endl;
    return 0;
}

void square(int x, int &result) {
    result += x * x;
}

accum = 2870

The correct answer is 2870, but if you repeat the execution many times you will find different results. For instance, repeating the execution 1000 times and checking for the unique answers one gets

for i in {1..1000}; do ./a.out; done | sort | uniq -c

accum = 2509
accum = 2674
accum = 2749
accum = 2806
accum = 2834
accum = 2845
accum = 2854
accum = 2861
accum = 2866
accum = 2869
accum = 2870

which shows that it not always yield 2870 .

13.8. Parallel algorithms in c++#

Since c++17, it is possible to execute some stl algorithms in parallel (shared memory), without explictly using threads. See:

This a simple example for performing a sum

#include <iostream>
#include <algorithm>
#include <numeric>
#include <vector>
#include <execution>

int main() {
    const long ARRAY_SIZE = 100000000;
    std::vector<double> myArray(ARRAY_SIZE);
    std::iota(myArray.begin(), myArray.end(), 0); // fill array with 0, 1, 2, ..., ARRAY_SIZE-1

    // sequential execution
    auto sum_seq = std::accumulate(myArray.begin(), myArray.end(), 0.0);
    std::cout << "Sequential sum: " << sum_seq << std::endl;

    // parallel execution
    auto sum_par = std::reduce(std::execution::par, myArray.begin(), myArray.end());
    std::cout << "Parallel sum: " << sum_par << std::endl;

    return 0;
}

To compile, use

g++ -std=c++17 par.cpp -ltbb

This is linking with an intel threads implementation.

You can of course measure how much time is spent on each part. To do so, we will use chrono:

Implementation 1:

#include <iostream>
#include <algorithm>
#include <numeric>
#include <vector>
#include <execution>

int main() {
    const long ARRAY_SIZE = 100000000;
    std::vector<double> myArray(ARRAY_SIZE);
    std::iota(myArray.begin(), myArray.end(), 0); // fill array with 0, 1, 2, ..., ARRAY_SIZE-1

    // sequential execution
    auto start_time = std::chrono::high_resolution_clock::now();
    auto sum_seq = std::accumulate(myArray.begin(), myArray.end(), 0.0);
    auto end_time = std::chrono::high_resolution_clock::now();
    auto seq_duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
    std::cout << "Sequential sum: " << sum_seq << "( took : " << seq_duration.count()/1000.0 << " s)" << std::endl;

    // parallel execution
    start_time = std::chrono::high_resolution_clock::now();
    auto sum_par = std::reduce(std::execution::par, myArray.begin(), myArray.end());
    end_time = std::chrono::high_resolution_clock::now();
    seq_duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
    std::cout << "Parallel sum: " << sum_par << "( took : " << seq_duration.count()/1000.0 << " s)" << std::endl;

    return 0;
}

Implementation 2:

#include <iostream>
#include <algorithm>
#include <numeric>
#include <vector>
#include <execution>
#include <chrono>

template<typename Func>
void time_function(Func func);

int main() {
  const long ARRAY_SIZE = 100000000;
  std::vector<double> myArray(ARRAY_SIZE);
  std::iota(myArray.begin(), myArray.end(), 0); // fill array with 0, 1, 2, ..., ARRAY_SIZE-1

  // sequential execution
  auto serial = [&myArray](){return std::accumulate(myArray.begin(), myArray.end(), 0.0);};
  time_function(serial);

  // parallel execution
  auto parallel = [&myArray](){return std::reduce(std::execution::par, myArray.begin(), myArray.end());};
  time_function(parallel);

  return 0;
}

template<typename Func>
void time_function(Func func) {
  auto start = std::chrono::high_resolution_clock::now();
  func();
  auto end = std::chrono::high_resolution_clock::now();
  auto duration_ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
  std::cout << "Elapsed time: " << duration_ms/1000.0 << "  s" << std::endl;
}

The standard does not specify a way to control the nuber tof threads. If you want to do so, and you are using Intel Threads Block implementation, you can add gthe following header

#include <tbb/task_scheduler_init.h>

and then , at some point, specify the thread total (in this case, 4)

tbb::task_scheduler_init init(4);

To learn more about the parallel algs, check

https://en.cppreference.com/w/cpp/algorithm

13.9. TODO Gpu Programming intro#

13.9.1. Environment setup for cuda#

Here I show two ways to setup the dev environment. One is based on a local computer with a graphics card, and the other using google collab.

Local environment

Here we will setup a computer which has an Nvidia Quadro P1000 card. You need to install both the driver and the cuda toolkit (the later better to be installed as a part of the nvidia sdk)
- Driver download for quadro P1000: https://www.nvidia.com/Download/driverResults.aspx/204639/en-us/
- Nvidia sdk: https://developer.nvidia.com/hpc-sdk-downloads
  - Nvidia singularity: This is the recommended way. The image is built at /packages/nvhpc₂₃.3_devel.sif. More instructions at https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nvhpc
    1. Accesing a shell inside the container but with visibility to all user account files:
      singularity shell --nv /packages/nvhpc_23.3_devel.sif
    2. Compiling
      singularity exec --nv /packages/nvhpc_23.3_devel.sif nvc++ -g cuda_02.cu
    3. Executing with nvprof
      singularity exec --nv /packages/nvhpc_23.3_devel.sif nvprof ./a.out
  - Local module: Load the nvidia sdk (sala2):
    module load /packages/nvidia/hpc_sdk/modulefiles/nvhpc/23.3
    Compile as
    nvc++ -std=c++17 -o offload.x offload.cpp
  - The docker container is installed. Unfortunately it does not run since the device compute capability is not enough
    docker run --gpus all -it --rm nvcr.io/nvidia/nvhpc:23.3-devel-cuda_multi-ubuntu20.04 docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
    More info about container: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nvhpc

Google collab

Open a collab notebook, go to runtime, change runtime type, hardware accelerator -> GPU, GPU type -> T4, Save. The you will have a runtime with a T4 card, for free. If you want an even better card, you can pay for collab pro.

Inside the notebook, you can run commands with the prefix ! to run then as in a console. For instance, to get the device properties, you can run

!nvidia-smi

to get something like

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

To create local files, like filename.cu, use the magic %%writefile filename.cu at the beginning of the cell and then put the file contents in the same cell.

Finally, to compile and run just execute the following

!nvcc filename.cu -o name.x
!nvprof ./name.x

13.9.2. TODO Cuda intro#

REF https://en.wikipedia.org/wiki/CUDA?useskin=vector Tutorial1: https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial01/ Tutorial2 https://developer.nvidia.com/blog/even-easier-introduction-cuda/

Tutorial 1

https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial01/ Example in c

Compile as
```
gcc example_01.c
```
Now the same but in cuda:

Compile as
```
nvcc example_01.cu
```
Execution will show errors, due to the fact that the code is NOT running on the device.

We need to allocate memory on it(cudaMalloc and cudaFree), and trasfer data to and from it (cudaMemCopy).

Tutorial 2

https://developer.nvidia.com/blog/even-easier-introduction-cuda/

#include <iostream>
#include <math.h>

// function to add the elements of two arrays
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20; // 1M elements

  float *x = new float[N];
  float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the CPU
  add(N, x, y);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

Compile as

g++ -g -std=c++17 cuda_01.cpp

Cuda example

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
  add<<<1, 1>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);

  return 0;
}

To compile, use nvc++.

If you have a singularity container with the nvidia sdk, you can just run the following

singularity exec --nv /packages/nvhpc_23.3_devel.sif nvc++ -g cuda_02.cu
singularity exec --nv /packages/nvhpc_23.3_devel.sif ./a.out
singularity exec --nv /packages/nvhpc_23.3_devel.sif nvprof ./a.out

and get something like

==16094== NVPROF is profiling process 16094, command: ./a.out
Max error: 0
==16094== Profiling application: ./a.out
==16094== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  2.54774s         1  2.54774s  2.54774s  2.54774s  add(int, float*, float*)
      API calls:   93.27%  2.54776s         1  2.54776s  2.54776s  2.54776s  cudaDeviceSynchronize
                    6.71%  183.20ms         2  91.602ms  20.540us  183.18ms  cudaMallocManaged
                    0.02%  468.25us         2  234.13us  216.27us  251.98us  cudaFree
                    0.01%  213.75us       101  2.1160us     141ns  150.11us  cuDeviceGetAttribute
                    0.00%  32.127us         1  32.127us  32.127us  32.127us  cudaLaunchKernel
                    0.00%  22.239us         1  22.239us  22.239us  22.239us  cuDeviceGetName
                    0.00%  6.1330us         1  6.1330us  6.1330us  6.1330us  cuDeviceGetPCIBusId
                    0.00%  1.5730us         3     524ns     197ns  1.1650us  cuDeviceGetCount
                    0.00%     808ns         2     404ns     141ns     667ns  cuDeviceGet
                    0.00%     530ns         1     530ns     530ns     530ns  cuDeviceTotalMem
                    0.00%     243ns         1     243ns     243ns     243ns  cuDeviceGetUuid

==16094== Unified Memory profiling result:
Device "Quadro P1000 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      48  170.67KB  4.0000KB  0.9961MB  8.000000MB  735.2380us  Host To Device
      24  170.67KB  4.0000KB  0.9961MB  4.000000MB  337.3770us  Device To Host
      24         -         -         -           -  2.855987ms  Gpu page fault groups
Total CPU Page faults: 36

You can also run it on google collab, where you will have an nvidia T4 card available for free (after changing the runtime), with the following typical output

==18853== NVPROF is profiling process 18853, command: ./cuda_02.x
Max error: 0
==18853== Profiling application: ./cuda_02.x
==18853== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  108.83ms         1  108.83ms  108.83ms  108.83ms  add(int, float*, float*)
      API calls:   72.48%  290.34ms         2  145.17ms  36.191us  290.31ms  cudaMallocManaged
                   27.17%  108.84ms         1  108.84ms  108.84ms  108.84ms  cudaDeviceSynchronize
                    0.28%  1.1298ms         2  564.90us  537.96us  591.84us  cudaFree
                    0.05%  182.13us       101  1.8030us     264ns  75.268us  cuDeviceGetAttribute
                    0.01%  48.553us         1  48.553us  48.553us  48.553us  cudaLaunchKernel
                    0.01%  28.488us         1  28.488us  28.488us  28.488us  cuDeviceGetName
                    0.00%  8.6520us         1  8.6520us  8.6520us  8.6520us  cuDeviceGetPCIBusId
                    0.00%  2.3140us         3     771ns     328ns  1.6230us  cuDeviceGetCount
                    0.00%     919ns         2     459ns     315ns     604ns  cuDeviceGet
                    0.00%     580ns         1     580ns     580ns     580ns  cuDeviceTotalMem
                    0.00%     532ns         1     532ns     532ns     532ns  cuModuleGetLoadingMode
                    0.00%     382ns         1     382ns     382ns     382ns  cuDeviceGetUuid

==18853== Unified Memory profiling result:
Device "Tesla T4 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      48  170.67KB  4.0000KB  0.9961MB  8.000000MB  809.9640us  Host To Device
      24  170.67KB  4.0000KB  0.9961MB  4.000000MB  360.6320us  Device To Host
      12         -         -         -           -  2.564287ms  Gpu page fault groups
Total CPU Page faults: 36

If you increase just the number of threads to 256 (check the change in <<<...>>>), and split correctly the work using the cuda vars threadIdx.x (thread id in the block) and blockDim.x (number of threads in the block), as shown,

__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

// Run kernel on 1M elements on the GPU
  add<<<1, 256>>>(N, x, y);

then you get the following output

Quadro P1000 : From 2.5 secs to 0.022 secs!

==21739== NVPROF is profiling process 21739, command: ./a.out
Max error: 0
==21739== Profiling application: ./a.out
==21739== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  21.978ms         1  21.978ms  21.978ms  21.978ms  add(int, float*, float*)
      API calls:   87.86%  164.24ms         2  82.118ms  12.398us  164.22ms  cudaMallocManaged
                   11.76%  21.980ms         1  21.980ms  21.980ms  21.980ms  cudaDeviceSynchronize
                    0.24%  457.32us         2  228.66us  177.89us  279.43us  cudaFree
                    0.11%  206.80us       101  2.0470us     128ns  144.81us  cuDeviceGetAttribute
                    0.02%  29.041us         1  29.041us  29.041us  29.041us  cudaLaunchKernel
                    0.01%  20.149us         1  20.149us  20.149us  20.149us  cuDeviceGetName
                    0.00%  5.5860us         1  5.5860us  5.5860us  5.5860us  cuDeviceGetPCIBusId
                    0.00%  2.1000us         3     700ns     277ns     958ns  cuDeviceGetCount
                    0.00%     952ns         2     476ns     330ns     622ns  cuDeviceGet
                    0.00%     391ns         1     391ns     391ns     391ns  cuDeviceTotalMem
                    0.00%     259ns         1     259ns     259ns     259ns  cuDeviceGetUuid

==21739== Unified Memory profiling result:
Device "Quadro P1000 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      48  170.67KB  4.0000KB  0.9961MB  8.000000MB  734.5940us  Host To Device
      24  170.67KB  4.0000KB  0.9961MB  4.000000MB  338.5950us  Device To Host
      24         -         -         -           -  1.764587ms  Gpu page fault groups
Total CPU Page faults: 36

Tesla T4: From 0.108 secs to 0.004 secs!

==21448== NVPROF is profiling process 21448, command: ./cuda_03.x
Max error: 0
==21448== Profiling application: ./cuda_03.x
==21448== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  3.7978ms         1  3.7978ms  3.7978ms  3.7978ms  add(int, float*, float*)
      API calls:   98.24%  291.22ms         2  145.61ms  73.005us  291.15ms  cudaMallocManaged
                    1.28%  3.8044ms         1  3.8044ms  3.8044ms  3.8044ms  cudaDeviceSynchronize
                    0.36%  1.0699ms         2  534.95us  512.29us  557.62us  cudaFree
                    0.08%  222.64us       101  2.2040us     174ns  102.62us  cuDeviceGetAttribute
                    0.02%  62.588us         1  62.588us  62.588us  62.588us  cudaLaunchKernel
                    0.02%  44.725us         1  44.725us  44.725us  44.725us  cuDeviceGetName
                    0.00%  8.1290us         1  8.1290us  8.1290us  8.1290us  cuDeviceGetPCIBusId
                    0.00%  3.2970us         3  1.0990us     266ns  2.6840us  cuDeviceGetCount
                    0.00%  1.7320us         2     866ns     352ns  1.3800us  cuDeviceGet
                    0.00%     632ns         1     632ns     632ns     632ns  cuDeviceTotalMem
                    0.00%     549ns         1     549ns     549ns     549ns  cuModuleGetLoadingMode
                    0.00%     377ns         1     377ns     377ns     377ns  cuDeviceGetUuid

==21448== Unified Memory profiling result:
Device "Tesla T4 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      48  170.67KB  4.0000KB  0.9961MB  8.000000MB  825.8720us  Host To Device
      24  170.67KB  4.0000KB  0.9961MB  4.000000MB  360.3130us  Device To Host
      13         -         -         -           -  2.951606ms  Gpu page fault groups
Total CPU Page faults: 36

Cuda devices group parallel processors into Streaming Multiprocessors (SM), and each of them can run several threads in parallel. In our case, by using the command deviceQuery (for the QuadroP1000 system it is at /opt/cuda/extras/demo_suite/deviceQuery), we get

Quadro P1000: 5 SM, 128 threads/SM
Tesla T4: 32 SM, 128 threads/SM

So the ideal number of threads changes per card, and we will compute as

int blockSize = 128;
int numBlocks = (N + blockSize - 1) / blockSize; // what if N is not divisible by blocksize?
add<<<numBlocks, blockSize>>>(N, x, y);

Notice that you can also compute this constant by using the follow code (generated by bard.google.com)

// Get the number of threads per multiprocessor.
int threadsPerMultiprocessor;
cudaError_t err = cudaDeviceGetAttribute(&threadsPerMultiprocessor, cudaDevAttrMaxThreadsPerMultiprocessor, device);
if (err != cudaSuccess) {
  // Handle error.
}

The kernel will now become

__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

based on the job distribution done by the tutorial https://developer-blogs.nvidia.com/wp-content/uploads/2017/01/cuda_indexing.png

Now we get

Nvidia Quadro P1000: From 2.500 to 0.022 to 0.006 secs!

==10662== NVPROF is profiling process 10662, command: ./a.out
Max error: 0
==10662== Profiling application: ./a.out
==10662== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  6.0868ms         1  6.0868ms  6.0868ms  6.0868ms  add(int, float*, float*)
      API calls:   96.03%  165.28ms         2  82.641ms  13.911us  165.27ms  cudaMallocManaged
                    3.54%  6.0887ms         1  6.0887ms  6.0887ms  6.0887ms  cudaDeviceSynchronize
                    0.27%  460.56us         2  230.28us  184.71us  275.85us  cudaFree
                    0.13%  215.37us       101  2.1320us     133ns  151.55us  cuDeviceGetAttribute
                    0.02%  30.822us         1  30.822us  30.822us  30.822us  cudaLaunchKernel
                    0.01%  22.122us         1  22.122us  22.122us  22.122us  cuDeviceGetName
                    0.00%  5.7430us         1  5.7430us  5.7430us  5.7430us  cuDeviceGetPCIBusId
                    0.00%  1.3810us         3     460ns     203ns     945ns  cuDeviceGetCount
                    0.00%     921ns         2     460ns     163ns     758ns  cuDeviceGet
                    0.00%     438ns         1     438ns     438ns     438ns  cuDeviceTotalMem
                    0.00%     234ns         1     234ns     234ns     234ns  cuDeviceGetUuid

==10662== Unified Memory profiling result:
Device "Quadro P1000 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      59  138.85KB  4.0000KB  0.9961MB  8.000000MB  740.3880us  Host To Device
      24  170.67KB  4.0000KB  0.9961MB  4.000000MB  337.8280us  Device To Host
      32         -         -         -           -  2.253582ms  Gpu page fault groups
Total CPU Page faults: 36

Testla T4: From 0.108 to 0.004 to 0.003 secs

==8972== NVPROF is profiling process 8972, command: ./cuda_04.x
Max error: 0
==8972== Profiling application: ./cuda_04.x
==8972== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  2.9741ms         1  2.9741ms  2.9741ms  2.9741ms  add(int, float*, float*)
      API calls:   98.47%  250.63ms         2  125.31ms  38.785us  250.59ms  cudaMallocManaged
                    1.18%  2.9959ms         1  2.9959ms  2.9959ms  2.9959ms  cudaDeviceSynchronize
                    0.24%  613.16us         2  306.58us  302.27us  310.89us  cudaFree
                    0.07%  188.26us       101  1.8630us     169ns  86.068us  cuDeviceGetAttribute
                    0.02%  38.874us         1  38.874us  38.874us  38.874us  cuDeviceGetName
                    0.01%  37.051us         1  37.051us  37.051us  37.051us  cudaLaunchKernel
                    0.00%  5.7050us         1  5.7050us  5.7050us  5.7050us  cuDeviceGetPCIBusId
                    0.00%  2.2980us         3     766ns     224ns  1.8050us  cuDeviceGetCount
                    0.00%     979ns         2     489ns     195ns     784ns  cuDeviceGet
                    0.00%     587ns         1     587ns     587ns     587ns  cuDeviceTotalMem
                    0.00%     367ns         1     367ns     367ns     367ns  cuModuleGetLoadingMode
                    0.00%     324ns         1     324ns     324ns     324ns  cuDeviceGetUuid

==8972== Unified Memory profiling result:
Device "Tesla T4 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
     106  77.282KB  4.0000KB  980.00KB  8.000000MB  969.6510us  Host To Device
      24  170.67KB  4.0000KB  0.9961MB  4.000000MB  363.6760us  Device To Host
      11         -         -         -           -  2.908132ms  Gpu page fault groups
Total CPU Page faults: 36

13.9.3. TODO Openmp offload to gpu#

REF:

https://www.youtube.com/watch?v=uVcvecgdW7g

Code

#include <iostream>
#include <cstdio>
#include <omp.h>

int main()
{
    int a[100], b[100], c[100];
    int i;

    // Initialize arrays a and b
    for (i = 0; i < 100; i++) {
        a[i] = i;
        b[i] = 2 * i;
    }

    int num_devices = omp_get_num_devices();
    printf("Number of available devices %d\n", num_devices);

    // Offload computation to GPU
    #pragma omp target teams distribute parallel for map(to:a[0:100], b[0:100]) map(from:c[0:100])
    for (i = 0; i < 100; i++) {
        c[i] = a[i] + b[i];
    }

    // Print results
    for (i = 0; i < 100; i++) {
        std::cout << c[i] << " ";
    }
    std::cout << std::endl;

    return 0;
}

Does not work in sala2 due to the error

OpenMP GPU Offload is available only on systems with NVIDIA GPUs with compute capability '>= cc70

It seems that sala2 compute capability is 6.1. It can be get with

nvidia-smi --query-gpu=compute_cap --format=csv

Using google collab I can compile it

!nvcc -arch sm_75 -O3 -o openmp_offload openmp_offload.cpp -lgomp

and get

Number of available devices 1
0 3 6 9 12 15 18 21 24 27 30 33 36 ...

Check:

https://enccs.github.io/openmp-gpu/target/

/* Copyright (c) 2019 CSC Training */
/* Copyright (c) 2021 ENCCS */
#include <stdio.h>

#ifdef _OPENMP
#include <omp.h>
#endif

int main()
{
  int num_devices = omp_get_num_devices();
  printf("Number of available devices %d\n", num_devices);

  #pragma omp target
  {
      if (omp_is_initial_device()) {
        printf("Running on host\n");
      } else {
        int nteams= omp_get_num_teams();
        int nthreads= omp_get_num_threads();
        printf("Running on device with %d teams in total and %d threads in each team\n",nteams,nthreads);
      }
  }
  return 0;
}

13.9.4. TODO OpenACC intro#