Introduction to OpenMp (Shared memory)

14. Introduction to OpenMp (Shared memory)#

This is an extension for languages like C/C++. It is targeted for multi-threaded, shared memory parallelism. It is very simple, you just add some pragma directives which, if not supported, are deactivated and then completely ignored. It is ideal for a machine with several cores, and shared memory. For much more info, see http://www.openmp.org/ and https://hpc-tutorials.llnl.gov/openmp/

14.1. Introduction: Hello world and threadid#

Typical Hello world,

#include <cstdio>
#include "omp.h"

int main(void)
{
    double x = 9.0;
//#pragma omp parallel num_threads(4)
#pragma omp parallel
    {// se generan los threads
        std::printf("Hello, world.\n");
        std::printf("Hello, world2.\n");
    } // mueren los threads

    return 0;
}

This makes a parallel version if you compile with the -fopemp flag. Test it. What happens if you write

export OMP_NUM_THREADS=8

and then you run the executable?

The actual directive is like

#pragma omp parallel [clause ...]  newline 
                   if (scalar_expression) 
                   private (list) 
                   shared (list) 
                   default (shared | none) 
                   firstprivate (list) 
                   reduction (operator: list) 
                   copyin (list) 
                   num_threads (integer-expression)
 structured_block  

But we will keep things simple.

What is really important to keep in mind is what variables are going to be shared and what are going to be private, to avoid errors, reca conditions, etc.

If you want to know the thread id, you can use

int nth =  omp_get_num_threads();
int tid =  omp_get_thread_num();

inside your code, as in the following,

#include <cstdio> // printf
#include <omp.h>

int main(void)
{
  double x = 9.0;

  int nth = omp_get_num_threads();
  int thid = omp_get_thread_num();
  std::printf("Hello world from thid: %d, out of %d .\n",
              thid, nth);

//#pragma omp parallel num_threads(4)
#pragma omp parallel
  {// se generan los threads
    int nth = omp_get_num_threads(); // al declarar aca, son privados
    int thid = omp_get_thread_num();
    std::printf("Hello world from thid: %d, out of %d .\n",
                thid, nth);
  } // mueren los threads

  std::printf("Hello world from thid: %d, out of %d .\n",
              thid, nth);


  return 0;
}

Please use the thread sanitize:

g++ -fopenmp -g -fsanitize=thread code.cpp

There are some other env variables that could be useful:

OMP_NUM_THREADS=4
OMP_DISPLAY_ENV=TRUE
OMP_DISPLAY_AFFINITY=TRUE
OMP_STACK_SIZE=1000000

14.2. Private and shared variables#

Shared and private

#include <cstdio>
#include <omp.h>

int main(void)
{
  double x = 9.0;

  int nth = omp_get_num_threads();
  int thid = omp_get_thread_num();
  std::printf("Hello world from thid: %d, out of %d .\n",
              thid, nth);

//#pragma omp parallel num_threads(4)
#pragma omp parallel private(thid, nth)
  {// se generan los threads
    thid = omp_get_thread_num(); // privada, aunque tenga el mismo nombre, tipo, y no haya sido redeclarada
    nth = omp_get_num_threads(); // privada, aunque tenga el mismo nombre, tipo, y no haya sido redeclarada
    std::printf("Hello world from thid: %d, out of %d .\n",
                thid, nth);
  } // mueren los threads

  return 0;
}

In this example we see the memory address for shared and private variables to illustrate those concepts:

#include <omp.h>
#include <iostream>
#include <cstdio>

int main(int argc, char *argv[]) {

  int nthreads, tid;

  /* Fork a team of threads with each thread having a private tid variable */
#pragma omp parallel private(tid)
  {

    /* Obtain and print thread id */
    tid = omp_get_thread_num();
    std::printf("Hello World from thread = %d\n", tid);
    std::cout << "Memory address for tid = " << &tid << std::endl;
    std::cout << "Memory address for nthreads = " << &nthreads << std::endl;

    /* Only master thread does this */
    if (tid == 0) 
      {
        nthreads = omp_get_num_threads();
        printf("Number of threads = %d\n", nthreads);
      }

  }  /* All threads join master thread and terminate */

  return 0;
}

Selecting code only for the master thread

#include <omp.h>
#include <stdio.h>

int main(int argc, char *argv[]) {

  int nthreads, tid;

  /* Fork a team of threads with each thread having a private tid variable */
#pragma omp parallel private(tid)
  {

    /* Obtain and print thread id */
    tid = omp_get_thread_num();
    printf("Hello World from thread = %d\n", tid);

    /* Only master thread does this */
    if (tid == 0)
    {
      nthreads = omp_get_num_threads();
      printf("Number of threads = %d\n", nthreads);
    }

  }  /* All threads join master thread and terminate */

}

14.3. Parallelizing a for loop#

Condiciones del problema:

Arreglo de tamanho N.
Tenemos nth threads.
Nlocal = N/nth
imin = tid*Nlocal, imax = (tid+1)*Nlocal = imin + Nlocal

14.3.1. Manually#

The following code assigns the elements of a vector in parallel:

#include <omp.h>
#include <iostream>
#include <cmath>
#include <vector>
#include <cstdlib>
#include <numeric>

void fill(std::vector<double> & array);
double suma(const std::vector<double> & array);

int main(int argc, char *argv[])
{
  const int N = std::atoi(argv[1]);
  std::vector<double> data(N);

  // llenar el arreglo
  fill(data);
  //std::cout << data[0] << "\n";

  // calcular la suma y el promedio
  double total = suma(data);
  std::cout << total/data.size() << "\n";

  return 0;
}

void fill(std::vector<double> & array)
{
  int N = array.size();
#pragma omp parallel
  {
    int thid = omp_get_thread_num();
    int nth = omp_get_num_threads();
    int Nlocal = N/nth;
    int iimin = thid*Nlocal;
    int iimax = iimin + Nlocal;
    for(int ii = iimin; ii < iimax; ii++) {
      array[ii] = 2*ii*std::sin(std::sqrt(ii/56.7)) +
        std::cos(std::pow(1.0*ii*ii/N, 0.3));
    }
  }
}

double suma(const std::vector<double> & array)
{
  int N = array.size();
  std::vector<double> sumaparcial;
#pragma omp parallel
  {
    int thid = omp_get_thread_num();
    int nth = omp_get_num_threads();
    if (0 == thid) {
      sumaparcial.resize(nth);
    }
#pragma omp barrier
    double localsum = 0.0;
    int Nlocal = N/nth;
    int iimin = thid*Nlocal;
    int iimax = iimin + Nlocal;
    for(int ii = iimin; ii < iimax; ii++) {
      localsum += array[ii];
    }
    sumaparcial[thid] = localsum;
  }
  return std::accumulate(sumaparcial.begin(), sumaparcial.end(), 0.0);
}

OLD

#include <omp.h>
#include <iostream>
#include <cmath>
#include <vector>

int main(int argc, char *argv[]) {

  const int N = 80000000;
  int i;
  double *a = new double[N];

#pragma omp parallel private(i)
  {
    int thid = omp_get_thread_num();
    int nth = omp_get_num_threads();
    int localsize = N/nth;
    int iimin = thid*localsize;
    int iimax = iimin + localsize;
    for(i = iimin; i < iimax; i++) {
      a[i] = 2*i*std::sin(std::sqrt(i/56.7)) +
        std::cos(std::pow(i*i, 0.3));
    }
  }

  std::cout << a[1] << "\n";

  delete [] a;
  return 0;
}

14.3.2. Using `omp parallel for`#

#include <omp.h>
#include <iostream>
#include <cmath>
#include <vector>
#include <cstdlib>
#include <numeric>

void fill(std::vector<double> & array);
double suma(const std::vector<double> & array);

int main(int argc, char *argv[])
{
  const int N = std::atoi(argv[1]);
  std::vector<double> data(N);

  // llenar el arreglo
  fill(data);
  //std::cout << data[0] << "\n";

  // calcular la suma y el promedio
  double total = suma(data);
  std::cout << total/data.size() << "\n";

  return 0;
}

void fill(std::vector<double> & array)
{
  const int N = array.size();
#pragma omp parallel for
  for(int ii = 0; ii < N; ii++) {
      array[ii] = 2*ii*std::sin(std::sqrt(ii/56.7)) +
        std::cos(std::pow(1.0*ii*ii/N, 0.3));
  }
}

double suma(const std::vector<double> & array)
{
  int N = array.size();
  double suma = 0.0;
#pragma omp parallel for reduction(+:suma)
  for(int ii = 0; ii < N; ii++) {
    suma += array[ii];
  }
  return suma;
}

OLD

#include <omp.h>
#include <iostream>
#include <cmath>

void fill(double * array, int size);
double average(double * array, int size);

int main(int argc, char *argv[]) {
  const int N = std::atoi(argv[1]);
  double *a = new double[N];

  fill(a, N);
  std::cout << a[1] << "\n";

  //std::cout << average(a, N) << "\n";

  double suma = 0.0;
#pragma omp parallel for reduction(+:suma)
  for(int i = 0; i < N; i++) {
    suma += a[i];
  }
  std::cout <<  suma/N << "\n";

  delete [] a;
  return 0;
}

void fill(double * array, int size)
{
#pragma omp parallel for
  for(int i = 0; i < size; i++) {
    array[i] = 2*i*std::sin(std::sqrt(i/56.7)) +
      std::cos(std::pow(1.0*i*i, 0.3));
  }
}
double average(double * array, int size)
{
  double suma = 0.0;
#pragma omp parallel for reduction(+:suma)
  for(int i = 0; i < size; i++) {
    suma += array[i];
  }
  return suma/size;
}

Exercise: Modify the code to be able to compute the mean in parallel. Play with larger and larger arrays and also with increasing the number of processors. Can you compute any parallel metric?

#include <omp.h>
#include <iostream>
#include <cmath>

void fill_data(double *a, int size);
double average(double *a, int size);

int main(int argc, char *argv[]) {
    std::cout.precision(15); std::cout.setf(std::ios::scientific);

    const int N = 80000000;
    int i;
    double *a = new double[N];

    fill_data(a, N);
    std::cout << a[1] << std::endl;
    double avg = average(a, N);
    std::cout << avg << "\n";

    delete [] a;
    return 0;
}

void fill_data(double *a, int size)
{
    long int i;
#pragma omp parallel for
    for(i = 0; i < size; i++) {
        a[i] = 2*i*std::sin(std::sqrt(i/56.7)) +
            std::cos(std::pow(i*i, 0.3));
    }
}

double average(double *a, int size)
{
    double result = 0;
#pragma omp parallel for reduction(+:result)
    for(int ii = 0; ii < size; ++ii) {
      result = result + a[ii];
    }
    result /= size;
    return result;
}

14.3.3. Using reductions#

This example shows how to reduce a variable using the reduction keyword:

  #include <iostream>
#include <cmath>

int main(int argc, char ** argv)
{
  const int N = 100000000;
  int i;
  double *a;
  a = new double [N];
  double suma = 0;

#pragma omp parallel for reduction(+:suma)
  for (i = 0; i < N; ++i) {
    a[i] = 2*std::sin(i/35.6);
    suma += a[i];
  }

  std::cout << suma << std::endl;

  delete [] a;
  return 0;
}

Now read the following code, try to predict its outcome, and then run it and test your prediction (from https://computing.llnl.gov/tutorials/openMP/#Introduction) :

#include <omp.h>
#define N 1000
#define CHUNKSIZE 100

main(int argc, char *argv[]) {

  int i, chunk;
  float a[N], b[N], c[N];

  /* Some initializations */
  for (i=0; i < N; i++)
    a[i] = b[i] = i * 1.0;
  chunk = CHUNKSIZE;

#pragma omp parallel shared(a,b,c,chunk) private(i)
  {

#pragma omp for schedule(dynamic,chunk) nowait
    for (i=0; i < N; i++)
c[i] = a[i] + b[i];

  }   /* end of parallel region */
  return 0;
}

14.4. Tutorial and extra materials:#

Parallel STL https://www.modernescpp.com/index.php?option=com_content&view=article&id=572&catid=49 https://www.modernescpp.com/index.php?option=com_content&view=article&id=573&catid=49

14.5. Exercises#

Refs: CT-LAB openmp exercises; https://computing.llnl.gov/tutorials/openMP/exercise.html

Write a program to compute the mean and the standard deviation for a large array using OpenMP, with only one reduction. Also compute the parallel metrics.

// stats.cpp
#include <omp.h>
#include <iostream>
#include <cmath>
#include <vector>
#include <cstdlib>
#include <numeric>

void fill(std::vector<double> & array);
void stats(const std::vector<double> & array, double &mean, double &sigma);

int main(int argc, char *argv[])
{
  const int N = std::atoi(argv[1]);
  std::vector<double> data(N);

  // llenar el arreglo
  fill(data);

  // calcular stats
  double mean{0.0}, sigma{0.0};
  double start = omp_get_wtime();
  stats(data, mean, sigma);
  double time = omp_get_wtime() - start;
  std::printf("%.15le\t\t%.15le\t\t%.15le\n", mean, sigma, time);

  return 0;
}

void fill(std::vector<double> & array)
{
  const int N = array.size();
#pragma omp parallel for
  for(int ii = 0; ii < N; ii++) {
      array[ii] = 2*ii*std::sin(std::sqrt(ii/56.7)) +
        std::cos(std::pow(1.0*ii*ii/N, 0.3));
  }
}

void stats(const std::vector<double> & array, double & mean, double & sigma)
{
  int N = array.size();
  double suma = 0.0;
#pragma omp parallel for reduction(+:suma)
  for(int ii = 0; ii < N; ii++) {
    suma += array[ii];
  }
  mean = suma/N;
}

Write a program to compute the integral of the function $y = x^{2}$, for $ x ∈ [0,10]$. Fill the following table:

# Threads

Runtime [s]

Speedup

Efficiency

Explain your results in terms of the number of processors available. Use $N=12000$. Do it in two ways:
1. Distributing the N intervals across all threads, so each one has a smaller part.
2. Keeping constant the resolution per thread, that is , each trhead has N intervals.
What is the difference between the two cases, regarding the precision and the time?

#include <iostream>
#include <omp.h>

using fptr = double(double);

double f(double x);
double integral_serial(double a, double b, int N, fptr f);
double integral_openmp(double a, double b, int N, fptr f);

int main(void)
{
  // declare vars
  const double XA = 0.0; 
  const double XB = 10.0; 
  const int N = 100000000;

  // print result
  //std::cout << "Serial integral: " << integral_serial(XA, XB, N, f) << "\n";
  //std::cout << "Serial openmp  : " << integral_openmp(XA, XB, N, f) << "\n";
  double t1 = omp_get_wtime();
  integral_openmp(XA, XB, N, f);
  double t2 = omp_get_wtime();

#pragma omp parallel
  {
    if(0 == omp_get_thread_num()) {
      std::cout << omp_get_num_threads() << "\t" << t2 - t1 << std::endl;
    }
  }
}

double f(double x)
{
  return x*x;
}

double integral_serial(double a, double b, int N, fptr f)
{
  const double dx = (b-a)/N; 
  // compute integral
  double sum = 0, x;
  for(int ii = 0; ii < N; ++ii) {
    x = a + ii*dx;
    sum += dx*f(x);
  }
  return sum;
}

double integral_openmp(double a, double b, int N, fptr f)
{
  TODO
}

Parallelize a Matrix-Matrix multiplication. Compare the performance when you use one, two, three, for threads.

Is this loop parallelizable? If not, why?

#pragma omp parallel for
for (int i = 1; i < N; i++)
{
  A[i] = B[i] – A[i – 1];
}

Parallelize a matrix-vector multiplication. What must be shared? what should be private?
Parallelize the matrix transposition.
Check and solve the exercises on https://computing.llnl.gov/tutorials/openMP/exercise.html .
Check and solve http://www.hpc.cineca.it/content/training-openmp .
Some other examples are at https://www.archer.ac.uk/training/course-material/2015/03/thread_prog/ .