Introduction to OpenMp (Shared memory)

15. Introduction to OpenMp (Shared memory)#

OpenMp is an extension for languages like C/C++. It is targeted for multi-threaded, shared memory parallelism. It is very simple, you just add some pragma directives which, if not supported, are deactivated and then completely ignored. It is ideal for a machine with several cores, and shared memory. For much more info, see http://www.openmp.org/ and https://hpc-tutorials.llnl.gov/openmp/. For a discussion of a recent release, check https://news.ycombinator.com/item?id=42139843.

15.1. Introduction: Hello world and threadid#

Typical Hello world,

#include <cstdio>
#include "omp.h"

int main(void)
{
    double x = 9.0;
//#pragma omp parallel num_threads(4)
#pragma omp parallel //  This starts a parallel region
    {// se generan los threads
        std::printf("Hello, world.\n");
        std::printf("Hello, world2.\n");
    } // mueren los threads

    return 0;
}

This makes a parallel version if you compile with the -fopemp flag

g++ -fopenmp -g -fsanitize=thread code.cpp

Test it. What happens if you write

export OMP_NUM_THREADS=8

and then you run the executable?

The actual directive is like

#pragma omp parallel [clause ...]  newline 
                   if (scalar_expression) 
                   private (list) 
                   shared (list) 
                   default (shared | none) 
                   firstprivate (list) 
                   reduction (operator: list) 
                   copyin (list) 
                   num_threads (integer-expression)
 structured_block  

But we will keep things simple.

What is really important to keep in mind is what variables are going to be shared and what are going to be private, to avoid errors, reca conditions, etc.

If you want to know the thread id, you can use

int nth =  omp_get_num_threads();
int tid =  omp_get_thread_num();

inside your code, as in the following,

#include <cstdio> // printf
#include <omp.h>

int main(void)
{
  double x = 9.0;

  int nth = omp_get_num_threads();
  int thid = omp_get_thread_num();
  std::printf("Hello world from thid: %d, out of %d .\n",
              thid, nth);

//#pragma omp parallel num_threads(4)
#pragma omp parallel
  {// se generan los threads
    int nth = omp_get_num_threads(); // al declarar aca, son privados
    int thid = omp_get_thread_num();
    std::printf("Hello world from thid: %d, out of %d .\n",
                thid, nth);
  } // mueren los threads

  std::printf("Hello world from thid: %d, out of %d .\n",
              thid, nth);


  return 0;
}

Please use the thread sanitize:

g++ -fopenmp -g -fsanitize=thread code.cpp

There are some other env variables that could be useful:

OMP_NUM_THREADS=4
OMP_DISPLAY_ENV=TRUE
OMP_DISPLAY_AFFINITY=TRUE
OMP_STACK_SIZE=1000000

15.2. Running openmp processes with slurm#

To run on the cluster, run as

export OMP_NUM_THREADS=8  # Request 8 threads for your OpenMP program
srun --cpus-per-task=8 --pty ./my_openmp_program

Here the key is to use --cpus-per-task=8, which means use 8 threads per process (task). You can even specify everythin in the command

srun --cpus-per-task=8 --pty bash -c "export OMP_NUM_THREADS=8; ./my_openmp_program"

or , for better consistency and avoiding duplications, asking slurm for the reserved threads

srun --cpus-per-task=4 --time=00:01:00 --pty bash -c "export OMP_NUM_THREADS=\$SLURM_CPUS_PER_TASK; ./my_openmp_program"

Of course, it is much better to use a script,

#!/bin/bash

#SBATCH --job-name=openmp_job      # Job name
#SBATCH --output=openmp_job_%j.out # Standard output and error log
#SBATCH --error=openmp_job_%j.err  # %j is replaced with the job ID
#SBATCH --nodes=1                  # Request 1 node
#SBATCH --ntasks=1                 # Request 1 task (your OpenMP program is a single task)
#SBATCH --cpus-per-task=8          # Request 8 CPUs for this task (adjust as needed for your program)
#SBATCH --time=00:10:00            # Wall-clock time limit (HH:MM:SS)
#SBATCH --mem=4G                   # Memory per node (adjust as needed)

# Load any necessary modules (e.g., GCC if it's not in your default path)
# module load gcc/11.2.0 # Example

# Set the number of OpenMP threads
# This should ideally be equal to --cpus-per-task for optimal performance
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

echo "Starting OpenMP job on node: $(hostname)"
echo "Number of requested CPUs per task: $SLURM_CPUS_PER_TASK"
echo "OMP_NUM_THREADS set to: $OMP_NUM_THREADS"

# Run your OpenMP program
./a.out

echo "OpenMP job finished."

and then sbatch:

sbatch openmp.sh

15.3. Private and shared variables#

Shared and private

#include <cstdio>
#include <omp.h>

int main(void)
{
  double x = 9.0; // "public" var, seen and shared by all threads 

  int nth = omp_get_num_threads();
  int thid = omp_get_thread_num();
  std::printf("Hello world from thid: %d, out of %d .\n",
              thid, nth);

//#pragma omp parallel num_threads(4) // set the num_threads instead of relying on OMP_NUM_THREADS
#pragma omp parallel private(thid, nth) // makes thid and nth private, not problem with overwriting
  {// se generan los threads
    thid = omp_get_thread_num(); // privada, aunque tenga el mismo nombre, tipo, y no haya sido redeclarada
    nth = omp_get_num_threads(); // privada, aunque tenga el mismo nombre, tipo, y no haya sido redeclarada
    std::printf("Hello world from thid: %d, out of %d .\n",
                thid, nth);
  } // mueren los threads

  return 0;
}

In this new example, we see the memory address for shared and private variables to illustrate those concepts:

#include <omp.h>
#include <iostream>
#include <cstdio>

int main(int argc, char *argv[]) {

  int nthreads, tid;

  /* Fork a team of threads with each thread having a private tid variable */
#pragma omp parallel private(tid)
  {

    /* Obtain and print thread id */
    tid = omp_get_thread_num();
    std::printf("Hello World from thread = %d\n", tid);
    std::cout << "Memory address for tid = " << &tid << std::endl;
    std::cout << "Memory address for nthreads = " << &nthreads << std::endl;

    /* Only master thread does this */
    if (tid == 0) 
      {
        nthreads = omp_get_num_threads();
        printf("Number of threads = %d\n", nthreads);
      }

  }  /* All threads join master thread and terminate */

  return 0;
}

Knowing the thread id allows to distribute job. For instance, the following shows how to select code only for the master thread

#include <omp.h>
#include <stdio.h>

int main(int argc, char *argv[]) {

  int nthreads, tid;

  /* Fork a team of threads with each thread having a private tid variable */
#pragma omp parallel private(tid)
  {

    /* Obtain and print thread id */
    tid = omp_get_thread_num();
    printf("Hello World from thread = %d\n", tid);

    /* Only master thread does this */
    if (tid == 0)
    {
      nthreads = omp_get_num_threads();
      printf("Number of threads = %d\n", nthreads);
    }

  }  /* All threads join master thread and terminate */

}

15.4. Parallelizing a for loop#

We want to sum all elements in a vector, in parallel. If the vector is of size N, and we have nth threads, how to distribute the job? Each thread will process a local part of size Nlocal = N/nth (what happens if N is not divisible by nth?). Then, each thread needs to process each segment. For example, let’s say that N = 1000000, and nth = 5. Then Nlocal = 20000. Here is important to think in local and global indexes:

local indexes: Each thread will run from a local index 0 up to Nlocal-1. This index represents the local part being processed.
global indexes: This is the original index for the whole array. This goes from 0 to N. In our case, for thread 0 it will run globally from 0 up to Nlocal-1, then for thread 1 it will run from Nlocal up to 2Nlocal-1, and so on. In general, for a thread id tid we have a minium local index equal to imin = tid*Nlocal, and a maximum (not inclusive) equal to imax = imin + Nlocal = (tid+1)*Nlocal.

It is crucial to compute this job partition among threads.

15.4.1. Manually#

The following code assigns the elements of a vector in parallel:

#include <omp.h>
#include <iostream>
#include <cmath>
#include <vector>
#include <cstdlib>
#include <numeric>

void fill(std::vector<double> & array);
double suma(const std::vector<double> & array);

int main(int argc, char *argv[])
{
  const int N = std::atoi(argv[1]);
  std::vector<double> data(N);

  // llenar el arreglo con some datos
  fill(data);
  //std::cout << data[0] << "\n";

  // calcular la suma y el promedio
  double total = suma(data);
  std::cout << total/data.size() << "\n";

  return 0;
}

void fill(std::vector<double> & array)
{
  int N = array.size();
#pragma omp parallel
  {
    int thid = omp_get_thread_num();
    int nth = omp_get_num_threads();
    int Nlocal = N/nth;
    int iimin = thid*Nlocal;
    int iimax = iimin + Nlocal;
    for(int ii = iimin; ii < iimax; ii++) {
      array[ii] = 2*ii*std::sin(std::sqrt(ii/56.7)) +
        std::cos(std::pow(1.0*ii*ii/N, 0.3));
    }
  }
}

double suma(const std::vector<double> & array)
{
  int N = array.size();
  int max_threads_possible = omp_get_max_threads();
  std::vector<double> sumaparcial(max_threads_possible, 0.0);

#pragma omp parallel
  {
    int thid = omp_get_thread_num();
    int nth = omp_get_num_threads();
    double localsum = 0.0;
    int Nlocal = N/nth;
    int iimin = thid*Nlocal;
    int iimax = iimin + Nlocal;
    for(int ii = iimin; ii < iimax; ii++) {
      localsum += array[ii];
    }
    sumaparcial[thid] = localsum; // save the localsum in a different place per thread
  }
  return std::accumulate(sumaparcial.begin(), sumaparcial.end(), 0.0); // sum up all partial sums
}

Then, after compilation,

g++ -std=c++17 -g -fopenmp suma.cpp -o suma.x

you can run it as

srun --cpus-per-task=1 --time=00:01:00 bash -c "export OMP_NUM_THREADS=\$SLURM_CPUS_PER_TASK; ./suma.x 50000000"

15.4.2. Using `omp parallel for`#

Actually it is much easier to use a parallel for and a reduction to accumulate the data. With this, openmp will take care of all the work splitting.

#include <omp.h>
#include <iostream>
#include <cmath>
#include <vector>
#include <cstdlib>
#include <numeric>

void fill(std::vector<double> & array);
double suma(const std::vector<double> & array);

int main(int argc, char *argv[])
{
  const int N = std::atoi(argv[1]);
  std::vector<double> data(N);

  // llenar el arreglo
  fill(data);
  //std::cout << data[0] << "\n";

  // calcular la suma y el promedio
  double total = suma(data);
  std::cout << total/data.size() << "\n";

  return 0;
}

void fill(std::vector<double> & array)
{
  const int N = array.size();
#pragma omp parallel for
  for(int ii = 0; ii < N; ii++) {
      array[ii] = 2*ii*std::sin(std::sqrt(ii/56.7)) +
        std::cos(std::pow(1.0*ii*ii/N, 0.3));
  }
}

double suma(const std::vector<double> & array)
{
  int N = array.size();
  double suma = 0.0;
#pragma omp parallel for reduction(+:suma)
  for(int ii = 0; ii < N; ii++) {
    suma += array[ii];
  }
  return suma;
}

Then, after compilation,

g++ -std=c++17 -g -fopenmp parallelfor.cpp -o parallelfor.x

you can run it as

srun --cpus-per-task=1 --time=00:01:00 bash -c "export OMP_NUM_THREADS=\$SLURM_CPUS_PER_TASK; ./parallelfor.x 50000000"

15.4.3. Exercise#

Compute the parallel metrics for the last code . Use omp_get_wtime(void).

15.5. Tutorial and extra materials:#

Parallel STL

Please notice that recent openmp standards allow you to offload work to the gpu

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

#define N 1000000

int main() {
    float *a = (float*) malloc(N * sizeof(float));
    float *b = (float*) malloc(N * sizeof(float));
    float *c = (float*) malloc(N * sizeof(float));

    // Initialize input vectors
    for (int i = 0; i < N; i++) {
        a[i] = i * 1.0f;
        b[i] = i * 2.0f;
    }

    // Offload the work to the GPU
    #pragma omp target data map(to: a[0:N], b[0:N]) map(from: c[0:N])
    {
        #pragma omp target teams distribute parallel for
        for (int i = 0; i < N; i++) {
            c[i] = a[i] + b[i];
        }
    }

    // Verify result
    int errors = 0;
    for (int i = 0; i < N; i++) {
        if (c[i] != a[i] + b[i]) {
            errors++;
        }
    }

    if (errors == 0)
        printf("Success! All values correct.\n");
    else
        printf("There were %d errors.\n", errors);

    free(a); free(b); free(c);
    return 0;
}

and then compile as

g++ -fopenmp -foffload=nvptx-none -O2 -o vec_add vec_add.c # you can also offload to amd, using amd compiler

15.6. Exercises#

Refs: CT-LAB openmp exercises; https://computing.llnl.gov/tutorials/openMP/exercise.html

Write a program to compute the mean and the standard deviation for a large array using OpenMP, with only one reduction. Also compute the parallel metrics.

// stats.cpp
#include <omp.h>
#include <iostream>
#include <cmath>
#include <vector>
#include <cstdlib>
#include <numeric>

void fill(std::vector<double> & array);
void stats(const std::vector<double> & array, double &mean, double &sigma);

int main(int argc, char *argv[])
{
  const int N = std::atoi(argv[1]);
  std::vector<double> data(N);

  // llenar el arreglo
  fill(data);

  // calcular stats
  double mean{0.0}, sigma{0.0};
  double start = omp_get_wtime();
  stats(data, mean, sigma);
  double time = omp_get_wtime() - start;
  std::printf("%.15le\t\t%.15le\t\t%.15le\n", mean, sigma, time);

  return 0;
}

void fill(std::vector<double> & array)
{
  const int N = array.size();
#pragma omp parallel for
  for(int ii = 0; ii < N; ii++) {
      array[ii] = 2*ii*std::sin(std::sqrt(ii/56.7)) +
        std::cos(std::pow(1.0*ii*ii/N, 0.3));
  }
}

void stats(const std::vector<double> & array, double & mean, double & sigma)
{
  int N = array.size();
  double suma = 0.0;
#pragma omp parallel for reduction(+:suma)
  for(int ii = 0; ii < N; ii++) {
    suma += array[ii];
  }
  mean = suma/N;
}

Write a program to compute the integral of the function $y = x^{2}$, for $ x ∈ [0,10]$. Fill the following table:

# Threads

Runtime [s]

Speedup

Efficiency

Explain your results in terms of the number of processors available. Use $N=12000$. Do it in two ways:
1. Distributing the N intervals across all threads, so each one has a smaller part.
2. Keeping constant the resolution per thread, that is , each trhead has N intervals.
What is the difference between the two cases, regarding the precision and the time?

#include <iostream>
#include <omp.h>

using fptr = double(double);

double f(double x);
double integral_serial(double a, double b, int N, fptr f);
double integral_openmp(double a, double b, int N, fptr f);

int main(void)
{
  // declare vars
  const double XA = 0.0; 
  const double XB = 10.0; 
  const int N = 100000000;

  // print result
  //std::cout << "Serial integral: " << integral_serial(XA, XB, N, f) << "\n";
  //std::cout << "Serial openmp  : " << integral_openmp(XA, XB, N, f) << "\n";
  double t1 = omp_get_wtime();
  integral_openmp(XA, XB, N, f);
  double t2 = omp_get_wtime();

#pragma omp parallel
  {
    if(0 == omp_get_thread_num()) {
      std::cout << omp_get_num_threads() << "\t" << t2 - t1 << std::endl;
    }
  }
}

double f(double x)
{
  return x*x;
}

double integral_serial(double a, double b, int N, fptr f)
{
  const double dx = (b-a)/N; 
  // compute integral
  double sum = 0, x;
  for(int ii = 0; ii < N; ++ii) {
    x = a + ii*dx;
    sum += dx*f(x);
  }
  return sum;
}

double integral_openmp(double a, double b, int N, fptr f)
{
  TODO
}

Parallelize a Matrix-Matrix multiplication. Compare the performance when you use one, two, three, for threads.

Is this loop parallelizable? If not, why?

#pragma omp parallel for
for (int i = 1; i < N; i++)
{
  A[i] = B[i] – A[i – 1];
}

Parallelize a matrix-vector multiplication. What must be shared? what should be private?
Parallelize the matrix transposition.
Check and solve the exercises on https://computing.llnl.gov/tutorials/openMP/exercise.html .
Check and solve http://www.hpc.cineca.it/content/training-openmp .
Some other examples are at https://www.archer.ac.uk/training/course-material/2015/03/thread_prog/ .