Simple parallelization: farm task and gnu parallel or xargs

9. Simple parallelization: farm task and gnu parallel or xargs#

Sometimes you do not need to actually parallelize your code, but to run it with many parameters combination. Let’s assume that we have a task that depend on one parameter and can be executed independent of other parameters. It can be a very complex program, but for now it will be just a very simple bash instructions that prints a value. Save the following code in a bash script (like script.sh) that will use the stress command to stress a single core

# file: script.sh
echo "First arg: ${1}"
stress -t 10 -c 1 # stress one core
echo "Stress test done"

When it is executed, it just prints the first argument

bash codes/script.sh 23

What if we want to do execute this task for 4 different arguments? we will just do it sequentially:

date +"%H-%M-%S"
bash codes/script.sh 23
bash codes/script.sh 42
bash codes/script.sh 10
bash codes/script.sh 57
date +"%H-%M-%S"

40 seconds in total. Remember that this example is very simple, but assume that the script is a very large task. Then, the previous task will take four times the time of a simple task. What if we have a machine with four possible threads? it will be useful to run all the commands in parallel. To do so you might just put them in the background with the & character at the end. But what will happen if you need to run 7 different arguments and you have only 4 threads? then it would be not optimal to have all of them running at tha same time with less than 100% of cpu usage. It would be better to run 4 of them and when one of the finishes then launch the next one and so on. To do this programatically, you can use gnu parallel, https://www.gnu.org/software/parallel/ (check the tutorial in the documentation section, or the cheatsheet, https://www.gnu.org/software/parallel/parallel_cheat.pdf). You can install as spack info parallel, or load it with spack load parallel if it not installed already. For our case, it would be very useful

date +"%H-%M-%S"
parallel 'bash codes/script.sh {} ' ::: 23 42 10 57
date +"%H-%M-%S"

08-12-25

First

arg:

23

stress:

info:

[83775]

dispatching

hogs:

1

cpu,

0

io,

0

vm,

0

hdd

Stress

test

done

First

arg:

42

stress:

info:

[83779]

dispatching

hogs:

1

cpu,

0

io,

0

vm,

0

hdd

Stress

test

done

First

arg:

10

stress:

info:

[83781]

dispatching

hogs:

1

cpu,

0

io,

0

vm,

0

hdd

Stress

test

done

First

arg:

57

stress:

info:

[83785]

dispatching

hogs:

1

cpu,

0

io,

0

vm,

0

hdd

Stress

test

done

08-12-36

Around 10 seconds now! Gnu parallel will detect the number of cores and launch the process accodingly taking care of jobs distribution. Read the manual for the many options of this powerful tool that is used even on large clusters. For instance, try to run 7 processes:

date +"%H-%M-%S"
parallel 'bash codes/script.sh {} ' ::: 23 42 10 57 21 8 83
date +"%H-%M-%S"

08-13-20

First

arg:

23

stress:

info:

[84082]

dispatching

hogs:

1

cpu,

0

io,

0

vm,

0

hdd

Stress

test

done

First

arg:

42

stress:

info:

[84086]

dispatching

hogs:

1

cpu,

0

io,

0

vm,

0

hdd

Stress

test

done

First

arg:

10

stress:

info:

[84088]

dispatching

hogs:

1

cpu,

0

io,

0

vm,

0

hdd

Stress

test

done

First

arg:

57

stress:

info:

[84091]

dispatching

hogs:

1

cpu,

0

io,

0

vm,

0

hdd

Stress

test

done

First

arg:

21

stress:

info:

[84161]

dispatching

hogs:

1

cpu,

0

io,

0

vm,

0

hdd

Stress

test

done

First

arg:

8

stress:

info:

[84165]

dispatching

hogs:

1

cpu,

0

io,

0

vm,

0

hdd

Stress

test

done

First

arg:

83

stress:

info:

[84168]

dispatching

hogs:

1

cpu,

0

io,

0

vm,

0

hdd

Stress

test

done

08-13-41

You can play with the -j n flag to control how many jobs to run with parallel. By default it uses all possible threads