Compute Canada

Parallel Programming

Please note: The FAQ pages at the HPCVL website are continuously being revised. Some pages might pertain to an older configuration of the system. Please let us know if you encounter problems or inaccuracies, and we will correct the entries.

This is a short introduction into how to carry over code from a serial programming environment to the multi-processor systems used at HPCVL. It is meant to give the user a basic idea of what to do to get the code running on several processors. The document is organized in an "FAQ" manner, i.e. a list of "obvious" questions is presented as a guideline. Please feel free to contact us if you want to see more questions included.

Will my serial code run in parallel without changes?

No. At the very least, you will have to recompile it with "parallel options" and to set a few environment variables. For most code, that will not be enough either. Fortunately, in many cases, it is not difficult to get the compiler to produce code that will show some performance gain from multi-threading.

How do I "parallelize" my code?

Here are 4 steps that should be considered to "parallelize" your code:

  1. Optimize the serial version as much as you can. Try to make it as "simple" as possible, avoiding nested loops and loops with dependencies, i.e. where the operations inside one iteration depend on the results from a previous one. Dependencies may be hidden in function calls or by reference to global variables or COMMON's. Often, a program spends most of the execution time in a few loops. Those are candidates for parallel performance. Try to find them (e.g. by running analyzer software, such as is available inside the "sunstudio" development tools, or by explicitely inserting timing routines like etime() into the code). Focus on the simplification of those loops.
  2. Use auto-parallelization flags of the compiler (see section 3)
  3. Force multi-threading via OpenMP compiler directives (see section 4)
  4. If the above approach does not work, or you need to deploy the resulting parallel on a cluster, use MPI routines to run separate processes that communicate with each other (see section 5). This usually requires a "from-scratch" approach.

How can I use multiple threads to get parallel performance out of my serial code?

The compilers running on HPCVL clusters are discussed in our Compiler FAQ. They have options that cause it to attempt to parallelize loops that have no dependencies by multi-threading them. The compiler flags to get this done are

  • -xautopar identifies loops that are obviously non-dependent and creates multithreaded code for them
  • -xreduction reduces variables inside a loop into a single value, for example by summing over them
  • -xloopinfo shows which loops were parallelized, and which not (and why)
  • -stackvar Necessary. Allocates local variables on the stack.

This will only work if the loops to be parallelized do not have any dependencies. Since the compiler is very conservative, even simple function calls from inside a loop cause it to reject auto-parallelization. This is because function calls could hide access to global variables (COMMON blocks or modules in Fortran) that establish dependencies. The result is that auto-parallelization often is not an option.

How do I force multi-thread parallelization? How to use compiler directives?

The compiler will be very conservative about multithreading loops automatically. If there is the slightest possibility of data dependencies, it will refuse to do it if -xautopar is used. Function calls within loops, if statements that depend on variables which change in the loop, and many other features will be considered "dangerous" and inhibit parallelization. The reason is that such features have a potential to make the result dependent on the order in which the loop iterations are carried out, and therefore go against a parallel execution.

However, often you know more than the compiler. You might be certain that a function call does not alter the value of variables that are shared with other loop iterations. If this is the case, there is ways to tell the compiler to parallelize anyhow. This is done viacompiler directives that look like comments, but if compiled with the proper flags, will guide the compiler in parallelizing the code. The most common one a OpenMP compiler directives. Here is an example in Fortran:

!$OMP PARALLEL DO PRIVATE(a)
do i = 1, n
a(1) = b(i)
do j = 2, n 
a(j) = a(j-1) + b(j) * c(j) 
end do 
x(i) = f(a)
end do

and in C:

#pragma omp parallel for private(a,j) 
for (i=1; i<n+1; i++){ 
a[1] = b[i];
for (j=2; j<n+1; j++){
a[j] =a[j-1] + b[j] * c[j];
}
x[i] = f(a)

The initial "!" in the first line of this Fortran segment causes that line to be interpreted as a comment, unless this is compiled with the compiler flag -xopenmp. In this case, the first line tells the compiler to parallelize the loop directly following it. The private declaration causes a separate copy of the array to be used for each parallel thread (i.e. the array "a" is used as a private variable).

Some commonly used compiler flags for this approach are:

  • -xopenmp includes all necessary flags for usage of OpenMP compiler directives. It includes several other flags (see man pages). This is the most commonly used multi-threading flag if you are doing explicit (as opposed to automatic) parallelization. Others are only occasionally used.
  • -vpara verbose output about dependencies in the explicitely parallelized loops.
  • -xloopinfo messages are issued about the parallelization of loops.
  • -xstackvar allocate private variables on the stack. This option is implied by -xopenmp
  • -xopenmp=noopt turns of the automatic increase in optimization level (to -xO3) implied in -xopenmp.

Because OpenMP platform-independent compiler directives are the standard, the use of older directives, while supported, is strongly discouraged.

A separate OpenMP FAQ is available that contains more information about this programming technique.

What is MPI and when do I use it?

Sometimes it is necessary to re-write the code in a parallel fashion, so that it can be executed on several separate processors, or indeed machines, separately. For this, it is necessary to establish some communication between the processes, and this is usually done by some form of message passing. A platform independent standard for this is a set of almost 300 routines, available in Fortran, C and C++, that comprise the MPI (Message Passing Interface) standard. Using these routines requires a little rethinking of the code structure, but is in reasonably simple and effective in many cases.

MPI is best used if your code has a good potential to employ many processors independently with none sitting idle. It is also advantageous to have only relatively little communication being necessary between processes. Examples are numerical integration (where independent evaluations of the integrant can be done separately), Monte-Carlo methods, finite-difference and finite-element methods (if the problem can be divided up into blocks of equal size with minimal communication). MPI requires some serious re-coding in some cases, but with a relatively small number of routines, great scaling can be achieved.

How do I parallelize my code with MPI?

A very simple example of how to parallelize code with MPI is given in the monte.f Fortran program.

Only a few MPI commands are necessary to parallelize this Monte-Carlo calculation of pi. The first

 call MPI_INIT(ierr) 

sets up the MPI system and has to be called in any MPI program. The next two

 call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) 
call MPI_COMM_SIZE(MPI_COMM_WORLD, np, ierr)

are used to determine the "rank", i.e. number of the presently running process, and the total number of processes running (size). The identifier MPI_COMM_WORLD is used to label a group of processes assigned to this task, called a "communicator". With

 call MPI_REDUCE(pi,pisum,1,MPI_DOUBLE_PRECISION,& 
MPI_SUM,0,MPI_COMM_WORLD,ierr)

the partial sums (pi) from the different processes are summed up (reduced) into the total (pisum). This is done simultaneously with the gathering of the results from the processes, and is called "reduction". Finally,

 call MPI_FINALIZE(ierr)

closes the MPI system.

To get an idea of how to use MPIand what the various routines do, check out the MPI workshop at the Maui HPC Centre site. For a list of routines in the MPIstandard, and a reference manual of their usage, go to the Sun Documentation Website and search for theSun MPI Programming and Reference Guide .

We offer a separate MPI FAQ with more information about this system.

Although the MPI standard comprises hundreds of routines, you can write very stable and scalable code with only a dozen or so routines. In fact, often the simpler you keep it the better it will work.

How do I compile and run MPI code on HPCVL clusters ?

To use MPI on our clusters, you will have to do the following things:

  • Include header files on the top of all subroutines that use MPI, i.e.
    for Fortran INCLUDE 'mpif.h' and for C
     #include <mpi.h>
    This is important for the definition of variables and constants that are used by theMPI system.
  • Compile and link with the following flags:
     -I/opt/SUNWhpc/include -L/opt/SUNWhpc/lib -R/opt/SUNWhpc/lib -lmpi 
    These tell the compiler, linker and runtime environment where to look for include files, static libraries and runtime dynamic libraries. The command -lmpi loads theMPI library.
  • Alternatively to the above flags, you can use the
     tmf90, tmcc, or tmCC
    macros for Fortran, C, and C++, respectively, instead of the standard compilers/linkers. These will automatically call the right flags. It also implies usage of the -lmpi library flag.
  • For running MPI programs, a special multi-processor runtime environment is needed. This allows you to specify how many processes are used for the execution of the program, from which pool of processes they should be taken, etc. The most important command is
     mpirun [options]
    where options specify the parameters of the run.

The mpirun command is part of the ClusterTools programming environment, and is necessary to run MPI programs and allocate the separate processes across the multi-processor system. The setup for ClusterTools is part of the default on our cluster. The/opt/SUNWhpc/bin directory must be in your PATH (which it is for the default environment).

mpirun lets you specify the number of processors, e.g.

 mpirun -np 4 test_par

runs the MPI program test_par on 4 processors. There is a myriad of other options for this command, many of which are concerned with details of process allocation that are automatically handled by the system on HPCVL clusters, and do therefore not have to concern the user.

For help on ClusterTools, consult Sun's Documentation Site and search for HPC Cluster Tools User's Guide.

It doesn't work. Where can I get help?

All of these things are documented at http://docs.sun.com , but the mass of information on that site makes it a bit difficult to know where to look. Try using the search engine.

If you have questions that you can't resolve by checking documentation, you can Contact us. We have several user support people who can help you with code migration to the parallel environment of the HPCVL facilities. If you want to start a larger project that involves making code executable on parallel machines, they might be able to help you. Keep in mind that we support many people at any given time, so we cannot do the coding for you. But we can do our best to help you get your code ready for multi-processor machines.

Of course, some programs are inherently non-parallel, and trying to make them scalable might be too much effort to be worth it. In that case, the best one can do is try to improve the serial performance by adopting the code to modern computer architecture. The performance enhancement that can be achieved is sometimes quite amazing. It seems, however, that most programs have a good potential to be executed in parallel, and a little effort in that direction often goes a long way.