Parallel Programming FAQ

This is a short introduction into how to carry over code from a serial programming environment to the SUNFire multi-processor system used by HPCVL. It is meant to give the user a basic idea of what to do to get the code running on several processors. We assume that the code is written in FORTRAN, but most considerations carry over directly to C/C++ code. The document is organized in an "FAQ" manner, i.e. a list of "obvious" questions is presented as a guideline. Please feel free to contact Hartmut Schmider if you want to see more questions included.

Frequently Asked Questions:

1.      Where are the Fortran and C/C++ compilers located?

2.      Which environment variables do I have to set, what does my path have to look like if I want to do program development?

3.      How do I compile and link serial programs? Which compiler flags should I use?

4.      Will my serial code run in parallel without changes?

5.      How do I "parallelize" my code?

6.      How can I use multiple threads to get parallel performance out of my serial code?

7.      How do I force multi-thread parallelization? How to use compiler directives?

8.      What is MPI and when do I use it?

9.      How do I parallelize my code with MPI?

10.  How do I compile and run MPI code on the SUN?

11.  How can I check out performance of my serial, multi-threaded, or MPI code?

12.  It doesn't work. Where can I get help?

Answers:

1.Where are the Fortran and C/C++ compilers located?

On the SUNFires of HPCVL, the Fortran and C++ compilers and the needed headers, libraries and tools can be found under the /opt/studioXX/SUNWspro subdirectory system. XX stands for the version. The current version is 12. The compilers for F77, F90, F95, C and C++, together with a development tool called "sunstudio" are under /opt/studioXX/SUNWspro/bin. Various libraries are under /opt/studioXX/SUNWspro/lib. This includes dynamic ones, so if your program complains about not finding "mickey_mouse.so", setting

  LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/studioXX/SUNWspro/lib 
might be a good idea. There is a lot of other stuff under this subdirectory, including online-documentation, so you can get help by pointing your web browser on the SunFire login node to
file:///opt/studioXX/SUNWspro/docs/index.html.

Back to top...

2.Which environment variables do I have to set, what does my path have to look like if I want to do program development?

On our new cluster, you do not have to add anything to your default setup to use program development tools such as compilers and debuggers. We are using a program called usepackage which replaces the issuing of lengthy setting for environment variables by a simple command "use". Without issuing any additional use commands, you start with standard-user-settings that include the latest compilers and development tools. If you want to change this, you can do so by issuing the use package command, where package stands for one of the following:

   ct6 - Sun ClusterTools 6
   studio12 - Sun Studio 12 Compilers and Tools
   studio11 - Sun Studio 11 Compilers and Tools
   studio10 - Sun Studio 10 Compilers and Tools
   studio8 - Sun Studio 8 Compilers and Tools
   studio7 - Sun Studio 7 Compilers and Tools
   workshop6 - Sun Workshop 6 Compilers and Tools

You can do things manually, of course. You should have something like

   PATH=$PATH:/opt/studio10/SUNWspro/bin:/opt/SUNWhpc/bin 
   export PATH
   MANPATH=$MANPATH:/opt/ss10/SUNWspro/man:/opt/SUNWhpc/man
   export MANPATH
in your setup file (sh, ksh, bash syntax, .profile, .bash_profile, .bashrc). The first sets your search path, the second your "manual path" (if you want to use the Unix man command). The first entry in each case is for standard compilers, the second is for "High Performance" tools, compilers and libraries. With these setting you should be able to run the development tool "sunstudio" and get started editing, compiling and debugging programs.

Back to top...

3.How do I compile and link serial programs? Which compiler flags should I use?

You will use the "Sun Studio" compilers which reside in /opt/studioXX/SUNWspro/bin to compile and link. To compile a Fortran 77, Fortran 90, Fortran 95, C, or C++ program, you issue the f77, f90, f95, cc, or CC commands, respectively. Compiling and linking is best done with a makefile. But you can also issue the commands by hand.

To compile:

  compiler -c [options] name.ext
compiler = f77, f90, f95, cc or CC; name = name of your program source file; ext = extension, i.e. f for Fortran, c for C, cpp or C for C++, etc., [options] denotes compiler flags that usually start with an '-')
Note for Fortran programmers: You are actually using the Fortran90 (f90) compiler even if you are compiling F77 programs. The f77 command issues a additional compiler flags that concern compatibility.

To link:

  compiler -o name [options] [libraries] list 
(compiler see above; name name of the executable; [options] see above; [libraries] libraries that need to be linked in, usually as a list of file names with full path, or as '-L' and '-l' combinations [see below]; list list of object files, usually with .o extension)

Using the compilers and the linker in the above manner requires the proper setting of the PATH environment variable.

There are hundreds of compiler flags, and many of them are not required most of the time. A few that are in more frequent use are:

-xOn optimizes your code. nis a number from 1 to 5 with increasing severity of alterations made to the code, but also increasing gain. Up to -xO3 is generally rather safe to use. But you should, of course, always check results against an un-optimized version: they might differ.

-fast is a combination of optimization flags that is quite safe to use and often improves performance a lot. However, the resulting code is specific for the current UltraSparc-IV+ machines and cannot be executed on older SUN's (including the UltraSparc-III based 15K's that are still part of our cluster). Note that this overrides the -xOn option if it comes after it, since compiler options are executed from left to right!

-g produces code that can be debugged. Unlike for other compilers, -g and -xOn are not mutually exclusive, so it is a good flag to have in the development stage of a program.

-v produces more output than you can handle, which makes it easier to track down problems.

-lname is used to bind in a library called libname.a (static) or libname.so (dynamic). This flag is used to link only.

-Ldirname is used in conjunction with -lname and lets the linker know where to look for libraries. dirnameis a directory name such as /opt/studio12/SUNWspro/prod/lib.

-Rdirname is used to tell the program where to get dynamic libraries at runtime.

There are many more flags. They are documented in the man pages (man f90 or man cc), as well in various documents that may be downloaded in pdf format from the Sun documentation website. The latter is a good place to look to resolve problems in any case. Use the search engine to obtain User's Guides and Reference Manuals.

Some compiler flags are only useful for parallel programs and will be discussed later. Sometimes there is a considerable performance gain from using specific options (such as -xchip and -xtarget), but the code becomes less general.

Back to top...

4.Will my serial code run in parallel without changes?

No. At the very least, you will have to recompile it with "parallel options" and to set a few environment variables. For most code, that will not be enough either. Fortunately, in many cases, it is not difficult to get the compiler to produce code that will show some performance gain from multi-threading.

Back to top...

5. How do I "parallelize" my code?

In essence there are 4 steps that should be considered to "parallelize" your code:

  1. Optimize the serial version as much as you can. Try to make it as "simple" as possible, avoiding nested loops and loops with dependencies, i.e. where the operations inside one iteration depend on the results from a previous one. Dependencies may be hidden in function calls or by reference to global variables or COMMON's. Often, a program spends most of the execution time in a few loops. Those are candidates for parallel performance. Try to find them (e.g. by running analyzer software, such as is available inside the "sunstudio" development tools, or by explicitely inserting timing routines like etime() into the code). Focus on the simplification of those loops.
  2. Use auto-parallelization flags of the compiler (see section 6)
  3. Force multi-threading via OpenMP compiler directives (see section 7)
  4. Use MPI routines to run separate processes that communicate with each other (see section 8).

Back to top...

6.How can I use multiple threads to get parallel performance out of my serial code?

The compilers running on the Sunfire cluster have options that cause it to attempt to parallelize loops that have no dependencies by "multi-threading" them. The compiler flags to get this done are

  • -autopar identifies loops that are obviously non-dependent and creates multithreaded code for them
  • -reduction reduces the elements of arrays into single values, for example by summing over them
  • -loopinfo shows which loops were parallelized, and which not (and why)
  • -stackvar Necessary. Allocates local variables on the stack.
This will only work when the loops to be parallelized do not have any dependencies.

Back to top...

7. How do I force multi-thread parallelization? How to use compiler directives?

The compiler will be very conservative about multithreading loops. If there is the slightest possibility of data dependencies, it will refuse to do it if -autopar is used. Function calls within loops, if statements that depend on variables which change in the loop, and many other features will be considered "dangerous" and inhibit parallelization. The reason is that such features have a potential to make the result dependent on the order in which the loop iterations are carried out, and therefore go against a parallel execution.

However, often you know more than the compiler. You might be certain that a function call does not alter the value of variables that are shared with other loop iterations. If this is the case, there is ways to tell the compiler to parallelize anyhow. This is done via compiler directives that look like comments, but if compiled with the proper flags, will guide the compiler in parallelizing the code. The most common one a OpenMP compiler directives. Here is an example:

   !$OMP DO PRIVATE(a)
   do i = 1, n
     a(1) = b(i)
     do j = 2, n
       a(j) = a(j-1) + b(j) * c(j) 
     end do
     x(i) = f(a) 
   end do
The initial "!" in the first line of this Fortran segment causes that line to be interpreted as a comment, unless this is compiled with the compiler flag "-xopenmp". In that case, the first line tells the compiler to parallelize the DO loop directly following it. The PRIVATE instruction causes a separate copy of the array to be used for each parallel thread (i.e. a is used as a "thread local variable").

Some of the compiler flags for this approach are:

  • -xexplicitpar parallelize when I tell you to by compiler directives
  • -vpara verbose output about dependencies in the explicitely parallelized loops
  • -parallel same as -xexplicitpar, but additional autoparallelize if possible
  • -mp=type specify type of directives, type can be "sun", "cray" or "openmp"; note that omp directives use OpenMP platform-independent compiler directives which are the de-facto industry standard. The use of these is strongly encouraged.
  • -xopenmp includes all necessary flags for usage of OpenMP compiler directives, and therefore replaces the -xexplicitpar -mp=omp combination. Also includes some other flags (see an pages).

Note that a separate OpenMP FAQ is available that contains more information about this programming technique.

Back to top...

8.What is MPI and when do I use it?

The ultimate parallelization is, of course, achieved by re-writing the code in a parallel fashion, so that it can be executed on several separate processors, or indeed machines, separately. For this, it is necessary to establish some communication between the processes, and this is usually done by some form of message passing. A platform independent standard for this is a set of more than 200 routines, available in Fortran and C, that comprise the MPI (Message Passing Interface) standard. Using these routines requires a little rethinking of the code structure, but is in many cases rather simple and effective.

MPI is best used if your code has a good potential to employ many processors independently with none sitting idle. It is also advantageous to have only relatively little communication being necessary between processes. Examples are numerical integration (where independent evaluations of the integrant can be done separately), Monte-Carlo methods, finite-difference and finite-element methods (if the problem can be divided up into blocks of equal size with minimal communication). MPI requires some serious re-coding in some cases, but with a relatively small number of routines, great scaling can be achieved.

Back to top...

9.How do I parallelize my code with MPI

A very simple example of how to parallelize code with MPI is given in the monte.f Fortran program.

Only a few MPI commands are necessary to parallelize this Monte-Carlo calculation of pi. The first

  call MPI_INIT(ierr) 
sets up the MPI system and has to be called in any MPI program. The next two
  call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
  call MPI_COMM_SIZE(MPI_COMM_WORLD, np, ierr)
are used to determine the "rank", i.e. number of the presently running process, and the total number of processes running (size). The identifier MPI_COMM_WORLD is used to label a group of processes assigned to this task, called a "communicator". With
  call MPI_REDUCE(pi,pisum,1,MPI_DOUBLE_PRECISION,&
       MPI_SUM,0,MPI_COMM_WORLD,ierr)
the partial sums (pi) from the different processes are summed up (reduced) into the total (pisum). This is done simultaneously with the gathering of the results from the processes, and is called "reduction". Finally,
  call MPI_FINALIZE(ierr)
closes the MPI system.

To get an idea of how to use MPIand what the various routines do, check out the MPI workshop at the Maui HPC Centre site. For a list of routines in the MPIstandard, and a reference manual of their usage, go to the Sun Documentation Website and search for the Sun MPI Programming and Reference Guide . Note that we offer a separate MPI FAQ with more information about this system.

Although the MPI standard comprises hundreds of routines, you can write very stable and scalable code with only a dozen or so routines. In fact, often the simpler you keep it the better it will work.

Back to top...

10.How do I compile and run MPI code on the SUN?

To use MPI on our Sunfire cluster, you will have to do the following things:

  • Include header files on the top of all subroutines that use MPI, i.e.
    for Fortran INCLUDE 'mpif.h' and for C/C++ #include <mpi.h>
    This is important for the definition of variables and constants that are used by the MPI system.
  • Compile and link with the following flags:
    -I/opt/SUNWhpc/include -L/opt/SUNWhpc/lib -R/opt/SUNWhpc/lib -lmpi
    These tell the compiler, linker and runtime environment where to look for include files, static libraries and runtime dynamic libraries. The command -lmpi loads the MPI library.
  • Alternatively to the above flags, you can use the tmf90, tmcc, or tmCC macros for Fortran, C, and C++, respectively, instead of the standard compilers/linkers. These will automatically call the right flags. However, the -lmpi library flag still has to be issued.
  • For running MPI programs, a special multi-processor runtime environment (CRE) is needed. This allows you to specify how many processes are used for the execution of the program, from which pool of processes they should be taken, etc...
The CRE runtime environment that SUN provides has the following important components:
  • mprun Well ... let me guess ... running programs?
  • mpps Monitor processes
  • mpkill Shutting down processes The setup for CRE is part of the default on our cluster. The /opt/SUNWhpc/bin directory must be in your PATH.

    mprun lets you specify the number of processors, e.g. mprun -np 4 test_par runs the MPI program test_par on 4 processors.

    mpps works just like the Unix ps command and lets you monitor running processes, and identify their number.

    mpkill works similar to the Unix kill command and is used to cancel running processes, e.g. mpkill -9 1512 will terminate job number 1512 (the -9 makes sure the process is completely killed).

    mpinfo gives information about partitions, processors, etc... It is usually called with the -N or -p switches.

    For help on the runtime environment on the SUN's, consult their Documentation Site and search for HPC Cluster Tools User's Guide.

    Back to top...

    How can I check out performance of my serial, multi-threaded, or MPI code?

    The SUN's are equipped with a powerful interface for program development called Sun Studio. If you have the proper shell setup, you can call it by simply typing sunstudio. The program is quite complex, so I can here only outline how to use it for profiling serial and multi-threaded code. An online guide is available at

    file:///opt/studioXX/SUNWspro/prod/lib/locale/C/html/index.html
    on the SunFire login node. Other documentation can be found at the Sun Docs Site.

    In order to analyze your program with the Sun Studio Tool, you need to compile it with the -g option. After calling sunstudio a GUI will appear. Then click on Analyze on the tool bar, choose File and Collect Experiment, then specify the program on the popup menu. After pressing Run, data from a program run will be collected. After completion, these data will be stored in a file called test.1.er and a (hidden) directory called .test.1.er. Now you are ready to have a look at them. Close the sampling collector window and go back to the main sunstudio tool bar. Click on Analyze -> File -> Open Experiment and load test.1.er. You will get an Analyzer window that lets you see the total exclusive and inclusive time spent in various subroutine, the % time used by these, and many more. Try the Metrics and the Callers-Callees windows to get more information.

    If you do not like GUI's, there is a collect command that lets you produce test.1.er from the command line. Check out the man pages with man collect. And if you prefer a printed report for analyzing the experiment, there is a utility that does that, called er_print, also documented in the man pages: man er_print. These come in handy if you do not have a desktop environment available.

    This tool lets you analyze where most of the execution time in your program is spent. It can also handle multiple processes which it collects into separate experiments.

    Back to top...

    12.It doesn't work. Where can I get help?

    All of these things are documented at http://docs.sun.com , but the mass of information on that site makes it a bit difficult to know where to look. Try using the search engine.

    If you have questions that you can't resolve by checking documentation, you can Contact us. We have several user support people who can help you with code migration to the parallel environment of the HPCVL facilities. If you want to start a larger project that involves making code executable on parallel machines, they might be able to help you. Keep in mind that we support many people at any given time, so we cannot do the coding for you. But we can do our best to help you get your code ready for multi-processor machines.

    Of course, some programs are inherently non-parallel, and trying to make them scalable might be too much effort to be worth it. In that case, the best one can do is try to improve the serial performance by adopting the code to modern computer architecture. The performance enhancement that can be achieved is sometimes quite amazing. It seems, however, that most programs have a good potential to be executed in parallel, and a little effort in that direction often goes a long way.

    Back to top...

  •  
     
       
    © HPCVL 2007