|
This is a short introduction into how to carry over code from a
serial programming environment to the SUNFire multi-processor system
used by HPCVL. It is meant to give the user a basic idea of what to do
to get the code running on several processors. We assume that the code
is written in FORTRAN, but most considerations carry over directly to
C/C++ code. The document is organized in an "FAQ" manner, i.e. a list
of "obvious" questions is presented as a guideline. Please feel free
to contact Hartmut Schmider if you want to see more questions
included.
Frequently Asked Questions:
1. Where are the Fortran and C/C++ compilers located?
2. Which environment variables do I have to set, what does my path have to look like if I want to do program development?
3. How do I compile and link serial programs? Which compiler flags should I use?
4. Will my serial code run in parallel without changes?
5. How do I "parallelize" my code?
6. How can I use multiple threads to get parallel performance out of my serial code?
7. How do I force multi-thread parallelization? How to use compiler directives?
8. What is MPI and when do I use it?
9. How do I parallelize my code with MPI?
10. How do I compile and run MPI code on the SUN?
11. How can I check out performance of my serial, multi-threaded, or MPI code?
12. It doesn't work. Where can I get help?
Answers:
1.Where are
the Fortran and C/C++ compilers located?
On the SUNFires of HPCVL, the Fortran and C++ compilers and the
needed headers, libraries and tools can be found under the
/opt/studioXX/SUNWspro subdirectory system. XX stands
for the version. The current version is 12. The compilers for
F77, F90, F95, C and C++, together with a development tool called
"sunstudio" are under /opt/studioXX/SUNWspro/bin. Various libraries
are under /opt/studioXX/SUNWspro/lib. This includes dynamic ones,
so if your program complains about not finding "mickey_mouse.so",
setting
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/studioXX/SUNWspro/lib
might be a good idea. There is a lot of other stuff under this
subdirectory, including online-documentation, so you can get help by
pointing your web browser on the SunFire login node to
file:///opt/studioXX/SUNWspro/docs/index.html.
Back to top...
2.Which
environment variables do I have to set, what does my path have to look
like if I want to do program development?
On our new cluster, you do not have to add anything to your default
setup to use program development tools such as compilers and
debuggers. We are using a program called usepackage which
replaces the issuing of lengthy setting for environment variables by a
simple command "use". Without issuing any additional use
commands, you start with standard-user-settings that include
the latest compilers and development tools. If you want to change
this, you can do so by issuing the use package command,
where package stands for one of the following:
ct6 - Sun ClusterTools 6
studio12 - Sun Studio 12 Compilers and Tools
studio11 - Sun Studio 11 Compilers and Tools
studio10 - Sun Studio 10 Compilers and Tools
studio8 - Sun Studio 8 Compilers and Tools
studio7 - Sun Studio 7 Compilers and Tools
workshop6 - Sun Workshop 6 Compilers and Tools
You can do things manually, of course. You should have something like
PATH=$PATH:/opt/studio10/SUNWspro/bin:/opt/SUNWhpc/bin
export PATH
MANPATH=$MANPATH:/opt/ss10/SUNWspro/man:/opt/SUNWhpc/man
export MANPATH
in your setup file (sh, ksh, bash syntax, .profile, .bash_profile,
.bashrc). The first sets your search path, the second your "manual
path" (if you want to use the Unix man command). The first entry
in each case is for standard compilers, the second is for "High
Performance" tools, compilers and libraries. With these setting you
should be able to run the development tool "sunstudio" and get started
editing, compiling and debugging programs.
Back to top...
3.How do I compile and link serial programs? Which
compiler flags should I use?
You will use the "Sun Studio" compilers which reside in
/opt/studioXX/SUNWspro/bin to compile and link. To compile a
Fortran 77, Fortran 90, Fortran 95, C, or C++ program, you issue the
f77, f90, f95, cc, or CC commands, respectively. Compiling and linking
is best done with a makefile. But you can also issue the commands by
hand.
To compile:
compiler -c [options] name.ext
compiler = f77, f90, f95, cc or CC; name = name of your
program source file; ext = extension, i.e. f for Fortran, c for
C, cpp or C for C++, etc., [options] denotes compiler flags
that usually start with an '-')
Note for Fortran programmers: You are actually using the
Fortran90 (f90) compiler even if you are compiling F77 programs. The
f77 command issues a additional compiler flags that concern
compatibility.
To link:
compiler -o name [options] [libraries] list
(compiler see above; name name of the executable;
[options] see above; [libraries] libraries that need to
be linked in, usually as a list of file names with full path, or as
'-L' and '-l' combinations [see below]; list
list of object files, usually with .o extension)
Using the compilers and the linker in the above manner requires the
proper setting of the PATH environment variable.
There are hundreds of compiler flags, and many of them are not
required most of the time. A few that are in more frequent use are:
-xOn optimizes your
code. nis a number from 1 to 5 with increasing severity of
alterations made to the code, but also increasing gain. Up to -xO3 is
generally rather safe to use. But you should, of course, always check
results against an un-optimized version: they might differ.
-fast is a combination of optimization flags that is quite
safe to use and often improves performance a lot. However, the
resulting code is specific for the current UltraSparc-IV+ machines and
cannot be executed on older SUN's (including the UltraSparc-III based
15K's that are still part of our cluster). Note that this overrides
the -xOn option if it comes after it, since compiler options
are executed from left to right!
-g produces code that can be debugged. Unlike for other
compilers, -g and -xOn are not mutually exclusive, so it
is a good flag to have in the development stage of a program.
-v produces more output than you can
handle, which makes it easier to track down problems.
-lname is used to bind in a library called
libname.a (static) or libname.so (dynamic). This flag is
used to link only.
-Ldirname is used in conjunction with
-lname and lets the linker know where to look for
libraries. dirnameis a directory name such as
/opt/studio12/SUNWspro/prod/lib.
-Rdirname is used to tell the program
where to get dynamic libraries at runtime.
There are many more flags. They are documented in the man
pages (man f90 or man cc), as well in various
documents that may be downloaded in pdf format from the Sun documentation website. The latter
is a good place to look to resolve problems in any case. Use the
search engine to obtain User's Guides and Reference
Manuals.
Some compiler flags are only useful for parallel programs and will
be discussed later. Sometimes there is a considerable performance gain
from using specific options (such as -xchip and
-xtarget), but the code becomes less general.
Back to top...
4.Will my serial code run in parallel without
changes?
No. At the very least, you will have to recompile it with
"parallel options" and to set a few environment variables. For most
code, that will not be enough either. Fortunately, in many cases, it
is not difficult to get the compiler to produce code that will show
some performance gain from multi-threading.
Back to top...
5. How do I
"parallelize" my code?
In essence there are 4 steps that should be considered to
"parallelize" your code:
- Optimize the serial version as much as you can. Try
to make it as "simple" as possible, avoiding nested loops and loops
with dependencies, i.e. where the operations inside one iteration
depend on the results from a previous one. Dependencies may be hidden
in function calls or by reference to global variables or COMMON's.
Often, a program spends most of the execution time in a few loops.
Those are candidates for parallel performance. Try to find them (e.g.
by running analyzer software, such as is available inside the
"sunstudio" development tools, or by explicitely inserting timing
routines like etime() into the code). Focus on the
simplification of those loops.
- Use auto-parallelization flags of the compiler (see section 6)
- Force multi-threading via OpenMP compiler directives (see section 7)
- Use MPI routines to run separate processes that
communicate with each other (see section 8).
Back to top...
6.How can I use multiple threads to get parallel
performance out of my serial code?
The compilers running on the Sunfire cluster have options that
cause it to attempt to parallelize loops that have no dependencies by
"multi-threading" them. The compiler flags to get this done
are
- -autopar identifies
loops that are obviously non-dependent and creates multithreaded code
for them
- -reduction reduces
the elements of arrays into single values, for example by summing over
them
- -loopinfo shows which loops were parallelized, and which
not (and why)
- -stackvar Necessary. Allocates local variables on the
stack.
This will only work when the loops to be parallelized do not have any
dependencies.
Back to top...
7. How do I force multi-thread parallelization? How
to use compiler directives?
The compiler will be very conservative about multithreading
loops. If there is the slightest possibility of data dependencies, it
will refuse to do it if -autopar is used. Function calls within loops,
if statements that depend on variables which change in the
loop, and many other features will be considered "dangerous" and
inhibit parallelization. The reason is that such features have a
potential to make the result dependent on the order in which the loop
iterations are carried out, and therefore go against a parallel
execution.
However, often you know more than the compiler. You might be
certain that a function call does not alter the value of variables
that are shared with other loop iterations. If this is the case, there
is ways to tell the compiler to parallelize anyhow. This is done via
compiler directives that look like comments, but if compiled
with the proper flags, will guide the compiler in parallelizing the
code. The most common one a OpenMP compiler directives. Here is
an example:
!$OMP DO PRIVATE(a)
do i = 1, n
a(1) = b(i)
do j = 2, n
a(j) = a(j-1) + b(j) * c(j)
end do
x(i) = f(a)
end do
The initial "!" in the first line of this Fortran segment causes that
line to be interpreted as a comment, unless this is compiled with the
compiler flag "-xopenmp". In that case, the first line tells the
compiler to parallelize the DO loop directly following it. The
PRIVATE instruction causes a separate copy of the array to be
used for each parallel thread (i.e. a is used as a "thread local
variable").
Some of the compiler flags for this approach are:
- -xexplicitpar parallelize when I tell you to by compiler
directives
- -vpara verbose output about dependencies in the
explicitely parallelized loops
- -parallel same as -xexplicitpar, but additional
autoparallelize if possible
- -mp=type specify type of directives,
type can be "sun", "cray" or "openmp";
note that omp directives use OpenMP platform-independent
compiler directives which are the de-facto industry standard. The use
of these is strongly encouraged.
- -xopenmp includes all necessary flags for usage of OpenMP
compiler directives, and therefore replaces the -xexplicitpar
-mp=omp combination. Also includes some other flags (see an
pages).
Note that a separate OpenMP FAQ is
available that contains more information about this programming
technique.
Back to top...
8.What is
MPI and when do I use it?
The ultimate parallelization is, of course, achieved by re-writing
the code in a parallel fashion, so that it can be executed on several
separate processors, or indeed machines, separately. For this, it is
necessary to establish some communication between the processes, and
this is usually done by some form of message passing. A platform
independent standard for this is a set of more than 200 routines,
available in Fortran and C, that comprise the MPI (Message Passing
Interface) standard. Using these routines requires a little
rethinking of the code structure, but is in many cases rather simple
and effective.
MPI is best used if your code has a good potential to employ many
processors independently with none sitting idle. It is also
advantageous to have only relatively little communication being
necessary between processes. Examples are numerical integration (where
independent evaluations of the integrant can be done separately),
Monte-Carlo methods, finite-difference and finite-element methods (if
the problem can be divided up into blocks of equal size with minimal
communication). MPI requires some serious re-coding in some
cases, but with a relatively small number of routines, great scaling
can be achieved.
Back to top...
9.How do I
parallelize my code with MPI
A very simple example of how to parallelize code with MPI is given
in the monte.f Fortran program.
Only a few MPI commands are necessary to parallelize
this Monte-Carlo calculation of pi. The first
call MPI_INIT(ierr)
sets up the MPI system and has to be called in any MPI
program. The next two
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, np, ierr)
are used to determine the "rank", i.e. number of the presently running
process, and the total number of processes running (size). The
identifier MPI_COMM_WORLD is used to label a group of processes
assigned to this task, called a "communicator". With
call MPI_REDUCE(pi,pisum,1,MPI_DOUBLE_PRECISION,&
MPI_SUM,0,MPI_COMM_WORLD,ierr)
the partial sums (pi) from the different processes are
summed up (reduced) into the total (pisum). This is done
simultaneously with the gathering of the results from the processes,
and is called "reduction". Finally,
call MPI_FINALIZE(ierr)
closes the MPI system.
To get an idea of how to use MPIand what the various
routines do, check out the MPI
workshop at the Maui HPC Centre site. For a list of routines in
the MPIstandard, and a reference manual of their usage, go to
the Sun Documentation Website and
search for the Sun MPI Programming and Reference Guide . Note
that we offer a separate
MPI FAQ with more information about this system.
Although the MPI standard comprises hundreds of routines,
you can write very stable and scalable code with only a dozen or so
routines. In fact, often the simpler you keep it the better it will
work.
Back to top...
10.How do I
compile and run MPI code on the SUN?
To use MPI on our Sunfire cluster, you will have to do the
following things:
- Include header files on the top of all subroutines that use
MPI, i.e.
for Fortran INCLUDE 'mpif.h' and for C/C++
#include <mpi.h> This is important for the definition of
variables and constants that are used by the MPI system.
- Compile and link with the following flags:
-I/opt/SUNWhpc/include -L/opt/SUNWhpc/lib -R/opt/SUNWhpc/lib
-lmpi These tell the compiler, linker and runtime environment
where to look for include files, static libraries and runtime dynamic
libraries. The command -lmpi loads the MPI library.
- Alternatively to the above flags, you can use the tmf90,
tmcc, or tmCC macros for Fortran, C, and C++,
respectively, instead of the standard compilers/linkers. These will
automatically call the right flags. However, the -lmpi library
flag still has to be issued.
- For running MPI programs, a special multi-processor
runtime environment (CRE) is needed. This allows you to specify
how many processes are used for the execution of the program, from
which pool of processes they should be taken, etc...
The CRE runtime environment that SUN provides has the following
important components:
mprun Well ... let me guess ... running programs?
mpps Monitor processes
mpkill Shutting down processes
The setup for CRE is part of the default on our cluster. The
/opt/SUNWhpc/bin directory must be in your PATH.
mprun lets you specify the number of processors, e.g.
mprun -np 4 test_par runs the MPI program test_par on 4
processors.
mpps works just like the Unix ps command and lets you
monitor running processes, and identify their number.
mpkill works similar to the Unix kill command and is
used to cancel running processes, e.g. mpkill -9 1512 will
terminate job number 1512 (the -9 makes sure the process is completely
killed).
mpinfo gives information about partitions, processors,
etc... It is usually called with the -N or -p switches.
For help on the runtime environment on the SUN's, consult their Documentation Site and search for
HPC Cluster Tools User's Guide.
Back to top...
How can I check out performance of my serial, multi-threaded, or MPI
code?
The SUN's are equipped with a powerful interface for program
development called Sun Studio. If you have the proper shell
setup, you can call it by simply typing sunstudio. The program
is quite complex, so I can here only outline how to use it for
profiling serial and multi-threaded code. An online guide is available
at
file:///opt/studioXX/SUNWspro/prod/lib/locale/C/html/index.html
on the SunFire login node. Other documentation can be found at the Sun Docs Site.
In order to analyze your program with the Sun Studio Tool, you
need to compile it with the -g option. After calling
sunstudio a GUI will appear. Then click on Analyze on
the tool bar, choose File and Collect Experiment, then
specify the program on the popup menu. After pressing Run,
data from a program run will be collected. After completion, these
data will be stored in a file called test.1.er and a (hidden)
directory called .test.1.er. Now you are ready to have a look
at them. Close the sampling collector window and go back to the main
sunstudio tool bar. Click on Analyze -> File -> Open Experiment
and load test.1.er. You will get an Analyzer window that lets
you see the total exclusive and inclusive time spent in various
subroutine, the % time used by these, and many more. Try the
Metrics and the Callers-Callees windows to get more
information.
If you do not like GUI's, there is a collect command that lets
you produce test.1.er from the command line. Check out the man pages
with man collect. And if you prefer a printed report for
analyzing the experiment, there is a utility that does that, called
er_print, also documented in the man pages: man
er_print. These come in handy if you do not have a desktop
environment available.
This tool lets you analyze where most of the execution time in
your program is spent. It can also handle multiple processes which it collects into separate experiments.
Back to top...
12.It doesn't work. Where can I get
help?
All of these things are documented at http://docs.sun.com , but the mass of
information on that site makes it a bit difficult to know where to
look. Try using the search engine.
If you have questions that you can't resolve by checking
documentation, you can Contact us. We have several
user support people who can help you with code migration to the
parallel environment of the HPCVL
facilities. If you want to start a larger project that involves making
code executable on parallel machines, they might be able to help
you. Keep in mind that we support many people at any given time, so we
cannot do the coding for you. But we can do our best to help you get
your code ready for multi-processor machines.
Of course, some programs are inherently non-parallel, and trying
to make them scalable might be too much effort to be worth it. In that
case, the best one can do is try to improve the serial performance by
adopting the code to modern computer architecture. The performance
enhancement that can be achieved is sometimes quite amazing. It seems,
however, that most programs have a good potential to be executed in
parallel, and a little effort in that direction often goes a long way.
Back to top...
|