Abstract:
This is a short introduction into how to carry over code from a serial
programming environment to the SUNFire multi-processor system used by
HPCVL. It is meant to give the user a basic idea of what to do to get the
code running on several processors. We assume that the code is written in
FORTRAN, but most considerations carry over directly to C/C++ code. The
document is organized in an "FAQ" manner, i.e. a list of
"obvious" questions is presented as a guideline. Please feel free
to contact Hartmut Schmider if you want to see more questions
included.
Frequently Asked Questions:
Where are the Fortran and C/C++ compilers
located?
Which environment variables do I have to
set, what does my path have to look like if I want to do program
development?
How do I compile and link serial
programs? Which compiler flags should I use?
Will my serial code run in parallel
without changes?
How do I "parallelize" my code?
How can I use multiple threads to get
parallel performance out of my serial code?
How do I force multi-thread
parallelization? How to use compiler directives?
What is MPI and when do I use it?
How do I parallelize my code with MPI?
How do I compile and run MPI code on the
SUN?
How can I check out performance of my
serial, multi-threaded, or MPI code?
It doesn't work. Where can I get help?
Answers:
Where are the
Fortran and C/C++ compilers located?
On the SUNFires of HPCVL, the Fortran and C++
compilers and the needed headers, libraries and tools can be found
under the /opt/s1s7
subdirectory system. The compilers for F77, F90, F95, C and C++,
together with a development tool called "Studio" are
under /opt/s1s7/bin. Various libraries are under
/opt/s1s7/lib. This includes dynamic ones, so if your program
complains about not finding "mickey_mouse.so", so
setting
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/s1s7/lib
might be a good idea. There is a lot of other stuff under this
subdirectory, including online-documentation, so you can get help
by pointing your web browser on the SunFire login node to
"file:///opt/s1s7/docs/index.html".
Which
environment variables do I have to set, what does my path have to
look like if I want to do program development?
You should have something
like
PATH=$PATH:/opt/s1s7/bin:/opt/SUNWhpc/bin export
PATH MANPATH=$MANPATH:/opt/s1s7/man:/opt/SUNWhpc/man export
MANPATH
in your setup file (.profile, .bashrc). The
first sets your search path, the second your "manual
path" (if you want to use the Unix man command), and the third tells the
system where to look for dynamic libraries. The first entry in each
case is for standard compilers, the second is for "High
Performance" tools, compilers and libraries. With these
setting you should be able to run the development tool
"Studio" and get started editing, compiling and debugging
programs.
How
do I compile and link serial programs? Which compiler flags should I
use?
You will use the "Forte" compilers which reside in
/opt/s1s7/bin to compile and link. To compile a Fortran, C,
or C++ program, you issue the f77, f90, f95, cc, or CC
commands. Compiling and linking is best done with a makefile. But
you can also issue the commands by hand.
To compile: compiler -c
[options] name.ext (compiler = f77,
f90, f95, cc or CC; name = name of your program source file;
ext = extension, i.e. f for Fortran, c for C, cpp or C for
C++, etc., [options] denotes compiler flags that usually
start with an '-') Note for Fortran
programmers: It is a good idea to use the Fortran90 (f90) compiler even if you are compiling F77
programs. It should be able to handle all f77 code, and it is the
one that is "supported".
To link: compiler -o name [options]
[libraries] list (compiler see above;
name name of the executable; [options] see above;
[libraries] libraries that need to be linked in, usually as
a list of file names with full path, or as '-L' and '-l'
combinations [see below]; list list of object files, usually
with .o extension)
Using the compilers and the linker in the above manner requires the
proper setting of the PATH environment variable.
There are literally hundreds of compiler flags,
and many of them are not required most of the time. The ones that I
use most often are:
-xOx
optimizes your code. x is a number from 1 to 5 with
increasing severity of alterations made to the code, but also
increasing gain. Up to -O3 is generally rather safe to use. But you
should, of course, always check results against an un-optimized
version: they might differ.
-fast is a combination of optimization
flags that is quite safe to use and often improves performance a
lot. However, the resulting code is specific for UltraSparc
machines and cannot be executed on older SUN's. Note that this
overrides the -xOx option
if it comes after it, since compiler options are executed from left
to right!
-g produces code that can be
debugged. Unlike for other compilers, -g and -O are not mutually
exclusive, so it is a good flag to have in the development stage of
a program.
-v produces more output than you can
handle, which makes it easier to track down problems.
-lname is used to bind in a
library called libname.a (static) or libname.so
(dynamic). This flag is used to link only.
-Ldirname is used in
conjunction with -lname and lets the linker know where to
look for libraries. dirname is a directory name such as
/opt/s1s7/WS6U1/lib.
-Rdirname is used to
tell the program where to get dynamic libraries at runtime.
There is many more flags. They are documented at the following
website: http://docs.sun.com
which is a good place to look to resolve problems in any case. Some compiler flags are only useful for
parallel programs, and I discuss them later. Sometimes there is a
considerable performance gain from using specific options (such as
-xchip and -xtarget), but the code becomes less
general.
Will my serial code run in parallel without
changes?
No. To the
very least, you will have to recompile it with "parallel
options" and to set a few environment variables. For most
code, that will not be enough either. Fortunately, in many cases,
it is not difficult to get the compiler to produce code that will
show some performance gain from multi-threading.
How do I
"parallelize" my code?
In essence there are 4 steps that
should be considered to "parallelize" your code:
--> Optimize the serial version as much as you
can. Try to make it as "simple" as possible, avoiding
nested loops and loops with dependencies, i.e. where the operations
inside one iteration depend on the results from a previous
one. Dependencies may be hidden in function calls or by reference
to global variables or COMMON's. Often, a program spends most of
the execution time in a few loops. Those are candidates for
parallel performance. Try to find them (e.g. by running analyzer
software, such as is available inside the "Studio"
development tools, or by explicitely inserting timing routines like
etime() into the
code). Focus on the simplification of those loops.
--> Use auto-parallelization flags of the compiler
(see section 6)
--> Force multi-threading via compiler directives (see section
7)
--> Use MPI routines to run separate
processes that communicate with each other (see section
8).
How
can I use multiple threads to get parallel performance out of my
serial code?
The compilers running on the SUNFires have options
that cause it to attempt to parallelize loops that have no
dependencies by "multi-threading" them. The compiler
flags to get this done are
-autopar (identifies loops that are
obviously non-dependent and creates multithreaded code for
them) -reduction (reduces the
elements of arrays into single values, for example by summing over
them) -loopinfo (shows which loops
were parallelized, and which not (and why)) -stackvar (sometimes useful. Allocates local
variables on the stack. Sometimes will cause the program to not
work)
This will only work when the loops to be parallelized do not have
any dependencies. How do I force
multi-thread parallelization? How to use compiler
directives?
The compiler will be very conservative
about multithreading loops. If there is the slightest possibility
of data dependencies, it will refuse to do it if -autopar is used.
Function calls within loops, if statements that
depend on variables which change in the loop, and many other
features will be considered "dangerous" and inhibit
parallelization. The reason is that such features have a potential
to make the result dependent on the order in which the loop
iterations are carried out, and therefore go against a parallel
execution.
However, often you know more than the compiler. You might be
certain that a function call does not alter the value of variables
that are shared with other loop iterations. If this is the case,
there is ways to tell the compiler to parallelize anyhow. This is
done via compiler
directives that look
like comments, but if compiled with the proper flags, will guide
the compiler in parallelizing the code. Here is an example:
C$PAR DOALL PRIVATE(a) do i = 1,
n a(1) = b(i) do j = 2, n a(j) = a(j-1) + b(j) *
c(j) end do x(i) = f(a) end
do
The initial "C" in the first line of this Fortran
segment causes that line to be interpreted as a comment, unless
this is compiled with the compiler flags "-explicitpar
-mp=sun". In that case, the first line tells the compiler to
parallelize the DO loop directly following it. The PRIVATE instruction causes a separate copy
of the array to be used for each parallel thread (i.e. a is used
as a "local variable"). The compiler flags for this
approach are: -explicitpar
(parallelize when I tell you to by compiler directives) -vpara (vernbose output about dependencies
in the explicitely parallelized loops) -parallel (same as -explicitpar, but
additional autoparallelize if possible) -mp=type (specify type of directives, type can be "sun",
"cray" or "openmp"; "sun" directives
use the syntax "C$PAR
..." and are specific for UltraSparcs, "cray" ones
begin with "!MIC$ ..." and
are there for compatibility with programs developped for CRAY
supercomputers, and "omp" directives use "C$OMP" and are
platform-independent. The latter are only available for the f95
compiler, but will soon (later in 2001, with Forte version 6 update
2) be available for C as well. The use of the OMP is encouraged
because of its platform independence.
What is MPI and when do I use
it?
The ultimate parallelization is, of course,
achieved by re-writing the code in a parallel fashion, so that it
can be executed on several separate processors, or indeed machines,
separately. For this, it is necessary to establish some
communication between the processes, and this is usually done by
some form of message passing. A platform independent standard for
this is a set of more than 200 routines, available in Fortran and
C, that comprise the MPI (Message
Passing Interface) standard. Using these routines requires a
little rethinking of the code structure, but is in many cases
rather simple and effective. MPI is best used if your code has a good
potential to employ many processors independently with none sitting
idle. It is also advantageous to have only relatively little
communication being necessary between processes. Examples are
numerical integration (where independent evaluations of the
integrant can be done separately), Monte-Carlo methods,
finite-difference and finite-element methods (if the problem can be
divided up into blocks of equal size with minimal
communication). MPI some serious re-coding in some cases, but with a relatively
small number of routines, great scaling can be
achieved.
How do I parallelize my code with
MPI
A very simple example of how to
parallelize code with MPI is given in
/opt/SUNWhpc/examples/mpi/monte.f:
Only a few MPI commands are
necessary to parallelize this Monte-Carlo calculation of PI. The
first call MPI_INIT(ierr) sets up the
MPI system and has to be
called in any MPI
program. The next two call
MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) call
MPI_COMM_SIZE(MPI_COMM_WORLD, np, ierr) are used to determine the
"rank", i.e. number of the presently running process, and
the total number of processes running (size). The identifier
MPI_COMM_WORLD is used to
label a group of processes assigned to this task, called a
"communicator". With call MPI_REDUCE(pi, pisum, 1, MPI_DOUBLE_PRECISION, MPI_SUM,
0 , MPI_COMM_WORLD, ierr) the partial sums (pi) from the different processes
are summed up (reduced) into the total (pisum). This is done
simultaneously with the gathering of the results from the
processes, and is called
"reduction". Finally, call
MPI_FINALIZE(ierr) closes the MPI
system.
To get an idea of how to use MPI and
what the various routines do, check out the following web
site:
http://www.mhpcc.edu/training/workshop/mpi/MAIN.html
For a list of routines in the MPI standard, and a reference manual of
their usage, go to
this link (or try http://docs.sun.com then click
"By Subject", "Programming", "Tools",
"Sun HPC 3.1 Answer Book Collection", "Sun MPI 4.1
Programming and Reference
Guide"). Although the MPI standard comprises hundreds of routines,
you can write very stable and scalable code with only MPI_INIT, MPI_COMM_SIZE, MPI_COMM_RANK, MPI_SEND,
MPI_RECV, MPI_BCAST, MPI_GATHER, MPI_REDUCE, and MPI_FINALIZE. In fact, the simpler you keep
it the better it will work.
How do I compile and run MPI code on the
SUN?
To use MPI, you will have to do the
following things:
--> Include header files on the top of all
subroutines that use MPI, i.e. for Fortran include 'mpif.h' and
for C/C++ include 'mpi.h'. This is important for the definition of
variables and constants that are used by the MPI
system.
--> Compile and link with the following flags: -I/opt/SUNWhpc/include -L/opt/SUNWhpc/lib
-R/opt/SUNWhpc/lib -lmpi M These
tell the compiler, linker and runtime environment where to look for
include files, static libraries and runtime dynamic libraries. The
command -lmpi loads the MPI
routines.
--> For running
MPI programs, a special
multi-processor runtime environment is needed. This allows you to
specify how many processes are used for the execution of the
program, from which pool of processes they should be taken,
etc... The CRE
runtime environment that SUN provides has the following important
components: * mprun (Well ... let
me guess ... running programs) * mpps (Monitor
processes) * mpkill
(Shutting down processes) In order to use it, you need to
include /opt/SUNWhpc/bin in your
PATH. mprun lets you specify the
number of processors, e.g. mprun -np 4 test_par runs the MPI program test_par
on 4 processors from the "standard partition".
Partitions are groups of processors on which your processes will
run. You can specify which one to use by the -p partition_name switch for the
mprun command. mpps works just like the Unix ps command and
lets you monitor running processes, and identify their
number. mpkill works similar to the
Unix kill command and is used to cancel running processes,
i.e. mpkill -signal
job_number. mpinfo
gives you information about partitions, processors, etc... It is
usually called with the -N or -p switches. For help on the
runtime environment on the SUN's, try out
this link or http://docs.sun.com then click
"By Subject", "Programming",
"Tools", " Sun HPC 3.1
Answer Book Collection", "Sun HPC Cluster Tools 3.1
User's Guide". How can I check out
performance of my serial, multi-threaded, or MPI
code?
The SUN's are equipped
with a rather powerful interface for program development called
"Studio". If you have the
environment variables in qustion 2 set, you can call it by simply
typing Studio. The program is quite
complex, so I can here only outline how to use it for profiling
serial and multi-threaded code. An online guide is available
under "file:///opt/s1s7/prod/lib/locale/C/html/index.html"
on the SunFire login node.
In order to debug you program with Studio,
you need to compile it with the -g
option. If you are working from a remote terminal, you will have to
set the environment variable DISPLAY,
e.g. by typing "export
DISPLAY=ip_number:0" where you substitute your
machines IP number for ip_number.
You also might have to allow external access to you display by
typing something like "xhost +".
When you have done that, call Studio
and close the initial GUI box. Then click on "Debug"
on the tool bar, and call "New Program"
on the popup menu. Choose the program you want to test out, and a
new "Debug" window, as well
as an editor with your source code will appear. Call yet another
window by clicking "Windows"
on the toolbar of the "Debug"
window, and then choosing "Sampling
Collector". This will call a tool that lets you run
experiments with your program. You can now click on the "Collect
Data: For one run only" box and then on the "Start
- run program from the beginning" icon in the upper left
corner of the Sampling Collector window. Your program will now
execute, and the sampling collector will collect timing data about
your code. After completion, these data will be stored in a file
called "test.1.er" and a
(hidden) directory called ".test.1.er".
Now you are ready to have a look at them. Close the sampling
collector window and go back to the main Studio tool bar. Click on
"Tools" and choose "Analyzer"
and "New" in the popup menu.
Load "test.1.er" and you will
get an Analyzer window that lets you see the total exclusive and
inclusive time spent in various subroutine, the % time used by
these, and many more. Try the "Metrics"
and the "Callers-Callees"
windows to get more information.
If you do not like GUI's,
there is a "collect" command
that lets you produce "test.1.er"
from the command line. Check out the man pages with "man
collect". And if you prefer a printed report for
analyzing the experiment, there is a utility that does that, called
"er_print", also documented
in the man pages: "man er_print".
These come in handy if you do not have a desktop environment
available, e.g. if you work from a vt100 terminal.
This tool lets you analyze where most of the execution time in your
program is spent. However, it can not handle simultaneous separate
processes, such as in an MPI program. For that, you need a debugger
called Prism. This program is
documented online as well: point a newsreader on the SunFire login
node to "
file:///opt/SUNWhpc/doc/prism/html/help.html". There
are too many commands for me to explain the usage of prism here. You call the program by typing
prism (don't forget to set the
environment variable DISPLAY if you
are working from a remote terminal). The actual usage is similar to
Studio, but complicated by having to deal with several parallel
processes.
Quite often, the best way to check the
performance of an MPI program is timing it by insertion of suitable
routines. MPI supplies a "wall-clock" routine called
MPI_WTIME(), that lets you determine how much actual time was spent
in a specific segment of your code. An other method is calling the
subroutines ETIME and DTIME, which can give you information about
the actual CPU time used. However, it is advisable to carefully
read
the documentation of these routines before using them with MPI
programs.
It doesn't
work. Where can I get help?
All of these things are
rather well documented on http://docs.sun.com but the mass
of information on that site makes it a bit difficult to know where
to look. If you have questions that you can't
resolve by checking documentation, you can call or send email to Hartmut Schmider, who wrote
this document, and who works for HPCVL as a scientific
programmer. My job is the support of code migration to the parallel
environment of the HPCVL
facilities. If you want to start a larger project that involves
making code executable on parallel machines, I may be able to help
you. Keep in mind that I support many people at any given time, so
I cannot do the coding for you. But I can do my best to help you
get your code ready for multi-processor machines.
Of course,
asking me for help might not work either. Some programs are
inherently non-parallel, and trying to make them scalable might be
too much effort to be worth it. In that case, the best one can do
is trying to improve the serial performance by adopting the code to
modern computer architecture. The performance enhancement that can
be achieved is sometimes quite amazing. It seems, however, that
most programs have a good potential to be executed "in
parallel", and a little effort in that direction goes a long
way.
|