|
This is a short introduction into how to carry over code from a
serial programming environment to the multi-processor systems used at
HPCVL. It is meant to give the user a basic idea of what to do to get
the code running on several processors. We assume that the code is
written in FORTRAN, but most considerations carry over directly to
C/C++ and other code. The document is organized in an "FAQ" manner,
i.e. a list of "obvious" questions is presented as a guideline. Please
feel free to contact us if you want to see more
questions included.
Frequently Asked Questions:
1. Where are the Fortran and C/C++ compilers located?
2. Which environment variables do I have to set, what does my path have to look like if I want to do program development?
3. How do I compile and link serial programs? Which compiler flags should I use?
4. Will my serial code run in parallel without changes?
5. How do I "parallelize" my code?
6. How can I use multiple threads to get parallel performance out of my serial code?
7. How do I force multi-thread parallelization? How to use compiler directives?
8. What is MPI and when do I use it?
9. How do I parallelize my code with MPI?
10. How do I compile and run MPI code on the SUN?
11. How can I check out performance of my serial, multi-threaded, or MPI code?
12. It doesn't work. Where can I get help?
Answers:
1.Where are the Fortran and C/C++
compilers located?
On the SUNFires of HPCVL, the Fortran and C++ compilers and the
needed headers, libraries and tools can be found under the
/opt/SUNWspro subdirectory system. The current version
is 12. The compilers for F77, F90, F95, C and C++, together
with a development tool called "sunstudio" are
under /opt/SUNWspro/bin. Various libraries are
under /opt/SUNWspro/lib. This includes dynamic ones, so if your
program complains about not finding "mickey_mouse.so", setting
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/SUNWspro/lib
might be a good idea. There is a lot of other stuff under this
subdirectory, including online-documentation, so you can get help by
pointing your web browser on the SunFire login node to
file:///opt/SUNWspro/docs/index.html.
Back to top...
2. Which environment variables do I
have to set, what does my path have to look like if I want to do
program development?
For the most part, you do not have to add anything to your default
setup to use program development tools such as compilers and
debuggers. We are using a program called usepackage which
replaces the issuing of lengthy setting for environment variables by a
simple command "use". Without issuing any additional use
commands, you start with standard-user-settings that include
the latest compilers and development tools. If you want to change
this, you can do so by issuing the use package command,
where package stands for one of the following:
ct6 - Sun ClusterTools 6
ct7 - Sun ClusterTools 7.1
ct8 - Sun CLusterTools 8.1
studio12 - Sun Studio 12 Compilers and Tools
studio11 - Sun Studio 11 Compilers and Tools
studio10 - Sun Studio 10 Compilers and Tools
studio8 - Sun Studio 8 Compilers and Tools
studio7 - Sun Studio 7 Compilers and Tools
workshop6 - Sun Workshop 6 Compilers and Tools
You can do things manually, of course. You should have something like
export PATH=$PATH:/opt/SUNWspro/bin:/opt/SUNWhpc/HPC8.1/bin
export MANPATH=$MANPATH:/opt/SUNWspro/man:/opt/SUNWhpc/HPC8.1/man
in your setup file (sh, ksh, bash syntax, .profile, .bash_profile,
.bashrc). The first sets your search path, the second your "manual
path" (if you want to use the Unix man command). The first entry
in each case is for standard compilers, the second is for "High
Performance" tools, compilers and libraries. With these setting you
should be able to run the development tool "sunstudio" and get started
editing, compiling and debugging programs.
Back to top...
3. How do I compile and link serial programs? Which
compiler flags should I use?
You will use the "Sun Studio" compilers which reside in
/opt/SUNWspro/bin to compile and link. To compile a Fortran 77,
Fortran 90, Fortran 95, C, or C++ program, you issue the f77, f90,
f95, cc, or CC commands, respectively. Compiling and linking is best
done with a makefile. But you can also issue the commands by hand.
To compile:
compiler -c [options] name.ext
compiler = f77, f90, f95, cc or CC; name = name of your
program source file; ext = extension, i.e. f or f90 for Fortran(90), c for
C, cpp or C for C++, etc., [options] denotes compiler flags
that usually start with an '-')
Note for Fortran programmers: You are actually using the
Fortran90 (f90) compiler even if you are compiling F77 programs. The
f77 command issues additional compiler flags that concern
compatibility.
To link:
compiler -o name [options] [libraries] list
(compiler see above; name name of the executable;
[options] see above; [libraries] libraries that need to
be linked in, usually as a list of file names with full path, or as
'-L' and '-l' combinations [see below]; list
list of object files, usually with .o extension)
Using the compilers and the linker in the above manner requires the
proper setting of the PATH environment variable.
There are hundreds of compiler flags, and many of them are not
required most of the time. A few that are in more frequent use are:
-xOn optimizes your code. nis a number from 1 to 5
with increasing severity of alterations made to the code, but also
increasing gain. Up to -xO3 is generally rather safe to use. But you
should, of course, always check results against an un-optimized
version: they might differ.
-fast is a combination of optimization flags that is quite
safe to use and often improves performance a lot. However, the
resulting code is optimized specifically for the current machine
architecture and cannot be executed on older SUN's (including the
UltraSparc-III). Note that this overrides the -xOn option if it
comes after it, since compiler options are executed from left to
right! If you use this flag for compiling, you also need to include it
at the linking stage.
-g produces code that can be debugged. Unlike for other
compilers, -g and -xOn are not mutually exclusive, so it
is a good flag to have in the development stage of a program.
-v produces more output than you can handle, which makes it
easier to track down problems.
-lname is used to bind in a library called
libname.a (static) or libname.so (dynamic). This flag is
used to link only.
-Ldirname is used in conjunction with
-lname and lets the linker know where to look for
libraries. dirnameis a directory name such as
/opt/studio12/SUNWspro/prod/lib.
-Rdirname is used to tell the program
where to get dynamic libraries at runtime.
There are many more flags. They are documented in the man
pages (man f90 or man cc), as well in various
documents that may be downloaded in pdf format from
the Sun documentation website. The
latter is a good place to look to resolve problems in any case. Use
the search engine to obtain User's Guides and Reference
Manuals.
Some compiler flags are only useful for parallel programs and will
be discussed later. Sometimes there is a considerable performance gain
from using specific options (such as -xchip and
-xtarget), but the code becomes less general.
Back to top...
4. Will my serial code run in parallel without
changes?
No. At the very least, you will have to recompile it with
"parallel options" and to set a few environment variables. For most
code, that will not be enough either. Fortunately, in many cases, it
is not difficult to get the compiler to produce code that will show
some performance gain from multi-threading.
Back to top...
5. How do I "parallelize" my
code?
In essence there are 4 steps that should be considered to
"parallelize" your code:
- Optimize the serial version as much as you can. Try to
make it as "simple" as possible, avoiding nested loops and loops with
dependencies, i.e. where the operations inside one iteration depend on
the results from a previous one. Dependencies may be hidden in
function calls or by reference to global variables or COMMON's.
Often, a program spends most of the execution time in a few loops.
Those are candidates for parallel performance. Try to find them (e.g.
by running analyzer software, such as is available inside the
"sunstudio" development tools, or by explicitely inserting timing
routines like etime() into the code). Focus on the
simplification of those loops.
- Use auto-parallelization flags of the compiler
(see section 6)
- Force multi-threading via OpenMP compiler directives
(see section 7)
- Use MPI routines to run separate processes that
communicate with each other (see section 8).
Back to top...
6. How can I use
multiple threads to get parallel performance out of my serial
code?
The compilers running on HPCVL clusters have options that cause it
to attempt to parallelize loops that have no dependencies by
"multi-threading" them. The compiler flags to get this done
are
- -xautopar identifies loops that are obviously non-dependent
and creates multithreaded code for them
- -xreduction reduces the elements of arrays into single
values, for example by summing over them
- -xloopinfo shows which loops were parallelized, and which
not (and why)
- -stackvar Necessary. Allocates local variables on the
stack.
This will only work when the loops to be parallelized do not have any
dependencies.
Back to top...
7. How do I force multi-thread parallelization? How
to use compiler directives?
The compiler will be very conservative about multithreading loops
automatically. If there is the slightest possibility of data
dependencies, it will refuse to do it if -xautopar is
used. Function calls within loops, if statements that depend on
variables which change in the loop, and many other features will be
considered "dangerous" and inhibit parallelization. The reason is
that such features have a potential to make the result dependent on
the order in which the loop iterations are carried out, and therefore
go against a parallel execution.
However, often you know more than the compiler. You might be
certain that a function call does not alter the value of variables
that are shared with other loop iterations. If this is the case, there
is ways to tell the compiler to parallelize anyhow. This is done via
compiler directives that look like comments, but if compiled
with the proper flags, will guide the compiler in parallelizing the
code. The most common one a OpenMP compiler directives. Here is
an example:
!$OMP DO PRIVATE(a)
do i = 1, n
a(1) = b(i)
do j = 2, n
a(j) = a(j-1) + b(j) * c(j)
end do
x(i) = f(a)
end do
The initial "!" in the first line of this Fortran segment causes that
line to be interpreted as a comment, unless this is compiled with the
compiler flag "-xopenmp". In that case, the first line tells the
compiler to parallelize the DO loop directly following it. The
PRIVATE instruction causes a separate copy of the array to be
used for each parallel thread (i.e. a is used as a "thread local" or
"private" variable).
Some commonly used compiler flags for this approach are:
- -xopenmp includes all necessary flags for usage of OpenMP
compiler directives. It includes several other flags (see man
pages). This is the most commonly used multi-threading flag if you are
doing explicit (as opposed to automatic) parallelization. Others are
only occasionally used.
- -vpara verbose output about dependencies in the
explicitely parallelized loops.
Others are only occasionally used:
- -xexplicitpar parallelize when I tell you to by compiler
directives
- -parallel same as -xexplicitpar, but additional
autoparallelize if possible
- -mp=type specify type of directives,
type can be "sun", "cray" or "openmp";
note that omp directives use OpenMP platform-independent
compiler directives which are the de-facto industry standard. The use
of these is strongly encouraged.
Note that a
separate OpenMP
FAQ is available that contains more information about this
programming technique.
Back to top...
8. What is
MPI and when do I use it?
The ultimate parallelization is, of course, achieved by re-writing
the code in a parallel fashion, so that it can be executed on several
separate processors, or indeed machines, separately. For this, it is
necessary to establish some communication between the processes, and
this is usually done by some form of message passing. A platform
independent standard for this is a set of almost 300 routines,
available in Fortran and C, that comprise the MPI (Message
Passing Interface) standard. Using these routines requires a little
rethinking of the code structure, but is in many cases rather simple
and effective.
MPI is best used if your code has a good potential to employ many
processors independently with none sitting idle. It is also
advantageous to have only relatively little communication being
necessary between processes. Examples are numerical integration (where
independent evaluations of the integrant can be done separately),
Monte-Carlo methods, finite-difference and finite-element methods (if
the problem can be divided up into blocks of equal size with minimal
communication). MPI requires some serious re-coding in some
cases, but with a relatively small number of routines, great scaling
can be achieved.
Back to top...
9. How do I
parallelize my code with MPI
A very simple example of how to parallelize code with MPI
is given in the monte.f Fortran program.
Only a few MPI commands are necessary to parallelize this
Monte-Carlo calculation of pi. The first
call MPI_INIT(ierr)
sets up the MPI system and has to be called in any MPI
program. The next two
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, np, ierr)
are used to determine the "rank", i.e. number of the presently running
process, and the total number of processes running (size). The
identifier MPI_COMM_WORLD is used to label a group of processes
assigned to this task, called a "communicator". With
call MPI_REDUCE(pi,pisum,1,MPI_DOUBLE_PRECISION,&
MPI_SUM,0,MPI_COMM_WORLD,ierr)
the partial sums (pi) from the different processes are
summed up (reduced) into the total (pisum). This is done
simultaneously with the gathering of the results from the processes,
and is called "reduction". Finally,
call MPI_FINALIZE(ierr)
closes the MPI system.
To get an idea of how to use MPIand what the various
routines do, check out
the
MPI workshop at the Maui HPC Centre site. For a list of routines
in the MPIstandard, and a reference manual of their usage, go
to the Sun Documentation Website
and search for the Sun MPI Programming and Reference
Guide . Note that we offer a
separate MPI FAQ
with more information about this system.
Although the MPI standard comprises hundreds of routines,
you can write very stable and scalable code with only a dozen or so
routines. In fact, often the simpler you keep it the better it will
work.
Back to top...
10. How do I
compile and run MPI code on HPCVL clusters ?
To use MPI on our clusters, you will have to do the
following things:
The mpirun command is part of the ClusterTools
programming environment, and is necessary to run MPI programs and
allocate the separate processes across the multi-processor system.
The setup for ClusterTools is part of the default on our
cluster. The /opt/SUNWhpc/bin directory must be in
your PATH.
mpirun lets you specify the number of processors, e.g.
mpirun -np 4 test_par runs the MPI program test_par on 4
processors. There is a myriad of other options for this command, many
of which are concerned with details of process allocation that are
automatically handled by the system on HPCVL clusters, and do
therefore not have to concern the user.
For help on ClusterTools, consult
Sun's Documentation Site and
search for
HPC Cluster Tools User's Guide.
Back to top...
11. How can I check out performance of my
serial, multi-threaded, or MPI
code?
The SUN's are equipped with a powerful interface for program
development called Sun Studio. If you have the proper shell
setup, you can call it by simply typing sunstudio. The program
is quite complex, so I can here only outline how to use it for
profiling serial and multi-threaded code. An online guide is available
at
file:///opt/SUNWspro/prod/lib/locale/C/html/index.html
on our systems. Other documentation can be found at the Sun Docs Site.
In order to analyze your program with the Sun Studio Tool, you
need to compile it with the -g option. After calling
sunstudio a GUI will appear. Then click on Analyze on
the tool bar, choose File and Collect Experiment, then
specify the program on the popup menu. After pressing Run,
data from a program run will be collected. After completion, these
data will be stored in a file called test.1.er and a (hidden)
directory called .test.1.er. Now you are ready to have a look
at them. Close the sampling collector window and go back to the main
sunstudio tool bar. Click on Analyze -> File -> Open Experiment
and load test.1.er. You will get an Analyzer window that lets
you see the total exclusive and inclusive time spent in various
subroutine, the % time used by these, and many more. Try the
Metrics and the Callers-Callees windows to get more
information.
If you do not like GUI's, there is a collect command that lets
you produce test.1.er from the command line. Check out the man pages
with man collect. And if you prefer a printed report for
analyzing the experiment, there is a utility that does that, called
er_print also documented in the man pages: man
er_print. These come in handy if you do not have a desktop
environment available.
This tool lets you analyze where most of the execution time in
your program is spent. It can also handle multiple processes which it
collects into separate experiments.
Back to top...
12. It doesn't work. Where can I get
help?
All of these things are documented
at http://docs.sun.com , but the
mass of information on that site makes it a bit difficult to know
where to look. Try using the search engine.
If you have questions that you can't resolve by checking
documentation, you
can Contact us. We have
several user support people who can help you with code migration to
the parallel environment of
the HPCVL facilities. If you want
to start a larger project that involves making code executable on
parallel machines, they might be able to help you. Keep in mind that
we support many people at any given time, so we cannot do the coding
for you. But we can do our best to help you get your code ready for
multi-processor machines.
Of course, some programs are inherently non-parallel, and trying
to make them scalable might be too much effort to be worth it. In that
case, the best one can do is try to improve the serial performance by
adopting the code to modern computer architecture. The performance
enhancement that can be achieved is sometimes quite amazing. It seems,
however, that most programs have a good potential to be executed in
parallel, and a little effort in that direction often goes a long way.
Back to top...
|