|
This is a short introduction to the OpenMP industry standard for
shared-memory parallel programming. It outlines basic features of the
system and explains its usage on the HPCVL Sunfire SMP machines. This
is not an introduction to OpenMP programming. References and links for
further details are given.
Frequently Asked Questions:
What is OpenMP?
What kind of system uses OpenMP?
How is OpenMP used?
Give me an example.
How is OpenMP implemented on the
HPCVL Sunfire machines?
How do I compile OpenMP code on the
Sun?
How do I run OpenMP programs on a
Sunfire machine?
Where can I learn details about
OpenMP?
Are there any tools to help me with
OpenMP programming?
It doesn't work. Where can I get
help?
Answers:
What is OpenMP?
OpenMP is a system of so-called "compiler directives"
that are used to express parallelism on a shared-memory
machine. OpenMP has become an industry standard for such
directives, and at this point, most parallel enabled compilers
that are used on SMP machines are capable of processing OpenMP
directives. The OpenMP standard has had a rather short and
steep career: it was introduced in 1997 and has since sidelined
all other similar systems.
OpenMP is exclusively designed for shared-memory machines, and
is based on "multi-threading", i.e. the dynamic
spawning of so-called "light-weight" sub-processes,
commonly within loops. In favorable cases it is quite possible
to create a well-scaling parallel program from a serial code by
inserting a few lines of OpenMP directives into the serial
precursor and recompiling. The simplicity and ease of use of
OpenMP directives have made it a popular alternative to the
more involved (and arguably more powerful) communication system
MPI, which was designed for distributed-memory systems.
What kind of
system uses OpenMP?
OpenMP was designed from the outset for shared-memory machines,
commonly called SMP (Symmetric
Multi-Processor) machines. These types of
parallel computers have the advantage of not requiring
communication between processors for parallel processing, and
therefore bypassing the associated overhead. In addition, they
allow multi-threading, which is a dynamic form of parallelism
in which sub-processes are created and destroyed during program
execution. In some cases this can be done automatically at
compile time. In other cases, the compiler needs to be
instructed about details of the "parallel region" of
code where multi-threading is to take place. OpenMP was
designed to perform this task.
OpenMP therefore needs
both a shared-memory (SMP) computer and a compiler that
understands OpenMP directives. The Sunfire machines at HPCVL
fulfill both of these requirements.
OpenMP will not
work on distributed-memory clusters, such as a
Beowulf. However, it may sometimes be used with combination
with distributed memory parallel systems such as MPI. However,
this holds only if each of the nodes in a cluster has in itself
at least 2 CPUs available.
How is OpenMP
used?
OpenMP is usually used in the stepwise parallelization of
pre-existing serial programs. Shared-memory parallelism is
often called "loop parallelism" because of the
typical situation that make OpenMP compiler directives an
option.
The OpenMP compiler directives are inserted into the serial
code by the user. They instruct the compiler to distribute
the tasks performed in a certain region of the code (usually
a loop) over several sub-processes, which in turn may be
executing on different CPUs.
For instance, the following Fortran loop looks as if the
repeated calls to the function point() could be done
in seperate processes, or better on seperate CPUs:
do imesh=inz,nnn,nstep
svec(1)=xmesh(imesh)
svec(2)=ymesh(imesh)
svec(3)=zmesh(imesh)
integral=integral+wints(imesh)*point(svec)
end do
If we are using a compiler that is able to automatically
parallelize code, and try to use that feature, we will find
that things are not that simple. The function call to
point may hide a "loop dependency", i.e. a
situation where data computed in one loop iteration depend on
data calculated in another. The compiler will therefore
commonly reject parallelizing such a loop as
"unsafe".
The use of OpenMP directives can solve this problem:
!$omp parallel do private (imesh,svec) &
!$omp shared (inz,nnn,nstep,xmesh,ymesh,zmesh,wints) &
!$omp reduction(+:integral)
do imesh=inz,nnn,nstep
svec(1)=xmesh(imesh)
svec(2)=ymesh(imesh)
svec(3)=zmesh(imesh)
integral=integral+wints(imesh)*point(svec)
end do
!$omp end parallel do
The three lines of directives have the effect of forcing the
compiler to distribute the tasks performed in each of the
loop iterations over seperate, dynamically created
processes. Furthermore, they inform the compiler which
variables can be used by all sub-processes (ie,
shared), and which have different values for each
process (ie, private). Finally, they direct the
compiler to collect values of integral sperately in
each process and then "reduce" them to a common
value by summing them up.
OpenMP programs need to be compiled with special compiler
options and will then yield parallel code. It must be pointed
out that since the compiler is forced to multi-thread
specific regions of the code, it is the responsibility of the
programmer to ensure that such multi-threading is safe,
i.e. no dependeny between iterations in the parallelized loop
exist. In the above example that means that the tasks
performed inside the point call are indeed
independent.
Give me an example
The working principle of OpenMP is perhaps best illustrated on the
grounds of a programming example. The following
program, written in Fortran 90 computes the sum of all
square-roots of integers from 0 up to a specific limit m:
program example02
call demo02
stop
end
subroutine demo02
integer:: m, i
real*8 :: mys
write(*,*)'how many terms?'
read(*,*) m
mys=0.d0
!$omp parallel do private (i) &
!$omp shared (m) &
!$omp reduction (+:mys)
do i=0,m
mys=mys+dsqrt(dfloat(i))
end do
write(*,*) 'mys=',mys, ' m:',m
return
end
!$omp end parallel do
It is instructive to compare this example with the one in our MPI FAQ which performs exactly the same task. It is
obvious that the OpenMP version is a good deal shorter. In fact,
apart from the OpenMP directives (starting with !$omp), this
is just a simple serial program.
In Fortran 90, anything after a ! sign is commonly
interpreted as a comment, so that the above example when compiled
without special options will just yield the serial version of the
program. If the -openmp option is specified at compile time,
the compiler will use the OpenMP directives to create a
multi-threaded executable.
The instruction parallel do will cause the next
do-loop to be taken as parallel region, in other words
before executing that loop, multiple sub-processes will be created
and the loop iterations will be distributed to those processes. The
order in which this happens should not matter, since we have to be
sure that the iterations are independent.
The instruction private(i) ensures that a separate value of
the loop index is used for each process or thread. By default,
scalar values such as i are considered private,
i.e. thread-specific. We are specifying this only for demonstration
purposes. It is in any case a good idea not to rely on default
settings.
The instruction shared(m) makes sure that all threads are
using the same maximum value. This declaration is necessary, since
only arrays are considered "shared" by default. Since we
are not doing anything with m, it is safe to assume a common
value.
Finally, we instruct the compiler to treat the value of mys
specially. The reduce(+:mys) instruction causes a private
value for mys to be initiatialized with the current
mys value before thread creation. After all loop iterations
have been completed, the different private values are
reduced to a single on by a sum (+ sign in the
directive).
After compilation, we can convince ourselves easily that we have in
fact created a parallel program. Here is the execution with a
maximum of m=100,000,000 and only one thread:
bash-2.05$ timex a.out in
how many terms?
mys= 6.66666671666567E+11 m: 100000000
real 3.90
user 3.88
sys 0.01
And here's the same run with two threads:
bash-2.05$ timex a.out in
how many terms?
mys= 6.66666671666484E+11 m: 100000000
real 2.00
user 3.92
sys 0.02
We note that the result is the same to 12 significant digits, but
the time of the second run is only slightly more than 1/2 of the
first. The difference of about 0.1 seconds may be attributed to a
constant overhead for reading in m and writing out the
results.
How is OpenMP implemented on the HPCVL Sunfire machines?
The Sun Studio 10 compilers on the HPCVL Sunfire machines are
capable of processing OpenMP directives. No special settings
need to be specified in setup files to use this
capability. It is just convenient to have the location of the
compiler suite in the PATH variable. This location is
presently (version 10): /opt/ss10/SUNWspro/bin.
How do I compile OpenMP code on the
Sun?
To enable the interpretation of OpenMP compiler directives, the
-xopenmp compiler option has to be specified both at the
compile and at the link level. This holds for the Fortran, C, and
C++ compilers equally. At the compile level, it is often useful to
also use the -xloopinfo option which creates a list of loops
and information on whether they have been parallelized or not. For
instance, in the case of a Fortran program, the compiler calls will
be, for compiling:
f90 -c -xopenmp -xloopinfo test.f90
and for linking
f90 -o test.exe -xopenmp test.o
Both can of course be combined:
f90 -o test.exe -xopenmp -xloopinfo test.f90
Note that the -xopenmp option is a macro which includes
several sub options. Also, if no optimization is specified (as in
the above lines), the optimization level will automatically be
increased to -xO3 to support multi-threading. This can not
be disabled.
How do I run OpenMP programs on a Sunfire machine?
Unlike MPI programs, shared-memory parallel OpenMP programs do not
need a special runtime environment to run in parallel. They only
need to be instructed about the number of threads (or processes)
that should be used. This is usually done by setting an environment
variable. The default variable used on any system that is OpenMP
enabled is OMP_NUM_THREADS. For instance, in a
csh, the following sequence will cause the OpenMP program
test_omp.exe to be executed with 16 threads:
setenv OMP_NUM_THREADS 16; test_omp.exe
An alternative variable that is specific to the Sun Solaris
operating environment is PARALLEL. The following
sequence when typed in a bash shell will cause our test
program to run with 4 threads:
export PARALLEL=4; test_omp.exe
Incidentally, the number of threads may also be set from inside the
program by means of a function call. The line
call omp_set_num_threads(16)
inside the program test_omp.f90 will have the same effect as
setting the environment variable. This will take precedent over
external settings. However, it is rarely done.
Where can I learn details about
OpenMP?
As already pointed out, this FAQ is not an introduction to OpenMP
programming. In fact, we barely scratch the surface of what can be
done. OpenMP includes many directives and functions that warrant
study before they can be used properly. It is necessary to point
out that shared-memory programming has its pitfalls: hidden
dependencies may lead to so-called "race conditions", and
cache-effects such as "false sharing" can seriously
degrade performance. Often, a detailed analysis of a parallel
region or a loop is necessary to determine if and how it may be
parallelized using OpenMP.
A good text book on OpenMP is:
Rohit Chandra, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff
McDonald, and Ramesh Menon: Parallel Programming in
OpenMP, Academic Press, San Diego, California, 2001; ISBN
1-55860-671-8
This text includes examples that are worked out in detail and
explains concepts of shared-memory programming that might be
unfamiliar to may users.
A good online tutorial for OpenMP shared-memory programming can be
found at
Lawrence Livermore National Laboratory.
There is a website devoted
specifically to all things OpenMP, which is a good starting point
for learning about it.
For Sun and Solaris specific questions, including CRE and Sun MPI,
visit the Sun Documentation Site
and use their Search Engine to look for "OpenMP".
HPCVL also organizes Workshops on a regular basis, and one of them
is devoted to OpenMP programming. They are announce on our web site. We might see you
there sometime soon.
Are there
any tools to help me with OpenMP programming?
The standard debugging and profiling tool on the Sunfire machines
at HPCVL is Sun Studio, which provides a GUI and is well
documented internally. It is able to handle multi-threaded code,
i.e. programs that were written using OpenMP. If the home directory
of the compiler suite (/opt/ss10/SUNWspro/bin) is in
your PATH, the tool may be invoked by simply typing
sunstudio at the command prompt. Help is easily
invoked, for example by clicking on "Help", selecting
"Contents" and then going to "Managing Threads"
in the menu that appears. Sun Studio supplies both a means to
debug your program, and to perform timing experiments on them
(profiling). The tool might be used to track down bottle necks to
the level of single lines of code, or even assembly language
instructions.
Quite often, the best way to check the performance of a
multi-threaded program is timing it by insertion of suitable
routines. This can be done by calling the subroutines ETIME and
DTIME, which can give you information about actual CPU time
used. However, it is advisable to carefully read the documentation
before using them with OpenMP programs. In this case, go to docs.sun.com and search for "Sun
Studio 10: Fortran Library Reference".
HPCVL also provides a package called the HPCVL Working
Template (HWT), which was created by Gang Liu and has now
reached version 5.1. The HWT provides 3 main functionalities:
- Maintenance of multiple versions of the same code from a single
source file. This is very useful, if your OpenMP code is based on a
serial code that you want to convert, which usually is the case.
- Automatic Relative Debugging which allows you to use
pre-existing code (for example the serial version of your program)
as a reference to check the correctness of your OpenMP code.
- Simple Timing which is needed to determine bottlenecks
for parallelization, to optimize code, and to check its scaling
properties.
The HWT is based on libraries and script files. It is easy to use
and portable (written largely in Fortran). Fortran, C, C++, and any
mixture thereof are supported, as well as OpenMP and MPI for
parallelism. Documentation of the HWT
is available. The package is installed on the Sunfire cluster in
/usr/local/hwt.
It doesn't
work. Where can I get help?
Most of the Sun specific issues addressed in this FAQ are
documented at http://docs.sun.com. The search
engine provides a reliable means to find specific documents.
You can also call or send email to one of our support
staff, who include several scientific programmers. Keep in mind
that we support many people at any given time, so we cannot do the
coding for you. But we can do my best to help you get your code
ready for multi-processor machines.
|