FAQ

Abstract:

This is a short introduction into how to carry over code from a serial programming environment to the SUNFire multi-processor system used by HPCVL. It is meant to give the user a basic idea of what to do to get the code running on several processors. We assume that the code is written in FORTRAN, but most considerations carry over directly to C/C++ code. The document is organized in an "FAQ" manner, i.e. a list of "obvious" questions is presented as a guideline. Please feel free to contact Hartmut Schmider if you want to see more questions included.

Frequently Asked Questions:

  1. Where are the Fortran and C/C++ compilers located?

  2. Which environment variables do I have to set, what does my path have to look like if I want to do program development?

  3. How do I compile and link serial programs? Which compiler flags should I use?

  4. Will my serial code run in parallel without changes?

  5. How do I "parallelize" my code?

  6. How can I use multiple threads to get parallel performance out of my serial code?

  7. How do I force multi-thread parallelization? How to use compiler directives?

  8. What is MPI and when do I use it?

  9. How do I parallelize my code with MPI?

  10. How do I compile and run MPI code on the SUN?

  11. How can I check out performance of my serial, multi-threaded, or MPI code?

  12. It doesn't work. Where can I get help?



Answers:

  1. Where are the Fortran and C/C++ compilers located?

    On the SUNFires of HPCVL, the Fortran and C++ compilers and the needed headers, libraries and tools can be found under the /opt/s1s7 subdirectory system. The compilers for F77, F90, F95, C and C++, together with a development tool called "Studio" are under /opt/s1s7/bin. Various libraries are under /opt/s1s7/lib. This includes dynamic ones, so if your program complains about not finding "mickey_mouse.so", so setting
    LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/s1s7/lib
    might be a good idea. There is a lot of other stuff under this subdirectory, including online-documentation, so you can get help by pointing your web browser on the SunFire login node to
    "file:///opt/s1s7/docs/index.html".

  2. Which environment variables do I have to set, what does my path have to look like if I want to do program development?

    You should have something like

    PATH=$PATH:/opt/s1s7/bin:/opt/SUNWhpc/bin
    export PATH
    MANPATH=$MANPATH:/opt/s1s7/man:/opt/SUNWhpc/man
    export MANPATH


    in your setup file (.profile, .bashrc). The first sets your search path, the second your "manual path" (if you want to use the Unix man command), and the third tells the system where to look for dynamic libraries. The first entry in each case is for standard compilers, the second is for "High Performance" tools, compilers and libraries. With these setting you should be able to run the development tool "Studio" and get started editing, compiling and debugging programs.

  3. How do I compile and link serial programs? Which compiler flags should I use?

    You will use the "Forte" compilers which reside in /opt/s1s7/bin to compile and link. To compile a Fortran, C, or C++ program, you issue the f77, f90, f95, cc, or CC commands. Compiling and linking is best done with a makefile. But you can also issue the commands by hand.

    To compile:
    compiler -c [options] name.ext
    (compiler = f77, f90, f95, cc or CC; name = name of your program source file; ext = extension, i.e. f for Fortran, c for C, cpp or C for C++, etc., [options] denotes compiler flags that usually start with an '-')
    Note for Fortran programmers: It is a good idea to use the Fortran90 (f90) compiler even if you are compiling F77 programs. It should be able to handle all f77 code, and it is the one that is "supported".

    To link:
    compiler -o name [options] [libraries] list
    (compiler see above; name name of the executable; [options] see above; [libraries] libraries that need to be linked in, usually as a list of file names with full path, or as '-L' and '-l' combinations [see below]; list list of object files, usually with .o extension)
    Using the compilers and the linker in the above manner requires the proper setting of the PATH environment variable.

    There are literally hundreds of compiler flags, and many of them are not required most of the time. The ones that I use most often are:

    -xOx optimizes your code. x is a number from 1 to 5 with increasing severity of alterations made to the code, but also increasing gain. Up to -O3 is generally rather safe to use. But you should, of course, always check results against an un-optimized version: they might differ.

    -fast is a combination of optimization flags that is quite safe to use and often improves performance a lot. However, the resulting code is specific for UltraSparc machines and cannot be executed on older SUN's. Note that this overrides the -xOx option if it comes after it, since compiler options are executed from left to right!

    -g produces code that can be debugged. Unlike for other compilers, -g and -O are not mutually exclusive, so it is a good flag to have in the development stage of a program.

    -v produces more output than you can handle, which makes it easier to track down problems.

    -lname is used to bind in a library called libname.a (static) or libname.so (dynamic). This flag is used to link only.

    -Ldirname is used in conjunction with -lname and lets the linker know where to look for libraries. dirname is a directory name such as /opt/s1s7/WS6U1/lib.

    -Rdirname is used to tell the program where to get dynamic libraries at runtime.

    There is many more flags. They are documented at the following website: http://docs.sun.com which is a good place to look to resolve problems in any case. Some compiler flags are only useful for parallel programs, and I discuss them later. Sometimes there is a considerable performance gain from using specific options (such as -xchip and -xtarget), but the code becomes less general.


  4. Will my serial code run in parallel without changes?

    No. To the very least, you will have to recompile it with "parallel options" and to set a few environment variables. For most code, that will not be enough either. Fortunately, in many cases, it is not difficult to get the compiler to produce code that will show some performance gain from multi-threading.

  5. How do I "parallelize" my code?

    In essence there are 4 steps that should be considered to "parallelize" your code:

    --> Optimize the serial version as much as you can.
    Try to make it as "simple" as possible, avoiding nested loops and loops with dependencies, i.e. where the operations inside one iteration depend on the results from a previous one. Dependencies may be hidden in function calls or by reference to global variables or COMMON's.
    Often, a program spends most of the execution time in a few loops. Those are candidates for parallel performance. Try to find them (e.g. by running analyzer software, such as is available inside the "Studio" development tools, or by explicitely inserting timing routines like etime() into the code). Focus on the simplification of those loops.

    --> Use auto-parallelization flags of the compiler (see section 6)

    --> Force multi-threading via compiler directives (see section 7)

    --> Use MPI routines to run separate processes that communicate with each other (see section 8).


  6. How can I use multiple threads to get parallel performance out of my serial code?

    The compilers running on the SUNFires have options that cause it to attempt to parallelize loops that have no dependencies by "multi-threading" them. The compiler flags to get this done are

    -autopar (identifies loops that are obviously non-dependent and creates multithreaded code for them)
    -reduction (reduces the elements of arrays into single values, for example by summing over them)
    -loopinfo (shows which loops were parallelized, and which not (and why))
    -stackvar (sometimes useful. Allocates local variables on the stack. Sometimes will cause the program to not work)

    This will only work when the loops to be parallelized do not have any dependencies.

  7. How do I force multi-thread parallelization? How to use compiler directives?

    The compiler will be very conservative about multithreading loops. If there is the slightest possibility of data dependencies, it will refuse to do it if -autopar is used. Function calls within loops
    , if statements that depend on variables which change in the loop, and many other features will be considered "dangerous" and inhibit parallelization. The reason is that such features have a potential to make the result dependent on the order in which the loop iterations are carried out, and therefore go against a parallel execution.

    However, often you know more than the compiler. You might be certain that a function call does not alter the value of variables that are shared with other loop iterations. If this is the case, there is ways to tell the compiler to parallelize anyhow. This is done via compiler directives that look like comments, but if compiled with the proper flags, will guide the compiler in parallelizing the code. Here is an example:

    C$PAR DOALL PRIVATE(a)
    do i = 1, n
    a(1) = b(i)
    do j = 2, n
    a(j) = a(j-1) + b(j) * c(j)
    end do
    x(i) = f(a)
    end do


    The initial "C" in the first line of this Fortran segment causes that line to be interpreted as a comment, unless this is compiled with the compiler flags "-explicitpar -mp=sun". In that case, the first line tells the compiler to parallelize the DO loop directly following it. The PRIVATE instruction causes a separate copy of the array to be used for each parallel thread (i.e. a is used as a "local variable").
    The compiler flags for this approach are:
    -explicitpar (parallelize when I tell you to by compiler directives)
    -vpara (vernbose output about dependencies in the explicitely parallelized loops)
    -parallel (same as -explicitpar, but additional autoparallelize if possible)
    -mp=
    type (specify type of directives, type can be "sun", "cray" or "openmp"; "sun" directives use the syntax "C$PAR ..." and are specific for UltraSparcs, "cray" ones begin with "!MIC$ ..." and are there for compatibility with programs developped for CRAY supercomputers, and "omp" directives use "C$OMP" and are platform-independent. The latter are only available for the f95 compiler, but will soon (later in 2001, with Forte version 6 update 2) be available for C as well. The use of the OMP is encouraged because of its platform independence.

  8. What is MPI and when do I use it?

    The ultimate parallelization is, of course, achieved by re-writing the code in a parallel fashion, so that it can be executed on several separate processors, or indeed machines, separately. For this, it is necessary to establish some communication between the processes, and this is usually done by some form of message passing. A platform independent standard for this is a set of more than 200 routines, available in Fortran and C, that comprise the MPI (Message Passing Interface) standard. Using these routines requires a little rethinking of the code structure, but is in many cases rather simple and effective.
    MPI is best used if your code has a good potential to employ many processors independently with none sitting idle. It is also advantageous to have only relatively little communication being necessary between processes. Examples are numerical integration (where independent evaluations of the integrant can be done separately), Monte-Carlo methods, finite-difference and finite-element methods (if the problem can be divided up into blocks of equal size with minimal communication). MPI some serious re-coding in some cases, but with a relatively small number of routines, great scaling can be achieved.


  9. How do I parallelize my code with MPI

    A very simple example of how to parallelize code with MPI is given in
    /opt/SUNWhpc/examples/mpi/monte.f:

    Only a few MPI commands are necessary to parallelize this Monte-Carlo calculation of PI. The first
    call MPI_INIT(ierr)
    sets up the MPI system and has to be called in any MPI program. The next two

    call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
    call MPI_COMM_SIZE(MPI_COMM_WORLD, np, ierr)

    are used to determine the "rank", i.e. number of the presently running process, and the total number of processes running (size). The identifier MPI_COMM_WORLD is used to label a group of processes assigned to this task, called a "communicator". With
    call MPI_REDUCE(pi, pisum, 1, MPI_DOUBLE_PRECISION, MPI_SUM, 0 , MPI_COMM_WORLD, ierr)
    the partial sums (pi) from the different processes are summed up (reduced) into the total (pisum). This is done simultaneously with the gathering of the results from the processes, and is called "reduction". Finally,
    call MPI_FINALIZE(ierr)
    closes the MPI system.

    To get an idea of how to use MPI and what the various routines do, check out the following web site:
    http://www.mhpcc.edu/training/workshop/mpi/MAIN.html

    For a list of routines in the MPI standard, and a reference manual of their usage, go to
    this link (or try http://docs.sun.com then click "By Subject", "Programming", "Tools", "Sun HPC 3.1 Answer Book Collection", "Sun MPI 4.1 Programming and Reference Guide").
    Although the MPI standard comprises hundreds of routines, you can write very stable and scalable code with only MPI_INIT, MPI_COMM_SIZE, MPI_COMM_RANK, MPI_SEND, MPI_RECV, MPI_BCAST, MPI_GATHER, MPI_REDUCE, and MPI_FINALIZE. In fact, the simpler you keep it the better it will work.

  10. How do I compile and run MPI code on the SUN?

    To use MPI, you will have to do the following things:

    --> Include header files on the top of all subroutines that use MPI, i.e. for Fortran include 'mpif.h' and for C/C++ include 'mpi.h'. This is important for the definition of variables and constants that are used by the MPI system.

    --> Compile and link with the following flags:
    -I/opt/SUNWhpc/include -L/opt/SUNWhpc/lib -R/opt/SUNWhpc/lib -lmpi M
    These tell the compiler, linker and runtime environment where to look for include files, static libraries and runtime dynamic libraries. The command -lmpi loads the MPI routines.

    --> For running MPI programs, a special multi-processor runtime environment is needed. This allows you to specify how many processes are used for the execution of the program, from which pool of processes they should be taken, etc...
    The
    CRE runtime environment that SUN provides has the following important components:
    * mprun (Well ... let me guess ... running programs)
    * mpps (Monitor processes)
    *
    mpkill (Shutting down processes)
    In order to use it, you need to include /opt/SUNWhpc/bin in your PATH.
    mprun lets you specify the number of processors, e.g. mprun -np 4 test_par runs the MPI program test_par on 4 processors from the "standard partition". Partitions are groups of processors on which your processes will run. You can specify which one to use by the -p partition_name switch for the mprun command.
    mpps works just like the Unix ps command and lets you monitor running processes, and identify their number.
    mpkill works similar to the Unix kill command and is used to cancel running processes, i.e. mpkill -signal job_number.
    mpinfo gives you information about partitions, processors, etc... It is usually called with the -N or -p switches.
    For help on the runtime environment on the SUN's, try out
    this link or http://docs.sun.com then click "By Subject", "Programming
    ", "Tools", " Sun HPC 3.1 Answer Book Collection", "Sun HPC Cluster Tools 3.1 User's Guide".

  11. How can I check out performance of my serial, multi-threaded, or MPI code?

    The SUN's are equipped with a rather powerful interface for program development called "Studio". If you have the environment variables in qustion 2 set, you can call it by simply typing Studio. The program is quite complex, so I can here only outline how to use it for profiling serial and multi-threaded code. An online guide is available under
    "file:///opt/s1s7/prod/lib/locale/C/html/index.html"
    on the SunFire login node.

    In order to debug you program with Studio, you need to compile it with the -g option. If you are working from a remote terminal, you will have to set the environment variable DISPLAY, e.g. by typing "export DISPLAY=ip_number:0" where you substitute your machines IP number for ip_number. You also might have to allow external access to you display by typing something like "xhost +". When you have done that, call Studio and close the initial GUI box. Then click on "Debug" on the tool bar, and call "New Program" on the popup menu. Choose the program you want to test out, and a new "Debug" window, as well as an editor with your source code will appear. Call yet another window by clicking "Windows" on the toolbar of the "Debug" window, and then choosing "Sampling Collector". This will call a tool that lets you run experiments with your program. You can now click on the "Collect Data: For one run only" box and then on the "Start - run program from the beginning" icon in the upper left corner of the Sampling Collector window. Your program will now execute, and the sampling collector will collect timing data about your code. After completion, these data will be stored in a file called "test.1.er" and a (hidden) directory called ".test.1.er". Now you are ready to have a look at them. Close the sampling collector window and go back to the main Studio tool bar. Click on "Tools" and choose "Analyzer" and "New" in the popup menu. Load "test.1.er" and you will get an Analyzer window that lets you see the total exclusive and inclusive time spent in various subroutine, the % time used by these, and many more. Try the "Metrics" and the "Callers-Callees" windows to get more information.

    If you do not like GUI's, there is a "collect" command that lets you produce "test.1.er" from the command line. Check out the man pages with "man collect". And if you prefer a printed report for analyzing the experiment, there is a utility that does that, called "er_print", also documented in the man pages: "man er_print". These come in handy if you do not have a desktop environment available, e.g. if you work from a vt100 terminal.

    This tool lets you analyze where most of the execution time in your program is spent. However, it can not handle simultaneous separate processes, such as in an MPI program. For that, you need a debugger called Prism. This program is documented online as well: point a newsreader on the SunFire login node to
    " file:///opt/SUNWhpc/doc/prism/html/help.html".
    There are too many commands for me to explain the usage of prism here. You call the program by typing prism (don't forget to set the environment variable DISPLAY if you are working from a remote terminal). The actual usage is similar to Studio, but complicated by having to deal with several parallel processes.

    Quite often, the best way to check the performance of an MPI program is timing it by insertion of suitable routines. MPI supplies a "wall-clock" routine called MPI_WTIME(), that lets you determine how much actual time was spent in a specific segment of your code. An other method is calling the subroutines ETIME and DTIME, which can give you information about the actual CPU time used. However, it is advisable to carefully read the documentation of these routines before using them with MPI programs.

  12. It doesn't work. Where can I get help?

    All of these things are rather well documented on http://docs.sun.com but the mass of information on that site makes it a bit difficult to know where to look. If you have questions that you can't resolve by checking documentation, you can call or send email to Hartmut Schmider, who wrote this document, and who works for HPCVL as a scientific programmer. My job is the support of code migration to the parallel environment of the HPCVL facilities. If you want to start a larger project that involves making code executable on parallel machines, I may be able to help you. Keep in mind that I support many people at any given time, so I cannot do the coding for you. But I can do my best to help you get your code ready for multi-processor machines.

    Of course, asking me for help might not work either. Some programs are inherently non-parallel, and trying to make them scalable might be too much effort to be worth it. In that case, the best one can do is trying to improve the serial performance by adopting the code to modern computer architecture. The performance enhancement that can be achieved is sometimes quite amazing. It seems, however, that most programs have a good potential to be executed "in parallel", and a little effort in that direction goes a long way.



 
 
   
© HPCVL 2007