Compute Canada

DMSM Library

Please note: The FAQ pages at the HPCVL website are continuously being revised. Some pages might pertain to an older configuration of the system. Please let us know if you encounter problems or inaccuracies, and we will correct the entries.

This FAQ file explains the structure and usage of a library that implements the Double-layer Master Slave Model (DMSM), a flexible and generally applicable model for parallel programming that can be deployed across clusters with multicore nodes. The model is based on a two-tier structure:

  • The Message-Passing Interface (MPI) is used to distribute work-groups across a cluster, based on a simple Master-Slave approach.
  • Locally, on each cluster node, OpenMP compiler directives are used to work through the work items in a group based on a dynamic ``All Slaves'' model.

Several variations of this basic structure were implemented in a library in Fortran and C. Here we discuss the working principle and components of this library. We also outline the capabilities of the model.

  1. What is the HPCVL DMSM Library ?
  2. Where can I get the DMSM library and will it run on my computer ?
  3. How do I use the DMSM Library ?
  4. Which variations of the Double-layer Master-Slave model can I use with the DMSM library ?
  5. What are the functions I have to supply to work with the library ? 
  6. Can I run parallel programs from inside this parallel library ?
  7. Where can I get more details ?

1. What is the HPCVL Double-layer Master-Slave Model (DMSM) Library ?

Modern clusters usually consist of nodes with multiple CPU's or cores, sometime "multi-threaded". The standard approach to exploit parallelism on a cluster is by use of the Message-Passing Interface MPI. This requires the communication between nodes in a distributed-memory fashion. While it is possible to run multiple independent MPI processes on a single multicore cluster node, this may not be the most efficient use of resources.

A more appropriate approach to use multicore nodes is through the use of compiler directives such as OpenMP. The resulting multithreaded programs do not require communication resources and use shared memory for information exchange between multiple threads instead.

For distributed-memory clusters a common model is a ``Master-Slave Model" in which one process - the Master - is dedicated to work distribution, while the others - the Slaves - work on individual tasks. This model is useful when there is a chance of workload imbalances due to an unpredictable workload for each task. On a shared-memory system such as a multicore node it is usually not necessary to dedicate a specific processor for the Master tasks. Instead each process that runs out of work supplies itself with the next task, using a counter variable to keep track of what has already been done. We call this an "All-Slaves Model".

It is easy to see that an MPI Master-Slave Model can be expanded to a hybrid model if each of the tasks is further divided into sub-tasks which are then executed in an OpenMP parallel fashion. In practice it is best to combine multiple tasks into "packages" and distribute these packages across nodes using an MPI Master-Slave approach. Each of the tasks in the package is then internally distributed on each node via an OpenMP "All-Slaves Model".

So here is an outline of the "Double-layer Master-Slave Model" (DMSM):

  • On a cluster, one node is the master node, all other nodes are Slaves. Communication between them is done via MPI.
  • In the beginning, each Slave signals to the Master that it is idle, and the Master sends out instructions and data to all Slaves establishing their workload. This workload consists of many smaller tasks that are handled internally by the Slave nodes.
  • Any Slave node works through such a workload package without any further communication to the Master. Once the workload is executed, a signal is sent to the Master requesting another package. This is done until a stop signal is received.
  • The Master continues to serve requests for new packages until all the available work is distributed. Then a stop signal is sent. 

The DMSM library is an implementation of the above scheme. Variations of the basic principle are supported. The library consists of several routines that are called in sequence. The user needs to supply at least one function that defines the individual workload. The library is available in both a Fortran90 and a C version. Both are distributed in the form of source code.

 

2. Where can I get the DMSM library and will it run on my computer ?

The DMSM library is distributed in the form of source code. It can be downloaded directly in the form of a tar file. This archive can easily be extracted via the tar command. All files are expanded into a single directory to avoid complicated installation procedures.

The Fortran version of the library consists of only one component that includes all sub-routines, functions, and modules. This is called dmsm.f90. For the C version, additional header fi les are required. The C components are called dmsm.c and dmsm.h.

The following system components are required to use the library successfully:

  • Operating system: Preferably Unix based (Solaris, Linux, AIX, Irix, etc.)
  • Standard tools: MPI library.
  • OpenMP-enabled Fortran 90 and C compilers, preferably native to the operating system. However, the cross-platform gnu compilers should work as well.

The library does not make use of any platform specific features and is programmed in standard Fortran or C. It shouldl therefore compile straightforwardly on any platform.

3. How do I use the DMSM library ?

To use the library, it is only necessay to compile the source code into an object or archive file (.o or .a) and then link this file with the rest of your code.

The library supplies a number of routines that initialize the model, perform a complete run, check for proper execution, and finalize the run. There are also some auxilliary routines. In this FAQ file, we only briefly discuss the interface of one routine which performs all of the above tasks in sequence:

SUBROUTINE DMSM_ALL(   &
THREADS_PER_PROCESS, &
JOB_DISTRIBUTION_PLAN,  &
TOTAL_JOBS,                      &
NUM_OF_JOBS_PER_GROUP, &
DO_THE_JOB,                           &
JOB_GROUP_PREPARATION,  &
RESULT_COLLECTION,            &
COLLECTION_ENABLED)

This routine requires up to 8 arguments. The first one THREADS_PER_PROCESS defines how many threads will be used to work on a job package on a given node. The next one JOB_DISTRIBUTION_PLAN is used to select a specific variation of the model (see next section). Number three TOTAL_JOBS tells the routine how many jobs are to be performed. Number four NUM_OF_JOBS_PER_GROUP defines the size of a job package. The following three arguments are the names of user-supplied functions (see below). Only the first of these DO_THE_JOB is required for all cases, as it names the routine that is used to perform a single job. The last argument is an integer that is used only for specific cases. The above interface is for Fortran90, and in the simplest case is called with only the first five arguments. For C, the arguments are equivalent, but the routine is called with all 8 arguments, and NULL is used for those that are not reuired. Please consult the manual for details.

4. Which variations of the Double-layer Master Slave Model can I use with the DMSM library ?

Depending on the complexity and flexibility of the basic algorithm, the Double-layer Master-Slave Model can be used in nine variations. These are labeled by an integer variable that serves as an argument in standard function call to the library. In all variations the main thread on the Master node serves as the Master to distribute job packages.

ArgumentOther threads on the MasterSlave Communication
11noneserial
12pre-allocatedserial
13dynamicserial
21noneparallel 0
22pre-allocatedparallel 0
23dynamicparallel 0
31noneparallel any
32pre-allocatedparallel any
33dynamicparallel any

The fi rst column of the above table indicates the integer label that must be passed to the library to select the corresponding variation of the model. Details about the interface of the library routines can be found in the DMSM manual.

The second column "Other threads on Master" indicates what tasks are performed by the non-master threads - if any - on the Master node. To make good use of the available resources, they can either work through packages according to a pre-allocated static schedule, or dynamically obtain job packages from the Master and work through them using an All-Slaves scheme.
The last column "Slave Communication" explains how MPI communication takes place on the side of the Slave nodes. "serial" indicates that work packages are obtained before a parallel thread-set is invoked, i.e. from the serial region of the code. "parallel 0" means that a speci c thread (number 0 ) obtains new packages from inside the OpenMP parallel region when needed. "parallel any" means that any of the slave threads can communicate with the Master from inside the parallel region when a new work package is needed.
Clearly, Variation 11 is the most static but also the most safe since no communication from inside OpenMP parallel regions takes place at all, and therefore OpenMP and MPI parallelism are completely separated. Variation 33 is the most complex and therefore is most suitable to avoid load imbalances. Anytime a slave thread detects that no further work items are left unallocated, it obtains another work package from the Master. In addition, threads on the Master node work on their own work package to make optimal use of the computing resources.

5. What are the functions I have to supply to work with the library ? 

The user needs to supply at least one function to the library, namely the one that executes a speci c single job. It only requires a single integer argument, which the number of the job to be done. In Fortran, it is recommended to make an INTERFACE declaration of the routine to make sure that its name is properly passed to the DMSM library. The actual code for the routine can then be done in a separate file:

interface
subroutine single_job(m)
implicit none
integer :: m
end subroutine single_job
end interface

In some cases it is necessary to supply job data to the nodes dynamically at the time when a new job package is issued by the master. In this case the user has to supply such another routine with 4 integer input arguments, the so-called "Job Group Preparation routine".

If it also necessary to collect results from the nodes to the Master in a dynamic fashion, then a user-supplied third routine with 2 arguments, the "Result Collection routine" is required.

Details about these routines and their interfaces can be found in the DMSM manual.

6. Can I run parallel programs from inside this parallel library ?

Yes. It is possible to set the number of threads per MPI process to 1 and reduce the model to a simple MPI Master-Slave model, while still use multiple OpenMP threads for each individual job to be executed in parallel. As a result, many OpenMP parallel jobs may be executed simultaneously on multiple slave nodes. However, no computations can be performed on the Master node in this model. 

You can also use this library, reduced to the pure MPI Master-Slave Model, to distribute independent MPI program runs to the nodes of a cluster. This implies that MPI is used on two different levels: one to communicate program runs and the corresponding data to the nodes, and another to execute individual jobs on the nodes. The latter are usually considerably more communication intensive than the former, and therefore the shared-memory structure of a single node is desirable. The library supplies additional functions to generate the required communicators and to run this type of model.

For details, please consult the manual.

7. Where can I get more details ?

Yes. It is possible to set the number of threads per MPI process to 1 and reduce the model to a simple MPI Master-Slave model, while still use multiple OpenMP threads for each individual job to be executed in parallel. As a result, many OpenMP parallel jobs may be executed simultaneously on multiple slave nodes. However, no computations can be performed on the Master node in this model. 

You can also use this library, reduced to the pure MPI Master-Slave Model, to distribute independent MPI program runs to the nodes of a cluster. This implies that MPI is used on two different levels: one to communicate program runs and the corresponding data to the nodes, and another to execute individual jobs on the nodes. The latter are usually considerably more communication intensive than the former, and therefore the shared-memory structure of a single node is desirable. The library supplies additional functions to generate the required communicators and to run this type of model.

For details, please consult the manual.

There are currently no posts in this category.