Compute Canada

Can you give me an example?

Please note: The FAQ pages at the HPCVL website are continuously being revised. Some pages might pertain to an older configuration of the system. Please let us know if you encounter problems or inaccuracies, and we will correct the entries.

The working principle of OpenMP is perhaps best illustrated on the grounds of a programming example. The following program written in Fortran 90 computes the sum of all square-roots of integers from 0 up to a specific limit m:

 

 	 program example02 
call demo02
stop
end 
	 subroutine demo02 
integer:: m, i
real*8 :: mys
write(*,*)'how many terms?'
read(*,*) m
mys=0.d0 
!$omp parallel do private (i) &
!$omp shared (m) &
!$omp reduction (+:mys)
do i=0,m
mys=mys+dsqrt(dfloat(i))
end do
write(*,*) 'mys=',mys, ' m:',m
return
end
!$omp end parallel do

It is instructive to compare this example with the one in our MPI FAQ which performs exactly the same task. It is obvious that the OpenMP version is a good deal shorter. In fact, apart from the OpenMP directives (starting with !$omp), this is just a simple serial program.

 

In Fortran 90, anything after a ! sign is commonly interpreted as a comment, so that the above example when compiled without special options will just yield the serial version of the program. If the -openmp option is specified at compile time, the compiler will use the OpenMP directives to create a multi-threaded executable.

The instruction parallel do will cause the next do-loop to be taken as parallel region, in other words before executing that loop, multiple sub-processes will be created and the loop iterations will be distributed to those processes. The order in which this happens should not matter, since we have to be sure that the iterations are independent.

The instruction private(i) ensures that a separate value of the loop index is used for each process or thread. By default, scalar values such as i are considered private, i.e. thread-specific. We are specifying this only for demonstration purposes. It is in any case a good idea not to rely on default settings.

The instruction shared(m) makes sure that all threads are using the same maximum value. This declaration is necessary, since only arrays are considered "shared" by default. Since we are not doing anything with m, it is safe to assume a common value.

Finally, we instruct the compiler to treat the value of mys specially. Thereduce(+:mys) instruction causes a private value for mys to be initiatialized with the current mys value before thread creation. After all loop iterations have been completed, the different private values are reduced to a single on by a sum (+ sign in the directive).

After compilation, we can convince ourselves easily that we have in fact created a parallel program. Here is the execution with a maximum of m=100,000,000 and only one thread:

bash-2.05$ timex a.out in 
how many terms?
mys= 6.66666671666567E+11 m: 100000000 

real 3.90
user 3.88
sys 0.01

And here's the same run with two threads:

bash-2.05$ timex a.out in 
how many terms?
mys= 6.66666671666484E+11 m: 100000000 
real 2.00 
user 3.92
sys 0.02

We note that the result is the same to 12 significant digits, but the time of the second run is only slightly more than 1/2 of the first. The difference of about 0.1 seconds may be attributed to a constant overhead for reading in m and writing out the results.