Please note: The FAQ pages at the HPCVL website are continuously being revised. Some pages might pertain to an older configuration of the system. Please let us know if you encounter problems or inaccuracies, and we will correct the entries.
The Abaqus jobs that you will want to run on the HPCVL machines are likely to be quite large. To utilize the parallel structure of a cluster such as ours, Abaqus offers several options to execute the solver in a parallel environment, i.e. on several CPU's simultaneously.
HPCVL clusters consist of several interconnected nodes, each of which is a shared-memory machine with up to 512 cores or processors. The cluster is able to execute both distributed-memory parallel programs (usually employing MPI), and shared-memory (multi-threaded) programs. The Abaqus software achieves a certain degree of parallel scaling using both of these methods. The parallel portions of Abaqus are restricted to the solver and operations on the elements. Here is a list of operations with the corresponding parallel mode that Abaqus supports:
Element operations - MPI only
Iterative solver - MPI or threads
Direct solver - Threads only
Lanczos solver - Threads only
Note that at present only the shared-memory parallelism is in use on our clusters. It is necessary to decide before a parallel Abaqus run which parallel mode (if any) is to be used (on our clusters, use "threads"), and how many processes are to be started.
Production jobs on the HPCVL Clusters must be submitted via the Grid Engine scheduling software. Since most parallel Abaqus jobs fall into this category, we have made a sample script for Gridengine submission. Note that Grid Engine allocates all processors on a single node.
Processes are not the only resources that need to be allocated when a parallel Abaqus job is submitted. Since the Abaqus license is limited, a scheme must be applied that determines if there are still enough license tokens available. Therefore a special parallel environment abaqus.pe is used. This is expressed in the "#$ -pe" line in the above sample scripts. Note that the following limitations apply for Abaqus production jobs:
This is to ensure fair access to the limited number of tokens and to avoid shared-memory problems that occur on some nodes if too many processes are used for a single Abaqus job.
Grid Engine is able to interact with the Abaqus license manager to check if sufficient licenses are available for running. This will keep the scheduler from starting jobs because enough processors are available, just to be stopped again because there are not enough licenses. Grid Engine keeps an internal counter of available "token slots" which gets updated frequently. Everytime Grid Engine attempts to schedule an Abaqus job and is kept from doing so because not enough licenses are available, it will "requeue" the job. Since this causes the issue of an email if the email notification line (#$ -m) is present, this line should be omitted. Instead, Grid Engine was configured to send notification at the beginning and end of job execution, whenever the email definition line (#$ -M) is present. Therefore, if you want to be notified include the #$ -M, otherwise omit it. Do not include the #$ -m line because it floods your email with notifications.
After altering the script by substuting the items enclosed in , it in can be submitted to the Gridengine by
qsub batch_file_name
from sfnode0 (which is the GridEngine submit host). Note that the job will appear as a parallel job on the GridEngine's qstat or qmon. Note also that submission of a parallel job in this way is only profitable for large systems that use many CPU cycles, since the overhead for assigning processes, preparing nodes, and communication between them is considerable.