Enterprise M9000 Servers

Our default compute cluster consists of eight large shared-memory servers of the Enterprise M9000 type.This page explains essential features of these machines and is meant as a basic guide for their usage.

What are these Servers?

Our cluster consists of eight shared-memory machines that are highend Sun SPARC Enterprise M9000 Servers which Sun Microsystems built in partnership with Fujitsu. Access is handled exclusively by Grid Engine, including test jobs that are specific to these servers. The server nodes are called m9k0001...m9k0008

Enterprise M9000 Servers

Each of these servers consists of 64 quad-core 2.52 Ghz Sparc64 VII processors. Each of these chips has 4 compute cores, and each core is capable of Chip Multi Threading with 2 hardware threads. This means that each of the servers is capable of working simultaneously on up to 512 threads. In total they are able to process more than 4000 threads. As each core carries two Floating-Point Units that can handle Additions and Multiplications in a "Fused" manner (FMA), the cluster has a TPP of up to 20 TFlops.

Chip Multi Threading (CMT) is a technology that allows multiple threads (process) to simultaneously share a single computing resource, such as a core. This increases the efficiency of usage of the core. At the same time, multiple cores share chip resources, thereby improving their utilization.

Our servers have a total of 2TByte of memory (8 GB per core). These machines are suitable for very-high-memory applications.

For more information on the Sparc64 VII Architecture, please check out this website.

Why these Servers?

The main emphasis in these high-end Shared-Memory servers is to deliver the maximum possible floating-point performance while not compromising on memory requirements. These machines are to some degree complementary to our Victoria Falls Cluster, where the emphasis is on "Throughput", while here the focus is on sheer TFlops.

The large memory of these servers make them ideally suited for large-scale computations. Large L2 caches keep memory latencies low, while chip multithreading technology increases core utilization. These are true high-performance machines.

Who Should Use this Cluster?

Large-memory Shared-Memory machines are ideally suited for applications that require fast access to well-organized memory, combined with rapid execution of Floating-Point Operations. While the multi-threaded chip does not explicitly make the distinction between multi-threaded shared-memory applications and multi-processing distributed-memory programs, there is some preference for the former because of the lower degree of communication latencies. Applications that are very floating-point extensive, or depend crucially on cache may be good candidates for running on these servers.

We suggest you consider using this compute cluster if

  • Your application is very floating-point intensive and has little else to do, and you need very large amounts of memory. These compute nodes have 2 TB of RAM.
  • Your application is explicitly or automatically multi-threaded (for instance, using OpenMP) and shows good scaling for large numbers of threads (>50).
  • Your application is explicitly parallel (for instance, using MPI) and communication intensive, or your application combines MPI-type parallelism with multi-threading on the processes.
  • Your application uses a commercial license that is scaled per process; in such cases it is favourable to use machines with the maximum per-process performance.

The new cluster might not be suitable if

  • You need to perform a large number of relatively short jobs, each serial or small-scale multi-threaded. Jobs like this should be sent to the "Victoria Falls" cluster.
  • Your application is "embarassingly parallel", i.e. it scales trivially well but uses very little communication. Such jobs are typically run on a distributed-memory cluster, and might be considered for the VF cluster.

If you think your application could run more efficiently on these machines with some modifications, please contact us (help@hpcvl.org) to discuss any concerns and let us assist you in getting started.

Note that on these SMP machines, we have to enforce dedicated cores or CPUs to avoid sharing and context switching overheads. No "overloading" can be allowed.

How Do I Use These Servers?

... to access

The servers are accessed via the HPCVL Secure Portal at https://portal.hpcvl.queensu.ca/ from the login node sflogin0 (also called sfnode0).

Clicking on the "Secure Desktop" tab in the portal will present you with a list of applications. Choose the one saying dtterm (sfnode0) or xterm (sfnode0). This will bring up a login terminal on the login node sflogin0.

The file systems for all of our clusters are shared, so you will be using the same home directory. The login node can be used for compilation, program development, and testing only, not for production jobs.

... to compile and link

Since the architecture of the Sparc64 VII chips of the M9000 Servers differs in some important details from the one of the login node, it may be a good idea to re-compile your code whenever possible. This is in most cases very simple:

  • Make sure you are using Studio 12 compilers. This is the default, but if you have entries in your shell setup that reset the compiler, you might have to modify these by typing use studio12
  • Many optimization options in the Studio compilers, such as -fast imply settings that involve -native, i.e. they optimize for the architecture and chipset of the machine on which you are doing the compilation. You might want to change these settings as they imply optimization for the login node (presently a Sunfire 2900) which might be soewhat sub-optimal for the M9000 servers. The compilation should include additional options to overwrite existing ones.
  • Explicitly architecture-dependent optimization options include -xtarget=sparc64vii -xcache=64/64/2:6144/256/12 -xarch=sparcima These are best added to the right of pre-existing compiler options such as -fast because this way they overwrite previous settings. An environment variable M9KFLAGS is set to these flags in the default setup, so that instead of the above settings, you can just type $M9KFLAGS.
  • To include "fused multiplication/addition" (FMA) in the compilation you need to specify -xarch=sparcfmaf -fma=fused after the other options (note that -xarch needs to be overwritten). An environment variable FMAFLAGS is being set by default and may be used instead of these settings.

For applications that can not be re-compiled (for instance, because the source code is not accessible), compilations for any post-USIII UltraSparc chip will work, usually pretty well.

For a general introduction to program development, compilation and application building on our systems see the HPCVL Parallel Programming FAQ.

... to run jobs

As mentioned earlier, program runs for user and application software on the login node are allowed only for test purposes. Production runs must be submitted to Grid Engine. For a description of how to use Grid Engine, see the HPCVL GridEngine faq

Grid Engine will schedule jobs to a default pool of machines unless otherwise stated. This default pool contains presently only the our M9000 nodes m9k0001-8. Therefore, you need to add no special script lines to be scheduled to these servers exclusively.

Note that your jobs will run on dedicated threads, i.e. up to 512 processes can be scheduled to a single server. The Grid Engine will do the scheduling, i.e. there is no way for the user to determine which processes run on which cores.


...to find more information

For a more thorough review of Multi-core environment, please check out this PDF. You might want to follow some of the links provided in this document. General information about using HPCVL facilities can be found in our FAQ pages.

We also supply user support (please contact us at help@hpcvl.org), so if you experience problems, we can assist you.