Sun Sparc Enterprise M9000 Servers

HPCVL's Sun Sparc Enterprise M9000 Servers

HPCVL has installed eight new SMP machines of the Enterprise M9000 type that greatly expands our computing capacity. This page explains essential features of the new machines and is meant as a basic guide for their usage.

1. What are these servers?

We are installing a cluster of eight new SMP machines that are Sun SPARC Enterprise M9000 Servers which Sun Microsystems built in partnership with Fujitsu. These servers will be made available gradually. There is going to be no special login node, as access will be handled exclusively by Grid Engine, including test jobs that are specific to these servers. The server nodes are called m9k0001...m9k0008.

Each of these servers consists of 64 quad-core 2.52 Ghz Sparc64 VII processors. Each of these chips has 4 compute cores, and each core is capable of Chip Multi Threading with 2 hardware threads. This means that each of the servers is capable of working simultaneously on up to 512 threads. Once fully installed, they will be able to process more than 4000 threads. As each core carries two Floating-Point Units that can handle Additions and Multiplications in a "Fused" manner (FMA), the cluster adds up to 20 TFlops TPP to our computing capacity.

Chip Multi Threading (CMT) is a technology that allows multiple threads (process) to simultaneously share a single computing resource, such as a core. This increases the efficiency of usage of the core. At the same time, multiple cores share chip resources, thereby improving their utilization.

The new servers have a total of 2TByte (!) of memory (8 GB per core). These machines are obviously meant for very-high-memory applications.

For more information on the Sparc64 VII Architecture, please check out this website.

2. Why these servers?

The main emphasis in of these high-end SMP servers is to deliver the maximum possible floating-point performance while not compromising on memory requirements. These machines are to some degree complementary to our Victoria Falls Cluster, where the emphasis is on "Throughput", while here the focus is on sheer TFlops.

The very large memory of these servers make them ideally suited for large-scale computations. Large L2 caches keep memory latencies low, while chip multithreading technology increases core utilization. These are true high-performance machines.

3. Who should use this cluster?

Large-memory SMP machines are ideally suited for applications that require fast access to well-organized memory, combined with rapid execution of Floating-Point Operations. While the multi-threaded chip does not explicitly make the distinction between multi-threaded shared-memory applications and multi-processing distributed-memory programs, there is some preference for the former because of the lower degree of communication latencies. Applications that are very floating-point extensive, or depend crucially on cache may be good candidates for running on these servers.

We suggest you consider using the new compute cluster if

  • Your application is very floating-point intensive and has little else to do, and you need very large amounts of memory. These compute nodes have 2 TB of RAM.
  • Your application is explicitly or automatically multi-threaded (for instance, using OpenMP) and shows good scaling for large numbers of threads (>50).
  • Your application is explicitly parallel (for instance, using MPI) and communication intensive, or your application combines MPI-type parallelism with multi-threading on the processes.
  • Your application uses a commercial license that is scaled per process; in such cases it is favourable to use machines with the maximum per-process performance.

The new cluster might not be suitable if

  • You need to perform a large number of relatively short jobs, each serial or small-scale multi-threaded. Jobs like this should be sent to the "Victoria Falls" cluster.
  • Your application is "embarassingly parallel", i.e. it scales trivially well but uses very little communication. Such jobs are typically run on a distributed-memory cluster, and might be considered for the VF cluster.
  • Your application is able to use many processors but has a very small memory footprint. In this case you are not making proper use of the memory capacity of these machines, and would be better off to use even more processes on a "flat cluster".

If you think your application could run efficiently on these machines, please contact us (help@hpcvl.org) to discuss any concerns and let us assist you in getting started.

Note that on these SMP machines, we have to enforce dedicated cores or CPUs to avoid sharing and context switching overheads. No "overloading" can be allowed.

4. How do I use these servers?

a) ... to access

The servers are accessed via the HPCVL Secure Portal at https://portal.hpcvl.queensu.ca/ from the same login node as the Sunfire cluster (sflogin0).
Clicking on the "Secure Desktop" tab in the portal will present you with a list of applications. Choose the one saying dtterm (sfnode0) or xterm (sfnode0). This will bring up a login terminal on the login node sflogin0. Alternatively, you can submit jobs to the M9000 servers from the Victoria Falls login node vflogin0.

The file systems for all of our clusters are shared, so you will be using the same home directory. Everything will be very similar to the Sunfire cluster, including OS, shell setup, and Grid Engine usage. The login node can be used for compilation, program development, and testing only, not for production jobs. This is also just as on the SF25K cluster.

b) ... to compile and link

Since the architecture of the Sparc64 VII chips of the M9000 Servers differs in some important details from the Sunfire (US IV+) one, it may be a good idea to re-compile your code whenever possible. This is in most cases very simple:

  • Make sure you are using Studio 12 compilers. This is the default, but if you have entries in your shell setup that reset the compiler, you might have to modify these by typing use studio12
  • Many optimization options in the Studio compilers, such as -fast imply settings that involve -native, i.e. they optimize for the architecture and chipset of the machine on which you are doing the compilation. These settings may have to be changed as they imply optimization for the login node (presently a Sunfire 2900) which might be sub-optimal for the M9000 servers. The compilation should include additional options to overwrite existing ones.
  • Explicitly architecture-dependent optimization options include
    -xtarget=sparc64vii, -xcache=64/64/2:6144/256/12, and -xarch=sparcima.
    These are best added to the right of pre-existing compiler options such as -fast because this way they overwrite previous settings. An environment variable M9KFLAGS is set to these flags when "use studio12" is specified, so that instead of the above settings, you can just type $M9KFLAGS.
  • To include "fused multiplication/addition" (FMA) in the compilation you need to specify
    -xarch=sparcfmaf -fma=fused
    after the other options (note that -xarch needs to be overwritten). An environment variable FMAFLAGS is being set on "use studio12" and may be used instead of these settings.

Otherwise program development, compilation and application building are done the same way as on the Sunfire cluster. For a general introduction, see http://www.hpcvl.org/faqs/programming/parallel-prog-faq.html.

For applications that can not be re-compiled (for instance, because the source code is not accessible), compilations for any post-USIII UltraSparc chip will work.

c) ... to run jobs

As mentioned earlier, program runs for user and application software on the login node are allowed only for test purposes. Production runs must be submitted to Grid Engine. This is exactly as on the Sunfire cluster. For a description of how to use Grid Engine, see the HPCVL GridEngine faq

Grid Engine will schedule jobs to a default pool of machines unless otherwise stated. This default pool contains presently only the Sunfire 25K's, i.e. hpcvl0-hpcvl6. To include the M9000 servers in the machine pool, you need to add the line

#$ -q m9k.q

in your submission script. With this line, the job can go either to the 25K's or the M9K's. If you want to restrict the submission to only the M9000 servers machines, also include the line

#$ -l qname=m9k.q

in the script. Your job will then be sent to the Enterprise M9000's exclusively.

Note that your jobs will run on dedicated threads, i.e. up to 512 processes can be scheduled to a single server. The Grid Engine will do the scheduling, i.e. there is no way for the user to determine which processes run on which cores.

5. Help?

...to find more information

For a more thorough review of Multi-core environment, please check out this PDF. You might want to follow some of the links provided in this document. General information about using HPCVL facilities can be found in our FAQ pages.

We also supply user support (please contact us at help@hpcvl.org), so if you experience problems, we can assist you.

 
 
   
© HPCVL 2008