Compute Canada

Victoria Falls Cluster

HPCVL's Throughput Compute Cluster

HPCVL operates a compute cluster that greatly expands our capacity and make use of the latest chip multi-threading technology. This page explains essential features of the cluster and is meant as a basic guide for its usage.

What is the Cluster?

The Victoria Falls compute cluster is based on Sun SPARC Enterprise T5140 Servers. The compute nodes are named vf0000.... A total of 73 nodes form the production cluster.

Each of these nodes includes two 1.2 Ghz UltraSparc T2+ chips. Each of these chips has 8 compute cores, and each core is capable of Chip Multi-Threading with 8 hardware threads. This means that each of the nodes is capable of working simultaneously on up to 128 threads. The "Victoria Falls" Cluster is therefore able to process almost 10,000 threads.

Ten of the cluster nodes have a total of 64GB of memory (4 GB per core), the others have 32 GB (2 GB per core).

Chip Multi Threading (CMT) is a technology that allows multiple threads (process) to simultaneously share a single computing resource, such as a core. This greatly increases the efficiency of usage of the core. At the same time, multiple cores share chip resources, such as memory controllers and caches, thereby improving their utilization. The result is unprecedented per-chip performance. For an introduction to CMT, see this collection of papers. For more background, see this paper.

For more information on the UltraSparc T2 Server Architecture, please check out this whitepaper (pdf).

Why this Cluster?

The main emphasis in CMT clusters is on "getting the job done". Since most jobs do not just consist of a long sequence of floating-point operations, the emphasis is shifted away from "FLOPS" towards per-chip performance and "throughput".

Modern superscalar processing units can perform multiple operations per clock cycle. However, it is quite common that large portions of these capabilities are not realized because operations of one type have to wait until operations of another have finished. In particular, memory operations are comparatively slow and have large latencies. They will therefore often lead to poor utilization of the available "CPU slots".

This problem can be addressed by either increasing the core speed, or by giving the core opportunities to pick operations from multiple incoming instruction strands or threads. Since the former route is clearly approaching physical limitations, the latter is taken by CMT. An added benefit of this approach is the sharing of hardware resources, and the corresponding efficiency in terms of energy usage, space requirements, as well as acquisition and operating cost.

Who Should Use this Cluster?

CMT machines are ideally suited for applications that require "a good mix" of operations, and that scale well in parallel mode. While the multi-threaded chip does not explicitly make the distinction between multi-threaded shared-memory applications and multi-processing distributed-memory programs, there is some preference for the former due to a greater degree of sharing of resources. Applications that are very floating-point extensive, or depend crucially on cache usage may run into contention on the floating-point processing unit (FPU) or the cache since both are shared by all threads on a core.

We suggest you consider using the compute cluster if

  • Your application is explicitly or automatically multi-threaded (for instance, using OpenMP) and shows at least some scaling for moderately large numbers of threads (>20).
  • Your application is explicitly parallel (for instance, using MPI) and not too communication intensive. Ideally, your application combines MPI-type parallelism with multi-threading on the processes.
  • You need to perform a large number of relatively short jobs, each serial or preferably multi-threaded.

The cluster might not be suitable if

  • Your application uses a commercial license that is scaled per process; such jobs should be run on dedicated CPUs.
  • Your application is very floating-point intensive and has little else to do. You might still want to test the actual behaviour, as CMT machines can "mask" memory latencies to make very efficient use of the limited FPU's available.
  • You need very large amounts of memory. These compute nodes have 32 or 64 GB of RAM.

If you think your application could run efficiently on these machines, please contact us (help@hpcvl.org) to discuss any concerns and let us assist you in getting started.

Note that on the CMT machines, the number of processes that is chosen for a given application is usually larger than for standard shared-memory machines. For the latter, it is desirable to use dedicated cores or CPUs to avoid sharing and context switching overheads. For the CMT machines, "overloading" is the rule. How many threads or processes are optimal usaully has to be determined by experimentation.

How Do I Use this Cluster?

... to access

Login access to the general login node is available via the HPCVL Secure Portal. Clicking on the "Secure Desktop" tab in the portal will present you with a list of applications. Choose the one saying "dtterm (sfnode0)" or "xterm(sfnode0)". This will bring up a login terminal on the login node sflogin0 or, equivalently sfnode0.

The file systems for all our compute clusters are shared, so you will be using the same home directory. Everything else is very similar to the other clusters, including OS, shell setup, and Grid Engine usage. The login node can be used for compilation, program development, and testing only, not for production jobs.

... to compile and link

Since the architecture of the Victoria Falls cluster differs substantially from the Sunfire one, it is likely a good idea to re-compile your code whenever possible. This is in most cases very simple:

  • Make sure you are using Studio 12 compilers. This is the default, but if you have entries in your shell setup that reset the compiler, you might have to modify these by typing
    use studio12
  • Many optimization options in the Studio compilers, such as -fast imply settings that involve -native, i.e. they optimize for the architecture and chipset of the machine on which you are doing the compilation. These settings do not have to be changed. The compilation should just be redone on a Victoria Falls node. For instructions on how to access such a node interactively, please contact us.
  • Explicitly architecture-dependent optimization options such as -xtarget=ultra4plus need to be changed. For the Victoria Falls cluster, use
    -xtarget=ultraT2
    -xcache=8/16/4:4096/64/16

Otherwise program development, compilation and application building are done the same way as on the other clusters. For a general introduction, see the  HPCVL parallel programming FAQ.

For applications that can not be re-compiled (for instance, because the source code is not accessible), compilations for any post-USIII UltraSparc chip will work.

... to run jobs

As mentioned earlier, program runs for user and application software on the login node are allowed only for test purposes. Production runs must be submitted to Grid Engine. For a description of how to use Grid Engine, see the HPCVL GridEngine faq

Grid Engine will schedule jobs to a default pool of machines unless otherwise stated. This default pool does not contain the Victoria Falls cluster nodes. To include the cluster in the machine pool, you need to add the line

#$ -q vf.q

in your submission script. With this line, the job can go either to the default or the Victoria Falls. If you want to restrict the submission to only the Victoria Falls machines, also include the line

#$ -l qname=vf.q

in the script. Your job will then be sent to the Victoria Falls exclusively.

Note that the number of processes for CMT machines should be chosen substantially greater than for standard Shared-Memory systems. Your jobs will not run on dedicated processors anymore, but several (up to 8) threads will be scheduled to the same CPU. Which number to choose must be determined largely by experimentation specifically for each application.

... to optimize

We encourage our users to have a look at the Sun CoolTools distribution. This is a collection of tools that are designed to "improve the ease of deployment of UltraSPARC T1 and UltraSPARC T2 based servers". Some of them are rather low-level, and therefore more suited for system managers, but others can be useful to tune your applications for usage on the Victoria Falls cluster.

For instance, SPOT the Simple Performance Optimization Tool provides reports about the performance of a given application and is meant to detect conditions such as cache misses. Another example is the Thread Analyzer which is integrated into Sun Studio 12 software, and is able to detect race conditions and deadlocks in multi-threaded programs.

Help?

...to find more information

For a more thorough review of Multi-core environment, please check out this PDF. You might want to follow some of the links provided in this document. General information about using HPCVL facilities can be found in our FAQ pages. There is an article in the HPCVL Labnote newsletter (Summer/2008) with a good overview of the Victoria Falls Cluster.

We also supply user support (please contact us at help@hpcvl.org), so if you experience problems, we can assist you.