HPCVL operates a compute cluster that greatly expands our capacity and make use of the latest chip multi-threading technology. This page explains essential features of the cluster and is meant as a basic guide for its usage.
The Victoria Falls compute cluster is based on Sun SPARC Enterprise T5140 Servers. The compute nodes are named vf0000.... A total of 73 nodes form the production cluster.
Each of these nodes includes two 1.2 Ghz UltraSparc T2+ chips. Each of these chips has 8 compute cores, and each core is capable of Chip Multi-Threading with 8 hardware threads. This means that each of the nodes is capable of working simultaneously on up to 128 threads. The "Victoria Falls" Cluster is therefore able to process almost 10,000 threads.
Ten of the cluster nodes have a total of 64GB of memory (4 GB per core), the others have 32 GB (2 GB per core).
Chip Multi Threading (CMT) is a technology that allows multiple threads (process) to simultaneously share a single computing resource, such as a core. This greatly increases the efficiency of usage of the core. At the same time, multiple cores share chip resources, such as memory controllers and caches, thereby improving their utilization. The result is unprecedented per-chip performance. For an introduction to CMT, see this collection of papers. For more background, see this paper.
For more information on the UltraSparc T2 Server Architecture, please check out this whitepaper (pdf).
The main emphasis in CMT clusters is on "getting the job done". Since most jobs do not just consist of a long sequence of floating-point operations, the emphasis is shifted away from "FLOPS" towards per-chip performance and "throughput".
Modern superscalar processing units can perform multiple operations per clock cycle. However, it is quite common that large portions of these capabilities are not realized because operations of one type have to wait until operations of another have finished. In particular, memory operations are comparatively slow and have large latencies. They will therefore often lead to poor utilization of the available "CPU slots".
This problem can be addressed by either increasing the core speed, or by giving the core opportunities to pick operations from multiple incoming instruction strands or threads. Since the former route is clearly approaching physical limitations, the latter is taken by CMT. An added benefit of this approach is the sharing of hardware resources, and the corresponding efficiency in terms of energy usage, space requirements, as well as acquisition and operating cost.
CMT machines are ideally suited for applications that require "a good mix" of operations, and that scale well in parallel mode. While the multi-threaded chip does not explicitly make the distinction between multi-threaded shared-memory applications and multi-processing distributed-memory programs, there is some preference for the former due to a greater degree of sharing of resources. Applications that are very floating-point extensive, or depend crucially on cache usage may run into contention on the floating-point processing unit (FPU) or the cache since both are shared by all threads on a core.
If you think your application could run efficiently on these machines, please contact us (email@example.com) to discuss any concerns and let us assist you in getting started.
Note that on the CMT machines, the number of processes that is chosen for a given application is usually larger than for standard shared-memory machines. For the latter, it is desirable to use dedicated cores or CPUs to avoid sharing and context switching overheads. For the CMT machines, "overloading" is the rule. How many threads or processes are optimal usaully has to be determined by experimentation.
Login access to the general login node is available via the HPCVL Secure Portal. Clicking on the "Secure Desktop" tab in the portal will present you with a list of applications. Choose the one saying "dtterm (sfnode0)" or "xterm(sfnode0)". This will bring up a login terminal on the login node sflogin0 or, equivalently sfnode0.
The file systems for all our compute clusters are shared, so you will be using the same home directory. Everything else is very similar to the other clusters, including OS, shell setup, and Grid Engine usage. The login node can be used for compilation, program development, and testing only, not for production jobs.
Since the architecture of the Victoria Falls cluster differs substantially from the Sunfire one, it is likely a good idea to re-compile your code whenever possible. This is in most cases very simple:
Otherwise program development, compilation and application building are done the same way as on the other clusters. For a general introduction, see the HPCVL parallel programming FAQ.
For applications that can not be re-compiled (for instance, because the source code is not accessible), compilations for any post-USIII UltraSparc chip will work.
As mentioned earlier, program runs for user and application software on the login node are allowed only for test purposes. Production runs must be submitted to Grid Engine. For a description of how to use Grid Engine, see the HPCVL GridEngine faq
Grid Engine will schedule jobs to a default pool of machines unless otherwise stated. This default pool does not contain the Victoria Falls cluster nodes. To include the cluster in the machine pool, you need to add the line
#$ -q vf.q
in your submission script. With this line, the job can go either to the default or the Victoria Falls. If you want to restrict the submission to only the Victoria Falls machines, also include the line
#$ -l qname=vf.q
in the script. Your job will then be sent to the Victoria Falls exclusively.
Note that the number of processes for CMT machines should be chosen substantially greater than for standard Shared-Memory systems. Your jobs will not run on dedicated processors anymore, but several (up to 8) threads will be scheduled to the same CPU. Which number to choose must be determined largely by experimentation specifically for each application.
We encourage our users to have a look at the Sun CoolTools distribution. This is a collection of tools that are designed to "improve the ease of deployment of UltraSPARC T1 and UltraSPARC T2 based servers". Some of them are rather low-level, and therefore more suited for system managers, but others can be useful to tune your applications for usage on the Victoria Falls cluster.
For instance, SPOT the Simple Performance Optimization Tool provides reports about the performance of a given application and is meant to detect conditions such as cache misses. Another example is the Thread Analyzer which is integrated into Sun Studio 12 software, and is able to detect race conditions and deadlocks in multi-threaded programs.
For a more thorough review of Multi-core environment, please check out this PDF. You might want to follow some of the links provided in this document. General information about using HPCVL facilities can be found in our FAQ pages. There is an article in the HPCVL Labnote newsletter (Summer/2008) with a good overview of the Victoria Falls Cluster.
We also supply user support (please contact us at firstname.lastname@example.org), so if you experience problems, we can assist you.