How does automatic relative debugging work in the HWT?

Please note: The FAQ pages at the HPCVL website are continuously being revised. Some pages might pertain to an older configuration of the system. Please let us know if you encounter problems or inaccuracies, and we will correct the entries.

The second functionality of the HWT is Automatic Relative Debugging (ARD). This is a debugging method in which reference data, often from another version of the program, are used to determine automatically if intermediate data in a program execution are correct or not. While the reference data may be taken from any source, it is more common that this method is used to compare a "correct" version of the program with the one that needs to be debugged. For instance, if the goal is to parallelize a program, the serial program version may serve as a reference for the parallel one under development.

To achieve this it is necessary to define which data need to be compared with which. The HWT solves this problem by means of a set of library routines that are supplied with the package. Calls to these routines are used to define a unique Data Identifier, which is then printed out with the data that need to be compared. The comparison only takes place if the data identifier matches between the reference data and the debugging data. The data identifiers used in the HWT are of a specific standard form that consists of three components, namely:

  1. A Principal Component which is usually a descriptive string that is indicative of the physical meaning of the data to be compared.
  2. An Instance Component that consists of both strings and integer variables, and indicates the instance of those data in the code. This is often used to indicate possibly multiple loop indices.
  3. An additional Physical Index (integer) is used to uniquely label simple data inside a data structure. This enables the comparison of ordinary arrays with distributed ones in parallel programming.

Since it is often whole arrays that need to be compared, the Physical Index becomes important if the internal structure of the array in the debugging version of the program differs from the one in the reference. This is the case when non-distributed arrays in a serial program serve as reference for distributed arrays in a parallel program.

Space does not allow us to explain the usage of routine calls for the construction of data identifiers here. This is explained in the manual. Here we can only outline the basic steps of a debugging run:

  1. The user inserts calls to debugging routines into the code. These appear in pre-processor constructs, and therefore are specific to a debugging version of the code. This is done in both the version to be debugged and the reference version. These calls construct data identifiers and initiate the output of the debugging data (into intermediate files).
  2. Both the debugging version and the reference version are executed after using the HWT to compile them (which links in the HWT library). This causes the generation of intermediate files that contain the data used for debugging.
  3. A call to the script call.debugger.hwt causes the HWT to compare the corresponding data items and detect deviations that exceed a certain tolerance limit (user-defined). Error reports are generated that contain information necessary to locate the problems.

For many cases, the use of 4 or 5 different routines is sufficient to debug relatively large data structures and locate errors. Since any comparison is done automatically, no "manual" comparison of single data values is required.