IBM High Productivity Computing Systems Toolkit     



IBM High Productivity Computing Systems Toolkit - overview

IBM High Performance Computing Systems Toolkit (HPCST) is a framework that provides services for performance data collection, bottleneck identification, solution discovery and implementation, and iteration of tuning process. It provides access to a wide array of information from static analysis, runtime behavior, algorithm property, architecture feature, and expert domain knowledge. Based upon such information, the framework provides a mechanism to compare and correlate performance metrics from different aspects (e.g., computation, memory, communication, I/O), and pinpoint the cause of performance problems. The framework also attempts to mitigate performance problems by suggesting and implementing solutions.

The methodology can be summarized as follows. We collect the cause of performance problems from literature and performance experts, and store them as patterns defined on performance metrics. Our framework inspects and instruments the application, and actively searches for known patterns in the pattern database. Once a pattern is discovered, we claim that the corresponding bottleneck is found, and the framework consults the knowledge database for possible solutions. Solutions are evaluated, and implemented if desired by the user. For more detailed information regarding the HPCS toolkit, please read the document here.

Our framework contains many built-in metrics and rules for bottleneck detection. The runtime metrics are collected through the IBM High Performance Computing Toolkit (IHPCT) which are hardware event counters, MPI profiling and tracing, IO profiling and tracing, and OpenMP profiling data. There are also metrics collected from static analysis and compiler analysis. One of the important features of our framework is its extensibility. We have been able to incorporate the metrics provided by the SCALASCA module in our framework. Scalasca is the toolset developed at the Jlich Supercomputing Centre in cooperation with the University of Tennessee. On the solution side, the framework is able to automatically tune for several performance problems. To demonstrate the advantages of performance optimization through our framework, we present a case study using the Lattice Boltzmann Magneto-Hydrodynamics code (LBMHD).

The HPCS toolkit differs than the traditional performance tools. It tries to bridge the productivity gap between hardware complexity and software limitations of current and next-generation systems. The HPCS toolkit allows users at any level of experience to conduct performance analysis and tuning of scientific applications. It tries to encode solved problems (i.e., the problems that are identified and solved by a user), and hence can detect and solve them in other application. We understand the effectiveness of the framework depends on the number and quality of bottleneck rules and solutions in our database. Thus, the framework is designed to be open and extensible by offering common utilities and services that help expert users expand the bottleneck rules and solutions. We are currently working with the Barcelona Supercomputing Center and University of Oregon to extend the metrics and bottleneck rules using data directly from their performance tools.