IBM High Performance Computing Toolkit - overview
The IBM High Performance Computing Toolkit is a suite of performance-related tools and libraries to assist in application tuning. This toolkit is an integrated environment for performance analysis of sequential and parallel applications using the MPI and OpenMP paradigms. The scientists can collect rich performance data from selected parts of an execution, digest the data at a very high level, and plan for improvements within a single unified interface. It provides a common framework for IBM's mid-range server offerings, including pSeries and eSeries servers and Blue Gene systems, on both AIX and Linux. The IBM Redbook for High Performance Computing Toolkit for Blue Gene/P can be downloaded from here.
The main characteristics of our framework are that:
- It offers an integrated environment for simultaneous investigation of a number of aspects of performance (e.g. Hardware Counter data, threads, message passing, IO, memory).
- It requires no source code modifications. The binary instrumentation facility that we develop rewrites the application executable and adds the requested calls to the instrumentation libraries into the executable.
- It is able to collect different aspects of performance data in a single run, thereby significantly reducing the time needed to profile the application.
- The performance data information can be mapped to the source code statement.
- It allows the user to select the granularity of data collection. For instance, it is possible, to gather information at a function boundary level and later dig down into the details of selected functions.
All the features above can be achieved within the same peekperf GUI. The peekperf GUI is the control center of our framework. The entire performance tuning process, from instrumentation to execution and analysis of data can be conducted from here.
The dimensions of performance data provided in our toolkit are:
Hardware performance counters, including measurements for cache misses at all levels of cache, number of floating point instructions executed, number of load instructions resulting in TLB misses, and other measurements that that are supported by the hardware. These measurements help the algorithm designer or developer identify and eliminate performance bottlenecks. The hardware performance counter tools allow you to run individual tasks of an MPI application with different groups of hardware counters monitored in each MPI task, so that you can obtain measurements for more than one hardware counter group within a single execution. These tools also allow you to summarize or aggregate hardware performance counter measurements from a set of MPI tasks, using plug-ins provided with the IBM HPC Toolkit or provided by you. On AIX, you can multiplex or time slice multiple hardware counter groups in a single task, allowing you to get count hardware performance counter events from multiple groups in the same application process.
MPI profiling/Tracing, where you can generate a trace of MPI calls in your application so you can observe communication patterns and match MPI calls to your source code. MPI profiling also obtains performance metrics including time spent in each MPI function and MPI message sizes.
OpenMP profiling, where you can obtain information about time spent in OpenMP constructs in your program, information about overhead in OpenMP constructs, and information about how workload is balanced across OpenMP threads in your application.
Application I/O profiling, where you can obtain information about I/O calls made in your application to help you understand application I/O performance and identify possible I/O performance problems in your application. For example, when an application exhibits the I/O pattern of sequential reading of large files, MIO detects the behavior and invokes its asynchronous prefetching module to prefetch user data. Experiments with the AIX JFS file system demonstrates significant improvement over system throughput when using MIO.
Xprofiler/Application Profiling, where you can identify functions in your application where the most time is being spent, or where the amount of time spent is different from your expectations. This information is presented in a graphical display that helps you better understand the relationships between functions in your application. With the GUI,
it is very easy to find the application's performance-critical areas.