Featured Story Corner
Integrated Performance Monitoring (IPM) at SDSC
—Nick Wright, SDSC & David Skinner, NERSC
As high-performance computing resources become larger and more complicated, using them to their full potential for scientific research also becomes increasingly challenging. One way of helping to overcome these challenges is to provide analysis tools that give users easy-to-digest information about how their code is behaving, which enables them to make improvements to realize their code's full potential performance.
Many approaches are available to a user who wishes to understand how their code is performing, ranging from simple benchmarking all the way to line-by-line source code tracing. IPM is a tool that fills the needs of users wishing to take the simple benchmarking approach. Its approach is a minimalist one; it requires no recompilation and is designed to operate with low overhead. Additionally it is portable across different architectures. This makes it very useful to users who are trying to obtain a basic understanding of how their code is behaving but do not have time or inclination to learn how to use a sophisticated analysis tool because their focus has to be, understandably, on generating results of scientific merit.
IPM provides a summary of the computation and communication in a parallel program in the areas outlined below. Click on the graphs and tables to see enlarged images of IPM Web page output.
Using this information a user can determine whether their code is running with adequate speed, as well as communication patterns and how well the load is balanced. Providing a performance profile in an easy-to-use and scalable manner is the core goal of IPM. Again, all of this is obtained from a single tool with no recompilation required.
Computing centers such as NERSC at LBL have made heavy use of IPM in gathering performance profiles for workload analysis and performance improvement. The 2004 and 2005 DOE INCITE projects relied on IPM to quickly test for and improve the scalable performance of these massively parallel codes. The performance gains from a chemistry code are summarized below.
How does this kind of performance test and improvement happen? Diagnosing a performance bottle-neck is not always easy, and particularly so at high concurrency. There is no tool that will reliably improve parallel performance automatically. IPM aims to provide a performance profile as a starting point for application or runtime environment changes to improve performance. Exposing bottlenecks is often the first concrete outcome of such profiling. One performance view that IPM provides is a wide range of metrics indexed by the tasks in the parallel code. This perspective often makes the diagnosis of load imbalance and other performance "gotchas" transparent. This figure is a clear cut case of load imbalance:
In the above example IPM serves largely to quantitatively diagnose a performance bottleneck due to load imbalance. Of the 1024 tasks, 64 arrive late to a synchronizing MPI_Alltoall collective. Over the full execution of the code, the remaining 960 tasks wait 6.25 seconds on those slow 64 tasks. By adjusting the domain decomposition to level the work, the performance discontinuity disappears and 6000 seconds of wall time are recovered as a result.
Another common set of optimizations of parallel codes have to do with message sizes. IPM provides several views of which message sizes occupy the largest fraction of wall clock time. By judicious changes to the configuration of the application and runtime settings the message size distribution can often be shifted away from small latency bound messages into a bandwidth driven regime. The example below shows a 64-way PARATEC NSF benchmark (run on seaborg.nersc.gov) before and after such optimization. Each graph shows a cumulative inventory of wall time expended within each MPI call across buffer sizes. In the Before scenario, "MPI_Wait-ing" for small latency-bound messages dominates the communication time. After performance tuning, the red curve has diminished as has the total communication time. Here an overall performance improvement of just over a factor of two in wall clock time was obtained.
In this case, the performance improvement, which moves the message size distribution to the right, is not itself provided by IPM. The profile, which includes a thorough accounting of communication and computation, often points to the next steps to undertake in achieving scalable performance.
This article illustrates how IPM produces information that is easily interpretable by a majority of HPC users without a considerable amount of effort. There are many other facets to application and workload profiling via IPM that have not been examined here, but the author id actively pursuing HPC research topics that include communication topology and detailed profiling of application flow. More information about using IPM at SDSC is available on the User Services Web site.