SDSC Thread Graphic Issue 6, June 2006

RSS RSS Feed (What is this?)

User Services Director:
Anke Kamrath

Subhashini Sivagnanam

Graphics Designer:
Diana Diehl

Application Designer:
Fariba Fana

Featured Story Corner

Integrated Performance Monitoring (IPM) at SDSC

—Nick Wright, SDSC & David Skinner, NERSC

As high-performance computing resources become larger and more complicated, using them to their full potential for scientific research also becomes increasingly challenging. One way of helping to overcome these challenges is to provide analysis tools that give users easy-to-digest information about how their code is behaving, which enables them to make improvements to realize their code's full potential performance.

Many approaches are available to a user who wishes to understand how their code is performing, ranging from simple benchmarking all the way to line-by-line source code tracing. IPM is a tool that fills the needs of users wishing to take the simple benchmarking approach. Its approach is a minimalist one; it requires no recompilation and is designed to operate with low overhead. Additionally it is portable across different architectures. This makes it very useful to users who are trying to obtain a basic understanding of how their code is behaving but do not have time or inclination to learn how to use a sophisticated analysis tool because their focus has to be, understandably, on generating results of scientific merit.

IPM provides a summary of the computation and communication in a parallel program in the areas outlined below. Click on the graphs and tables to see enlarged images of IPM Web page output.

  • The job, its memory usage and the speed it ran at (gflops)

screen shot of memory usage and job speed output of IPM program
  • MPI calls
    screen shot of MPI call information as pie chart

    screen shot of MPI call information as data table

  • HPM: PAPI or PMAPI performance events. These tell a user about the single processor performance of their code, the flops, amount of cache misses, amount of CPU cycles etc.

    screen shot of HPM performance event data from IPM program

  • Load balance, both in terms of MPI calls

    screen shot of communication balance graph

    as well as performance events
    screen shot of load balance graph (HPM counters)

Using this information a user can determine whether their code is running with adequate speed, as well as communication patterns and how well the load is balanced. Providing a performance profile in an easy-to-use and scalable manner is the core goal of IPM. Again, all of this is obtained from a single tool with no recompilation required.

Case Studies:

Computing centers such as NERSC at LBL have made heavy use of IPM in gathering performance profiles for workload analysis and performance improvement. The 2004 and 2005 DOE INCITE projects relied on IPM to quickly test for and improve the scalable performance of these massively parallel codes. The performance gains from a chemistry code are summarized below. screen shot of performance gains from chemistry code

How does this kind of performance test and improvement happen? Diagnosing a performance bottle-neck is not always easy, and particularly so at high concurrency. There is no tool that will reliably improve parallel performance automatically. IPM aims to provide a performance profile as a starting point for application or runtime environment changes to improve performance. Exposing bottlenecks is often the first concrete outcome of such profiling. One performance view that IPM provides is a wide range of metrics indexed by the tasks in the parallel code. This perspective often makes the diagnosis of load imbalance and other performance "gotchas" transparent. This figure is a clear cut case of load imbalance:

screen shot of load imbalance

In the above example IPM serves largely to quantitatively diagnose a performance bottleneck due to load imbalance. Of the 1024 tasks, 64 arrive late to a synchronizing MPI_Alltoall collective. Over the full execution of the code, the remaining 960 tasks wait 6.25 seconds on those slow 64 tasks. By adjusting the domain decomposition to level the work, the performance discontinuity disappears and 6000 seconds of wall time are recovered as a result.

Another common set of optimizations of parallel codes have to do with message sizes. IPM provides several views of which message sizes occupy the largest fraction of wall clock time. By judicious changes to the configuration of the application and runtime settings the message size distribution can often be shifted away from small latency bound messages into a bandwidth driven regime. The example below shows a 64-way PARATEC NSF benchmark (run on before and after such optimization. Each graph shows a cumulative inventory of wall time expended within each MPI call across buffer sizes. In the Before scenario, "MPI_Wait-ing" for small latency-bound messages dominates the communication time. After performance tuning, the red curve has diminished as has the total communication time. Here an overall performance improvement of just over a factor of two in wall clock time was obtained.

screen shot of performance before optimization
screen shot of performance after optimization

In this case, the performance improvement, which moves the message size distribution to the right, is not itself provided by IPM. The profile, which includes a thorough accounting of communication and computation, often points to the next steps to undertake in achieving scalable performance.

This article illustrates how IPM produces information that is easily interpretable by a majority of HPC users without a considerable amount of effort. There are many other facets to application and workload profiling via IPM that have not been examined here, but the author id actively pursuing HPC research topics that include communication topology and detailed profiling of application flow. More information about using IPM at SDSC is available on the User Services Web site.

IPM ( was developed by David Skinner at NERSC (, Michael L Welcome at LBL ( and Nick Wright at SDSC (

Did you know ..?

that SDSC has limited the core file size to 32MB.
To make good use of the core file size it is recommneded using the MP_COREFILE_FORMAT environment variable (or its associated command-line flag -corefile_format) to set the format of corefiles to lightweight corefiles.
See Thread article Issue 4, February 2006 for more details.