[an error occurred while processing this directive]  

Scientific Computing at NPACI (SCAN)

Single processor optimization of AMBER for the CRAY T3E and IBM SP

AMBER is a widely used package of codes commonly employed to study problems in biochemistry. In contrast to quantum chemistry programs such as Gaussian and GAMESS, AMBER uses classical methods to perform energy minimization and dynamics simulations. This classical approach makes AMBER ideal for investigating problems such as protein folding and ligand binding. For a more detailed description of AMBER and its capabilities, see the AMBER home page.

While the current release of AMBER consists of roughly 65 programs, it is the SANDER module which performs the costly molecular dynamics simulations and accounts for the bulk of the AMBER's CPU usage. Not only is AMBER the most heavily used application on SDSC's CRAY T3E, but it receives widespread use from researchers in the biochemistry and phameceutical communities throughout the world. For these reasons, there was a strong motivation to work on the optimization of AMBER.

In recent years, two efforts were made to boost the perfomance of AMBER. Yong Duan and Lu Wang, working with Peter Kollman at UCSF, had previously focused on improving the parallel scaling of AMBER, allowing runs to be done efficiently on large numbers of processors. The work described in this SCAN article, by contrast, focuses on optimizing the single processor performance of AMBER. Complete details of code modifications will be included with the next AMBER release.

IDENTIFYING COMPUTATIONAL HOTSPOTS

AMBER runs on nearly all major computational platforms, with portability achieved through the extensive use of preprocessor directives. Optimization efforts first targeted the CRAY T3E. The CRAY T3E processors, with their small 8 KB primary data cache, are probably the most difficult to optimize for, and it was assumed that cache-level optimizations performed for the T3E would be applicable to other machines with larger data caches.

The SANDER module of AMBER was intially profiled using both Apprentice and PAT. We ultimately settled on using PAT, since it introduces less overhead, is easier to use, and does not require recompilation of code. While Apprentice tends to do a reasonable job of subtracting out its own overhead before reporting timing results, it interferes with the numerous timing routines that had already been manually inserted into the source. Throughout the optimization of AMBER, the code was periodically reprofiled to determine which routines should be the next focus of attention. Progress was measured by monitoring the CPU time for several test cases involving solvated biomolecules:

The first of these test cases stresses the routines which are responsible for evaluating water-water interactions, while the second and third emphasize the water-protein and protein-protein interactions. As expected, PAT identified the routines responsible for calculating the intermolecular forces, square roots, and periodic boundary conditions as being the most significant users of CPU time.

OPTIMIZING SQUARE ROOT OPERATIONS

Combining the inverse and square root operations

The most dramatic performance gains were realized by optimizing the square root operations. In several of the routines responsible for calculating the intermolecular forces, inverse distances between pairs of molecules need to be evaluated. In the original version of the code, this was done in a two-step process. First, the inverse-squared particle distances were calculated near the start of the subroutine as follows:


do jn=1,npr
   rw(jn) = 1.0/(xwij(1,jn)*xwij(1,jn) +
                  xwij(2,jn)*xwij(2,jn) +
                  xwij(3,jn)*xwij(3,jn))
enddo

In the force vector evaluations performed later in the subroutine, the inverse particle distances were calculated as needed:


df2 = -cgoh*sqrt(rw(jn))

Typically, the inverse square root operation can be performed considerably faster than the inverse and square root operations separately. On the CRAY T3E, for the first test case mentioned above (ala-dipeptide), this modification alone slightly more than doubled the speed of the code. Since the inverse and inverse-squared distances are both required in these subroutines, this code modification results in an extra multiplication having to be performed when the line df2 = -cgoh*sqrt(rw(jn)) is replaced by the following pair of operations:


df2 = -cgoh*rw(jn)
rw(jn) = rw(jn)*rw(jn)

Nonetheless, the savings due to combining the square root and inverse operations more than offset the cost of the extra multiplication. It should also be mentioned that it is only in the simulations involving solvated biomolecules that the square root needs to be evaluated. Additional coding was added to avoid unnecessary square root operations for the non-solvated molecules.

Taking advantage of the vector intrinsics

It turns out that the code modifications made above do more than just combine the square root and inverse operations. Using PAT to profile the original and modified codes indicated that the original code spent a significant amount of time in a routine named _SQRT_CODE, while the modified code makes a call to _SQRTINV_V. This routine is the Cray intrinsic vector inverse square root function. The f90 compiler was able to recognize that the inverse square root operations could be pulled out of the loop and evaluated using a vector version of the intrinsic routine. (In the vector version of an intrinsic routine, operations are carried out on a vector of array elements rather than one element at a time, thereby making better use of the arithmetic functional units).

We had expected that the compilers on the IBM SP would have recognized the opportunity to do the same type of optimization, but this was not the case. To get the vector intrinsic routine, the following steps had to be taken

Using preprocessor directives to maintain portability, the new code fragment is written as


#ifdef IBM_VECTOR
            DO JN=1,NPR
               RW(JN) = ( XWIJ(1,JN)*XWIJ(1,JN) +
     +              XWIJ(2,JN)*XWIJ(2,JN) +
     +              XWIJ(3,JN)*XWIJ(3,JN) )
            ENDDO
            call vrsqrt(rw,rw,npr)
#else
            DO JN=1,NPR
               RW(JN) = ONE/SQRT( XWIJ(1,JN)*XWIJ(1,JN) +
     +              XWIJ(2,JN)*XWIJ(2,JN) +
     +              XWIJ(3,JN)*XWIJ(3,JN) )
            ENDDO
#endif

and the compile line is modified to include -L/usr/local/apps/mass/lib -qdpc -lmassvp2

CACHE OPTIMIZATIONS

Loop fusion

In the original version of the subroutine responsible for calculating the water-water interactions (QIKTIP), the force vectors were calculated in six separate loops encompassing the six different possible interactions:

Fusing these six loops into two resulted in significant data reuse and hence better performance.

Restructuring of work arrays

Three separate work arrays (RW1, RW2, RW3) are passed as arguments to subroutine QIKTIP. After the loop fusion described above is performed, the corresponding elements of these three arrays are always used together. To enable better use of cache, these three arrays can be combined into a single array RWX(3,1000), where the second dimension was chosen so as to avoid the possibility of cache conflict. On the T3E, this involves avoiding dimensions very close to 1024. Similarly, the work array FWX(9,1000) was defined to replace FW(3,*). To avoid the possibility of cache conflict between arrays RWX and FWX, the two arrays were collected into a common block to fix the relative location of the two arrays in memory. Similar modifications were made in the routines which calculate protein-water and protein-protein interactions.

OPTIMIZING PERIODIC IMAGING -- ELIMINATING UNNECESSARY OPERATIONS

Periodic boundary conditions are commonly used in molecular dynamics simulations to avoid the problems associated with finite-sized domains. This introduces a minor complication in that molecules must be allowed to interact with their neighbors that lie just across the boundary. To handle this, each molecule typically will interact with the nearest periodic image of its neighbors.

To minimize computational costs, it is often assumed that molecules interact only with neighboring molecules that lie within a specified cutoff distance, thereby reducing the molecular dynamics simulations from O(N 2) to O(N).

In the original coding of AMBER, periodic imaging was always applied. By taking advantage of the fact that periodic imaging does not need to be performed for molecules that lie sufficiently far away from the edges of the computational domain, a significant number of arithmetic and logical operations can be avoided.

MAKING THE COMMON CASE FAST

Tuning for the NVT ensemble

A great deal of flexibility has been built into AMBER to allow computations based on different statistical ensembles including NVT (constant number of molecules, volume, and temperature) and NPT (constant number of molecules, pressure, and temperature). While NPT calculations are commonly used during equilibration, the NVT ensemble is used for most long calculations.

The NPT calculations require the storage of intermediate arrays used in evaluating the collision virials. By replacing references to these arrays with scalar quantities in the NVT calculations, significant speedups can be realized, since the amount of data being moved between memory and cache is greatly reduced.

Separating code blocks for single and dual cutoffs

As mentioned in the previous section, most MD simulations take advantage of the short range of intermolecular forces to reduce the problem from O(N2) to O(N). To further reduce the computational cost, multiple cutoffs are sometimes employed. For example, the forces between molecules within the first cutoff may be updated every timestep, while those between the first and second cutoffs are updated every nth timestep.

Unfortunately, the general block of code used to construct the neighbor lists for the single (common case) and dual cutoff cases is much less efficient for the single cutoff case than code that is written specifically for it. By writing separate blocks of code for the two cases, the single cutoff neighbor list construction can be made much faster.

THE IMPORTANCE OF COMPILER OPTIONS

The first step taken in optimizing AMBER for the CRAY T3E was to identify the best set of compiler flags. Although the makefiles distributed with AMBER did include flags to turn on optimization, they did not perform all possible optimizations. Replacing the -Oscalar3 flag with the combination -Oscalar3 -Ounroll2 -Opipeline2 -Ovector3, in addition to loading the fast vector math library, -lmfastv, had very little effect and actually caused the code to run slower for certain problems. Interestingly though, these compiler options enhanced the code changes described above. A summary of speedups obtained for four different cases is given below:

Speedup relative to original code compiled with original flags
Non-setup routines, protein kinase test case, 4 CRAY T3E processors
case speedup
Original code / Original flags -
Original code / New flags 0.97
Hand tuned code / Original flags 1.48
Hand tuned code / New flags 1.76

The effect of the compiler flags on the IBM SP was much less pronounced. Using the suggested set of options for best performance, including flags which specify the particular processors (Power2) and cache properties,

-O3 -qhot -qarch=pwr2 -qtune=pwr2 -qcache=assoc=4:cost=8:level=2:line=256:type=C -qmaxmem=1 -qipa -qdcp

had virtually no effect on performance for either the original or hand-tuned codes.

FINAL RESULTS

The table below summarizes the final results obtained for the standard plastocyanine benchmark run on SDSC's CRAY T3E and IBM SP. Note that speedup is defined as the ratio of run times for the original code and the modified code for the same number of processors. Keep in mind that the diminished speedup of the modified code relative to the original code at larger number of processors is due to message passing overhead. It is our hope that integration of the optimized message passing mentioned in the Introduction and the optimizations described in this article will remedy this situation.

Results of standard plastocyanine benchmark
  IBM SP CRAY T3E
Number
Processors
Original code (s) Modified code (s) Speedup Original code (s) Modified code (s) Speedup
1 315 220 1.43 - - -
2 160 113 1.41 316 184 1.72
4 84.4 60.8 1.39 162 96.2 1.69
8 44.8 33.59 1.33 85.0 51.8 1.64
16 25.9 20.3 1.27 46.2 29.6 1.56
32 18.3 15.2 1.20 25.6 17.6 1.45
64 17.1 15.3 1.12 16.4 12.3 1.33

© 1998 Online: News about the NPACI and SDSC Community