Techniques for parallelization of ab initio codes are not new. Investigators have parallelized many aspects of the quantum mechanical algorithmic procedures, including the two-electron repulsion and gradient integrals, integral transformation, MP2 energy, configuration interaction energy, and coupled-cluster energy. Much of the detail of the parallelization strategies used for GAMESS can be found in the original papers. Originally, GAMESS was parallelized using the tool kit TCGMSG (Theoretical Chemistry Group Message Passing), a message passing library developed by the theoretical chemistry group at Argonne National Laboratory . This package implements a distributed memory MIMD programming model on distributed memory hardware, as well as on some shared memory multiprocessor systems.
Since the original parallel implementation via TCGMSG, several modifications have been made in order to implement GAMESS on other muliprocessor machines. To run on alpha clusters, DEC, in Ireland, succeeded in making a version of TCGMSG for AXP clusters that uses 32 bit integers. To run on the CRAY/T3D platform, Martin Feyereisen has written special code to translate TCGMSG to PVM. The Paragon and SP2 parallel versions of GAMESS are now also being implemented without TCGMSG, using only the native message-passing environments of these machines. Recently, we have ported GAMESS to our CRAY/T3E (blah, blah), 256 node parallel processor, using PVM.
GAMESS will currently evaluate the energy and gradient of any of the Hartree-Fock wavefunctions (RHF through GVB) in parallel. In order to parallelize these SCF wavefunctions, the following sections of the program were modified to run in parallel: 1e- integrals, 1e- ECP integrals, 2e- integrals, matrix manipulations to set up the SCF equations, "semiparallelization" of the matrix diagonalization, 1e- gradient, 1e- ECP gradient, and the 2e-gradient. Once the energy and gradient run in parallel, many other run types are effectively parallelized, for example geometry searches or numerical hessians.
For illustration of performance, we have chosen the cyclophane, HSi[(CH2)3]3C6H3, (Figure 1: sicage.pict), as we have comparison with several other parallel processors. Although this cyclophane has C3 symmetry, the test data were deliberately obtained by running in C1 to simulate an asymmetric case. Using the 6-31G(d) basis set, the calculation involves 288 basis functions. Reported here are total time (wall clock time), speedup ratio given as s(p) = t(p)/t(1) and, efficiency of the parallelization given as s(p)/p * 100%, where p is the number of processors for the molecules under consideration.
Figure 2 shows log plot of a partial task distribution for the silicon cage molecule. In general, the bulk of the computational cost is in the RHF step and the calculation of the 2-electron gradients. Both of these steps are nearly linearly parallel until the amount of work per node is greatly reduced, at which time one sees a leveling off in efficiency. For the cyclophane, this happens relatively quickly (~64 nodes), but for larger molecular systems, this linear nature of parallelism is extended out to many nodes.
It is interesting to look at the balance of the CPU time and the efficiency of the run. The crossover point between the speedup and the efficiency curves can give a general guideline for the optimal number of nodes to efficiently run on. Figure 3 gives a view of the compromise between speedup and efficiency in the case of the silicon cage benchmark. At 64 nodes, one still sees over 50% efficiency and a continued increase in speedup, and at 96 nodes, the two curves are nearly identical, the efficiency now dwindling to around 45%.
A number of larger test cases have been calculated to test the bounds of the parallel environments. The calculations have been done at a variety of basis sets, from 320-1100 basis functions, and on a variety of different node number combinations (96-192). In particular, a series of buckminsterfullerene fragments have been studied using the DZV(2d,p) level basis set. Figure 4 is an illustration of one of the largest calculations in this set, C50H10, which involves 1100 basis functions and was run on 192 nodes of the T3E.
The following figure illustrates graphically the comparison of five parallel platforms for the calculation of the silicon cage cyclophane. The T3D and Paragon platforms behave similarly. The workstation-based parallel platforms show very good performance for GAMESS in general. One sees that the DEC/AXP (SP2) is at least 3 (7) times faster than a Paragon node. These machines in principle have much better I/O capability than a Paragon since the scratch disks can be directly connected to each node. A comparison of the T3D versus the T3E over many molecular constructions show that the T3E is about 6 times faster than the T3D. These calculations again show the significance of targeting the right size and number of nodes for a particular calculation. Too small a memory capacity cause a significant decrease in efficiency due to paging; too large a memory capacity causes a degradation of efficiency due to Amdahl's Law.
Analysis of Results
With larger calculations come larger data sets to analyze. Visual analysis of such data sets becomes crucial to interpreting the results. Examples include energy minimized structure of a molecule, reaction path trajectories, 3-D molecular electrostatic potential maps, and 3-D molecular orbital data. To address the individual molecular quantum chemical modeling needs of chemists, SDSC principal scientist, Kim Baldridge and, staff scientist, Jerry Greenberg created a molecular visual analysis software package, called QMView, that allows flexibility as to the types of input and output data formats used in different calculation programs. QMView running on any of a variety of workstations (The original version of QMView was specifically written for the SGI platform using the IRIS GLTM graphics library. A new version, written with the OpenGLTM application programming interface and the GLUT utility library has been written and will run on any UNIX platform that has an OpenGL server), can either connect directly with GAMESS running on a parallel computer via a socket connection or through output files created by GAMESS. A QMView library is also available to link with a FORTRAN program running on a remote supercomputer.