Jeffrey Mandell(1)
mandell@sdsc.edu
Victoria A. Roberts(2)
vickie@scripps.edu
Michael E. Pique(2)
mp@scripps.edu
(1) San Diego Supercomputer Center
Computational Center for Macromolecular Structure
P.O. Box 85608
San Diego, CA 92186-9784
(2) The Scripps Research Institute
Department of Molecular Biology
10666 North Torrey Pines Road
La Jolla, CA 92037
The purpose of the molecular interaction program DOT (Daughter of Turnip) is rapid computation of the electrostatic potential energy between two proteins or other charged molecules. DOT exhaustively tests all six degrees of freedom, rotational and translational, and produces a grid of approximate interaction energies and orientations. It is able to do this because the problem is cast as the convolution of the potential field of the first molecule and any rotated charge distribution of the second. The algorithm lends itself to both parallelization and vectorization, permitting huge increases in computational speed over other methods for obtaining the same information. For example, a complete mapping of interactions between plastocyanin and cytochrome c was done in eight minutes using 256 nodes of an Intel Paragon. DOT is expected to be particularly useful as a rapid screen to find configurations for more detailed study using exact energy models.
Detection and analysis of the interactions of biological macromolecules is a key problem in molecular biology. Many complex cellular regulatory systems are controlled through such interactions. The subtle web of chemical processes that constitutes cellular metabolism depends on the interactions of macromolecules. Knowledge of the most stable relative orientations between interacting macromolecules is essential in understanding how proteins associate with one another. Knowledge of specific protein-to-protein interactions may be applied to protein subunit aggregation (for the study, e.g., of supramolecular structures), to computer-aided drug design, and to fundamental problems of cellular signaling and expression.
Most methods for modeling macromolecular interactions depend on energy minimization, molecular dynamics, free energy simulations, and related methods to find energy minima. DOT takes a different approach. The objective of DOT is to perform a rapid calculation that will immediately identify all of the "hot spots"--the relatively small number of configurations that are notably more favorable than others. The resulting list of possibilities can then be examined in greater detail by more comprehensive (and computationally expensive) methods.
DOT achieves its objective by efficiently enumerating all of the possibilities, thereby reducing the minimization problem to one of scanning the results. DOT exhaustively tests all six degrees of freedom, rotational and translational. The energies calculated by DOT include electrostatic interaction energies and van der Waals collisions, and DOT can also count the number of atoms that are in contact between the two molecules. The result of the calculation is a grid of approximate interaction energies and orientations. Since DOT is restricted to rigid molecules, calculates energies only on a regular grid in Cartesian space, and makes some approximations in the potential energy calculations, the favorable points found by DOT are not exact, but should be further analyzed by energy minimizations, molecular dynamics, or other related techniques. DOT does not currently take solvation energy into account, and thus it is most successful at present where the molecular interaction is known to be dominated by electrostatic forces. (Methods are presented in this paper that enable the calculation of buried surface area, which can be used to approximate solvation effects.) In most situations, however, DOT will quickly accomplish what have been among the most rate-limiting steps in the typical search of configuration space.
DOT (Daughter of Turnip) is based on TURNIP [1], a program developed by one of us (VR) for the study of electrostatically interacting proteins. TURNIP did not scale well to larger problems, evaluated only a simple Coulombic electrostatic model, and did not consider geometric fit apart from constraining the closest distance between protein molecules. DOT addresses these issues. The electrostatic model used by DOT requires the calculation of a potential field once during the calculation, but only once: the use of full Poisson-Boltzmann models, therefore, does not significantly affect the cost of the computation. The method for detection of overlap between two molecules used by DOT can also count the number of atoms close to contact, giving a direct estimate of the buried surface area.
DOT drastically reduces the computational cost of the problem. It has been written to support parallel computation in the PVM environment [2] on systems ranging from heterogeneous workstation networks through a DEC Alpha farm to an Intel Paragon. In the test reported here, DOT completely mapped the interactions between plastocyanin and cytochrome c in about eight minutes on 256 nodes of the Intel Paragon at the San Diego Supercomputer Center. Plastocyanin was held still and cytochrome c was compared in 2.6 billion different configurations. A similar calculation on a Convex C2 minisupercomputer took 30 hours to make far fewer energy evaluations -- by a factor of 780.
DOT achieves its computational speed by casting the problem in terms of a molecule moving inside the potential field of another molecule. The first molecule is held still while the second is moved through all six degrees of freedom on a fine mesh. This permits evaluation of the energies as correlations, which can in turn be evaluated using fast Fourier transforms (FFTs). Harrison et al. [3] have shown how potentials and energies can be computed using the convolution theorem. DOT does not currently use these methods for calculation of potentials, but does use them for evaluation of energies.
The electrostatic energy of a system consisting of a charge distribution Q(r) in a potential field V(r) is given by
If the charge distribution is rotated by an angle theta and displaced from the origin by a vector r-nought, the energy of the system is given by
Mathematically this is the correlation of the potential field V(r) and the rotated charge distribution Q(r).
The treatment of geometric fit is analogous to the treatment of electrostatics. Let G(r) be a step-function potential such that
where M is a number greater than the number of atoms in molecule 2, "inside" is defined with reference to the solvent-accessible surface of molecule 1, and the "surface layer" consists of grid points within an atomic radius or so of the solvent-accessible surface. Let A(r) be the set of delta functions at atomic centers of molecule 2. The fit between molecule 1 and molecule 2, translated and rotated as above, is
where j is the number of atoms from molecule 2 that are inside molecule 1 (which cannot exceed M), and k is the number of atoms from molecule 2 which are in surface contact with molecule 1. This formulation of the overlap computation is also a correlation. This method for computing overlap is essentially identical with that described by Katchalski-Katzir et al. [4] and expanded by Vakser and Aflalo [5]. Harrison et al. [3] describe an alternative method for evaluating the non-bonded potentials as convolutions, which is probably more accurate but which requires separate potential functions for the attractive and repulsive van der Waals terms. This significantly increases the storage requirements, and does not seem necessary for our present purposes. We plan to use the Harrison model in further, more detailed work in the future.
Correlations can be calculated very efficiently through the use of the Convolution Theorem. Given two functions, X(r) and Y(r), and their Fourier transforms
(where K is a normalization constant dependent on the dimensionality of the problem), the convolution of the functions can be calculated as
Correlation and convolution are very closely related. If the function Y(r) is inverted through the origin, the convolution becomes a correlation. This is very easy to do with Fourier transforms, since
where the asterisk denotes the complex conjugate. The desired correlation is thus
If the functions X and Y are defined on a grid of N points, the correlation calculation has the discrete analog
Evaluating the correlation directly from the definition for N points requires N x N multiplications. Evaluating the correlation using FFTs costs three FFTs, each proportional to NlogN, and N multiplications, so the computational cost is proportional to N(3 log N + 1).
The cost can be further reduced by noting that where multiple correlations are required and only one of the functions changes, one of the FFTs can be omitted from all but the first calculation. This reduces the cost multiplier from 3 to 2.
The algorithm DOT employs lends itself to both parallelization and vectorization, which permits huge increases in computational speed. Not only can orientations be farmed out to various processors, but also the convolutions themselves can be easily vectorized. The ratio of computational costs between TURNIP and DOT contains complex scale factors because the cost of TURNIP depends on the number of atoms while the cost of DOT depends on the number of grid points. The number of grid points required is, however, proportional to the number of atoms. TURNIP thus scales as N*N and DOT scales as N log N. The speed increase gives DOT the potential to be used interactively, and a graphical interface is under development that will permit such use across many platforms.
DOT operates in a master/slave mode. The Fourier transforms of the potential grid and the van der Waals exclusion mask of the stationary molecule are precalculated. The Fourier transforms are distributed to the slave processors, along with the coordinates of the moving molecule. Each slave processor is then given a list of Eulerian angles to apply. The slave processors each rotate their copy of the moving molecule and place the charges of the rotated atoms on a grid. The Fourier-transform of the charge grid is calculated, multiplied by the Fourier transform of the electrostatic potential, and inverse-transformed. The result is the electrostatic energy of interaction through space for the given rotation. A second convolution is performed using the rotated atoms; this convolution gives the steric information. For each slave processor the answers are accumulated on another grid. For each point in the grid the best answer that is not excluded by steric collision is saved, along with the orientation that produced it.
Load balancing is achieved by giving each slave a relatively small number of orientations to process at a time. When a slave finishes its list, it notifies the master, which records the corresponding orientations as complete and gives the slave more work. When no more work remains to be done, the slave is ordered to report its results back to the master. If two slaves are eligible to report at the same time, they are ordered to merge their results first. The effect of this strategy is cost proportional to log N for gathering of results in a homogeneous situation, as on the Intel Paragon or the DEC Alpha farm. This method also provides effective load balancing and merging for heterogeneous workstations connected by an Ethernet. An added benefit of this approach is that the master processor can time the slaves to detect relative speeds and hung processors. The system is thus robust against failures of slaves with little increase in overhead for the homogeneous situation. It is not robust against failure of the master processor.
The amount of data communication required is low except at the beginning and the end of the computation, at which points large arrays must be communicated. The communication costs are dependent on the message passing architecture. The amount of data to be passed among processors at both beginning and end of the calculation is 2MN, where M is the number of processors and N is the number of grid points. When message passing can be done in parallel, this can be accomplished in log M time. The initial data is invariant for all processors, which means that a broadcast function can be used if such a function is available. DOT uses broadcast on the Intel Paragon for a significant saving in startup time. DOT implements a log M method in gathering the results on all platforms.
DOT implements a tcl/tk (X-window-based) graphical user interface that makes all manipulable objects and information accessible. A data interface translates external input data into DOT format. The computational module manages the molecular surface generation and evaluation of the grid. Grid potential generation will be incorporated in a future release; it is presently generated using either the University of Houston's UHBD [6] code or Biosym Technologies' DelPhi [7,8] code, both of which employ comprehensive Poisson-Boltzmann calculations. The number of data points generated in the calculation is so large that at present only a list of most favorable orientations and interaction energies for each position in space can be stored and maintained for subsequent visualization.
DOT requires a potential file (joules per coulomb) and a molecular volume file for the stationary molecule (molecule 1), and an atomic coordinate/charge file for the moving molecule (molecule 2). DOT currently supports a .grd input format for the potential, obtainable from DelPhi; corrections must be made to match DelPhi and DOT grid sizes by shifting the grid center appropriately. Molecular volume files are generated by vol, a module based on the program Surf [9] by Amitah Varshney of the University of North Carolina. Surf generates a triangulated description of a molecular surface extremely rapidly. In vol the code goes on to generate a list of all gridpoints inside the molecular surface, with user-defined resolution, and a list of coordinates in Cartesian form is output for use by DOT.
The atomic coordinate information for molecule 2 can be supplied in a variety of file formats. DOT will support .dot, .car, and .atm formats. The .dot format is x y z c, where x y z are the spatial coordinates of the atom and c is the partial charge. The .car format is that used by software from Biosym Technologies, Inc. A Protein Data Bank (.pdb) file does not contain charge information, so a program like Insight (Biosym Technologies) must be used to generate a .car or .atm file containing the charge information.
The grid is a 3D lattice of points with user-defined resolution on which the potential in joules/coulomb of the stationary molecule (molecule 1) is carried (as generated at present by UHBD [6] or DelPhi [7,8]).
The three rotational degrees of freedom are defined relative to the geometric center of the moving molecule and the three translational degrees about the geometric center of the stationary molecule. The convolution algorithm expedites the translational component of the calculation. Instead of moving molecule 2 through all defined 3D space for each orientation, the convolution computes the 3 translational degrees of freedom by multiplying the Fourier transforms of their values together. This multiplication step is easily vectorized for substantial increases in speed, while each orientation can be parallelized, as explained above.
The program then screens for overlap by an implementation of the step-function potential (G(r)) method described above. The energies associated with overlap conditions are labeled invalid.
The results for each orientation are merged locally. Each processor maintains a list of the best valid energies and orientations at each grid point. At present only one value is retained but additional values are potentially useful in examining the correlation of energy and orientations at neighboring grid points.
As mentioned earlier, it is impractical to store all the data points generated for each orientation. A list of the best energies (user-defined cutoff) for a given orientation is maintained for visualization. A DOT module called colvol displays 3D scalar data coded such that the most favorable interaction energies are color-coded at the blue end of the spectrum and the less favorable energies run toward the red. A DOT module called vis displays the surface generated by Surf in vol.
TestThe visualizations of our preliminary studies were produced using AVS. The problem is the same one found in ref. [1], the interaction of plastocyanin and cytochrome c.
As explained in detail in ref. [1], the plastocyanin/cytochrome c interaction is faster than expected from random collision, but too slow for study by molecular dynamics techniques. Significant electrostatic potentials are concentrated on one-fourth (969 square Ångstroms) of the plastocyanin surface, with the greatest negative potential centered on the tyrosine 83 hydroxyl within the acidic patch, and on one-eighth (632 square Ångstroms) of the cytochrome c surface, with the greatest positive potential centered near the exposed heme edge. Coherent electrostatic fields occur only over these regions, suggesting that local, rather than global, charge complementarity controls productive recognition. Using TURNIP, Roberts et al. found that the three energetically favored families of precollision orientations all directed the positive region surrounding the heme edge of cytochrome c toward the acidic patch of plastocyanin but differed in heme plane orientation. Analysis of electrostatic fields, electrostatic energies of precollision orientations (with 12- and 6-Ångstrom separation distances), and surface topographies suggested that the favored orientations should converge to productive complexes promoting a single electron-transfer pathway from the cytochrome c heme edge to tyrosine 83 of plastocyanin. Direct interactions of the exposed Cu ligand in plastocyanin with the cytochrome c heme edge were not unfavorable sterically or electrostatically, but the analysis suggested they should occur no faster than randomly, indicating that such an interaction is not the primary pathway for electron transfer.
The TURNIP calculations took about 30 hours on a Convex C2 at the Scripps Research Institute, but the number of interactions studied was smaller by a factor of 780 than the 2.6 billion energy evaluations made with DOT in just 8.7 minutes on the SDSC Intel Paragon.
Visualizing the results of 2.6 billion energy evaluations is a demanding task. It is simple enough to apply volume visualization techniques to the merged grid, but there is much more information present than the simple energy minima. Coherence of orientation at nearby minima is also important. AVS (Advanced Visual Systems, Inc.) was used to display the results of the calculations. AVS allows composition of visualizations which include both the scalar information from the numerical grids and the geometric information of the molecular structures.
Figure 1 shows two representations of the proteins analyzed by DOT. Plastocyanin (the stationary molecule) is on the left, cytochrome c on the right. The lower portion of the image shows the alpha carbon backbone of each protein and gives a good feel for the proteins' general shapes. The top part of the image shows a potential isosurface for each protein, displaying charge characteristics, with red negative and blue positive. Plastocyanin has a large negative lobe centered about tyrosine 83 (the pale red ring) and cytochrome c has a large positive lobe. The problem is to determine the exact relative orientation the two molecules have when docked.
DOT calculated the energies of about 2.6 billion different configurations of cytochrome c positioned about plastocyanin (the stationary molecule). Figure 2 shows plastocyanin (white tubes with gold sphere for the bound copper ion) surrounded by spheres representing centers of the 2,000 most favorable configurations of cytochrome c found by DOT. In this energy evaluation, only the configuration with the best energy for each grid point was saved. A blue sphere indicates a more favorable energy (larger negative number) than a red one. A single cluster of spheres is seen, agreeing with the results from TURNIP, in which cytochrome c was found to approach the face of plastocyanin that contains the tyrosine 83 side chain (red ring). The volume of space (a 128 Ångstrom cube) surveyed around the plastocyanin molecule is shown in Figure 2b.
Figure 3 shows the full backbone representation of the cytochrome c molecule in the configuration found as the second most favorable energy. DOT retains the rotation of the moving molecule, as well as the energy, so that the complete molecular complex can be rapidly regenerated for viewing and analysis.
In Figure 4, the spheres are colored by energy as in Figure 2, but the size of the spheres is scaled by the similarity of orientation at neighboring grid points. This gives a visual correlation of the relationship between two distinct properties of the data. Here, similar orientations are concentrated in the middle of the cluster of most favorable energies, indicating the importance of electrostatic steering in this system.
Figure 5 gives the performance on a Digital Equipment Corporation Alpha Farm at the San Diego Supercomputer Center. This machine contains nine DEC 3000/400 systems each of which has a performance of 70.8 Mflops on the 64-bit Linpack 1000x1000 benchmark and 26.4 Mflops on the corresponding 100x100 benchmark. Figure 5a shows the real time speedup; Figure 5b plots the same data on a log-log scale to show the linearity of time and processors. The speed is perfectly linear in the number of processors up to the size of the machine.
Figure 6 shows the performance on the Intel Paragon XP/S. This system contains 400 GP nodes, each of which performs at 34 Mflops on the Linpack 1000x1000 benchmark and 10 Mflops on the Linpack 100x100 benchmark. Only 256 of the 400 GP nodes have 32MB of memory, which is preferred for running DOT. The larger number of nodes finally showed where the scalability of the calculation starts to break down. The speedup is completely linear up to 100 processors. At 256 processors each node has only 40 orientations to compute, so the time is dominated by the communications cost of merging the answers. However, the whole calculation takes only 8.7 minutes.
This research was supported by grant BIR 9223760 from the National Science Foundation and used the facilities of the San Diego Supercomputer Center. Merry Maisel, Gina Caputo, and Gail Bamber of SDSC supplied major assistance in the preparation of this article.
[1] Roberts, V.A., H.C. Freeman, A.J. Olson, J.A. Tainer, E.D. Getzoff (1991): Electrostatic orientation of the electron-transfer complex between plastocyanin and cytochrome c, Journal of Biological Chemistry, vol. 266, p. 13431.
[2] Geist, G.A., A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V.S. Sunderam (1994): PVM: Parallel Virtual Machine - A User's Guide and Tutorial for Networked Parallel Computing, MIT Press, Cambridge. Additional information can be found at the PVM Home Page (URL is http://www.epm.ornl.gov/pvm/pvm_home.html).
[3] Harrison, R.W., I.V. Kourinov, L.C. Andrews (1994): Fourier-Green's function and the rapid evaluation of molecular potentials, Protein Engineering, vol. 7, p. 359.
[4] Katchalski-Katzir, E., I. Shariv, M. Eisenstein, et al. (1992): Molecular surface recognition--Determination of geometric fit between proteins and their ligands by correlation techniques, Proceedings of the National Academy of Sciences vol. 89, p. 2195.
[5] Vakser, I.A., C. Aflalo (1994): Hydrophobic Docking: A proposed enhancement to molecular recognition techniques, PROTEINS: Structure, Function, and Genetics, vol. 20, p. 320.
[6] Gilson, M.K., M.E. Davis, B.A. Luty, J.A. McCammon (1993): Computation of electrostatic forces on solvated molecules using the Poisson-Boltzmann equation, Journal of Physical Chemistry, vol. 97, p. 3591.
[7] Gilson, M.K., K. Sharp, B. Honig (1987): Calculating the electrostatic potential of molecules in solution, Journal of Computational Chemistry, vol. 9, p. 327.
[8] Gilson, M.K., B. Honig (1988): Energetics of charge-charge interactions in proteins, Proteins, vol. 3, p.32.
[9] Varshney, A., F.P. Brooks, W.V. Wright (1994): Computing smooth molecular surfaces, IEEE Computer Graphics and Applications, v. 14, p. 19.