teneyckl@sdsc.edu

**Jeffrey Mandell**(1)

mandell@sdsc.edu

**Victoria A. Roberts**(2)

vickie@scripps.edu

**Michael E. Pique**(2)

mp@scripps.edu

(1) San Diego Supercomputer Center

Computational Center for Macromolecular Structure

P.O. Box 85608

San Diego, CA 92186-9784

(2) The Scripps Research Institute

Department of Molecular Biology

10666 North Torrey Pines Road

La Jolla, CA 92037

The purpose of the molecular interaction program DOT (Daughter of Turnip) is
rapid computation of the electrostatic potential energy between two proteins or other charged molecules. DOT exhaustively tests
all six degrees of freedom, rotational and translational, and produces a grid
of approximate interaction energies and orientations. It is able to do this
because the problem is cast as the convolution of the potential field of the
first molecule and any rotated charge distribution of the second. The
algorithm lends itself to both parallelization and vectorization, permitting
huge increases in computational speed over other methods for obtaining
the same information. For example, a complete mapping of interactions between
plastocyanin and cytochrome *c* was done in eight minutes using 256 nodes of an
Intel Paragon. DOT is expected to be particularly useful as a rapid screen to find configurations for more detailed study using exact energy models.

Copyright 1995 by the Association for Computing Machinery, Inc. (ACM).

Detection and analysis of the interactions of biological macromolecules is a key problem in molecular biology. Many complex cellular regulatory systems are controlled through such interactions. The subtle web of chemical processes that constitutes cellular metabolism depends on the interactions of macromolecules. Knowledge of the most stable relative orientations between interacting macromolecules is essential in understanding how proteins associate with one another. Knowledge of specific protein-to-protein interactions may be applied to protein subunit aggregation (for the study, e.g., of supramolecular structures), to computer-aided drug design, and to fundamental problems of cellular signaling and expression.

Most methods for modeling macromolecular interactions depend on energy minimization, molecular dynamics, free energy simulations, and related methods to find energy minima. DOT takes a different approach. The objective of DOT is to perform a rapid calculation that will immediately identify all of the "hot spots"--the relatively small number of configurations that are notably more favorable than others. The resulting list of possibilities can then be examined in greater detail by more comprehensive (and computationally expensive) methods.

DOT achieves its objective by efficiently enumerating all of the
possibilities, thereby reducing the minimization problem to one of scanning the
results.** **DOT exhaustively tests all six degrees of freedom, rotational
and translational. The energies calculated by DOT include electrostatic
interaction energies and van der Waals collisions, and DOT can also count the
number of atoms that are in contact between the two molecules.** **The
result of the calculation is a grid of approximate interaction energies and
orientations. Since DOT is restricted to rigid molecules, calculates energies
only on a regular grid in Cartesian space, and makes some approximations in the
potential energy calculations, the favorable points found by DOT are not exact,
but should be further analyzed by energy minimizations, molecular dynamics, or
other related techniques. DOT does not currently take solvation energy into
account, and thus it is most successful at present where the molecular
interaction is known to be dominated by electrostatic forces. (Methods are
presented in this paper that enable the calculation of buried surface area,
which can be used to approximate solvation effects.) In most situations,
however, DOT will quickly accomplish what have been among the most
rate-limiting steps in the typical search of configuration space.

DOT (Daughter of Turnip) is based on TURNIP [1], a program developed by one of us (VR) for the study of electrostatically interacting proteins. TURNIP did not scale well to larger problems, evaluated only a simple Coulombic electrostatic model, and did not consider geometric fit apart from constraining the closest distance between protein molecules. DOT addresses these issues. The electrostatic model used by DOT requires the calculation of a potential field once during the calculation, but only once: the use of full Poisson-Boltzmann models, therefore, does not significantly affect the cost of the computation. The method for detection of overlap between two molecules used by DOT can also count the number of atoms close to contact, giving a direct estimate of the buried surface area.

DOT drastically reduces the computational cost of the problem. It has been
written to support parallel computation in the PVM environment
[2] on systems
ranging from heterogeneous workstation networks through a DEC Alpha farm to an
Intel Paragon. In the test reported here, DOT completely mapped the interactions between plastocyanin and cytochrome *c* in about eight minutes on 256 nodes of the Intel Paragon at the San Diego Supercomputer Center. Plastocyanin was held still and cytochrome *c* was compared in 2.6 billion different configurations. A similar calculation on a Convex C2 minisupercomputer took 30 hours to make far fewer energy evaluations -- by a factor of 780.

DOT achieves its computational speed by casting the problem in terms of a
molecule moving inside the potential field of another molecule. The first
molecule is held still while the second is moved through all six degrees of
freedom on a fine mesh. This permits evaluation of the energies
as *correlations,* which can in turn be evaluated using fast Fourier
transforms (FFTs). Harrison *et al.* [3] have shown how potentials and energies can be computed using the convolution theorem. DOT does not currently
use these methods for calculation of potentials, but does use them for evaluation of energies.

The electrostatic energy of a system consisting of a charge distribution
*Q*(**r**) in a potential field *V*(**r**) is given by

If the charge distribution is rotated by an angle theta and displaced from the
origin by a vector **r**-nought, the energy of the system is given by

Mathematically this is the correlation of the potential field *V*(**r**) and the rotated charge distribution *Q*(**r**).

The treatment of geometric fit is analogous to the treatment of
electrostatics. Let *G*(**r**) be a step-function potential such that

where *M *is a number greater than the number of atoms in molecule 2,
"inside" is defined with reference to the solvent-accessible surface of
molecule 1, and the "surface layer" consists of grid points within an atomic
radius or so of the solvent-accessible surface. Let *A*(**r**) be the set of
delta functions at atomic centers of molecule 2. The fit between molecule 1
and molecule 2, translated and rotated as above, is

where *j *is the number of atoms from molecule 2 that are inside molecule
1 (which cannot exceed *M*), and *k* is the number of atoms from
molecule 2 which are in surface contact with molecule 1. This formulation
of the overlap computation is also
a correlation. This method for computing overlap is essentially identical with
that described by Katchalski-Katzir et al. [4] and expanded by Vakser and Aflalo [5].
Harrison et al. [3] describe an alternative method for evaluating the non-bonded potentials as convolutions, which is probably more accurate but which requires separate potential functions for the attractive and repulsive van der Waals terms. This significantly increases the storage requirements, and does not seem necessary for our present purposes. We plan to use the Harrison model in further, more detailed work in the future.

Correlations can be calculated very efficiently through the use of the
Convolution Theorem. Given two functions, *X*(**r**) and *Y*(**r**), and
their Fourier transforms

(where *K* is a normalization constant dependent on the dimensionality of
the problem), the convolution of the functions can be calculated as

Correlation and convolution are very closely related. If the function *Y*(**r**) is inverted through the origin, the convolution becomes a correlation. This is very easy to do with Fourier transforms, since

where the asterisk denotes the complex conjugate. The desired correlation is thus

If the functions *X* and *Y* are defined on a grid of *N*
points, the correlation calculation has the discrete analog

Evaluating the correlation directly from the definition for *N* points
requires *N* x *N *multiplications. Evaluating the correlation using
FFTs costs three FFTs, each proportional to *N*log*N*, and *N*
multiplications, so the computational cost is proportional to
*N(*3 log *N* + 1*).*

The cost can be further reduced by noting that where multiple correlations are required and only one of the functions changes, one of the FFTs can be omitted from all but the first calculation. This reduces the cost multiplier from 3 to 2.

The algorithm DOT employs lends itself to both parallelization and
vectorization, which permits huge increases in computational speed. Not only
can orientations be farmed out to various processors, but also the convolutions
themselves can be easily vectorized. The ratio of computational costs between
TURNIP and DOT contains complex scale factors because the cost of TURNIP
depends on the number of atoms while the cost of DOT depends on the number of grid
points. The number of grid points required is, however, proportional to the
number of atoms. TURNIP thus scales as *N*N* and DOT scales as *N* log *N*. The speed increase gives DOT the potential to be used interactively, and a graphical interface is under development that will permit such use across
many platforms.

DOT operates in a master/slave mode. The Fourier transforms of the potential grid and the van der Waals exclusion mask of the stationary molecule are precalculated. The Fourier transforms are distributed to the slave processors, along with the coordinates of the moving molecule. Each slave processor is then given a list of Eulerian angles to apply. The slave processors each rotate their copy of the moving molecule and place the charges of the rotated atoms on a grid. The Fourier-transform of the charge grid is calculated, multiplied by the Fourier transform of the electrostatic potential, and inverse-transformed. The result is the electrostatic energy of interaction through space for the given rotation. A second convolution is performed using the rotated atoms; this convolution gives the steric information. For each slave processor the answers are accumulated on another grid. For each point in the grid the best answer that is not excluded by steric collision is saved, along with the orientation that produced it.

Load balancing is achieved by giving each slave a relatively small
number of orientations to process at a time. When a slave finishes its list,
it notifies the master, which records the corresponding orientations as
complete and gives the slave more work. When no more work remains to be done,
the slave is ordered to report its results back to the master. If two slaves
are eligible to report at the same time, they are ordered to merge their results
first. The effect of this strategy is cost proportional to log *N* for gathering of results in a homogeneous situation, as on the Intel Paragon or the DEC Alpha farm. This method also provides effective load balancing and merging for heterogeneous workstations connected by an Ethernet. An added benefit of this
approach is that the master processor can time the slaves to detect relative
speeds and hung processors. The system is thus robust against failures of
slaves with little increase in overhead for the homogeneous situation. It is not robust against failure of the master processor.

The amount of data communication required is low except at the beginning
and the end of the computation, at which points large arrays must be
communicated. The communication costs are dependent on the message passing architecture. The amount of data to be passed among processors at both beginning and end of the calculation is *2MN,* where *M* is the number of processors and *N* is the number of grid points. When message passing can be done in parallel, this can be accomplished in log *M* time. The initial data is invariant for all processors, which means that a broadcast function can be used if such a function is
available. DOT uses broadcast on the Intel Paragon for a significant saving in startup time. DOT implements a log *M* method in gathering the results on all platforms.

DOT implements a tcl/tk (X-window-based) graphical user interface that makes all manipulable objects and information accessible. A data interface translates external input data into DOT format. The computational module manages the molecular surface generation and evaluation of the grid. Grid potential generation will be incorporated in a future release; it is presently generated using either the University of Houston's UHBD [6] code or Biosym Technologies' DelPhi [7,8] code, both of which employ comprehensive Poisson-Boltzmann calculations. The number of data points generated in the calculation is so large that at present only a list of most favorable orientations and interaction energies for each position in space can be stored and maintained for subsequent visualization.

DOT requires a potential file (joules per coulomb) and a molecular volume file
for the stationary molecule (molecule 1), and an atomic coordinate/charge file
for the moving molecule (molecule 2). DOT currently supports a *.grd* input
format for the potential, obtainable from *DelPhi*; corrections must be made to
match *DelPhi* and DOT grid sizes by shifting the grid center appropriately.
Molecular volume files are generated by *vol,* a module based on the program
*Surf* [9] by Amitah Varshney of the University of North Carolina. *Surf*
generates a triangulated description of a molecular surface extremely rapidly.
In *vol* the code goes on to generate a list of all gridpoints inside the
molecular surface, with user-defined resolution, and a list of coordinates in Cartesian form is output for use by DOT.

The atomic coordinate information for molecule 2 can be supplied in a variety of
file formats. DOT will support *.dot, .car,* and *.atm* formats. The *.dot* format
is *x y z c,* where *x y z* are the spatial coordinates of the atom and *c* is the partial charge. The *.car* format is that used by software from Biosym Technologies, Inc. A Protein Data Bank (*.pdb*) file does not contain charge information, so a program like Insight (Biosym Technologies) must be used to generate a *.car* or *.atm* file containing the charge information.

The grid is a 3D lattice of points with user-defined resolution on which the potential in joules/coulomb of the stationary molecule (molecule 1) is carried (as generated at present by UHBD [6] or DelPhi [7,8]).

The three rotational degrees of freedom are defined relative to the geometric center of the moving molecule and the three translational degrees about the geometric center of the stationary molecule. The convolution algorithm expedites the translational component of the calculation. Instead of moving molecule 2 through all defined 3D space for each orientation, the convolution computes the 3 translational degrees of freedom by multiplying the Fourier transforms of their values together. This multiplication step is easily vectorized for substantial increases in speed, while each orientation can be parallelized, as explained above.

The program then screens for overlap by an implementation of the step-function
potential (*G*(**r**)) method described above. The energies associated with
overlap conditions are labeled invalid.

The results for each orientation are merged locally. Each processor maintains a list of the best valid energies and orientations at each grid point. At present only one value is retained but additional values are potentially useful in examining the correlation of energy and orientations at neighboring grid points.

As mentioned earlier, it is impractical to store all the data points generated
for each orientation. A list of the best energies (user-defined cutoff) for a
given orientation is maintained for visualization. A DOT module called *colvol*
displays 3D scalar data coded such that the most favorable interaction
energies are color-coded at the blue end of the spectrum and the less favorable
energies run toward the red. A DOT module called *vis* displays the surface
generated by *Surf* in *vol*.

` `

Test
The visualizations of our preliminary studies were produced using AVS. The problem is the same one found in ref. [1], the interaction of plastocyanin and cytochrome *c*.

As explained in detail in ref. [1], the plastocyanin/cytochrome *c* interaction is
faster than expected from random collision, but too slow for study by molecular dynamics techniques. Significant electrostatic potentials are concentrated on
one-fourth (969 square Ångstroms) of the plastocyanin surface, with the
greatest negative potential centered on the tyrosine 83 hydroxyl within the acidic
patch, and on one-eighth (632 square Ångstroms) of the cytochrome *c* surface,
with the greatest positive potential centered near the exposed heme edge.
Coherent electrostatic fields occur only over these regions, suggesting that
local, rather than global, charge complementarity controls productive
recognition. Using TURNIP, Roberts et al. found that the three energetically
favored families of precollision orientations all directed the positive region
surrounding the heme edge of cytochrome *c* toward the acidic patch of
plastocyanin but differed in heme plane orientation. Analysis of electrostatic
fields, electrostatic energies of precollision orientations (with 12- and
6-Ångstrom separation distances), and surface topographies suggested that the
favored orientations should converge to productive complexes promoting a single
electron-transfer pathway from the cytochrome *c* heme edge to tyrosine 83 of
plastocyanin. Direct interactions of the exposed Cu ligand in plastocyanin with
the cytochrome *c* heme edge were not unfavorable sterically or
electrostatically, but the analysis suggested they should occur no faster than
randomly, indicating that such an interaction is not the primary pathway for
electron transfer.

The TURNIP calculations took about 30 hours on a Convex C2 at the Scripps Research Institute, but the number of interactions studied was smaller by a factor of 780 than the 2.6 billion energy evaluations made with DOT in just 8.7 minutes on the SDSC Intel Paragon.

Visualizing the results of 2.6 billion energy evaluations is a demanding task. It is simple enough to apply volume visualization techniques to the merged grid, but there is much more information present than the simple energy minima. Coherence of orientation at nearby minima is also important. AVS (Advanced Visual Systems, Inc.) was used to display the results of the calculations. AVS allows composition of visualizations which include both the scalar information from the numerical grids and the geometric information of the molecular structures.

Figure 1 shows two representations of the proteins analyzed by DOT.
Plastocyanin (the stationary molecule) is on the left, cytochrome *c* on the
right. The lower portion of the image shows the alpha carbon backbone of each
protein and gives a good feel for the proteins' general shapes. The top part
of the image shows a potential isosurface for each protein, displaying charge
characteristics, with red negative and blue positive. Plastocyanin
has a large negative lobe centered about tyrosine 83 (the pale red ring) and cytochrome *c* has a large positive lobe. The
problem is to determine the exact relative orientation the two molecules have
when docked.

DOT calculated the energies of about 2.6 billion different configurations of cytochrome *c* positioned about plastocyanin (the stationary molecule).
Figure 2 shows plastocyanin (white tubes with gold sphere for the bound copper ion) surrounded by spheres representing centers of the 2,000 most favorable configurations of cytochrome *c* found by DOT.
In this energy evaluation, only the configuration with the best energy for each grid point was saved. A blue sphere indicates a more favorable energy (larger negative number) than a red one.
A single cluster of spheres is seen,
agreeing with the results from TURNIP, in which cytochrome *c* was found to approach the face of plastocyanin that contains the tyrosine 83 side chain (red ring). The volume of space (a 128 Ångstrom cube) surveyed around the plastocyanin molecule is shown in Figure 2b.

Figure 3 shows the full backbone representation of the cytochrome *c* molecule in the configuration found as the second most favorable energy.
DOT retains the rotation of the moving molecule, as well as the energy, so that the complete molecular complex can be rapidly regenerated for viewing and analysis.

In Figure 4, the spheres are colored by energy as in Figure 2, but the size of the spheres is scaled by the similarity of orientation at neighboring grid points. This gives a visual correlation of the relationship between two distinct properties of the data. Here, similar orientations are concentrated in the middle of the cluster of most favorable energies, indicating the importance of electrostatic steering in this system.

Figure 5 gives the performance on a Digital Equipment Corporation Alpha Farm at the San Diego Supercomputer Center. This machine contains nine DEC 3000/400 systems each of which has a performance of 70.8 Mflops on the 64-bit Linpack 1000x1000 benchmark and 26.4 Mflops on the corresponding 100x100 benchmark. Figure 5a shows the real time speedup; Figure 5b plots the same data on a log-log scale to show the linearity of time and processors. The speed is perfectly linear in the number of processors up to the size of the machine.

Figure 6 shows the performance on the Intel Paragon XP/S. This system contains 400 GP nodes, each of which performs at 34 Mflops on the Linpack 1000x1000 benchmark and 10 Mflops on the Linpack 100x100 benchmark. Only 256 of the 400 GP nodes have 32MB of memory, which is preferred for running DOT. The larger number of nodes finally showed where the scalability of the calculation starts to break down. The speedup is completely linear up to 100 processors. At 256 processors each node has only 40 orientations to compute, so the time is dominated by the communications cost of merging the answers. However, the whole calculation takes only 8.7 minutes.

This research was supported by grant BIR 9223760 from the National Science Foundation and used the facilities of the San Diego Supercomputer Center. Merry Maisel, Gina Caputo, and Gail Bamber of SDSC supplied major assistance in the preparation of this article.

[1] Roberts, V.A., H.C. Freeman, A.J. Olson, J.A. Tainer, E.D. Getzoff (1991):
Electrostatic orientation of the electron-transfer complex between plastocyanin
and cytochrome *c*, *Journal of Biological Chemistry, *vol. 266, p.
13431.

[2] Geist, G.A., A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V.S. Sunderam (1994): *PVM: Parallel Virtual Machine - A User's Guide and Tutorial for Networked Parallel Computing,* MIT Press, Cambridge. Additional information can be found at the PVM Home Page (URL is http://www.epm.ornl.gov/pvm/pvm_home.html).

[3] Harrison, R.W., I.V. Kourinov, L.C. Andrews (1994): Fourier-Green's function and the rapid evaluation of molecular potentials, *Protein Engineering, *vol. 7, p. 359.

[4] Katchalski-Katzir, E., I. Shariv, M. Eisenstein, et al. (1992): Molecular
surface recognition--Determination of geometric fit between proteins and their
ligands by correlation techniques, *Proceedings of the National Academy of
Sciences *vol. 89, p. 2195.

[5] Vakser, I.A., C. Aflalo (1994): Hydrophobic Docking: A proposed enhancement to molecular recognition techniques, *PROTEINS: Structure, Function, and Genetics, *vol. 20, p. 320.

[6] Gilson, M.K., M.E. Davis, B.A. Luty, J.A. McCammon (1993): Computation of electrostatic forces on solvated molecules using the Poisson-Boltzmann equation, *Journal of Physical Chemistry, *vol. 97, p. 3591.

[7] Gilson, M.K., K. Sharp, B. Honig (1987): Calculating the electrostatic potential of molecules in solution, *Journal of Computational Chemistry, *vol. 9, p. 327.

[8] Gilson, M.K., B. Honig (1988): Energetics of charge-charge interactions in proteins, *Proteins, *vol. 3, p.32.

[9] Varshney, A., F.P. Brooks, W.V. Wright (1994): Computing smooth molecular
surfaces, *IEEE Computer Graphics and Applications,* v. 14, p. 19.