Legion of Computers
roteins are synthesized within cells as strings of amino acid molecules arrayed along a "backbone" of carbon atoms. But just as soon as a complete protein is manufactured, it folds into a unique three- dimensional shape, and that shape determines how the protein will function. One of the leading problems in molecular biology is to achieve an understanding of protein folding. An NPACI alpha project is dedicated to exploring the how and why of protein folding by distributing large-scale protein-folding simulations across the Grid. The project recently reached a scientific and computational milestone by simulating the folding of a protein called Protein L using six different computers, some thousands of miles apart. According to project leader Charles L. Brooks, III, "We compressed a month’s worth of computing on our own systems into just 36 hours by using distributed computing."
|Figure 1. Protein L
Left: Protein L unfolded. Right: Protein L folded according to CHARMM computation.
The NPACI alpha project, Protein Folding in a Distributed Computing Environment, is led by Brooks, a professor of molecular biology from The Scripps Research Institute (TSRI) in La Jolla, and Andrew Grimshaw, professor of computer science at the University of Virginia. Their collaborators include research scientist Michael Crowley of TSRI and computer scientist Anand Natrajan of Grimshaw’s group at Virginia.
Crowley worked closely with the Virginia team members to adapt the molecular dynamics program CHARMM for each platform involved in the runs. CHARMM, originally developed in the laboratory of Martin Karplus at Harvard University, is under continual development by a consortium of scientists worldwide, with oversight and coordination by Brooks, Karplus, and Bernard R. Brooks (who is Chief of the Computational Biophysics Section of NIH/NHLBI/LBC). CHARMM supplies a general simulation environment for studies of the motions and mechanics of bio-macromolecular systems.
For the distributed computing test, CHARMM was used to compute the folding free-energy landscape of Protein L, a small (62 residues) protein with 585 atoms. The available processors for each run were divided into "gangs" of 16 processors that performed tightly coupled parallel calculations, with each gang exploring approximately 200 distinct regions of conformational space. The loose folds represented in each region proceeded toward the native folded state during the computation.
"We are still analyzing the calculations," said Charles Brooks, "and we hope to confirm experimental observations made by other groups that reveal a very specific order of folding for Protein L." The way in which Protein L folds differs significantly from the way in which a very similar protein, called Protein G, is known to fold. Both proteins contain a large "alpha helix" structure laid upon several "beta strands" (relatively flat ribbons), and their amino acid sequences are nearly identical. "The fact that the sites of nucleation or condensation differ may indicate the importance of very small differences in the sequence. If so, we will be closer to a deepened understanding of the protein folding problem," Brooks said.
Legion of Computers
The key to the recent successful simulations, according to Brooks, was the use of Legion, a grid operating system developed by Grimshaw and colleagues with funding from NPACI and various agencies. "With Legion," said Grimshaw, "all the scientist needs are compiled codes for each platform that may be used, a script for dispatching jobs, and another for keeping track of the results. Legion does everything else, either through an easy-to-use Web interface or a more traditional Unix command-line interface."
Grimshaw explained that Legion manages queues, accounting, security, job submission, recovery from errors of all kinds, status reporting, and the job of returning the output to the scientist when the calculations are done. Any computing system registered in the Legion network may participate in the calculation if compiled code for it exists. Legion inventories the available resources and schedules the job to take best advantage of them.
The NPACI systems used in the project included Blue Horizon, a 1.7 teraflops IBM SP at SDSC, a 32-processor Sun Enterprise 10000 at SDSC, a 128-processor HP V2500 at Caltech, a 32-processor Centurion Alpha cluster at Virginia, and IBM systems at the universities of Michigan and Texas totaling 56 processors. "There were no Legion run-time failures," said Anand. "We plan to add functionality in the form of archival or mass storage systems." He credited work done by NPACI computer scientists led by Nancy Wilkins-Diehr for the successful runs. –MM
Charles L. Brooks III
University of Virginia
University of Virginia