Skip to content

ALPHA PROJECTS | Contents | Next

Russ Altman
Stanford University

Reagan Moore
Chaitan Baru,
Phil Bourne, Jon Genetti, Michael Gribskov, Greg Johnson, Bertram Ludäscher, John Moreland, Dave Nadeau, Bernard Pailthorpe, Arcot Rajasekar, Ilya Shindyalov

Hector Garcia-Molina, Andreas Paepcke, Allison Waugh, Glenn Williams
Stanford University

Andrew Grimshaw, Katherine Holcomb
University of Virginia
Lennart Johnsson, Montgomery Pettitt
University of Houston
David States
Washington University
Roy Williams


Waugh, A., G.A. Williams, L. Wei, and R.B. Altman. 2001. Using Metacomputing Tools to Facilitate Large-Scale Analyses of Biological Databases. In Pacific Symposium on Biocomputing 2001, 360–371, Singapore: World Scientific.

Wei, L., and R. B. Altman. 1997. Recognizing Protein Binding Sites Using Statistical Descriptions of their 3D Environments. In Pacific Symposium of Biocomputing 1998, ed. R.B. Altman, A.K. Dunker, L. Hunter, and T.E. Klein, 497–508. Singapore: World Scientific.

Bioinformatics Infrastructure for Large-Scale Analyses

T he amount of data in biological collections is growing at a phenomenal rate. Analytical tools are necessary to make sense of the information in databases such as the Protein Data Bank (PDB), a worldwide repository of 3-D protein structures. Such tools need to be scalable, to deal with the heterogeneous nature of the data collections; accessible, to allow researchers to easily find the data they need; and ubiquitous, to move large data sets across the grid. This is where bioinformatics steps in. NPACI alpha project co-leader Russ Altman and his group at Stanford University have developed a way to scan the entire PDB to search for a specific feature of proteins. This was the first major test of the Bioinformatics Infrastructure alpha project's capabilities.

In the last three years, the SDSC-hosted PDB has increased from 5,811 released atomic coordinate entries to 13,447 in October 2000. The growth of the PDB is typical for many other biological databases. Given the high rate at which biological data are being collected and made public, computational tools must be developed that can efficiently access and analyze these data. Altman's group has used the Legion metasystem to enable large-scale computations on the PDB. In particular, it has used the Feature program to scan all of the protein structures in the PDB in a search for unrecognized potential binding sites for cations (positively charged ions), such as calcium, an important signaling ion within the cell.

"As part of large-scale structural genomics initiatives, automated annotation tools may be invaluable for providing a first assessment of the functional capabilities of molecules," Altman said.

The Bioinformatics alpha project focuses on using data manipulation, analysis, and visualization infrastructure to integrate data from molecular structure data resources such as the PDB. Legion is then used for performing large-scale computations across the databases. Together, these components create a discovery environment that will eventually be applied to many disciplines.


Altman leads the Helix Group at the Stanford Medical Informatics laboratory, where Feature was developed. The first phase takes examples of known binding sites within proteins for an element such as calcium, and a set of non-calcium binding sites. It uses these two sets to build a statistical model that describes the significant features of the sites, in comparison to non-sites. In the second phase, Feature takes a protein and superimposes on it a 3-D grid. At each point, the system evaluates the degree of match between the calcium site model and the local environment around the grid point, and gives a score. High-scoring regions are the most likely to be the calcium site.

"We built the calcium model previously and published it on a relatively small set of proteins. Now, we can take the model and run it on the entire PDB," Altman said. "This took about seven hours on the Centurion cluster at the University of Virginia, using about 50 processors at any given moment." The high-parallelism of Legion allows complete scans of the entire PDB in less than a day, compared to several days to a week of run time necessary when making a sequential scan.


In the analysis component of the alpha project, Feature is just one of the molecular scanning and comparison algorithms for various collections. Legion, a metasystem developed by Andrew Grimshaw and his team at Virginia, recruits the computing resources required for large-scale analyses, such as "linear scans" (which are required by Feature) through the databases and "all-versus-all" comparisons across databases (which are required when clustering structures or sequences), as well as analyses that identify the geometric and thermodynamic properties of molecules.

Altman's team performed a full scan of the PDB using Legion. Feature was registered in the Legion object space, Altman said, and then used to scan for calcium-binding sites. "By doing the scan, we discovered some things regarding the PDB," Altman added. "For example, we found irregularities--that is, some calcium-binding sites that showed up on the scan were not accounted for in the PDB structures, and some calcium-binding sites that were noted in the PDB could not be confirmed."

The grid-based infrastructure that allows Feature computations using Legion were demonstrated by alpha project participant Glenn Williams of Stanford at NPACI's research exhibit at SC2000 in Dallas, Texas, November 6-9.

"I think these preliminary results begin to vindicate the alpha project approach we have adopted in NPACI, that is, focusing effort on projects that involve both applications and technologies with a common, well-defined goal," Altman said. "We needed Legion for this project to work. The Legion team came to Stanford regularly, and we were in constant contact with them--helping to debug elements of their system, while they helped us learn some novel biology. If the project consisted simply of Feature users, it would not have worked." --AV *