Skip to content


ALPHA PROJECTS | Contents | Next

Bioinformatics Infrastructure for Large-Scale Analyses

Russ Altman, Stanford University,
NPACI Molecular Science thrust area leader

Reagan Moore, SDSC, NPACI Data-Intensive Computing thrust area leader

Russ Altman, Stanford University

Allison Waugh, Liping Wei, Glenn Williams, Catherine Ying, Stanford University
Chaitan Baru, Phil Bourne, Arcot Rajasekar, Ilya Shindyalov, SDSC
Montgomery Pettitt, University of Houston
David States, Washington University
Fran Berman, Yannis Papakonstantinou, UC San Diego
Andrew Grimshaw,, Katherine Holcomb, University of Virginia
David Hillis, University of Texas
Arthur Olson, The Scripps Research Institute

T raditionally, biology research begins with a hypothesis. A biologist then collects experimental data and analyzes them to support or disprove the hypothesis. However, information technology is changing this sequence of events. Today, large-scale exploratory experiments are gathering as much data as possible; for example, the Human Genome Project is enumerating the three billion amino acids in the human genetic blueprint. So now when a biologist forms a hypothesis, the data may already be in such a collection, just a computer search away. An NPACI alpha project led by Russ Altman of Stanford University is developing new tools and techniques to extract insights from growing biology data collections.

Alongside these collections, a new discipline has emerged, called bioinformatics. Bioinformatics researchers develop and apply computing tools to extract the secrets of the life and death of organisms from the genetic blueprints and molecular structures stored in digital collections. However, while large-scale activities are amassing vast quantities of data, the techniques for routinely analyzing these collections have not kept pace.

"Because of activities like the Human Genome Project, exploratory data collection has come to be seen as more cost effective," said Altman, associate professor of medicine in the section on Medical Informatics at Stanford University Medical Center and leader of NPACI's Molecular Science thrust area. "The collectors do not necessarily know why they are collecting. They figure that out later. This project will develop the infrastructure that will allow biologists to answer questions using these data collections."

For example, a biologist might hypothesize that a protein with certain features--such as a particular sequence or structure--must exist to carry out some task in a cell. A time-consuming approach is to devise an experimental technique to search for the hypothesized protein. However, with bioinformatics techniques, the biologist can scan a collection such as the Protein Data Bank (PDB) for proteins that meet the hypothesis criteria.






Grimshaw, A.S., A. Ferrari, F.C. Knabe, and M. Hum-phrey (1999): Wide-area computing: Resource sharing on a large scale, IEEE Computer 32, 29-37.


The Bioinformatics Infrastructure alpha project is making a three-pronged assault on the problem: analysis, access, and movement of data. In the analysis component, participants are developing molecular scanning and comparison algorithms for various collections, and the Legion metasystem will recruit the computing resources required for large-scale analyses.

"Legion's architecture and past work on other biological codes will allow us to make progress quickly," said Andrew Grimshaw, the leader of the Legion effort and NPACI's Metasystems thrust area leader. "We already have a framework for similar database analyses, which can easily take advantage of Legion's built-in fault tolerance."

Altman's group, including Liping Wei, Allison Waugh, and Catherine Ying, have developed a
scanning code that examines a PDB entry for active sites for which models exist, such as that for calcium. NPACI support has contributed to code improvements. Glenn Williams at Stanford is now working with the Legion team at the University of Virginia to modify the code to scan the entire PDB.

"Our code takes a PDB entry and derives information about potential calcium binding sites," Altman said. "Any similar code could be used, and we want to make tools like these more readily available." The project also includes a molecular comparison algorithm for the PDB called combinatorial extension by Phil Bourne and Ilya Shindyalov at SDSC. An algorithm for phylogenetic analysis from David Hillis at the University of Texas and a genetic sequence comparison algorithm by David States at Washington University in St. Louis both run on sequence databases such as SWISS-PROT and GenBank.

To run a scanning algorithm against every record in the PDB takes time proportional to the number of PDB entries. For example, Altman's code takes about 0.01 seconds per atom to scan a 3-D structure. With more than 10 million atoms in the PDB, the code takes days to scan the PDB on a workstation.

All-versus-all comparisons, in contrast, take time proportional to the square of the number of entries. Last year, Bourne and Shindyalov used their combinatorial extension code, which finds similarities between structures, to compare more than 8,000 proteins in the PDB against every other protein. The effort consumed more than 24,000 processor hours on the Cray T3E at SDSC. On a workstation, the effort would have taken almost three years.

Top | Contents | Next

Growth of Biological Data Growth of Biological Data
Figure 1. Growth of Biological Data
The Bioinformatics Infrastructure alpha project is developing the tools and techniques for analyzing the the growing amount of biological data in databases. Left: The DNA base pairs stored in GenBank through August 1999. Right: The structures stored and deposited in the Protein Data Bank through October 1999.


In the data access component, participants from the Molecular Science and Data-Intensive Computing Environments (DICE) thrust areas are working to address the information access challenges of seamlessly accessing the information in different molecular data collections. When complete, the project will permit analyses to span not just one collection but several.

"The information access challenge is to take these data collections and compare them on a high-performance system such as the Centurion cluster at the University of Virginia," said Reagan Moore, associate director of SDSC and leader of the DICE thrust area. "To do that, we must have definitions of how the collections are organized, and we have to understand the structure of objects in the collection."

Operated by NPACI partners Rutgers University, SDSC, and the National Institute for Standards and Technology, the PDB is the premiere worldwide repository of 3-D protein structures, with 11,000 such structures today. Bourne leads the SDSC team. The Institute for Biomedical Computing at Washington University, directed by States, mirrors several protein and genetic databases. These databases include GenBank, which holds 3.4 billion bases in 4.6 million entries from 47,000 species, and SWISS-PROT, with 29 million amino acids in 80,000 entries (Figure 1).

The collection that promises to seriously tax the data-handling infrastructure, however, is being created by Montgomery Pettitt, director of the Institute for Molecular Design at the University of Houston. Pettitt is creating a database of molecular dynamics trajectories, time-slice snapshots from molecular dynamics simulations. A one-microsecond simulation might, for example, produce a time slice every femtosecond--or one million database entries from a single simulation.

This diverse set of collections will be integrated by applying technologies developed in the Mediation of Information using XML (MIX) project, led by Chaitan Baru at SDSC and Yannis Papakonstantinou, director of UC San Diego's Database Laboratory (see enVision, July-September 1999). The MIX software acts as a mediator between data sources with different structures, such as the collections in the alpha project. The MIX tools will use infrastructure-independent XML representations being developed by the collection maintainers to allow biologists to query, access, and analyze data from any or all of the collections from a single application or Web interface.

Top | Contents | Next


The data movement component joins the access and analysis components and is essential to the ultimate goal of the alpha project--large-scale analyses across more than one collection at a time. Initially, the Legion-enhanced codes will access data directly from the data collections, but the longer-term goal of moving data across the Grid is being addressed by efforts to integrate the SDSC Storage Resource Broker (SRB), the cornerstone of NPACI's data-handling infrastructure, with the Legion architecture.

"The only challenge for Legion will be the amount of data the codes will produce," Grimshaw said. "We don't have enough disk space, so we'll need some way to manage the storage." In addition, the code results could potentially exceed Legion's current 4-GB file size limit, and even when Legion allows larger files, the question of where to put the data remains.

The SDSC SRB provides an interface to distributed, heterogeneous data resources. When requesting data through the SRB, applications and users need not know how or where a data collection is stored. With an interface between Legion and the SRB, an algorithm will be able to marshal both computing processors and data collections across the network.

"Traditionally, data movement is handled by copying files via FTP to local disk or into memory," Moore said. "We need an architecture to stream the data from collections to an application on an as-needed basis. That's where the SRB fits in." The DICE activities in this project build not only on NPACI activities, but also on externally funded projects with the NASA Information Power Grid, the Department of Energy's Visualization Interaction Environments for Weapons Simulation (VIEWS), and the National Archives and Records Administration.

Top | Contents | Next


While the project will require many new developments on the underlying technologies, the success of the project will be measured in part by how well the underlying infrastructure permits biology researchers to make new discoveries from the collections. For these users, the alpha project will provide a number of new resources and capabilities.

First, the test codes run by the participants will provide both benchmarks and biological results. Second, the experiences of the participants will be captured in templates for performing similar analyses. Timings on resources, formulas for extrapolating from workstation runs to Legion runs, and other information will help determine the resources needed to answer a hypothesis and hence develop an NPACI allocation request. Eventually, the test codes and algorithms may be made available for execution, possibly via the Web.

"In some ways, the biology community is depending on projects like this alpha project to keep up with the rapid growth of biology data collections," Altman said. "Using the Grid to marshal the resources necessary to scan databases such as the PDB and GenBank, this project will make this kind of bioinformatics capability routine." --DH *

Top | Contents | Next