| ALPHA PROJECTS | Contents | Next
|
|
Bioinformatics Infrastructure for Large-Scale Analyses |
|
| PROJECT LEADER Russ Altman, Stanford University, NPACI Molecular Science thrust area leader Reagan Moore, SDSC, NPACI Data-Intensive Computing thrust area leader PROJECT MANAGER |
PARTICIPANTS Allison Waugh, Liping Wei, Glenn Williams, Catherine Ying, Stanford University Chaitan Baru, Phil Bourne, Arcot Rajasekar, Ilya Shindyalov, SDSC Montgomery Pettitt, University of Houston David States, Washington University Fran Berman, Yannis Papakonstantinou, UC San Diego Andrew Grimshaw,, Katherine Holcomb, University of Virginia David Hillis, University of Texas Arthur Olson, The Scripps Research Institute |
|
LEGION-SCALE ANALYSISMIXED INFORMATION ACCESSSTREAMLINED DATA MOVEMENTBIOLOGICAL DISCOVERYREFERENCE Grimshaw, A.S., A. Ferrari, F.C. Knabe, and M. Hum-phrey (1999): Wide-area computing: Resource sharing on a large scale, IEEE Computer 32, 29-37. |
LEGION-SCALE ANALYSISThe Bioinformatics Infrastructure alpha project is making a three-pronged assault on the problem: analysis, access, and movement of data. In the analysis component, participants are developing molecular scanning and comparison algorithms for various collections, and the Legion metasystem will recruit the computing resources required for large-scale analyses. "Legion's architecture and past work on other biological codes will allow us to make progress quickly," said Andrew Grimshaw, the leader of the Legion effort and NPACI's Metasystems thrust area leader. "We already have a framework for similar database analyses, which can easily take advantage of Legion's built-in fault tolerance." Altman's group, including Liping Wei, Allison Waugh, and Catherine Ying, have developed a "Our code takes a PDB entry and derives information about potential calcium binding sites," Altman said. "Any similar code could be used, and we want to make tools like these more readily available." The project also includes a molecular comparison algorithm for the PDB called combinatorial extension by Phil Bourne and Ilya Shindyalov at SDSC. An algorithm for phylogenetic analysis from David Hillis at the University of Texas and a genetic sequence comparison algorithm by David States at Washington University in St. Louis both run on sequence databases such as SWISS-PROT and GenBank. To run a scanning algorithm against every record in the PDB takes time proportional to the number of PDB entries. For example, Altman's code takes about 0.01 seconds per atom to scan a 3-D structure. With more than 10 million atoms in the PDB, the code takes days to scan the PDB on a workstation. All-versus-all comparisons, in contrast, take time proportional to the square of the number of entries. Last year, Bourne and Shindyalov used their combinatorial extension code, which finds similarities between structures, to compare more than 8,000 proteins in the PDB against every other protein. The effort consumed more than 24,000 processor hours on the Cray T3E at SDSC. On a workstation, the effort would have taken almost three years. |
Top | Contents | Next |
|
|
|
![]() |
![]() |
Figure 1. Growth of Biological DataThe Bioinformatics Infrastructure alpha project is developing the tools and techniques for analyzing the the growing amount of biological data in databases. Left: The DNA base pairs stored in GenBank through August 1999. Right: The structures stored and deposited in the Protein Data Bank through October 1999.
|
|
MIXED INFORMATION ACCESSIn the data access component, participants from the Molecular Science and Data-Intensive Computing Environments (DICE) thrust areas are working to address the information access challenges of seamlessly accessing the information in different molecular data collections. When complete, the project will permit analyses to span not just one collection but several. "The information access challenge is to take these data collections and compare them on a high-performance system such as the Centurion cluster at the University of Virginia," said Reagan Moore, associate director of SDSC and leader of the DICE thrust area. "To do that, we must have definitions of how the collections are organized, and we have to understand the structure of objects in the collection." Operated by NPACI partners Rutgers University, SDSC, and the National Institute for Standards and Technology, the PDB is the premiere worldwide repository of 3-D protein structures, with 11,000 such structures today. Bourne leads the SDSC team. The Institute for Biomedical Computing at Washington University, directed by States, mirrors several protein and genetic databases. These databases include GenBank, which holds 3.4 billion bases in 4.6 million entries from 47,000 species, and SWISS-PROT, with 29 million amino acids in 80,000 entries (Figure 1). The collection that promises to seriously tax the data-handling infrastructure, however, is being created by Montgomery Pettitt, director of the Institute for Molecular Design at the University of Houston. Pettitt is creating a database of molecular dynamics trajectories, time-slice snapshots from molecular dynamics simulations. A one-microsecond simulation might, for example, produce a time slice every femtosecond--or one million database entries from a single simulation. This diverse set of collections will be integrated by applying technologies developed in the Mediation of Information using XML (MIX) project, led by Chaitan Baru at SDSC and Yannis Papakonstantinou, director of UC San Diego's Database Laboratory (see enVision, July-September 1999). The MIX software acts as a mediator between data sources with different structures, such as the collections in the alpha project. The MIX tools will use infrastructure-independent XML representations being developed by the collection maintainers to allow biologists to query, access, and analyze data from any or all of the collections from a single application or Web interface. |
Top | Contents | Next |
STREAMLINED DATA MOVEMENTThe data movement component joins the access and analysis components and is essential to the ultimate goal of the alpha project--large-scale analyses across more than one collection at a time. Initially, the Legion-enhanced codes will access data directly from the data collections, but the longer-term goal of moving data across the Grid is being addressed by efforts to integrate the SDSC Storage Resource Broker (SRB), the cornerstone of NPACI's data-handling infrastructure, with the Legion architecture. "The only challenge for Legion will be the amount of data the codes will produce," Grimshaw said. "We don't have enough disk space, so we'll need some way to manage the storage." In addition, the code results could potentially exceed Legion's current 4-GB file size limit, and even when Legion allows larger files, the question of where to put the data remains. The SDSC SRB provides an interface to distributed, heterogeneous data resources. When requesting data through the SRB, applications and users need not know how or where a data collection is stored. With an interface between Legion and the SRB, an algorithm will be able to marshal both computing processors and data collections across the network. "Traditionally, data movement is handled by copying files via FTP to local disk or into memory," Moore said. "We need an architecture to stream the data from collections to an application on an as-needed basis. That's where the SRB fits in." The DICE activities in this project build not only on NPACI activities, but also on externally funded projects with the NASA Information Power Grid, the Department of Energy's Visualization Interaction Environments for Weapons Simulation (VIEWS), and the National Archives and Records Administration. |
Top | Contents | Next |
BIOLOGICAL DISCOVERYWhile the project will require many new developments on the underlying technologies, the success of the project will be measured in part by how well the underlying infrastructure permits biology researchers to make new discoveries from the collections. For these users, the alpha project will provide a number of new resources and capabilities. First, the test codes run by the participants will provide both benchmarks and biological results. Second, the experiences of the participants will be captured in templates for performing similar analyses. Timings on resources, formulas for extrapolating from workstation runs to Legion runs, and other information will help determine the resources needed to answer a hypothesis and hence develop an NPACI allocation request. Eventually, the test codes and algorithms may be made available for execution, possibly via the Web. "In some ways, the biology community is depending on projects like this alpha project to keep up with the rapid growth of biology data collections," Altman said. "Using the Grid to marshal the resources necessary to scan databases such as the PDB and GenBank, this project will make this kind of bioinformatics capability routine." --DH * |
Top | Contents | Next |