s NPACI works to build the computing infrastructure to link nearly 40 partner sites--a testbed for a national computing environment for the 21st century--one of the first tasks is to enable partners to exchange data among a diverse collection of tape archives, file systems, databases, and digital libraries.
At the 10 partner sites where NPACI will be installing data caches, for example, data will be stored on tape archives, in DB2, Oracle, Informix, Sybase, Illustra, Objectivity and other databases, and in Unix file systems controlled by IBM, Sun, SGI, and Digital platforms. The goal is to allow applications running on NPACI computing platforms to access data from any of these storage locations.
To "glue" together the data-handling environment, NPACI is installing the Storage Resource Broker (SRB) at the data cache sites. The SRB was originally developed as part of SDSC's Massive Data Analysis System project and is supporting the ongoing Distributed Object Computation Testbed (DOCT) collaboration. As such, the SRB is already installed at three NPACI sites--UC San Diego, UC Berkeley, and Caltech--and two DOCT sites, NCSA and SAIC.
"The SRB middleware allows a client application to access and manage data on distributed storage resources," said Chaitanya Baru, leader of the SRB project at SDSC. "The SRB server handles the complex tasks of interpreting a variety of storage protocols, interfacing to many systems through proprietary interfaces, and dealing with multiple architectures."
Baru, Michael Wan, Arcot Rajasekar, Wayne Schroeder, and colleagues in SDSC's Enabling Technologies group recently released SRB version 1.1, and NPACI is in the process of upgrading the existing sites and installing the SRB at additional data cache sites. As the middleware that holds together the NPACI computing environment, the SRB client API and server software is available for AIX, SunOS, Solaris, SGI, Cray UNICOS, and Digital OSF operating systems. SRB currently provides a uniform interface to High Performance Storage System (HPSS) archives, DB2, Oracle, and Illustra databases, ftp archives, and Unix file systems via a set of resource-specific "drivers."
As part of DOCT, for example, NCSA developed an Oracle driver for the SRB. Drivers for the Sybase database and the ADSM archival storage system are scheduled to be completed by the middle of 1998. Once completed, NPACI researchers will be able to access data at any of NPACI's data cache sites. NPACI data-intensive computing partners also include UC Santa Cruz, which has an existing data cache for the REINAS project, and Stanford and Oregon State universities, which are providing software technologies.
In practice, the SRB server takes requests from applications through an application program interface, queries a metadata catalog for the physical location of the requested data, and accesses the data using the appropriate protocol via a resource-specific driver (Figure 1).
The SRB hides the low-level details of accessing each store, and the SRB file API provides a common Unix filelike interface to storage regardless of the underlying storage system, medium, or location. Version 1.1 of the SRB also incorporates the SDSC Encryption and Authentication (SEA) system for controlling access to data.
"Future versions of the SRB will expand the functionality of the metadata catalog to support heterogeneous, distributed catalogs, including LDAP directories," Baru said. "For example, the SRB will be able to leverage off the NPACI LDAP directories, which contain information about NPACI users." Currently, the metadata catalog is implemented as either a DB2 or Oracle database.
Several NPACI projects will take advantage of the SRB to exchange data among partner sites. In recent years, neuroscientists have accumulated staggering amounts of volumetric brain mapping data from anatomical, PET, and MRI studies of the human brain and also of model systems, for example, the laboratory rat and cricket. NPACI researchers will acquire, store, and provide access to gigabytes of data from individual experiments, often conducted at different sites and described in data formats that include volumetric structural data, reconstructions of the cortical surface, functional imaging data, and connectional data (Figure 2).
Researchers from the Neuroscience thrust area will use the SRB to link brain mapping data caches at UC San Diego, UCLA, and Washington University in St. Louis. Assuming high-speed network connections are completed in time, the data caches and SRB servers will be installed at UCLA and Washington University by mid-1998. By September of this year, neuroscientists expect to be able to demonstrate an exchange of images between the three sites using the SRB. And in the Metasystems thrust area, the SRB will provide the interface to allow both the Legion and Globus systems to access the NPACI storage resources.
Within the Data-intensive Computing thrust area itself, the SRB will enable communication between the various digital libraries across the partnership. NPACI's digital library systems include an IBM commercial system at SDSC, the Alexandria Digital Library at UC Santa Barbara, the Digital Library Project at UC Berkeley, the Digital Sky Project at Caltech, the University of Michigan Digital Library, an Earth systems data repository at the University of Maryland, and Stanford's Digital Libraries Project. The UC Berkeley Digital Library project is currently using the SRB software to store data at its SDSC mirror site; the Alexandria Digital Library is in the process of doing the same.
Neuroscientists at UCLA, UC San Diego, and Washington University are integrating data caches with the SRB to create a federated database of brain-mapping images. Arthur Toga and colleagues at UCLA, for example, collect from 60 gigabytes to 800 gigabytes of data per brain. Staining the tissue augments the initial tomographic survey to provide even more detail. Here, an individual brain slice is digitally captured in 2048 x 2048 pixel resolution and 24-bit color. The three panels show increasing zoom to appreciate cellular groupings.
The SRB is already installed at three of NPACI's data cache sites--UC San Diego, UC Berkeley, and Caltech. Current plans call for running the SRB at all of the data cache sites and several other partner sites by the end of 1998. In March and April, the SRB will be installed at the universities of Michigan, Maryland, Houston, and Texas at Austin, and at Oregon State University. By mid-1998, UC Davis, UCLA, UC Santa Barbara, and Washington University will be running the SRB. Later in the year, Stanford, UC Santa Cruz, and Montana State University will join the list of SRB sites.
Other sites running the SRB include NCSA, as part of the DOCT project, and in the future the Naval Oceanographic Office, with whom SDSC is partnered through the Department of Defense HPC Modernization program. "We are concentrating first on high-bandwidth sites," said Reagan Moore, the leader of NPACI's Data-intensive Computing thrust area. "These have the most urgent need and ability to move large amounts of data."
Software development is also continuing, expanding the number of components that the SRB can glue together. The SRB integration with Globus will occur by June 1998; the DOCT project has already accomplished preliminary SRB and Legion integration. New drivers that will enable the SRB to serve data from Sybase and Informix databases and ADSM archival storage systems are slated for the latter half of 1998. By late 1998, the metadata catalog will also be integrated via a proxy into the Stanford InfoBus system.
"Digital data collections are becoming indispensable to the advancement of science," Moore said. "With the SRB, NPACI is taking the first steps to integrate many independent efforts and create a scalable, partnership-wide production data-handling environment." --DH