n studying the Earth, scientists are collecting volumes of data from remote-sensing satellites, field surveys, climate models, and ground-based sensors. Each source, however, presents only a fragment of the whole picture. To integrate these measurements into more accurate estimates of, for example, the spread of deserts, the loss of rain forests, and coastal pollution levels, researchers must be able to discover relevant data sets from both their own collections and the large volume of published data collected by the Earth systems science community.
For publishing, discovering, and analyzing discipline-wide stores of scientific data, NPACI's Data-intensive Computing thrust area is developing and integrating digital library technology as part of the infrastructure. This effort, led by Terence Smith of UC Santa Barbara, involves Digital Library Initiative projects at four NPACI sites, integration technologies, and applications researchers poised to apply digital libraries in their fields.
"Digital libraries have the potential to change the way research is conducted," Smith said. "We're developing the tools for describing, publishing, and analyzing large data sets within NPACI's data-handling environment."
Figure 1: Alexandria Digital Library Interface
In general, digital libraries are storehouses of information available through the Internet that provide ways to collect, store, and organize data and make it available for searching, retrieval, and processing. To make progress toward this goal, the National Science Foundation (NSF), the Department of Defense Advanced Research Projects Agency (DARPA), and NASA have jointly established six digital library projects, four of which are participating in NPACI.
The Alexandria Digital Library, led by Smith at UC Santa Barbara, is providing easy access to large and diverse collections of maps, images and pictorial materials (Figure 1). Currently, Alexandria has more than 2.8 terabytes of geographically referenced data, including a large collection of maps and photos from the U.S. Geological Survey. In mid-1998, a mirror of the Alexandria library will be established at SDSC.
The UC Berkeley Digital Library, led by NPACI partner Robert Wilensky is collecting diverse information about the environment to be used for the preparation and evaluation of environmental data, impact reports, and related materials. To this end, the UC Berkeley library is working with the State of California Environmental Resource Evaluation System (CERES). The more than 300 gigabytes of data in the UC Berkeley library is currently mirrored at SDSC.
At the University of Michigan's Digital Library project, NPACI researcher Randy Frank is working to integrate the Michigan system with SDSC's Storage Resource Broker (SRB). The Michigan project, led by Daniel Atkins, is concentrating on the Earth and space sciences. In an early application, the JSTOR project has made the archives of more than 40 scientific journals available for searching and browsing.
At Stanford University, NPACI researcher Hector Garcia-Molina is leading the Stanford Integrated Digital Library Project to create a shared environment that links everything from personal information collections, to collections found today in conventional libraries, to large data collections shared by scientists. A key development from the Stanford project, which the NPACI effort will use to integrate the various components, is the Digital Library Interoperability Protocol (DLIOP).
Figure 2: Combining Geographically Referenced Data Sets
This view of Santa Barbara, California, composites spatial data sets from the Alexandria Digital Library, including satellite imagery from SPOT and the Landsat Thematic Mapper, a digital elevation model from the U.S. Geological Survey, and geologic features such as faults and folds. Downtown Santa Barbara and Santa Barbara harbor are present in the black-and-white SPOT image. Digital libraries will help researchers search for and integrate such diverse data sources.
The DLIOP is one of several interfaces that NPACI researchers will use to link the four digital library projects and a commercial digital library from IBM that is planned for installation at SDSC. The DLIOP allows flexible searches via a uniform interface across any library that supports it. Also within NPACI's digital library efforts, Cherri Pancake at Oregon State University will lead the development of a Web-based query interface that uses the DLIOP to search multiple repositories.
At the resource level, SDSC's SRB will provide the interface to the variety of storage devices that comprise the NPACI data-handling environment, including HPSS and ADSM archives and the hardware of the NPACI data caches. The SRB accepts queries about data sets in the form of attribute-value pairs and provides the physical location of the data on the appropriate storage device.
Finally, to ensure that applications and future technologies will be able to interact with digital libraries, the DLIOP is implemented with proxies based on the Common Object Request Broker Architecture (CORBA). Researchers from NPACI's Metasystems and Interaction Environments thrusts will also use the CORBA objects that implement the DLIOP to access digital libraries in their respective projects.
By the end of 1998, the digital library efforts plan to have integrated Michigan's ADSM archive via the SRB interface, demonstrated searches across multiple repositories, and begun efforts to link interaction environments with the digital libraries.
In concert with the digital library efforts, researchers from NPACI's applications thrust areas will apply the technologies to address pressing discipline-specific needs. Integrated digital libraries will open up new opportunities not only to make scientific discoveries and foster new intellectual collaborations, but also to knit together existing research communities more tightly.
For example, neuroscientists will be able to seek out, assimilate, analyze, and visualize widely distributed data sets. Through metadata catalogs that describe scientific data sets housed in multiple data collections at Washington University in St. Louis, UCLA, and UC San Diego, neuroscientists will be able to identify relevant brain images for comparisons.
In astronomy, NPACI's Digital Sky Project, led by Tom Prince at Caltech, will be an early user of the NPACI data-handling environment. "This project also illustrates how data-handling tools can draw a community of researchers closer together," Prince said. The project will begin with the Digital Palomar Observatory Sky Survey, the 2-Micron All Sky Survey, and the Sloan Digital Sky Survey and publish the surveys for astronomers, providing access to the catalogs and images together with sufficient computing power to allow detailed correlated studies across the entire data set. The sheer volume of records--eventually about one billion records, each perhaps only 500 bytes long--poses challenges for the visualization and analysis of the data.
In contrast to the billion records in the Digital Sky project, researchers from the Earth systems sciences are collecting much larger data sets, including continuous data streams (Figure 2). To permit useful analyses over time and over the surface of the Earth, the repositories comprise extensive geographically referenced data sets. Joseph JÃ¡ JÃ¡, director of the University of Maryland's Institute for Advanced Computer Studies, is leading an NPACI collaboration to develop a database for remote-sensing images.
At UC Santa Cruz, Patrick Mantey will work to integrate elements of the Real-time Environmental Information Network and Analysis System (REINAS), a distributed measurement-gathering environment to support both real-time and retrospective regional scale environmental science. REINAS allows researchers to collect and store real-time data from dispersed sensors in a physically distributed database.
"We expect these early prototypes to spark interest," Smith said, "in similar data resources by other scientific disciplines, museums, government agencies, and other information providers." --DH