07/05/2006Data Grid Replicates and Shares Growing Data Archives for Enhanced Collaboration
SDSC SRB Manages Data for the National Optical Astronomy Observatory
The NOAO Science Archive, based on the SDSC Storage Resource Broker (SRB), manages some 14.5 terabytes of astronomical data in more than 800,000 files. This mosaic of images of the Large Magellanic Cloud, part of the MCELS study of the glowing interstellar medium -- the breeding ground for new stars and cemetery for dead stars -- will form part of the archive. C. Smith, S. Points, the MCELS Team and NOAO/AURA/NSF.
by Paul Tooby, SDSC senior science writer
The traditional picture people have of an astronomer standing at a telescope and taking photographs of heavenly objects has been dramatically transformed by advancing technologies. Observers now use a wide range of instruments that gather immense volumes of data in digital form, not only from visible light but also in wavelengths from very short gamma rays to long radio waves, opening multiple new windows on the Universe.
Because observing time on telescopes is scarce and the Universe is continually evolving, the hard-won data astronomers gather is a valuable and irreplaceable asset. In addition, opportunities for fundamentally new avenues of science are emerging as researchers develop tools to mine and compare growing archives of observational data. For example, the availability of infrared data led to the discovery of hidden active galactic nuclei potentially holding black holes, as well as star-forming regions unsuspected from looking at visible images alone. And being able to compare images of the same region of the sky over time has enabled detailed study of events such as exploding supernovae, leading to the discovery that the expansion of the Universe is actually accelerating.
All of these factors are driving the need for astronomers to be able to manage, integrate, share, and archive their growing data collections. To handle key parts of this challenge, researchers at the National Optical Astronomy Observatory (NOAO) have turned to the Storage Resource Broker (SRB), a powerful system for end-to-end data management developed at the San Diego Supercomputer Center (SDSC) at UC San Diego.
"Astronomers are beginning to realize how important it is to have an archiving system with a global catalog for all their data that offers uniform search and access, no matter where the data is or what instrument produced it, so that other scientists can discover and make use of it," said Irene Barg, data archive specialist with the National Optical Astronomy Observatory Data Products Program at NOAO headquarters in Tucson, Arizona. "Previously, when an astronomer walked out the door of one of our observatories, they took their data with them." While there was a tape backup, in practice it was not accessible to other astronomers, and the only way to share data was to ask that observer to make his or her data available, a lengthy process involving manual copying of data, possible incompatible formats, and other challenges.
The NOAO Science Archive
To introduce modern data management practices, Barg and her colleagues working in the NOAO Science Archive have designed and built a system based on the SDSC Storage Resource Broker client-server system. The SRB provides a uniform interface for connecting to heterogeneous data resources over a network and accessing replicated data sets. In conjunction with the Metadata Catalog (MCAT), the SRB provides a way to access data sets and resources based on their attributes rather than their names or physical locations. Barg presented a paper on "Managing the NOAO Distributed Archive Using the SRB" at a user workshop on "Applications of the SDSC Storage Resource Broker" held at SDSC. For more information and the workshop proceedings see http://www.sdsc.edu/srb/Workshop.
"The NOAO is a good example of the power of the federation mechanisms of the Storage Resource Broker data grid," said Reagan Moore, Distinguished Scientist and director of SDSC's Data-Intensive Computing Environments group, which developed the SRB system. "Their experience reflects how a growing number of projects are able to create both national scale and international scale shared collections that are assembled by federation of local data management systems."
The NOAO Science Archive is managing data for the National Optical Astronomy Observatory, which operates an international network of ground-based observatories at the Kitt Peak National Observatory in Arizona and the Cerro Tololo Inter-American Observatory in Chile. Organized in 1982, the NOAO is a national facility that provides peer-reviewed access for the nation's astronomers to leading-edge astronomical telescopes. The NOAO is operated by the Association of Universities for Research in Astronomy and funded by the National Science Foundation.
With hundreds of observers gathering data at each NOAO site every year, the data is flooding in. At the Kitt Peak National Observatory alone, which has produced some eight terabytes of data in the last two years, the data rate will soon increase by at least an additional 12 terabytes per year. And additional new instruments continue to come online.
"In our NOAO system the data comes to us in three ways," said Barg. "The first is raw data directly from the instrument, the second is from the post-processing pipeline, and the third is survey data where scientists carry out further analysis and then store the results back into the system." And all of the NOAO data, no matter where it comes from, is replicated to three centers: one in Tucson, Arizona, one in La Serena, Chile, and one at the National Center for Supercomputing Applications (NCSA) in Illinois.
Data Virtualization and Distributed File Management
"All these steps may seem trivial but they're not," said Barg. "The real beauty of the SRB, which makes all this possible, is distributed file management and virtualization of data." The SRB-based NOAO Data Transport System was commissioned in June 2004. Since then the system has moved and replicated more than 468 collections totaling 895,167 files and 14.5 terabytes of astronomical data.
"That's a lot of collections, but thanks to the SRB, when a user searches for data objects, it doesn't matter which archive they're in." Even if data objects are from three different surveys at three different sites, the SRB lets users find and access the data they need.
The SRB also helps manage the flow of data as it becomes older. "We can't store everything forever, so we migrate data out to lower-use sites as that data ages," said Barg. "Every three or four months, after confirmation that the data is safely in the three data centers, the files are removed from the mountain observatory 'caches,' and the data collections are kept in all three data centers for one to two years. Then they're retired from the Tucson and La Serena centers, leaving the safe long term storage and retrieval to the tape archives in Illinois."
Barg feels that the while researchers appreciate the new capabilities of data access and sharing, they don't always realize just how much the SRB makes the job of data managers more efficient. "By not having to worry about where data collections are stored it makes designing our code so much easier -- since we use the logical file system we never have to change our code even when data is moved, migrated to new servers, or merged with other collections."
The SRB zone feature, in which each zone has its own replicated data collections and Metadata Catalog or MCAT, is also proving very important. Because the NOAO systems are live all the time, if any node goes down, the different zones allow the other archives to keep working so that the data collections are always available.
"The flexibility, location transparency, and robustness of the SRB give us a reliable system," Barg explained. "If we have an observer coming off a three day run, and he just found out that he can't read his second tape, he calls and asks if we got the data, and I say not only did we get it but it's replicated in three places -- they're very happy to hear this."
The NOAO Portal
To enable easier user access to the growing data collections, the NOAO team is developing a portal. Support for portal access to NOAO holdings managed by the SRB-based Data Transport System is planned for the next release in July 2006. One challenge the researchers face is how to integrate legacy data from older instruments, in which different terminology was used, and make it searchable along with modern data.
For broader usefulness, the NOAO researchers are working to ensure that their portal interoperates with the National Virtual Observatory (NVO) environment, so that astronomers and others can access not only NOAO collections but other collections in the NVO through the same portal. The NVO is part of a broad initiative to create a clearinghouse that brings together all of the catalogs and astronomical data of U.S., and eventually world observatories, into one unified virtual observatory.
The NOAO portal has tools such as Open SkyNode that let astronomers perform searches across multiple catalogs. For example, if scientists want to find data from different wavelengths in the same area of sky, they can cross search various catalogs and find the known objects in that area along with observations that may be available in infrared, visible, or other wavelengths. This facilitates important new science based on comparing different kinds of data. The portal can be found at http://nvo.noao.edu.