SDSC and the National Center for Atmospheric Research Enter Partnership to Preserve Critical Data

Collaboration will ensure reliable, long-term preservation of vital digital assets

Published 08/08/2006

The San Diego Supercomputer Center (SDSC) and the National Center for Atmospheric Research (NCAR) have entered a new collaboration for data preservation. The Memorandum of Understanding (MOU) signed in June recognizes an existing deep partnership in data sharing, data preservation, and collaborative efforts using the SDSC Storage Resource Broker (SRB), as well as providing a basis for expanded interactions.

A key feature of the collaboration is a mechanism by which the two sites will provide archival storage space to back up each other's critical data. The geographical replication of data across different sites ensures the preservation of these digital assets.

"Many of our scientific and societal datasets have a very high value and some are simply irreplaceable. This will benefit both centers by letting us store critical datasets offsite," said Richard Moore, Director of Production Systems at SDSC. "If there's an unexpected disaster at a center, vital data from that center's archival storage system will be preserved at the other location."

"By leveraging the existing storage facilities at each center," adds Moore, "this arrangement provides a cost-effective and mutually beneficial solution for what's known as geographical data replication�the duplication of data at separate physical sites."

Initially, SDSC and NCAR will each make available 100 terabytes (trillion bytes) of archival storage for replication of each other's data. The amount of data storage available at each site will increase each year by 50 terabytes, reaching 300 terabytes by 2010, and the amount can be further increased by mutual agreement. The interface between the two archival systems will be managed with the Storage Resource Broker software developed at SDSC. The replicated data from each center will be held in "dark" archives, which are protected by being inaccessible to the public and accessed only for restoration of lost data.

Both centers have massive data storage systems with tape silos that house vital research data. SDSC's tape archive has a current capacity of 18 petabytes (one petabyte is one million gigabytes, equivalent to the storage capacity of about 10,000 desktop computers), and holds important collections from hundreds of different projects, ranging from biological data in the Protein Data Bank to massive astronomy surveys in the National Virtual Observatory as well as data for the computational scientists and other researchers who use SDSC's resources. The Mass Storage System (MSS) at NCAR holds digitally archived data on some 45,000 tape cartridges with data collections that are of vital importance to atmospheric and other geoscientists from around the world.

Since NCAR has recently joined the NSF TeraGrid, which has nine sites including SDSC that form the world's most comprehensive distributed cyberinfrastructure for open research, SDSC and NCAR will use the TeraGrid 10-gigabit network to transfer data back and forth efficiently.

The two institutions will also work together to provide access to storage, implement networking procedures, share software tools, create documentation, and conduct random tests of data retrieval from both sites, as well as develop mechanisms to track and catalog data. SDSC and NCAR will also jointly sponsor and participate in a workshop on data integrity and security.