Title
Usage Scenarios for Information Sharing in a Data Grid
Conference
The 7th Annual International Conference on Digital Government Research,
San Diego, May 21-24, 2006.
Tutorial Presenters
Reagan W. Moore, Richard Marciano, Arcot Rajasekar
San Diego Supercomputer Center
University of California at San Diego
{moore, marciano, sekar}@sdsc.edu
Short Description
Data Grids support shared collections that may be distributed across multiple institutions. Data Grids decouple the management of the shared collections from the storage systems, making it possible to logically organize existing data into a new collection. Data Grids support seamless access to the information, but unlike the web, provide certificate-based authentication, authorization through access controls and tickets, and audit trails to keep track of usage. Moreover, discovery of relevant files are facilitated through associated metadata. Data grids provide rich formats for organizing metadata including attribute-value pairs, relational schemas, and semi-structured XML-based metadata. Moreover, one can store system-level metadata and provenance metadata to keep track of the evolution and by products of the collections that one is sharing.
Target and Goals
Any group that manages distributed data will profit from the tutorial, which will provide the basic information needed to design and implement a shared collection that spans multiple storage systems and administrative domains. The shared collections can be used to implement data grids for sharing data, digital libraries for publishing data, and persistent archives for preserving data.
In our tutorial, we will detail real-life use case scenarios for information sharing. The usage scenarios are based on applications of the Storage Resource Broker, premier data grid software developed at the San Diego Supercomputer Center. In our tutorial, we will discuss information sharing for the National Archives (NARA), distributed state archives (Persistent Archive Testbed project with NHPRC), the Worldwide Universities Network data grid (shared collections spanning three continents), biomedical networks (NIH Bio-medical Informatics Research Network), astronomy collections (NSF National Virtual Observatory), seismic digital libraries (NSF Southern California Earthquake Center), real-time sensor data (NSF Real-time Observatories, Applications and Data management Network), etc. Each of these use cases provides unique insights into problems and solutions for data sharing. We will outline for each case the aims, problems encountered, solutions adapted and user experiences.
Solutions to common problems that are encountered in the formation of shared collections will be presented. The tutorial will cover the following topics, at about 30 minutes per topic:
- Support for digital entities. Data Grids support registration of files, SQL command strings, URLs, database tables, file system directories, binary large objects in databases, and sensor data streams.
- Authentication and authorization. The approach that works best is based on trust virtualization, in which the shared collections owns the data, authenticates user access, and manages access controls for each registered object.
- Support for legacy data. This is handled through bulk registration of existing files into the logical name space. The existing directory structure and naming convention can be replicated into the logical collection.
- Support for bulk operations. Many collections are assembled from small files (less than 10 MBytes in size). Data grids support bulk manipulation of thousands of small files at a time. Thus small files are packed into a large buffer before sending and the small files are unpacked at the receiving site. Alternatively, small files can be aggregated into containers before storage in archives.
- Support for extensible schema. Each registered digital entity may be assigned unique metadata attributes, which may be structured in snow-flake schema (no cycles across the tables).
- Support for user-preferred access mechanisms. Data Grids separate the interfaces for access mechanisms from the protocols required to interact with storage systems. This makes it possible to port very sophisticated interfaces on top of the shared collection. Examples include the DSpace and Fedora digital libraries, the Open Archives Initiative Protocol for Metadata Harvesting, WSDL services, web browser interfaces, etc.
- Federation of independent data grids. The logical name spaces used by each data grid can be shared or synchronized, under administrative control. Files and metadata can be copied between data grids. One can build peer-to-peer data grids, in which only public data is shared. One can build a central archive that holds copies of data from peer satellite data grids, or build a master-slave system, in which data is copied to remote data grids from a master data grid. Federations are used to improve interactivity in global data grids, provide high availability, and provide disaster recovery.
Material describing the Storage Resource Broker is available at the URL http://www.sdsc.edu/srb/
Relevant papers include:
- Moore, R., M. Wan, A. Rajasekar, "Storage Resource Broker: Generic Software Infrastructure for Managing Globally Distributed Data", Proceedings of IEEE Conference on Globally Distributed Data, Sardinia, Italy, June 28, 2005.
- Moore, R., R. Marciano, "Technologies for Preservation", book chapter in "Managing Electronic Records", edited by Julie McLeod and Catherine Hare, Facet Publishing, UK, October 2005.
- Rajasekar, A., M. Wan, R. Moore, W. Schroeder, "Data Grid Federation", PDPTA 2004 - Special Session on New Trends in Distributed Data Access, June 2004.
Bios of Presenters
Reagan Moore:
Dr. Reagan W. Moore is Director for Data Intensive Computing at the San Diego Supercomputer Center. He coordinates research efforts in development of massive data analysis systems, data grids, digital libraries, and persistent archives. Moore is the principal investigator for the development of the Storage Resource Broker data grid technology, which is used to support international shared collections. Moore leads SDSC involvement in projects with NARA on persistent archives, NSF on NSDL persistent archives, NASA Information Power Grid, NIH on Biomedical Informatics Research Network, and data management for the Library of Congress. His email address is moore@sdsc.edu
Richard Marciano:
Dr. Richard Marciano heads the Sustainable Archives and Libraries Technology Group at the San Diego Supercomputer Center. He is the principal investigator on preservation projects including the NHPRC Persistent Archives Testbed, the NHPRC California Geospatial Records Preservation Grant, and collaborates on the NARA research prototype persistent archive. His research interests include mapping of collections to Graphical Information Systems, development of preservation environments, and analysis of preservation environment consistency properties. His email address is marciano@sdsc.edu.
Arcot Rajasekar:
Dr. Arcot K. Rajasekar heads the Data Grid Technologies Group at the San Diego Supercomputer Center (SDSC). His major research interests include research and development of technologies for data grids, digital library systems and persistent archives. His current research activities at SDSC include development of the Storage Resource Broker for integrating distributed data repositories and digital library systems, and development of metadata catalog system for handling system-level and domain-specific meta data. His email address is sekar@sdsc.edu.