MDAS - Massive Data Analysis System

a metadata-based computing environment

Enabling Technologies Group
San Diego Supercomputer Center

MDAS is a DARPA funded project for providing a software infrastructure to enable resource discovery and scheduling of computations in a heterogeneous, distributed system. The creation of a Massive Data Analysis System (MDAS) will enable new modes of science through improved data management of scientific data sets. This requires a semantically-enabled scalable software infrastructure that can manage petabytes of data, support rapid access of selected data sets, and provide support for subsequent computationally intensive analyses. To accomplish this, MDAS provides a metadata-enabled computational framework that can inter-operate with diverse resources, applications, methods and datasets. The system is designed with a comprehensive ontology for storing a core system-level metadata that allows it to handle a variety of entities. Moreover, it provides the means to seamlessly integrate application- and resource-specific metadata that can be effectively utilized to deal with peculiarities of individual systems. Also, the system is being designed with a build environment which allows it to be ported to systems ranging from supercomputers to workstations.

The MDAS system is also designed to accomplish automatic collection of metadata descriptions about resources, methods and datasets through metadata collection agents. We contend that MDAS provides a pioneering framework for a semantically-enabled distributed and heterogeneous, global computing system; "semantic" capabilities that will provide an interoperating framework to existing legacy and emerging operating systems. The goal of the "semantic" infrastructure will be to provide a means for accessing a uniform "syntactic" interface to the resources, methods, and datasets in the global computational system.

MDAS stores metadata about: resources, (such as hardware systems including computing platforms, communications networks, storage systems, peripherals; system software including DBMSs, file systems, operating systems, schedulers; and application systems including digital libraries, search engines) methods, (such as access methods for using standardized and non-standard APIs, system and user-defined functions for manipulating datasets, format-conversion routines) datasets, (such as environmental datasets, patent datasets, Genome sequences, user- and system-generated data and information) and users and groups who can access the resources, methods and datasets through the MDAS framework. MDAS acts as a metadata repository for system-level and domain-level metadata and provides a framework for integrating the two types of metadata.

The metadata in MDAS provides a software infrastructure for transparent access and computation in a heterogeneous and distributed environment. The metadata catalog of MDAS provides information that can be utilized to provide access to datasets stored in a variety of formats, locations and storage media ranging from file systems to archival storage, and to provide computational and communication transparency for distributed platforms. The metadata also provides a means to store and retrieve security-related information for encryption, authentication and validation mechanisms. Moreover, the catalog can handle metadata about resources and datasets that are replicated and can provide information to fabricate, synchronize and schedule compound methods to achieve a given purpose.

We describe below three applications of MDAS that are currently under development.

1. Extended Access - Storage Resource Broker

MDAS provides a uniform framework for distributed data handling. The aim of the Storage Resource Broker (SRB) is to complement MDAS by providing a uniform access mechanism to diverse and distributed storage system. The SRB at SDSC uses the MDAS metadata catalog to obtain information about distributed storage resources (the SRB Storage Vault) which it brokers. The SRB handles the protocol conversions needed to interface to heterogeneous storage systems through standardized or proprietary interfaces. The SRB interfaces with the MDAS metadata catalog to mediate between user access requests and the storage systems, and to access metadata information about individual files and objects. SRB provides a uniform API to access stream data, tabular data and structured data residing in stable storage systems such as databases, archival storage systems and file systems. We have identified a set of API functions for accessing large streamable objects from file systems, archival storage systems and databases storing large objects. We are planning to abstract the APIs of CORBA and ODBC into the SRB framework to provide a higher-level API that can be used to access other types of data such as tabular data stored in relational databases and structured data stored in object bases, and possibly as structured data stored in file systems. We also plan to include stream-level access to resources accessible through a httpd daemon which will allow transparent access for programs to html documents and other files and executables through the world-wide web.

2. DOCT - A Distributed Object Computing Testbed The Distributed Object Computation Testbed (DOCT) is a broad research and development effort to unify data- and computation-access across multiple sites. More specifically, this environment supports the collection, organization, manipulation, analysis, and maintenance of terabyte-sized data collections. It uses an object-oriented approach, emphasizing retention and tracking of all data sets in the environment to manage complex documents comprised of text, images, and multimedia files. This approach is critical to the ability to effectively manage complex, dynamic documents.

The Distributed Object Computation Testbed (DOCT) will create an environment for handling complex documents on geographically distributed data archives and computing platforms. A persistent object representation based on the Legion Computation environment will be used to integrate text search systems, document handling systems, and intelligent agents that support electronic submission of documents. The documents will be stored on distributed databases systems that are integrated with archival storage systems. Research activities include development of a distributed scheduling system, and development of the requirements for supporting electronic filing of documents.

In DOCT, MDAS is used in two ways: as a facility for storing information and and as a repository of meta-information that can be used by the various sub-systems of DOCT. In particular, the Legion subsystem, a distributed object handling subsystem of DOCT, will use MDAS and SRB to define intelligent storage vaults for storing Legion objects. MDAS will also be used to store meta-information that can be used to provide DOCT functionalities such as security, scheduling and audit trails.

The DOCT project, sponsored by the Defense Advanced Research Projects Agency and the U.S. Patent and Trademark Office. teams the San Diego Supercomputer Center, as the lead organization, with nine other academic and industrial organizations: the California Institute of Technology, the National Center for Supercomputing Applications, Old Dominion University, Open Text Corporation, Science Applications International Corporation, the University of California, San Diego, the University of Virginia, Information Assets, Inc., and Jon Roberts & Associates. We expect this system to demonstrate applicability to a broad range of needs.

3. Archiving Digital Libraries - Integration of Digital Libraries with Archival Storage Technology

The Digital Library Initiative (DLI) sponsored by NSF, DARPA and NASA focuses on technology to "dramatically advance the means to collect, store, and organize information in digital forms, and to make it available for searching, retrieval and processing via communication networks." The digital library collaborations at SDSC are motivated by two concerns: first, to provide archival storage capability for digital library technology, and second, to provide interoperability between diverse digital library technology. A requirement of the technology is to provide rapid access to published data and to provide fault tolerance capability through replication of documents and repositories.

Systems are being designed that will be scalable to petabytes of data. This requires that the collections to be stored in multi-level storage systems with varying latency and cost function. Moreover, through document replication, the user will have fault tolerance and greater access throughput taking advantage of nearness and ease of access capabilities of different storage media. But, even in the face of such complications,the user access should be transparent to the underlying storage hierarchy and should be immune to transport of the datasets from different locations and storage media. Also, the replication of documents should be handled in such a way that the user access is always consistent and the integrity of the served data is maintained.

We plan to provide the above functionality through the use of the MDAS and SRB systems discussed above. The MDAS metadata catalog holds information about different repositories, their location, access and transport protocols. Moreover, the catalog also holds information about replicates of individual documents and their update characterizations. The SRB will be used to provide uniform access to different storage media such as file systems, databases, object stores and archival storage systems. With this setup the legacy digital library software can be transparent to the interfaces of the underlying storage systems and can have an uniform API for data handling.

The MDAS and SRB combination also provide a powerful framework for interoperating digital libraries. The connectivity information, capabilities and interfaces of different digital libraries can be registered with the MDAS metadata catalogs and a uniform access protocol can be established that abstracts the access of the underlying digital libraries. At this level, one needs to perform semantic interoperability and we contend that the specific-metadata capability of MDAS can be utilized effectively to provide the semantic integration. We will be mirror site for the University of California, at Berkeley Digital Library and the Alexandria Digital Library of the University of California at Santa Barbara and we plan to provide an integrating facility through the MDAS and SRB framework.

More details about MDAS can be found at the web-site at MDAS. Extensive discussions on the MDAS system design and metadata schema design can be found in a technical report at that web-site. A poster at the web-site also shows various aspects of the system.

Submitted to the Metadata Registry Workshop.
For further information contact: Arcot Rajasekar (