Intelligent Metacomputing Testbed

(Distributed Object Computational Testbed (DOCT))

 

(NARA Supplement on Persistent Digital Archives)

 

San Diego Supercomputer Center

Reagan Moore, Principal Investigator

 

 

QUARTERLY SCIENTIFIC & TECHNICAL REPORT

July 1997 - September 1997

 

 

 

Sponsored by:

Advanced Research Projects Agency/ITO

 

ARPA Order No. D570

Issued by ESC/DIB under contract F19628-97-c-0060

 

Disclaimer: "The views and conclusion contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Advanced Projects Research Agency or the US Government".

Table of Contents

 

1. Task Objectives

2. Technical Problems

3. General Methodology

3.1 Technical Methodology

3.2 Management Methodology

4. Technical Roadmap

4.1 Testbed implementation status

5. Special Comments

  1. Task Objectives

The National Archives and Records Administration (NARA) has funded a supplement to the Distributed Object Computation Testbed (DOCT) to examine application of supercomputing technology to the formation of a persistent digital archive. The DOCT project provides hardware and software systems on which to demonstrate advanced supercomputing technology that manipulates data on the scale required by NARA. The challenges facing NARA can only be solved through use of advanced technology:

As technology evolves, every component of the persistent archive will have to change to avoid obsolescence and loss of functionality. The activities in this project are categorized under the following three major task headings:

For each task, the goal is to identify the appropriate information management strategies that are needed to assure the ability to persistently store, manage, and retrieve electronic records. The strategies will be evaluated in terms of commercially available technologies. Prototypes will be assembled to demonstrate the ability to accession at least 1 million records per year, and manage multiple terabytes of data. The major challenge will be to show that the prototype can be designed such that any component is replaceable with new technology, and that any new mechanisms for indexing or retrieving the electronic records can be accommodated.

The architecture that is needed to support the persistent storage of archived records is represented in Figure 1, categorized by the levels of organization used to manage data. For each level, languages can be specified to specify the information management capabilities. The associated data flow and data control mechanisms can then be implemented, using the proposed formatting standards. The goal of the persistent storage project is to identify where current technology is adequate, and where new technology is needed to build persistent archives. Standards are needed to explicitly define interactions at each level of this hierarchy to make it possible to migrate the underlying hardware and software systems onto new technology.

Infrastructure

Language

Data Flow

Data Control

Implementation

Format

 

Presentation

 

DSSSL

Schema

Ontology Definition

 

Schema Manipulation

 

Access

 

Identification

 

XML-QL

Metadata

Metadata Definition

 

Metadata Manipulation

 

Database

 

Handling

 

XML

Archive

Collection Layout

 

Storage Management

 

Media

 

Storage systems

 

AIP

Figure 1. Information Management Architecture for a Persistent Archive

 

The tasks listed in this report are defined in a Research and Development Plan and Schedule, accessible at the URL:

http://www.sdsc.edu/DOCT/Publications/nara-rdps.doc

  1. Technical Problems
  2. The immediate high-priority task is the development of suitable test data collections for demonstrating the ability of the DOCT testbed to handle large data ingestion rates. These data collections are being defined jointly with the National Archives.

    The tasks that pose the highest technical risk are related to presentation of documents for which the original formatting software is no longer available. This project will demonstrate one technique for guaranteeing the ability to present archived documents through the migration of well-defined data to new formatting standards. As a test case, the USPTO patent database will be migrated from the proprietary Greenbook format to the emerging XML formatting standard.

  3. General Methodology
    1. Technical Methodology

The software infrastructure is being assembled through the integration of digital library, information discovery, data handling, data processing, and archival storage systems. Many of these systems are already integrated as part of the DOCT testbed. When possible, commercial systems will be identified that can support the persistent data archive

 

    1. Management Methodology

Management will be achieved through development of a Research and Development Plan and Schedule. The tasks specified by this plan will be developed in collaboration with the National Archives.

Demonstrations of the technology will be arranged with the National Archives as results are achieved. The first demonstration is expected in February, 1999, in which an ingestion rate of 1 million records per year will be shown.

 

  1. Technical Roadmap
  2. Excerpts from the Research and Development Plan and Schedule are listed to define the technical agenda for this project. The RDPS defines the time schedule for the accomplishment of the tasks, lists issues, dependencies between tasks, risks, expected products, and technology transfer mechanisms.

    1. Task I: Data Ingestion

SDSC will demonstrate data ingestion for several different document collections, each with a large number of documents, and quantify the issues involved in ingesting these collections into a long-term archival storage system. The collections considered include; electronic mail, word processing documents, a database consisting of service tickets, a patent document collection, and selected heterogeneous documents acquired from Web sites.

For each corpus, the ingestion tasks include:

  1. forming the corpus for the test demonstrations.
  2. ingesting the data in "raw" form into the archive. The separation of data load from the data accession steps will allow the construction of scalable systems.
  3. validating the "raw" input data. This will require processing the loaded data on a parallel computer. A typical approach will be to compare each loaded document against a DTD.
  4. creating an error database, if necessary. This will require developing appropriate metadata to describe the degraded data, such that retrieval of specific records will still be possible.
  5. classifying each validated document. This will require consensus on a persistent archive schema. We propose integrating Dublin Core provenance information with the SDSC IVcore image metadata, and additional metadata needed to migrate the data management software forward in time. To ensure the ability of the system to evolve, extensible schema will be used.
  6. indexing, cataloging, and archiving validated documents. This will require encapsulating each electronic record as an Archival Information Package. We will explore the use of XML for annotating each record with the required metadata, such as access control information.
  7. performing a price/performance analysis for the hardware, software, and labor involved, including time taken.

Where applicable, alternate technologies will be studied to support ingestion of different types of document collections. A possible technology that will be considered for data accessioning is the IBM research prototype, Grand Central Station, which has the capability to parse the structure of various types of documents and generate indexes based on the known structure.

The data ingestion tasks are:

    1. Task II: Persistent Storage

This task involves technology evaluations of persistent storage systems to verify that such systems are capable of providing the performance that will be necessary for long term archiving of very large numbers of electronic records. In addition, the task also addresses issues in dealing with technology evolution over the long term.

The tasks related to persistent storage are:

 

    1. Task III: Archiving Metadata

This task involves studying various issues related to handling changes in metadata over a period of time. Even if there is no technology evolution over a period of time, one can always expect the metadata related to individual documents, or to the corpus as a whole, to evolve over time. Long term archiving must account for such evolution by archiving metadata definitions, along with the metadata itself.

The tasks related to metadata management are:

 

    1. Testbed implementation status

 

The DOCT testbed has been upgraded to provide access to the latest version of HPSS (version 3.2), along with the most recent version of the associated IBM Universal Database. The combined product is in test, with the expectation that the accession of the 1 million records can be managed by UDB while the archive holds the electronic records.

 

A two-stage process is being used with initially the loading of electronic records, followed by their categorization and storage into the combined UDB/HPSS product. Migration of electronic records to new media will be managed using HPSS technology. Migration of electronic records to new formatting or presentation standards will require processing of records. This will be coordinated through use of the SRB data handling system.

 

  1. Special Comments

We are maintaining a Research and Development Plan & Schedule (RDPS) document for this project. The RDPS is a detailed, comprehensive overview of all NARA tasks, including all deliverables (reports and demonstrations). This document is the authoritative source for our project goals, tasks, deliverables, schedule, and expected work loads. The RDPS is available via the World Wide Web at the URL:

http://www.sdsc.edu/DOCT/Publications/nara-rdps.doc

Also note that many of the concepts that are being explored within this project are also supported through funding provided by other federal agencies. These efforts are documented under the URL:

http://www.npaci.edu/DICE/

 
This URL provides references to technical reports and presentations that have been made on information management technology.