Intelligent Metacomputing Testbed
(Distributed Object Computational Testbed (DOCT))
(NARA Supplement on Persistent Digital Archives)
San Diego Supercomputer Center
Reagan Moore, Principal Investigator
QUARTERLY SCIENTIFIC & TECHNICAL REPORT
July 1997 - September 1997
Sponsored by:
Advanced Research Projects Agency/ITO
ARPA Order No. D570
Issued by ESC/DIB under contract F19628-97-c-0060
Disclaimer: "The views and conclusion contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Advanced Projects Research Agency or the US Government".
Table of Contents
1. Task Objectives
2. Technical Problems
3. General Methodology
3.1 Technical Methodology
3.2 Management Methodology
4. Technical Roadmap
4.1 Testbed implementation status
5. Special Comments
The National Archives and Records Administration (NARA) has funded a supplement to the Distributed Object Computation Testbed (DOCT) to examine application of supercomputing technology to the formation of a persistent digital archive. The DOCT project provides hardware and software systems on which to demonstrate advanced supercomputing technology that manipulates data on the scale required by NARA. The challenges facing NARA can only be solved through use of advanced technology:
As technology evolves, every component of the persistent archive will have to change to avoid obsolescence and loss of functionality. The activities in this project are categorized under the following three major task headings:
For each task, the goal is to identify the appropriate information management strategies that are needed to assure the ability to persistently store, manage, and retrieve electronic records. The strategies will be evaluated in terms of commercially available technologies. Prototypes will be assembled to demonstrate the ability to accession at least 1 million records per year, and manage multiple terabytes of data. The major challenge will be to show that the prototype can be designed such that any component is replaceable with new technology, and that any new mechanisms for indexing or retrieving the electronic records can be accommodated.
The architecture that is needed to support the persistent storage of archived records is represented in Figure 1, categorized by the levels of organization used to manage data. For each level, languages can be specified to specify the information management capabilities. The associated data flow and data control mechanisms can then be implemented, using the proposed formatting standards. The goal of the persistent storage project is to identify where current technology is adequate, and where new technology is needed to build persistent archives. Standards are needed to explicitly define interactions at each level of this hierarchy to make it possible to migrate the underlying hardware and software systems onto new technology.
|
Infrastructure |
Language |
Data Flow |
Data Control |
Implementation |
|
Format |
Presentation |
DSSSL |
||
|
Schema |
Ontology Definition |
Schema Manipulation |
||
|
Access |
Identification |
XML-QL |
||
|
Metadata |
Metadata Definition |
Metadata Manipulation |
||
|
Database |
Handling |
XML |
||
|
Archive |
Collection Layout |
Storage Management |
||
|
Media |
Storage systems |
AIP |
Figure 1. Information Management Architecture for a Persistent Archive
The tasks listed in this report are defined in a Research and Development Plan and Schedule, accessible at the URL:
http://www.sdsc.edu/DOCT/Publications/nara-rdps.doc
The immediate high-priority task is the development of suitable test data collections for demonstrating the ability of the DOCT testbed to handle large data ingestion rates. These data collections are being defined jointly with the National Archives.
The tasks that pose the highest technical risk are related to presentation of documents for which the original formatting software is no longer available. This project will demonstrate one technique for guaranteeing the ability to present archived documents through the migration of well-defined data to new formatting standards. As a test case, the USPTO patent database will be migrated from the proprietary Greenbook format to the emerging XML formatting standard.
The software infrastructure is being assembled through the integration of digital library, information discovery, data handling, data processing, and archival storage systems. Many of these systems are already integrated as part of the DOCT testbed. When possible, commercial systems will be identified that can support the persistent data archive
Management Methodology
Management will be achieved through development of a Research and Development Plan and Schedule. The tasks specified by this plan will be developed in collaboration with the National Archives.
Demonstrations of the technology will be arranged with the National Archives as results are achieved. The first demonstration is expected in February, 1999, in which an ingestion rate of 1 million records per year will be shown.
Excerpts from the Research and Development Plan and Schedule are listed to define the technical agenda for this project. The RDPS defines the time schedule for the accomplishment of the tasks, lists issues, dependencies between tasks, risks, expected products, and technology transfer mechanisms.
SDSC will demonstrate data ingestion for several different document collections, each with a large number of documents, and quantify the issues involved in ingesting these collections into a long-term archival storage system. The collections considered include; electronic mail, word processing documents, a database consisting of service tickets, a patent document collection, and selected heterogeneous documents acquired from Web sites.
For each corpus, the ingestion tasks include:
Where applicable, alternate technologies will be studied to support ingestion of different types of document collections. A possible technology that will be considered for data accessioning is the IBM research prototype, Grand Central Station, which has the capability to parse the structure of various types of documents and generate indexes based on the known structure.
The data ingestion tasks are:
This task involves technology evaluations of persistent storage systems to verify that such systems are capable of providing the performance that will be necessary for long term archiving of very large numbers of electronic records. In addition, the task also addresses issues in dealing with technology evolution over the long term.
The tasks related to persistent storage are:
This task involves studying various issues related to handling changes in metadata over a period of time. Even if there is no technology evolution over a period of time, one can always expect the metadata related to individual documents, or to the corpus as a whole, to evolve over time. Long term archiving must account for such evolution by archiving metadata definitions, along with the metadata itself.
The tasks related to metadata management are:
Testbed implementation status
The DOCT testbed has been upgraded to provide access to the latest version of HPSS (version 3.2), along with the most recent version of the associated IBM Universal Database. The combined product is in test, with the expectation that the accession of the 1 million records can be managed by UDB while the archive holds the electronic records.
A two-stage process is being used with initially the loading of electronic records, followed by their categorization and storage into the combined UDB/HPSS product. Migration of electronic records to new media will be managed using HPSS technology. Migration of electronic records to new formatting or presentation standards will require processing of records. This will be coordinated through use of the SRB data handling system.
We are maintaining a Research and Development Plan & Schedule (RDPS) document for this project. The RDPS is a detailed, comprehensive overview of all NARA tasks, including all deliverables (reports and demonstrations). This document is the authoritative source for our project goals, tasks, deliverables, schedule, and expected work loads. The RDPS is available via the World Wide Web at the URL:
http://www.sdsc.edu/DOCT/Publications/nara-rdps.doc
Also note that many of the concepts that are being explored within this project are also supported through funding provided by other federal agencies. These efforts are documented under the URL:
http://www.npaci.edu/DICE/
This URL provides references to technical reports and presentations that have been made on information management technology.