January 11, 1999
To: ARPA Agent
Charles Shank
ESC/DIB
Bldg. 1704 Arlington, VA 22203-1714
Hanscom AFB, MA 01731-2116
ESC/PKRD
Attn: Ms Carole Stephan
104 Barksdale St., Bldg 1520
Hanscom AFB, MA 01731-1806
(Letter of Transmittal Only)
National Archives and Records Administration / NWME
Attn: Robert Chadduck
8601 Adelphi Rd.
College Park, MD 20740
Re: R&D Status Report
National Archives and Records Administration
Distributed Object Computation Testbed
(Formerly Intelligent Metacomputing Testbed)
F19628-96-C-0020
San Diego Supercomputer Center (SDSC)
To Program Manager:
During the period October – December 1998, the National Archives and Records Administration (NARA) extension to the Distributed Object Computation Testbed has focused on initial loading of representative data collections and development of the basic software infrastructure needed to support collection based persistent archives. A Research and Development Plan and Schedule has been negotiated with NARA and is published on the DOCT web page.
NARA Project Status:
The NARA extension to the DOCT project is designed to explore the application of high performance supercomputing technologies to support persistent object collections. The challenges are:
The San Diego Supercomputer Center is identifying the information architecture required to build a persistent archive. The approach is based upon the concept that both the original digital objects and the information required to assemble the digital objects into a data collection must be archived. Digital objects are not archived as stand-alone entities, but instead are archived as members of a digital data collection. Persistence is achieved through identification of meta-data for all attributes related to digital object properties and collection organization. Persistence is demonstrated by dynamically building the data collection from the individual data objects stored in the archive, dynamically creating the queries needed to discover information within the data collection, and dynamically constructing the presentation interface for the digital objects discovered through a query.
The software systems required to dynamically assemble a data collection, load the digital objects, and support arbitrary queries against the assembled collection have been developed at SDSC as part of a Meta-data Catalog system (MCAT). The software is used with the SDSC Storage Resource Broker to support the creation of scientific data collections. Software is under development to characterize presentation interfaces as semi-structured data organized through the use of XSL style sheets. This latter system supports mediation of information for XML tagged data sets (MIX).
The approach that is being followed is broken down into four areas corresponding to the automated loading of data and generation of meta-data, the long-term storage of digital objects, support for information discovery against the archived data, and the presentation of the discovered data objects. Each area has multiple components:
SDSC is using the following data collections to demonstrate the generation of a persistent archive:
Each collection provides a different test of the functionality or the performance of the system. The ingestion of a variety of data collections into the SDSC High Performance Storage System (HPSS) helps us understand the issues associated with data accession. It also demonstrates the feasibility of archiving data collections as well as digital objects, and demonstrates the level of performance that can be expected (number of records accessioned per year).
Nearly all data (apart from a 4mm cartridge) have been ingested in raw format and loaded to disk or HPSS (census data). For the ingestion from some sources (i.e., the newsgroup and census data collections), Perl scripts were written and used to dynamically assemble the data collections.
SDSC will use technology originally acquired as part of the DOCT project to support this effort. The components include:
Current Activities:
The data ingestion phase of the project is nearly complete. By Jan 20, we expect to have assembled a one million record data collection of Usenet messages. The collection will be used in February to demonstrate the ingestion, meta-data extraction, archival storage, dynamic assembly of the digital objects into a collection, and retrieval of individual objects from the collection through a dynamically generated query. Current timing estimates show that the entire process should be doable within 24 hours for a million record collection.
The archival storage phase of the project requires a demonstration of the minimal hardware components that are needed to support the persistent archive. Two reports have been written that characterize the workload currently supported by the SDSC archival storage system, and the hardware requirements needed to keep the system functioning smoothly. The reports will be used as the basis for scaling the hardware capacity to the expected data management requirements of NARA.
The information management software that is needed to dynamically assemble a data collection from archived digital objects has been prototyped. We will test the system against a selected subset of the data collections that have been loaded into HPSS.
The remaining effort is the ability to dynamically generate the GUI used to access the data collection. The basic concepts have been developed for the software infrastructure. We expect to define a DTD to represent the GUI data layout. The MIX technology will be used to support the mapping of the database attributes to the GUI DTD. A Blended Browsing and Query (BBQ) system is being developed to support dynamic creation of the GUI. A prototype of this software system is expected in the second quarter of 1999.
Publications:
On-line information about the DOCT project is available at the URLs:
http://www.sdsc.edu/DOCT, and http://www.npaci.edu/DICE. NARA related publications are included on the DOCT web page.The status of the data collection ingestion process is provided in a separate technical report available on the DOCT web page.
R. Moore, "Status of Collection Based Persistent Archives"
Modeling of the expected performance of the High Performance Storage System is described in two reports:
R. Moore, J. Lopez, C. Lofton, W. Schroeder, G. Kremenek, "Configuring and Tuning Archival Storage Systems", to be published in the Proceedings of the 16th IEEE Symposium on Mass Storage Systems and the 7th NASA Goddard Space Flight Center Conference on Mass Storage, San Diego, Mar 1999.
W. Schroeder, R. Marciano, J. Lopez, M. Gleicher, G. Kremenek, C. Baru, R. Moore, "Analysis of HPSS Performance Based on Per-file Transfer Logs", to be published in the Proceedings of the 16th IEEE Symposium on Mass Storage Systems and the 7th NASA Goddard Space Flight Center Conference on Mass Storage, San Diego, Mar 1999.
The technology that is used to implement a Collection Based Persistent Archive is described in the report:
A. Rajasekar, R. Moore, "Collection Based Persistent Archives", to be published in the Proceedings of the 16th IEEE Symposium on Mass Storage Systems and the 7th NASA Goddard Space Flight Center Conference on Mass Storage, San Diego, Mar 1999.
Sincerely yours,
Reagan Moore
SDSC
Attachments: Financial Status Table
Copies: ARPA Agent (2 copies)
NARA (2 copies)
ESC/PKRD (Transmittal Letter only)
DCAA San Diego North County office
Dr. Reagan Moore, SDSC
Ms. Bettye Washington, General Atomics
R&D STATUS REPORT
Program Financial Status
CUMULATIVE THROUGH December, 1998
$(000)
|
WORK BREAKDOWN STRUCTURE |
PLANNED EXPEND |
ACTUAL EXPEND |
Remaining Funding |
PERCENT COMPLETE |
Budget Completion |
|
NARA extension to DOCT |
$300 |
$64 |
$236 |
16% |
21% |
|
TOTAL |
$300 |
$64 |
$236 |
16% |
21% |
To: ESC/PKRD January 11, 1999
Attn: Ms Carole Stephan
104 Barksdale St., Bldg 1520
Hanscom AFB, MA 01731-1806
Subject: 1st Quarter FY99 Reports, F19628-96-C-0020
Intelligent Metacomputing Testbed: Distributed Object Computation Testbed
NARA Supplement San Diego Supercomputer Center (SDSC)
Dear Ms Stephan:
The quarterly status report for the period October – December 1998 for the National Archives and Records Administration extension to the Distributed Object Computation Testbed (DOCT) has been sent to the following individuals:
ARPA Agent
Charles Shank
ESC/DIB
Bldg. 1703, Rm. 114
5 Eglin Street
Hanscom AFB, MA 01731-2116
National Archives and Records Administration / NWME
Attn: Robert Chadduck
8601 Adelphi Rd.
College Park, MD 20740
Thank you,
Reagan W. Moore
San Diego Supercomputer Center