Report on Collection Based Persistent Archives

April 1999

The San Diego Supercomputer Center is identifying the information architecture required to build a persistent archive. The approach is based upon the concept that both the original digital objects and the information required to assemble the digital objects into a data collection must be archived. Digital objects are not archived as stand-alone entities, but instead are archived as members of a digital data collection. Persistence is achieved through identification of meta-data for all attributes related to digital object properties and collection organization. Persistence is demonstrated by dynamically building the data collection from the individual data objects stored in the archive, dynamically creating the queries needed to discover information within the data collection, and dynamically constructing the presentation interface for the digital objects discovered through a query.

The software systems required to dynamically assemble a data collection, load the digital objects, and support arbitrary queries against the assembled collection have been developed at SDSC as part of a Meta-data Catalog system (MCAT). The software is used with the SDSC Storage Resource Broker to support the creation of scientific data collections. Software is under development to characterize presentation interfaces as semi-structured data organized through the use of XSL style sheets. This latter system supports mediated information exchange (MIX) for XML tagged data sets.

The approach that is being followed is broken down into four areas corresponding to the automated loading of data and generation of meta-data, the long-term storage of digital objects, support for information discovery against the archived data, and the presentation of the discovered data objects. Each area has multiple components:

SDSC is using the following data collections to demonstrate the generation of a persistent archive:
  1. Newsgroups
  2. TIGER/Line 1992 (Bureau of the Census)
  3. 104th Congress
  4. Electronic Access Project (EAP) -- updated
  5. Vote Archive Demo 1997 (VAD)
  6. Combat Area Casualties Current File (CACCF) -- updated
  7. Patent Data (USPTO) -- 2do (who?)
  8. Joint Interoperability Test Command (JITC) -- new
  9. Image and Metadata collection (AMICO) -- new
Each collection provides a different test of the functionality or the performance of the system. The ingestion of a variety of data collections into the SDSC High Performance Storage System (HPSS) helps us understand the issues associated with data accession, demonstrates the feasibility of archiving data collections as well as digital objects, and demonstrates the level of performance that can be expected (number of records accessioned per year).

Nearly all data (apart from a 4mm cartridge) has been ingested in raw format and loaded to disk or HPSS. For the ingestion from some sources (i.e., the newsgroup and census data collections), Perl scripts were written (see below) and used to dynamically assemble the data collections.

Preliminary structural and meta-data information was obtained from the collections (mainly by ``looking'' at the sources manually; to a very limited extent by also using small Perl scripts to extract attribute names). The reporting format we follow for each data collection is:

Results of the ingestion process are given for the first five collections. The sixth collection is being ingested, while the last two data collections will require development of additional scripts to manage their very large size. The USPTO patent collection will be used to populate a new linked database/archival storage system based on the IBM Universal database and HPSS. The expectation is that the system can be used to directly store individual objects within HPSS, without having to concatenate digital objects together through the use of TAR. The patent collection will also be used to benchmark the effort required to migrate a collection to a new meta-data markup language. Scripts are in test for converting from the current SGML based version to an XML based version. The Image collection contains about 10,000 art images comprising about 25 GB of data. The challenge with this collection is the development of presentation interfaces that can provide the multiple views needed into different resolutions of the images. The image collection will be ingested into the archive at SDSC during January.