Report on Collection Based Persistent Archives
April 1999
The San Diego Supercomputer Center is identifying the information
architecture required to build a persistent archive. The approach is
based upon the concept that both the original digital objects and the
information required to assemble the digital objects into a data
collection must be archived. Digital objects are not archived as
stand-alone entities, but instead are archived as members of a digital
data collection. Persistence is achieved through identification of
meta-data for all attributes related to digital object properties and
collection organization. Persistence is demonstrated by dynamically
building the data collection from the individual data objects stored
in the archive, dynamically creating the queries needed to discover
information within the data collection, and dynamically constructing
the presentation interface for the digital objects discovered through
a query.
The software systems required to dynamically assemble a data
collection, load the digital objects, and support arbitrary queries
against the assembled collection have been developed at SDSC as part
of a Meta-data Catalog system (MCAT). The software is used with the
SDSC Storage Resource Broker to support the creation of scientific
data collections. Software is under development to characterize
presentation interfaces as semi-structured data organized through the
use of XSL style sheets. This latter system supports mediated
information exchange (MIX) for XML tagged data sets.
The approach that is being followed is broken down into four areas
corresponding to the automated loading of data and generation of
meta-data, the long-term storage of digital objects, support for
information discovery against the archived data, and the presentation
of the discovered data objects. Each area has multiple components:
- Ingestion
-
Creation of an image of the original contents of the source
-
Decomposition of the data collection into individual data objects
-
Characterization of the attributes used to describe each data object
-
Characterization of the data collection organization
-
Identification of unique features associated with each
collection, including software tools and ingestion
performance.
- Standardization of the mark-up language used to annotate the
digital objects with their associated meta-data based upon XML
and the MOA II DTD.
- Archival storage
-
Standardization of the archive format for storing a digital
object based upon OAIS
- Storage of the collection into the archive
- Persistence of archival storage meta-data needed to
guarantee the ability to retrieve the data collection
- Benchmark of the achievable performance for archiving
data
- Information discovery
- Dynamic reconstruction of the data collection through use of
object-relational database technology
- Dynamic generation of a user interface to support queries
against the data
- Dynamic generation of the SQL required to execute a
query
- Presentation
- Standardization of the mark-up language used to define the
presentation layout
- Support for retrieval of original data object
- Support for retrieval of meta-data used to characterize
information about the object or the associated data
collection
- Dynamic creation of the presentation interface for each
digital object
SDSC is using the following data collections to demonstrate the
generation of a persistent archive:
- Newsgroups
- TIGER/Line 1992 (Bureau of
the Census)
- 104th Congress
- Electronic Access Project
(EAP) -- updated
- Vote Archive Demo 1997 (VAD)
- Combat Area Casualties Current
File (CACCF) -- updated
- Patent Data (USPTO)
-- 2do (who?)
- Joint Interoperability Test
Command (JITC) -- new
- Image and Metadata
collection (AMICO) -- new
Each collection provides a different test of the functionality or the
performance of the system. The ingestion of a variety of data
collections into the SDSC High Performance Storage System (HPSS) helps
us understand the issues associated with data accession, demonstrates
the feasibility of archiving data collections as well as digital
objects, and demonstrates the level of performance that can be
expected (number of records accessioned per year).
Nearly all data (apart from a 4mm cartridge) has been ingested in
raw format and loaded to disk or HPSS. For the ingestion from some
sources (i.e., the newsgroup and census data collections), Perl
scripts were written (see below) and used to dynamically assemble the
data collections.
Preliminary structural and meta-data information was obtained from
the collections (mainly by ``looking'' at the sources manually; to a
very limited extent by also using small Perl scripts to extract
attribute names). The reporting format we follow for each data
collection is:
- Description of the type of collection
-
Physical Source
-
Collection Level Structure / Meta-data used for logical organization of
the data objects
-
Object Level Structure / Meta-data used for description of each object
-
Miscellaneous - performance, automation, and problem handling
Results of the ingestion process are given for the first five
collections. The sixth collection is being ingested, while the
last two data collections will require development of additional
scripts to manage their very large size. The USPTO patent
collection will be used to populate a new linked database/archival
storage system based on the IBM Universal database and HPSS. The
expectation is that the system can be used to directly store
individual objects within HPSS, without having to concatenate
digital objects together through the use of TAR. The patent
collection will also be used to benchmark the effort required to
migrate a collection to a new meta-data markup language. Scripts
are in test for converting from the current SGML based version to
an XML based version. The Image collection contains about 10,000
art images comprising about 25 GB of data. The challenge with this
collection is the development of presentation interfaces that can
provide the multiple views needed into different resolutions of
the images. The image collection will be ingested into the archive
at SDSC during January.