CHALLENGES
AND OPPORTUNITIES
KNOWLEDGE
BASED PERSISTANT ARCHIVES
BUILDING
SMARTER GRIDS
glut
of government records being created with rapidly changing information
technologies is a daunting problem for federal archivists who must
preserve the data for centuries. How can archivists store electronic
records so that we will be able to access data sets in the future,
as easily as we can now read the words of the 214-year-old U.S.
Constitution? Music lovers with LP records from the 1960s can appreciate
the problem of accessing yesterdays information with todays
technology. The National Archives and Records Administration (NARA)
is charged with managing federal government records following their
creation through storage to use by researchers. Now, archiving technologies
developed by SDSCs Data and Knowledge Systems (DAKS) researchers
are playing a pivotal role in NARAs efforts to provide future
access to authentic government electronic records. These technologies
are also central to the next-generation computing system known as
the TeraGrid.
 |
|
Long-Term
Data Archive Components
|
Information technologies
transform and enrich society, but they also create mountains of
electronic documents, which are sometimes called software-dependent
data objects. These objects are at risk of being orphaned
by changing technologies. The translucent fossil resin amber preserves
fragile insects for millions of years, and archivists need an equivalent
to preserve ephemeral electronic documents.
In collaboration with NARA and other partners, SDSC researchers
are developing a comprehensive solution for storing electronic records
in a type of electronic amber called knowledge-based persistent
digital archives. Were on the verge of a major technological
breakthrough for the long-term preservation of computer-generated
records of the federal government, John Carlin, chief archivist
of the United States, said in Congressional testimony last year.
Research and development work done for us by the San Diego
Supercomputer Center indicates that a practical electronic records
archives may be in sight.
CHALLENGES
AND OPPORTUNITIES
In order to realize
the promise of electronic records storage, DAKS researchers are
confronting not only difficulties posed by the rapid obsolescence
of information technologies, but other complexities as well. Unlike
words and images on a printed page, digital information is stored
on a disk drive in one format, and presented on a screen in a
radically different format. The technology incorporates great
flexibility and many intervening steps of transformation. While
this freedom offers great benefits, it also presents difficulties
in faithfully preserving the look and feel of information as presented
to the original readers, said Reagan Moore, SDSC Distinguished
Scientist, leader of NPACIs Data-Intensive Computing Environments
(DICE) thrust area, and adjunct professor in UCSDs Computer
Science and Engineering Department. This freedom also makes
it hard to ensure the authenticity of electronic records.
Electronic records are more than just volatile, less trustworthy
counterparts of their hard copy predecessors. In studying
the challenges of how to store authentic electronic records over
the long term, weve also found new opportunities to extract
information thats simply not available from traditional
hard copy documents, said DAKS researcher Richard Marciano.
In a significant achievement in the NARA research, work with a
test collection of Senate documents demonstrated that this data
could be put into a preservation format that is searchable and
provides added value. By converting or wrapping the
Microsoft Word files to the standard exchange format XML, Marciano,
fellow DAKS researcher Amarnath Gupta, and UCSD graduate student
Liying Sui were able to preserve the files and tease out previously
inaccessible, embedded relationships and knowledge.
The change-tracking feature in an MS Word document provides a
series of snapshots of its creation. After the researchers converted
the documents to XML format, they were able to separately access
each update and search the document changes by person, time, or
content. Instead of having only the hard copy document itself,
this approach produces a complete, searchable history of how the
document was preparedincluding who changed it and when.
Imagine being able to view the history of an important piece
of legislation or a treaty, said Gupta. Our approach
lets you automatically retrieve the document as it was on the
date you specify.
KNOWLEDGE-BASED
PERSISTANT ARCHIVES
Into
build a digital preservation system, DAKS researchers are collaborating
in research that is defining an architecture for the long-term
storage of electronic recordsthe amber of the Information
Age.
Creating knowledge-based persistent digital archives involves
three stages (Figure 1). For example, when a president leaves
office, records from the White House and other Executive Office
records are transferred to NARA. In the first stage, ingestion,
electronic records from separate and sometimes dissimilar sources
undergo transformations to put them into a standard persistent
form. They are then entered into a unified collection.
In the second stage, collection management, data are migrated
forward onto new hardware and software systems to avoid technological
obsolescence. The third stage, instantiation, occurs when someone
wants to access the stored presidential data and searches through
the archives, navigating to the information of interest. The desired
records are then unpackedloaded into a database system that
can be queriedand then validated and displayed. A major
advantage of the DAKS approach is that it will allow future generations
of researchers to retrieve archived records using current technologies,
rather than obsolete ones.
It turns out that the initial ingestion step is crucial
to the success of the later steps of management and access,
said DAKS researcher Bertram Ludäscher. If sufficient metadatadescriptive
information about the underlying datais archived along with
the data, it enables the stored data to be more easily managed
and migrated forward, despite inevitable changes in technology.
Metadata includes such information about the data as author, creation
date, as well as system information (file types and sizes). For
maximum robustness, the researchers also archive the time history
of the data, that is, the transformations the data sets undergo
in being archived.
In addition to archiving descriptive metadata, weve
also found that archiving a higher level layer of more complex
knowledge structures that are superimposed on the
collections provides an even more powerful tool for navigating
and validating the archived data, said Ludäscher. Knowledge
structures, which consist of relationships between relevant concepts
and the data, can be represented declaratively as logic rules.
We are working to implement these knowledge representation
techniques based on W3C [Worldwide Web Consortium] standards such
as Topic Maps, said Marciano. These rules can provide context
and constraints, and thus a richer semantic amber around the collections
being archived.
Suppose its 2020 and you want to do research on White
House science policy in the late 20th century using records stored
electronically, said Ludäscher. The usual methods of
finding information online or in libraries are ineffective in
these archives because the records were generally not created
for publication, but simply made in the course of carrying out
specific government functions. The intended audience was very
small, with a high degree of shared knowledge among those creating
and receiving the records. Typical keyword and even full-text
searches will often fail to find the relevant records because
the search terms may not appear within a record of interest but
only in related records, or only in the implicit knowledge of
those who created and received the records. Thus, archiving additional
context and knowledgefrom the specifics of a scientific
program design to broad policy statementscan be vital in
helping future users find and interpret historic records on science
policy (Figure 2).
BUILDING
SMARTER GRIDS
 |
|
Interactive
Map of Agent Orange Use in Vietnam
|
Its important
to realize that our work for NARA on persistent archives is not
being done in isolation, said Chaitan Baru, co-director
of the DAKS program. The DAKS archiving research is building on
technologies that come from synergies among the worlds of supercomputing,
digital libraries, and data grids. For example, the problems that
must be solved for ingesting diverse, far-flung data into a unified
collection are not unlike the technologies required for establishing
grid connectivity or interoperability. All the technologies
were developing, whether for data collections, digital libraries,
data grids, or persistent archives, are very related, said
Baru. And theyre also playing a central role in the
TeraGrid. By merging knowledge-based technologies into data
grids, DAKS researchers are helping develop a new generation of
smarter grids.
Based on the successful DAKS research, NARAs Electronic
Records Archives Program is expanding research and development
efforts with SDSC and NPACI. In addition, NARA is leveraging NSFs
investment in NPACI. Thanks to the expertise, talent, and
dedication of those in the scientific and engineering communities,
I can see the archives of the future taking shape right now,
said Carlin. I cant wait to see how it turns out.
The coming year will also be an important transition period for
this work, said SDSCs Moore. We will pursue further
research that will mark a significant step toward having these
technologies become more widely implemented, not only within NARA,
but also internationally, he said.
PT
|
PROJECT LEADER
Reagan Moore
SDSC
PARTICIPANTS
Amarnath Gupta,
Bertram Ludäscher,
Richard Marciano
SDSC
www.npaci.edu/DICE
www.sdsc.edu/NARA |