Skip to content

News Center

Home > News Center > Publications > EnVision


NARA/DICE | Contents | Next

Preserving Electronic Records for Posterity


glut of government records being created with rapidly changing information technologies is a daunting problem for federal archivists who must preserve the data for centuries. How can archivists store electronic records so that we will be able to access data sets in the future, as easily as we can now read the words of the 214-year-old U.S. Constitution? Music lovers with LP records from the 1960s can appreciate the problem of accessing yesterday’s information with today’s technology. The National Archives and Records Administration (NARA) is charged with managing federal government records following their creation through storage to use by researchers. Now, archiving technologies developed by SDSC’s Data and Knowledge Systems (DAKS) researchers are playing a pivotal role in NARA’s efforts to provide future access to authentic government electronic records. These technologies are also central to the next-generation computing system known as the TeraGrid.

Long-Term Data Archive Components

Information technologies transform and enrich society, but they also create mountains of electronic documents, which are sometimes called “software-dependent data objects.” These objects are at risk of being orphaned by changing technologies. The translucent fossil resin amber preserves fragile insects for millions of years, and archivists need an equivalent to preserve ephemeral electronic documents.

In collaboration with NARA and other partners, SDSC researchers are developing a comprehensive solution for storing electronic records in a type of electronic amber called knowledge-based persistent digital archives. “We’re on the verge of a major technological breakthrough for the long-term preservation of computer-generated records of the federal government,” John Carlin, chief archivist of the United States, said in Congressional testimony last year. “Research and development work done for us by the San Diego Supercomputer Center indicates that a practical electronic records archives may be in sight.”


In order to realize the promise of electronic records storage, DAKS researchers are confronting not only difficulties posed by the rapid obsolescence of information technologies, but other complexities as well. Unlike words and images on a printed page, digital information is stored on a disk drive in one format, and presented on a screen in a radically different format. The technology incorporates great flexibility and many intervening steps of transformation. “While this freedom offers great benefits, it also presents difficulties in faithfully preserving the look and feel of information as presented to the original readers,” said Reagan Moore, SDSC Distinguished Scientist, leader of NPACI’s Data-Intensive Computing Environments (DICE) thrust area, and adjunct professor in UCSD’s Computer Science and Engineering Department. “This freedom also makes it hard to ensure the authenticity of electronic records.”

Electronic records are more than just volatile, less trustworthy counterparts of their hard copy predecessors. “In studying the challenges of how to store authentic electronic records over the long term, we’ve also found new opportunities to extract information that’s simply not available from traditional hard copy documents,” said DAKS researcher Richard Marciano.

In a significant achievement in the NARA research, work with a test collection of Senate documents demonstrated that this data could be put into a preservation format that is searchable and provides added value. By converting or “wrapping” the Microsoft Word files to the standard exchange format XML, Marciano, fellow DAKS researcher Amarnath Gupta, and UCSD graduate student Liying Sui were able to preserve the files and tease out previously inaccessible, embedded relationships and knowledge.

The change-tracking feature in an MS Word document provides a series of snapshots of its creation. After the researchers converted the documents to XML format, they were able to separately access each update and search the document changes by person, time, or content. Instead of having only the hard copy document itself, this approach produces a complete, searchable history of how the document was prepared—including who changed it and when. “Imagine being able to view the history of an important piece of legislation or a treaty,” said Gupta. “Our approach lets you automatically retrieve the document as it was on the date you specify.”


Into build a digital preservation system, DAKS researchers are collaborating in research that is defining an architecture for the long-term storage of electronic records—the amber of the Information Age.
Creating knowledge-based persistent digital archives involves three stages (Figure 1). For example, when a president leaves office, records from the White House and other Executive Office records are transferred to NARA. In the first stage, ingestion, electronic records from separate and sometimes dissimilar sources undergo transformations to put them into a standard persistent form. They are then entered into a unified collection.

In the second stage, collection management, data are migrated forward onto new hardware and software systems to avoid technological obsolescence. The third stage, instantiation, occurs when someone wants to access the stored presidential data and searches through the archives, navigating to the information of interest. The desired records are then unpacked—loaded into a database system that can be queried—and then validated and displayed. A major advantage of the DAKS approach is that it will allow future generations of researchers to retrieve archived records using current technologies, rather than obsolete ones.

“It turns out that the initial ingestion step is crucial to the success of the later steps of management and access,” said DAKS researcher Bertram Ludäscher. If sufficient metadata—descriptive information about the underlying data—is archived along with the data, it enables the stored data to be more easily managed and migrated forward, despite inevitable changes in technology. Metadata includes such information about the data as author, creation date, as well as system information (file types and sizes). For maximum robustness, the researchers also archive the time history of the data, that is, the transformations the data sets undergo in being archived.

“In addition to archiving descriptive metadata, we’ve also found that archiving a higher level layer of more complex ‘knowledge structures’ that are superimposed on the collections provides an even more powerful tool for navigating and validating the archived data,” said Ludäscher. Knowledge structures, which consist of relationships between relevant concepts and the data, can be represented declaratively as logic rules. “We are working to implement these knowledge representation techniques based on W3C [Worldwide Web Consortium] standards such as Topic Maps,” said Marciano. These rules can provide context and constraints, and thus a richer semantic amber around the collections being archived.

“Suppose it’s 2020 and you want to do research on White House science policy in the late 20th century using records stored electronically,” said Ludäscher. The usual methods of finding information online or in libraries are ineffective in these archives because the records were generally not created for publication, but simply made in the course of carrying out specific government functions. The intended audience was very small, with a high degree of shared knowledge among those creating and receiving the records. Typical keyword and even full-text searches will often fail to find the relevant records because the search terms may not appear within a record of interest but only in related records, or only in the implicit knowledge of those who created and received the records. Thus, archiving additional context and knowledge—from the specifics of a scientific program design to broad policy statements—can be vital in helping future users find and interpret historic records on science policy (Figure 2).


Interactive Map of Agent Orange Use in Vietnam

“It’s important to realize that our work for NARA on persistent archives is not being done in isolation,” said Chaitan Baru, co-director of the DAKS program. The DAKS archiving research is building on technologies that come from synergies among the worlds of supercomputing, digital libraries, and data grids. For example, the problems that must be solved for ingesting diverse, far-flung data into a unified collection are not unlike the technologies required for establishing grid connectivity or interoperability. “All the technologies we’re developing, whether for data collections, digital libraries, data grids, or persistent archives, are very related,” said Baru. “And they’re also playing a central role in the TeraGrid.” By merging knowledge-based technologies into data grids, DAKS researchers are helping develop a new generation of smarter grids.

Based on the successful DAKS research, NARA’s Electronic Records Archives Program is expanding research and development efforts with SDSC and NPACI. In addition, NARA is leveraging NSF’s investment in NPACI. “Thanks to the expertise, talent, and dedication of those in the scientific and engineering communities, I can see the archives of the future taking shape right now,” said Carlin. “I can’t wait to see how it turns out.”

The coming year will also be an important transition period for this work, said SDSC’s Moore. “We will pursue further research that will mark a significant step toward having these technologies become more widely implemented, not only within NARA, but also internationally,” he said.

Reagan Moore

Amarnath Gupta,
Bertram Ludäscher,
Richard Marciano