Digging Into Data: Q&A with Reagan Moore
Reagan Moore is Director of the Knowledge and SRB Lab at the San Diego Supercomputer Center. He coordinates research efforts in the development of massive data analysis systems, scientific data publication systems, and persistent archives.
Q. How did you get started in science and what led you to concentrate on data?
Moore: I attended the Thacher Summer Science Program as a junior in high school, where we determined the orbital parameters of an asteroid. I then majored in physics at Caltech, and earned a PhD in plasma physics from UCSD. I worked on the equilibrium and stability of toroidal fusion devices for 10 years, culminating in a joint patent on elongated toroidal fusion devices. I developed stability codes for analyzing pressure gradient driven instabilities (ballooning modes) that limit the maximum achievable pressure within a Tokamak.
The analyses we conducted at General Atomics showed that stable plasmas had a maximum value of beta poloidal (ratio of plasma pressure to poloidal magnetic field pressure). An independent group on the East Coast believed that a second stability region existed at large values of beta poloidal. We were unable to get copies of the plasma equilibria on which this claim was based. The plasma community spent several years attempting to realize high beta poloidal equilibria. We now know that the perceived second stability regime was an artifact. If we had been able to analyze the equilibria with the stability codes developed at General Atomics, we could have shown this result much sooner. It was obvious that a better way to share data was needed that would work across independent administrative domains.
I joined the SDSC staff when the center started, and managed production systems for 8 years. I was then asked to develop a research group, pick the research topic, and acquire the funding support. I recognized that automation of data management was a critical need, and focused the new research group on developing generic data virtualization software, the basis of the Storage Resource Broker.
Q. What sparked your interest in data preservation?
Moore: One of the early projects that funded SRB development was the Distributed Object Computation Testbed (1998-1999). This was a DARPA-USPTO funded project which resulted in the creation of a US Patent digital library, with the holdings replicated across multiple sites using the SRB technology. We were approached by Robert Chadduck of NARA to apply the technology in preservation environments.
We recognized that the data virtualization mechanisms within the SRB (basically the ability to manage the properties of a shared collection independently of the distributed remote storage systems) could also be used to manage technology evolution for persistent archives. The technology that manages heterogeneity in space can also manage heterogeneity over time.
We were funded by NARA in 1999 to build a prototype persistent archive. We demonstrated the preservation of a 1-million e-Mail collection using workstation technology. The prototype persistent archive now manages over 1 Terabyte of data, including records from web crawls, records for the EAP collection, and records from presidential libraries. The prototype persistent archive consists of three independent SRB data grids at NARA, SDSC, and U Maryland, which are federated to share resources and user/file name spaces. The concepts behind the prototype persistent archive were used to define the solicitation for the Electronic Records Archive which is being implemented by Lockheed-Martin.
We now support multiple preservation projects, including:
- NARA research prototype persistent archive.
- NSF National Science Digital Library persistent archive
- NHPRC Persistent Archive Testbed (state archives)
- Library of Congress NDIIPP project with California Digital Library - Digital Preservation Repository
- NSF/Library of Congress Digital Archiving project to preserve UCTV video for the "Conversations with History" presentations
- UCSD Libraries image archive
- InterPARES VanMap GIS preservation environment
Q. Tell us more about the creation of SRB.
Moore: The SRB technology has been driven by the desire to automate all aspects of data manipulation, including discovery, access, retrieval, management, replication, archiving, and analysis. The initial project that funded the research was provided by DARPA for a "Massive Data Analysis System". The project was viewed as highly risky research. We spent 18 months thinking about the design of a system that would achieve the above goals in a distributed environment where the data resided on multiple types of storage systems. That initial design has been proved out over the succeeding 9 years of development effort. The original version supported containers, replication, metadata for each file, virtualization of the name spaces for resources, users, and files, a peer-to-peer server architecture, separation of interactions with storage systems from interactions by access methods, and stored all state information in a relational database (MCAT).
Aside from the original DARPA project, the funding sources for development were all collaborations on application of the SRB technology and the development of specific capabilities required by that application. Thus the set of features provided by the SRB have been driven by real use cases.
The second major release of the SRB included support for parallel I/O. We were driven by the need to support both bulk loading of small files, as well as the need to move large files rapidly.
The third major release of the SRB included support for federation of independent data grids. This feature is now widely used, and enables the creation of international shared collections. The SRB technology is used by a wide number of projects within the US, ranging from bioinformatics, to oceanography, to seismology, to education. The software is used internationally in over 15 countries. Both the UK and Australia are investing heavily in the SRB and are building their data management infrastructure on top of the SRB technology. In aggregate, there is over a petabyte of data managed by SRB shared collections around the world.
Q. What's been rewarding about working with the Library of Congress and NARA?
Moore: The collaborations with NARA have shown that generic data management infrastructure can be used to support not only data sharing, but also data preservation. We also collaborate with the digital library community. The integration of the SRB technology with the DSpace digital library and the Fedora digital library is just as important. The digital libraries gain the ability to manage collections that exceed the size of the local file system, gain support for replication, and gain the ability to federate with other digital libraries. These efforts are showing that generic data virtualization infrastructure can be used by all of the data management mechanisms whether for data sharing, data publication, or data preservation.
Q. Tell us about your genealogy hobby?
Moore: I have been working on my family genealogy for the last 15 years in my spare time. This evolved into an effort to understand the properties of genealogies in terms of completeness, closure, extent, and coverage. My grandmother had paid for the construction of the family genealogy that showed her ancestors had fought in the Revolutionary War, and hence qualified her for membership in the Daughters of the American Revolution. The ancestors included a William Wentworth who emigrated to the US about 1635. I found a resource from the Wentworth family that traced their ancestry back to England, with a lineage through William Marbury, to the Blount family, back to William the Conqueror.
I ended up tracking the genealogies of the royal lines of Europe to understand how large a genealogy was feasible, how quickly lineages coalesce (multiple children from the same family are ancestors), and how far back in time a genealogy is realistic. The genealogy eventually included all historical cultural groups in Europe, over 2000 kings, the standard apocryphal links to the Davidic King list back to Adam and Eve, the Heimskringla saga (links to Oden), and apocryphal links to the Pharaohs of Egypt. The current state of the genealogy is:
- 123,700 persons
- over 70,000 relatives, including over 10,000 family ancestors
- ancestors of the Kings and Queens of England, Norway, Sweden, Denmark, Belgium, Netherlands, and Spain. The number of their ancestors varies from 14,400 to 21,000. It turns out that the number of ancestors in common across the royal lines is over 14,000.
- links to 31 US presidents, 24 signers of the Magna Charta
- apocryphal lineages that go back 172 generations
- realistic lineages that go back to Charlemagne (about 50 generations)
I wrote a genealogy analysis program to calculate the number of ascents, descents, lineage of maximal ascent, and closure properties. For example:
- the genealogy contains over 50,000 descendents of Charlemagne
- the genealogy contains over 21,000 ancestors of Prince William
- the number of persons who are both ancestors of Prince William and descendents of Charlemagne is 8735. Between these persons there were 3,146 marriages.
- the number of descents from Charlemagne to Prince William can then be shown to exceed 1 billion.
Since the number of persons alive in Europe at the time of Charlemagne was on the order of 30 million persons, a high degree of coalescence of lineages has to occur. If you calculate the ratio of the number of lineages that coalesce (lineages that have common parents) to the number of lineages that have not coalesced, you find that for Prince William, this ratio exceeds 1. for 30 generations starting at the ninth generation.
The concept of extent is the comparison of the number of lineages that exist within the genealogy at each generation, with the number of persons alive in Europe at that time. For Prince William, this ratio exceeds 1. starting with the 37th generation and continues greater than 1 through the 172nd generation.
Thus properties that a "complete" genealogy should have include:
- at least half as many ancestors as a royal genealogy
- at least 10 generations during which the closure is greater than one
- at least 100,000 descents from Charlemagne
- an extent that exceeds 1. for all generations after some starting point
- links to apocryphal lineages that go back roughly 4000 years.
My family genealogy is close to "complete". However that last 10% will require doubling the genealogy size to about 250,000 persons. Maybe in another 12 years?