Skip to content

Putting Government Information at Citizens' Fingertips

SDSC RESEARCH |Contents | Next
Chaitan Baru

Amarnath Gupta

Yannis Papakonstantinou
UC San Diego

Robert Hollebeek
University of Pennsylvania

David Featherman
University of Michigan

S ince time immemorial, governments have compiled tax, census, cartographic, and other information in forms from clay tablets to disk drives as technologies progressed. In recent years, much of this data has moved from the file drawer to the Web--but still remains difficult to locate and access. Now, researchers in SDSC's Data-Intensive Computing Environments group are building the first ever Digital Government Information Integration Testbed, providing powerful new Web-based tools that let researchers and citizens seamlessly access, integrate, analyze, and display the vast storehouse of government information right from their desktops.




Digital Government
Figure 1. Information Flow from Data Sources through Mediators to User Portals
Information flows from government databases (bottom), through Mediator software based on MIX technology (middle). The Mediator wraps or translates data from distributed, heterogeneous government databases, integrates the data, conflates or resolves geospatial issues through the Spatial Mediator, and performs statistical and other analysis, visualization, and further processing with plug-ins. Finally, the Mediator passes the resulting new data set to Web-based user portals (top) for purposes ranging from simple archive access to data mining and specialized research tools like the Sociology Workbench.
As more and more government records are made available on the Web, the explosion of raw data has brought tantalizingly close a whole new world of rapid, universal access to undreamed of quantities and kinds of government information. "However, for all but the simplest queries, we now face a new series of challenges related to how to find and assemble the right data to answer your questions, whether for research or decision-making," says Chaitan Baru, SDSC senior principal scientist and leader of the Digital Government Information Integration Testbed (I2T) project. "In most cases, the information to answer your questions will be distributed among several databases at different government agencies, which are running different hardware and operating systems and using different data formats."
The I2T researchers are using a Web-based mediation approach, developing tools based on the Data-Intensive Computing Environments (DICE) group's Mediation of Information using XML (MIX) technology. These tools can find the right information wherever it is, "wrapping" or translating the different formats into the Extensible Markup Language (XML), integrating and analyzing this data, and then displaying it in ways that help the user gain the most insight into this information. To develop tools capable of doing all this, DICE researchers Amarnath Gupta, Ilya Zaslavsky, and Bertram Ludäscher are collaborating with NPACI partners at the UC San Diego Database Lab, the University of Pennsylvania, and the University of Michigan. "We're pooling our experience in related projects to create, adapt, and demonstrate the feasibility of the core technologies, the building blocks if you will, that will allow the vast and growing array of different government data to be seamlessly integrated and queried," Baru says.
In addition to the above research partners, government participants in this project include the U.S. Census Bureau, the Bureau of Labor Statistics, the State of Pennsylvania's Department of Community and Economic Development and Department of Labor and Industry, the San Diego Association of Governments (SANDAG), the National Archives and Records Administration, and the U.S. Geological Survey. "We wanted to include a cross-section or representative sample from federal, state, and local government levels to be sure that the unique problems of each level were represented in the testbed," Baru says.


As an example of how these tools work in practice, a researcher might seek information about land use to create a report on the rate of change of green space in relation to population growth in San Diego County, California, over the last five years. Sitting at a desktop computer, the researcher uses a single Web interface to submit queries about green space in San Diego County. Invisibly to the researcher, the Web-based software tools analyze the query, and using a large, high-speed catalog, identify which databases have the information sought. The information, it turns out, is in several places. The population information is in federal census databases (also distributed among several sites). The green space information is in both state-maintained databases and local government databases maintained by SANDAG, a regional organization of local governments.

As the researcher works, the mediation software accesses the databases on the different sites, translates the various data formats into the common data model of XML, and retrieves the subsets of data needed. It then combines the information and performs analysis to determine how fast population and green space are changing at the various locations over the five-year period. Finally, it lets the researcher choose how to visualize the information and display it using geographic information system (GIS) tools that show both population and green space overlaid on a topographic map of San Diego County so that the information is viewed in ways that give the most insight.

The system allows the researcher to adjust and refine the queries and analysis until the right information is collected and displayed in the most useful way. In just a short while, the researcher has learned that green space has grown during the research period, but much more unevenly than previously believed. The seamless access to information provided greater understanding of population growth and land use in far shorter time than ever before.

Top| Contents | Next


The principal technologies that DICE researchers are using to build this digital government information integration testbed are based on the DICE group's MIX project, developed in collaboration with UC San Diego researchers. The MIX mediator software produces integrated views of information based on collecting, filtering, and fusing information gathered from multiple sources in order to provide end-users with a data product that meets their information needs, organized and presented in a way that facilitates their research or decision-making.

"What we're constructing is a testbed for studying sophisticated database techniques for integrating information from multiple distributed sources. In particular, we are focusing on statistical and geospatial information sources, since governments at all levels routinely deal with such information," Baru says.

DICE researchers are using an "on-demand" approach in which data is fetched from individual sources only when requested by the client, rather than building an integrated data warehouse in advance. This has the advantage of providing the most up-to-date information because the user always gets current information from the original sources.

Major research goals of the I2T proposal include extending the MIX system to support statistical and geospatial information services; designing data transfer protocols to support queries from both mobile clients and large-scale bulk, distributed analysis applications; and investigating new applications of the I2T testbed environment, including researcher interfaces such as the Sociology Workbench and distributed decision-support applications that perform data analysis and data mining operations on integrated views provided by MIX. To support these capabilities, DICE researchers will explore "plug-in" support for external operations such as access to databases requiring special utilities. An additional part of the I2T project is to evaluate the Census Bureau's prototype Federal Electronic Research and Review Extraction Tool (FERRET) and explore how MIX technology might be employed in FERRET.

Top| Contents | Next


Transparency is a major goal, and DICE researchers are working to make it possible for a researcher or ordinary citizen to have a single point of entry and a uniform Web-based interface to seamlessly access the entire universe of government data right from their desktop and use it in novel ways. This requires the capability to integrate fundamentally different types of data. For example, a demographic information source may contain income data, while another database may contain epidemiological data on disease incidence. A mediated view across these two data sources that lets researchers integrate both information types can provide new insights into the relationship between income distribution and disease incidence patterns.

At first glance, integrating the data types in the above problem or the green space example may not seem too difficult, just a matter of going to the different databases and assembling the data. But upon closer inspection, there are major challenges facing DICE researchers as they attempt to build a truly seamless system.

In addition to the problem of mediating or translating all the data formats into the common XML language, in the green space example it turns out that different government agencies use slightly different definitions of land boundaries. And some use latitude and longitude for specifying location, while others use other coordinate systems to specify boundaries. In addition, they also use varying definitions of what green space is. So, the software must find ways to reconcile all these differences in granularity and resolution and data definitions, which make the data what researchers call semi-structured. "The strength of MIX and XML technology is that it can deal with semi-structured data, which is something very difficult to do in standard relational database systems," Baru says.

Another challenge is data quality. "You can't just ask, 'Where is the data?' You also need to ask, 'What is the quality of the data?'" Baru says. "Especially for researchers, they need to preserve information about the origin of each part of this integrated data, what is called metadata, and keep track of the accuracy or error bars and how they are modified through all the operations performed on the data as it is integrated and analyzed."

"There are some definite challenges we're facing," says Baru, "but as we extend these technologies with our partners, we're on the threshold of a whole new way of interacting with government and other data." --PT *

Top| Contents | Next