Skip to content


    Skip to content

     

    DATA-INTENSIVE COMPUTING ENVIRONMENTS | Contents | Next

    A Recipe for MIXing and Searching Online Information

    PROJECT LEADER
    Chaitan Baru, SDSC

    Yannis Papakonstantinou
    UC San Diego

    PARTICIPANTS
    Amarnath Gupta, Bertram Ludäscher, Yuri Kuzminsky, Richard Marciano, Kevin Munroe, Paul Nguyen, Ilya Zaslavsky, SDSC
    Michail Petropoulos, Pavel Velikhov, Victor Vianu, UC San Diego

    T he Web lets anyone access an almost limitless supply of data from businesses, government agencies, and other organizations. As more and more data come on line, the problem becomes finding the right information to answer a question. A homebuyer, for example, might look for houses based not only on the number of bedrooms, but also on the quality of the local schools and neighborhood crime statistics. Today, the homebuyer has to dig up this information from real estate listings, the school district, and law enforcement agencies. But a new project led by SDSC and the UC San Diego Database Lab is making it possible to create intelligent "hyper-Web" sites that pull together the information from such diverse sources.

    MIX AND MATCH COLLECTIONS

    ART, HISTORY, AND GOVERNMENT

    THE LONGEVITY PROBLEM

    MIX AND MATCH COLLECTIONS

    The project is called MIX, short for Mediation of Information using XML. The Database Lab and SDSC's Data-Intensive Computing Environments (DICE) group have joined efforts in MIX to further advance the state of the art in mediation using the eXtensible Markup Language (XML) and apply the results in solving real problems, including those faced by NPACI partners.

    "XML's ability to model semistructured data is the key," said Chaitanya Baru, manager of the DICE group at SDSC. "It's very simple, but the software built around it enables many things--in much the same way that the Web is enabled by the simple HTTP standard."

    As its name suggests, the MIX software acts as a mediator between data sources that have different structures. To mix in a new data source, an existing data collection does not have to be converted to XML format. Instead, a MIX developer writes a wrapper, which translates between the data's native format and an XML view. Once all the data sources have been wrapped, the mediator provides an integrated XML view by selecting, restructuring, and merging the views from the wrapped data sources.

    Both views and user queries are specified using an XML-based query language called XML Matching and Structuring (XMAS) language, developed by Bertram Ludäscher of SDSC, and Yannis Papakonstantinou, Pavel Velikhov, and Victor Vianu of UC San Diego. The MIX mediator translates XMAS queries on the integrated view into queries directed to the individual information sources. The MIX wrappers developed for each source translate between XMAS and the native format of the data source.

    The Database Lab and SDSC are also developing the Blended Browsing and Querying (BBQ) interface, which allows less sophisticated users to generate XMAS queries intuitively, without knowledge of XMAS--or any query language for that matter (Figure 1). The BBQ interface also supports browsing and further querying based on the returned results.

    The ability to integrate distinct Web sites has obvious benefits to information seekers. A MIX-enhanced site could provide a single interface that would automatically query and extract data from relevant databases, HTML pages, geographic information systems, or even legacy data systems. The homebuyer in the example would see a single page with all the information intelligently displayed. The buyer can then pose questions that combine pertinent home, school, and crime information (Figure 2).

    The MIX project is rooted in research that Database Lab members conducted over the past six years. Their core thesis has been that most Web information --and much other data--can be considered semistructured. Semistructured data do not subscribe to a rigid schema as in relational databases, yet they may have various degrees of structure. Information exchange and mediation systems gain flexibility by working with semistructured data models; one such is the OEM model introduced in 1994 by Papakonstantinou and Hector Garcia-Molina and Jennifer Widom of Stanford University, which has been used by multiple academic and industrial research projects since then. The emergence of XML as the de facto semistructured data standard has further validated the hypothesis and brought XML to the point where mediation is a practical possibility for solving real problems.

    Top | Contents | Next

    Prototype BBQ InterfaceFigure 1. Prototype BBQ Interface
    The Blended Browsing and Querying (BBQ) interface allows less-experienced users to generate queries for the MIX system.

    Information at a Homebuyer's FingertipsFigure 2. Information at a Homebuyer's Fingertips
    This example of an end-user interface, for individuals seeking to buy a home in San Diego County, integrates information from a variety of Web sources using MIX technology.


    ART, HISTORY, AND GOVERNMENT

    The MIX approach is being used in NPACI's digital library efforts and in projects with government agencies, including the U.S. Census Bureau. In one digital library prototype, the MIX technology is being used with the SDSC Storage Resource Broker (SRB) to provide access to an art image collection (see p.10).

    In the area of government, MIX techniques will help create virtual agencies from on-line data resources. A pilot project with the U.S. Census Bureau has developed a single XML view of what are in fact many Web sources. Beyond the pilot effort, the advance making much of this possible is the conversion to XML format of geographic information system (GIS) objects, which are objects of polygons and lines, representing geographic locations.

    By their very nature, the activities of many government agencies are defined by geographic boundaries--local precincts and development zones, counties, congressional districts, national forests, and many others. Local, state, and federal agencies track similar data, but at different granularities. For example, the San Diego Association of Governments (SANDAG) has data similar to that of the Census Bureau but at a higher resolution. If data from all of these agencies were provided in a common language, such as XML, the government could provide citizens with a single launching point from which to collect relevant information from all the correct agencies.

    Top | Contents | Next

    THE LONGEVITY PROBLEM

    In addition to combining previously unrelated Web sites, the XML framework offers other possibilities. As already noted, the data sources need not be limited to Web sites, nor even to today's data. A major goal of NPACI's DICE thrust area is to store collections of digital objects in long-term archives. This goal has coincided with the pressing needs of one of the MIX project's early supporters, the National Archives and Records Administration (NARA), and has spawned the XArchive project.

    "The strength of MIX is that all the technology keys off XML, which also makes it a potential solution for the longevity problem," Baru said. "NARA likes the fact that XML is the only standard we're banking on." The longevity problem is a paradox of the Information Age. As it becomes possible to generate more and more data, the technology for storing that data becomes obsolete and inaccessible at a faster rate. Anyone with a stack of 5.25-inch floppy disks tucked in a closet has experienced this firsthand.

    The problem for NARA is many times more complex. NARA must archive all information that any government agency deems of consequence, in whatever form the agency provides, and keep it accessible for 400 years. Chief Archivist John Carlin last year estimated that NARA had taken in a total of 90,000 files since 1972, but that now the Treasury Department alone generates 960,000 e-mail files per year.

    NARA partnered with SDSC to produce a system that could ingest 1 million electronic documents per year, for starters. NARA provided a test set of 1 million e-mails, and using the XML-based information model, the MIX team was able to translate the million messages into XML form and store them in two days.

    The elegance of XArchive is that software 400 years in the future will need only knowledge of the XML standard to view and search the e-mail collection. In some sense, XML is the "paper" for storing information in digital archives. Regardless of content, documents printed on paper are viewable and searchable in the same way no matter if they are five years old or 200 years old. Translating digital objects to XML serves much the same function. No matter which software was used to create the objects or which medium is used to store the objects, future software that understands XML can extract the relevant information.

    But even XML will not solve the full scope of the problem. "Standards change," Papakonstantinou said. "It's a natural process. Collections may have objects that subscribe to older standards, some to newer standards. Even as XML standards emerge, the users will still need the ability to transform, translate, and evolve data according to their needs. And this is very much a database research question." --DH *

    Top | Contents | Next