| A NEW TOOL FOR GIS
FREEDOM OF INFORMATION:
MULTIVALENT DOCUMENTS
PEER REVIEW THROUGH
COLLABORATIVE FILTERING
he
traditional system for handling scholarly information is on a
collision course with the exponential growth of this information.
Even spending 15% more each year, university libraries are simply
unable to keep up as this information doubles about every four
years, overwhelming both the technical and financial models that
have worked until now. Both to overcome these "growing pains"
and to explore the promise of fundamental improvements that modern
information technologies might provide, UC Berkeley researcher
Robert Wilensky and colleagues are rethinking the entire life
cycle of scholarly information, from creation to collaboration,
review, dissemination, and archiving, as part of the NSF Digital
Libraries Initiative, Phase 2 (DLI2).
|
|
|

Figure 1. GIS Viewer 3.0
Wide
Zoom GIS Viewer 3.0 has wide zoom range using many data
types, from large-scale images to 25.6 km maps down to high
resolution 200 m maps and 100 m resolution images. As the
user zooms in or out, GIS Viewer performs transparent reprojection
between latitude/longitude, Albers, and UTM projections
while automatically using the appropriate data set.
|
 |
|
|
Information technologies have had major impacts
on scholarly communication in areas from research and publishing
to libraries. However, many applications have involved improvements
in traditional ways of doing things rather than qualitatively
new approaches. Is the academic community overlooking important
new possibilities in digital technologies that could radically
enhance the scholarly information life cycle? Wilensky and colleagues
in the UC Berkeley Digital Library project think so, and they
are pursuing research on a number of fronts that shows promise
of fundamentally transforming scholarly communication. "We envision wide applications of new information
paradigms across many areas of society, but initially we're focusing
on scholarly communication, both because of the need in light
of steeply rising costs, and because there are fewer barriers
to change since academic researchers are naturally interested
in wide circulation of the knowledge they create, with fewer impediments
like copyright concerns," said Robert Wilensky, principal investigator
of the UC Berkeley Digital Library project and Professor, Computer
Science Division and School ofInformation Management and Systems. As part of DLI2, the UC Berkeley Digital Library
project is addressing an array of issues in scholarly communication
involving both software and data collections, some in the InterLib
collaboration between UC Berkeley, UC Santa Barbara, and Stanford
University that is demonstrating technologies in the California
Digital Library (CDL) as well as in a testbed developed by SDSC.
Top
|
Contents |
Next In developing a general document model, Wilensky
and colleagues wanted to include information types beyond text,
enabling researchers to seamlessly federate and interact with
any kind of information. "Because most people don't consider geographic
data to be documents-things like maps or spatial data or digital
raster graphics or rectified images-working with Geographic Information
System (GIS) data allowed us to broaden the definition of what
people think of as documents," said Wilensky. In addition to being able to access a wide
variety of data formats, Wilensky and researchers including Loretta
Willis, Jeff Anderson-Lee, and Howard Foster of the UC Berkeley
Digital Library project developed GIS Viewer 3.0 as a Java applet
that supports a number of powerful capabilities, including displaying
and turning on and off multiple layers of geopositioned data;
displaying vector, raster, and point data; smooth zooming and
panning; hyperlinking points and regions; interacting with other
online systems through server-side, Common Gateway Interface (CGI)
scripts; and finally, distributed end-user annotation and saving.
Figure 1 shows the application's wide zoom capability. The project has evolved into a substantial
tool that researchers are beginning to use in numerous applications.
"For example, the GIS Viewer interoperates with all the data on
the TerraServer, which contains the U.S. Geological Survey aerial
photos and high resolution graphic map images for the entire United
States. That's a terabyte-sized data set, and we're pleased that
the GIS Viewer is a serious tool able to work at that scale,"
said Wilensky. While these capabilities are interesting in
themselves, the GIS Viewer is a special case of a more general
and powerful approach the researchers have been working on for
several years, including what they call multivalent documents,
which enable completely new modes of scholarly communication.
Top
|
Contents | Next
|
|

Figure 2.
Multivalent Document "Lens" Function for Scanned
Text
The
multivalent document model enables seamless access to different
data types, including scanned text. Shown is the "lens"
function, capable of applying optical character recognition
and magnification to user-selected areas. In a scanned page
"enlivened" as a multivalent document, it is also
possible to select and paste text, highlight matching search
terms, and perform other manipulations such as sorting a table. |
|
|
Imagine that it is possible to seamlessly
open and work with HTML, PDF, or any word processing file, as
well as scanned or legacy documents. Imagine it is possible to
compose these different document types, overlaying them with data
retrieved from distant databases. Imagine it's possible to "semantically
align" these documents, as well as alternately hide, display,
or modify them in different ways the user selects. Imagine it's
also possible to perform complex operations including sophisticated
searches, analyses, and visualization options, along with more
traditional word processing and editing operations. Then, imagine
that these documents can be annotated, in a new document, as easily
and flexibly as writing notes on a paper page. And if the software
lacks a needed capability, it can be added by those who need it.
To achieve these capabilities, the researchers
have used a middleware model with structured document layers and
functions, providing an "anytime, anywhere, any type, every way
user-improvable digital document platform." This approach provides
freedom from traditional barriers of different file formats and
data types. In addition to reducing fragmentation in data, the
model is also designed to help overcome functional fragmentation
or limitations in applications. Traditionally, new functions can
be added only by the developer, and are neither consistently implemented
nor uniformly available across different applications. The approach
Wilensky and colleagues have taken is to invert traditional application
design, specifying only a minimum set of protocols, with all major
functions moved into extensions. "The protocols we've specified may seem pretty
conventional, and indeed that's the point. Because they're so
minimal, special case support isn't buried in the infrastructure,
and developers must instead rely on what we call behaviors to
extend the protocols for all interesting functionality," said
Wilensky. In an open source environment, this approach allows
for very flexible extension, and the researchers have already
been able to implement a number of advanced capabilities, including
such non-obvious ones as "lenses," which generate alternate views
of selected areas, including OCR and magnification (Figure 2).
Wilensky and UC Berkeley computer science
post-doc Thomas Phelps have developed a number of other behaviors
including distributed annotation with copy editor marks, notes,
highlighting, and hyperlinks, in which it is possible to add any
of these annotations to someone else's web page in a new document,
and then share this result with others, all without any special
server support. "You can much more fluidly move through and interact
with both your own information and other people's information.
The boundaries are in some sense greyer, and collaborative work
can emerge more spontaneously," said Wilensky. The open-ended or "messy" nature of the Web
with its dangling pointer, which many initially saw as a major
potential problem, has opened opportunities for broader collaborations,
which Wilensky wants to harness through this new model of multivalent
documents. But these new freedoms bring new challenges. A central
issue is how to make the new composited information in distributed
annotations persistent, since it now inherently involves links
to distributed information that the author does not control and
that changes chaotically. The researchers are pursuing successful
approaches to this robust hyperlink problem and the related subdocument
reference problem, as well as issues of security and quality control
or peer review.
Top
|
Contents |
Next
PEER REVIEW THROUGH COLLABORATIVE
FILTERING The primary paradigm shift Wilensky believes
digital technologies promise for scholarly communication is to
move from a centralized, discrete publishing model, to a continuous,
distributed, self-publishing model. Rather than submit papers
to conventional journals and buy back the edited results at ever-increasing
cost, scholars can self-publish on local resources, and then engage
in a continuous, collaborative peer-review system. "Academic communication is gated by technology
in the sense that in the paper world paper is expensive so it
pays to filter first. But in the electronic world there's no initial
cost, so you can distribute first and filter later," said Wilensky.
At first it might seem that peer review could be lost in such
a model. But Wilensky and graduate student Tracy Riggs are finding
that digital technologies offer novel opportunities for enhanced
collaboration and peer review. "Rather than being eliminated, scholarly review
can take the form of distributed annotations and the use of reviewers
or recommender systems," said Wilensky. Not only will academics
easily be able to access primary resources, but they can create
new composite documents containing annotations on resources, which
they then rapidly share with others. Thus, the distributed digital
library becomes a medium for continuous scholarly collaboration.
But how do you know what is good research
amid the vast array of instantly published work? "You'd like to
look at those articles that are given good reviews by good reviewers-those
who have given useful and reliable reviews in the past as indicated
by agreement with other reviews, and maybe looking backwards with
the citation index. We've used a hubs and authorities type of
algorithm to establish credentials," explained Wilensky. The researchers are using this approach to
build a collaborative filtering system, and in this way multivalent
documents offer "something much finer grained than collaborative
peer review or filtering that just makes an overall judgment,"
said Wilensky. Beyond the GIS Viewer and multivalent documents,
the researchers are investigating a wide array of other approaches,
including such technologies as robust references; image retrieval
by image content; document image analysis; distributed search;
and natural language processing for information access. These
technologies promise to extend the digital library of the future,
enhancing the scholarly information life cycle. -PT 
Top
|
Contents | Next
|
Project Leader
Robert Wilensky
UC Berkeley Participants
Howard Foster,
Thomas Phelps,
Jeff Anderson-Lee,
Tracy Riggs,
Loretta Willis
UC Berkeley
REFERENCES
Thomas A. Phelps and Robert Wilensky. June
2000. Multivalent Documents. Communications of the ACM 43(6):
83.
Thomas A. Phelps and Robert Wilensky. July/August
2000. Robust Hyperlinks and Locations. D-Lib Magazine 6(7/8).
dlp.cs.berkeley.edu |