GETTING
A GRIP ON DATA
UK'S
DATA GRID
THE
SRB IN DATA GRIDS
ew
information technologies are flooding scientists with data that
not only provide fresh insights into the natural world, but also
threaten to overwhelm existing data-handling capabilities. Researchers
need a tool that can efficiently move and reorganize their data
sets as they are gathered, stored, analyzed, and published. SDSCs
Data and Knowledge Systems (DAKS) program has developed the Storage
Resource Broker (SRB) for just this purpose. Initially released
four years ago, the middleware has more than 200 registered users
at some 50 sites. And in its largest deployment yet, United Kingdom
researchers have chosen the SRB as a core component of a data grid
architecture that will connect users at more than 10 sites.
 |
|
Figure 1:
Vegetation Index Data
|
This shows the value
of the SRB system as well as the growth of data grids worldwide,
said Chaitan Baru, co-director of the DAKS program, one of SDSCs
five strategic program areas. Just as the Internet has enabled people
around the globe to connect to diverse information sources, supercomputer
centers around the world are linking into computational and data
grids to deliver capabilities of unprecedented power and flexibility.
In the process, researchers are integrating new data-handling capabilities,
such as remote-data access, more deeply into the practice of science
(Figure 1).
In simple terms, a grid is a mechanism for connecting multiple,
separate computing and data storage systems. The central challenge
is how to make all the parts interoperate or work together so that
the systems are no longer stand-alone but form a single, virtual
system, said Reagan Moore, SDSC distinguished scientist and
adjunct professor in UCSDs Computer Science and Engineering
Department.
To support these capabilities, computer scientists are designing
a new generation of data management infrastructurethe plumbing
of networks, hardware, and software to connect everything. As this
infrastructure matures, it is opening novel avenues of scientific
research by providing vastly more data and computing power, and
broader, more rapid collaborations with colleagues.
GETTING
A GRIP ON DATA
In this grid infrastructure,
the ability to manage massive data sets plays a crucial role.
Digital data collections are indispensable to the advance
of science, said Moore.
For example, researchers are using the SRB data-handling system
to organize data archived at the Scripps Institution of Oceanography
(SIO) from shipboard expeditions covering the worlds oceans.
Without the SRB, a researcher must pick the geographic location
for study and then laboriously produce a list of the available
data for that area. For regions of the high seas, even only a
few degrees square, the search can turn up thousands of data files
from various vessels and cruises, which may be stored anywhere
among the million or so files and 400 gigabytes of archives. It
can take days at SIO to find the archived media and access the
appropriate files.
In contrast, with the help of the SRB, researchers, teachers,
and students will soon be able to use an Internet browser on a
desktop computer to search the SRBs central metadata catalog
for useful data by geographical area, ship, and many other criteria.
Upon finding data of interest, users can immediately view or download
them into new collections. This is a much faster, more flexible
and powerful way of managing data, said SDSC researcher
John Helly, who is collaborating in establishing SRB archives
for oceanographic data at SIO.
The SRB lets users create, share, search, and replicate their
data collections. It works with either client software on the
users system or through MySRB, a personalized, Web-based
version of the SRB. Both the client software and MySRB operate
in conjunction with a metadata catalog (MCAT) that holds helpful
descriptive information (metadata) about the data sets. In addition
to system-level metadata describing the data location, information
about the digital objects can be queried and users can add metadata
on the fly. This collection-based design gives users a logical
view of data sets so that they no longer have to know such things
as file name, machine location, and protocol for the data theyre
seeking, said Arcot Rajasekar of SDSCs DAKS program
and leader of the SRB project. This speeds up research by
eliminating manual data management and frees researchers to concentrate
on their science.
Beyond storing and moving data, the SRB and MCAT also provide
a powerful data discovery mechanism that lets users find resources
and data sets of interest by searching on the rich, descriptive
metadata that the MCAT maintains. The SRB lets researchers
organize distributed information into coherent collections, and
then explore and manipulate the organization of this information,
independent of how or where it is stored, said Moore.
UK's
DATA GRID
Grid computing is coming
to the UK, and the Central Laboratory of the Research Councils
(CLRC) for the UK has established the e-Science Centre to make
its facilities grid aware. The Centre is drawing together
experts from many science departments to work on an integrated
architecture for all UK facilities (Figure 2). The UK e-Science
Grid will be as much about data as computation, said Tony
Hey, e-Science core technology director. The SRB is useful
for our grid because it offers support for database users as well
as files, which we see as a key area for grid middleware in the
future.
In addition to the SRB, the software selected for the initial
UK data grid effort includes Globus grid middleware for underlying
grid services, including job management and security, and Condor
for local resource management and task farming for high-throughput
distributed computing.
As part of the e-Science Centre, the UK Grid Support Centre has
been established in cooperation with Manchester Computing and
the Edinburgh Parallel Computing Centre. The Grid Support Centre
is distributing a CD-ROM with a Grid Starter Kit for the UK academic
community that contains quick guides and installation software
for basic deployment of Globus, Condor, and the SRB. The
choice to include the SRB in the UK Grid Starter Kit is the clearest
indicator to date that the SRB has moved from research prototype
to production tool, said SDSCs Rajasekar.
Initial use of the SRB in the UK will be in a test bed for the
Earth sciences community. One national and eight regional e-Science
Centres have been set up with grid test bed projects such as high-energy
physics, structure-property mapping using combinatorial chemistry,
and bioscience microarray data. The aim of the e-Science
effort is to integrate all our experimental, computational, and
data resources, connect them to other sites, and make them easily
available to our user community, said Kerstin Kleese van
Dam, of the CLRC-Daresbury Laboratory, e-Science Centre.
In addition to distributing the Grid Starter Kit, the UK Grid
Support Centre will also provide support for SRB users throughout
the UK data grid community. This is a first, said
Rajasekar. Its an important step to enable large-scale
production use. Previously, all SRB support was provided by our
SDSC staff.
THE
SRB IN DATA GRIDS
 |
|
Figure 2:
United Kingdom CLRC Data Grid
|
The SRB supports a
wider range of capabilities than other data grid environments.
It provides an effective approach for making grids globally consistent,
which is the key to making an interconnected grid environment
work. By handling all negotiations and tasks, such as access
and protocol conversion, the SRB lets users seamlessly move or
access data anywhere in a uniform grid or SRB space, said
Rajasekar.
The SRB also provides multiple ways to access data in a grid environment.
For example, in the UK data grid, researchers can connect directly
to SRB, or connect using Grid-ftp to local data, or connect to
the SRB through Condor. Another issue in grid data access is overcoming
the latency or delays caused by data sets that are physically
separated over wide-area networks. This can be addressed by data-caching,
mirroring or replicating data at different locations, data streaming,
and using remote proxy operations to extract smaller data subsets
at the source for faster network access. The SRB supports all
of these methods.
The SRB also includes support for security and flexible access
control for sharing data with colleagues. The SRB system runs
on UNIX; Windows 98, NT, Me, and 2000; Red Hat Linux (6.2); and
Macintosh OS X platforms. Fully functioning SRB demo versions
for Windows 98, NT, and 2000 are available on a free CD-ROM, which
may be obtained by sending e-mail to srb@sdsc.edu.
With the SRB the size of data collections and number of files
can grow very large. As the number of files increases into the
millions, they are organized into containers of files, with related
files stored in the same container for faster access. As
data grows, users can scale up their SRB collections all the way
to the multi-terabyte data sets found on todays largest
production data grids, said Phil Andrews, co-director of
SDSCs DAKS program. PT
|
PROJECT LEADERS
Arcot Rajasekar,
Reagan Moore,
Michael Wan
SDSC
PARTICIPANTS
Tony Hey
Engineering and Physical Sciences Research
Council
Paul Durham, David Boyd
e-Science Centre at CLRC
Kerstin Kleese van Dam
CLRC-Daresbury Laboratory, e-Science Centre
www.e-science.clrc.ac.uk
www.grid-support.ac.uk
www.npaci.edu/DICE/SRB
|