Skip to content

News Center

Home > News Center > Publications > EnVision

 

DATA STORAGE | Contents | Next

Exporting an SDSC Data Tool to the United Kingdom

GETTING A GRIP ON DATA
UK'S DATA GRID
THE SRB IN DATA GRIDS


ew information technologies are flooding scientists with data that not only provide fresh insights into the natural world, but also threaten to overwhelm existing data-handling capabilities. Researchers need a tool that can efficiently move and reorganize their data sets as they are gathered, stored, analyzed, and published. SDSC’s Data and Knowledge Systems (DAKS) program has developed the Storage Resource Broker (SRB) for just this purpose. Initially released four years ago, the middleware has more than 200 registered users at some 50 sites. And in its largest deployment yet, United Kingdom researchers have chosen the SRB as a core component of a data grid architecture that will connect users at more than 10 sites.

Figure 1: Vegetation Index Data


“This shows the value of the SRB system as well as the growth of data grids worldwide,” said Chaitan Baru, co-director of the DAKS program, one of SDSC’s five strategic program areas. Just as the Internet has enabled people around the globe to connect to diverse information sources, supercomputer centers around the world are linking into computational and data grids to deliver capabilities of unprecedented power and flexibility. In the process, researchers are integrating new data-handling capabilities, such as remote-data access, more deeply into the practice of science (Figure 1).

In simple terms, a grid is a mechanism for connecting multiple, separate computing and data storage systems. “The central challenge is how to make all the parts interoperate or work together so that the systems are no longer stand-alone but form a single, virtual system,” said Reagan Moore, SDSC distinguished scientist and adjunct professor in UCSD’s Computer Science and Engineering Department.

To support these capabilities, computer scientists are designing a new generation of data management infrastructure—the “plumbing” of networks, hardware, and software to connect everything. As this infrastructure matures, it is opening novel avenues of scientific research by providing vastly more data and computing power, and broader, more rapid collaborations with colleagues.

GETTING A GRIP ON DATA

In this grid infrastructure, the ability to manage massive data sets plays a crucial role. “Digital data collections are indispensable to the advance of science,” said Moore.

For example, researchers are using the SRB data-handling system to organize data archived at the Scripps Institution of Oceanography (SIO) from shipboard expeditions covering the world’s oceans. Without the SRB, a researcher must pick the geographic location for study and then laboriously produce a list of the available data for that area. For regions of the high seas, even only a few degrees square, the search can turn up thousands of data files from various vessels and cruises, which may be stored anywhere among the million or so files and 400 gigabytes of archives. It can take days at SIO to find the archived media and access the appropriate files.

In contrast, with the help of the SRB, researchers, teachers, and students will soon be able to use an Internet browser on a desktop computer to search the SRB’s central metadata catalog for useful data by geographical area, ship, and many other criteria. Upon finding data of interest, users can immediately view or download them into new collections. “This is a much faster, more flexible and powerful way of managing data,” said SDSC researcher John Helly, who is collaborating in establishing SRB archives for oceanographic data at SIO.

The SRB lets users create, share, search, and replicate their data collections. It works with either client software on the user’s system or through MySRB, a personalized, Web-based version of the SRB. Both the client software and MySRB operate in conjunction with a metadata catalog (MCAT) that holds helpful descriptive information (metadata) about the data sets. In addition to system-level metadata describing the data location, information about the digital objects can be queried and users can add metadata on the fly. “This collection-based design gives users a logical view of data sets so that they no longer have to know such things as file name, machine location, and protocol for the data they’re seeking,” said Arcot Rajasekar of SDSC’s DAKS program and leader of the SRB project. “This speeds up research by eliminating manual data management and frees researchers to concentrate on their science.”

Beyond storing and moving data, the SRB and MCAT also provide a powerful data discovery mechanism that lets users find resources and data sets of interest by searching on the rich, descriptive metadata that the MCAT maintains. “The SRB lets researchers organize distributed information into coherent collections, and then explore and manipulate the organization of this information, independent of how or where it is stored,” said Moore.

UK's DATA GRID

Grid computing is coming to the UK, and the Central Laboratory of the Research Councils (CLRC) for the UK has established the e-Science Centre to make its facilities “grid aware.” The Centre is drawing together experts from many science departments to work on an integrated architecture for all UK facilities (Figure 2). “The UK e-Science Grid will be as much about data as computation,” said Tony Hey, e-Science core technology director. “The SRB is useful for our grid because it offers support for database users as well as files, which we see as a key area for grid middleware in the future.”

In addition to the SRB, the software selected for the initial UK data grid effort includes Globus grid middleware for underlying grid services, including job management and security, and Condor for local resource management and task farming for high-throughput distributed computing.

As part of the e-Science Centre, the UK Grid Support Centre has been established in cooperation with Manchester Computing and the Edinburgh Parallel Computing Centre. The Grid Support Centre is distributing a CD-ROM with a Grid Starter Kit for the UK academic community that contains quick guides and installation software for basic deployment of Globus, Condor, and the SRB. “The choice to include the SRB in the UK Grid Starter Kit is the clearest indicator to date that the SRB has moved from research prototype to production tool,” said SDSC’s Rajasekar.

Initial use of the SRB in the UK will be in a test bed for the Earth sciences community. One national and eight regional e-Science Centres have been set up with grid test bed projects such as high-energy physics, structure-property mapping using combinatorial chemistry, and bioscience microarray data. “The aim of the e-Science effort is to integrate all our experimental, computational, and data resources, connect them to other sites, and make them easily available to our user community,” said Kerstin Kleese van Dam, of the CLRC-Daresbury Laboratory, e-Science Centre.

In addition to distributing the Grid Starter Kit, the UK Grid Support Centre will also provide support for SRB users throughout the UK data grid community. “This is a first,” said Rajasekar. “It’s an important step to enable large-scale production use. Previously, all SRB support was provided by our SDSC staff.”

THE SRB IN DATA GRIDS

Figure 2: United Kingdom CLRC Data Grid

The SRB supports a wider range of capabilities than other data grid environments. It provides an effective approach for making grids globally consistent, which is the key to making an interconnected grid environment work. “By handling all negotiations and tasks, such as access and protocol conversion, the SRB lets users seamlessly move or access data anywhere in a uniform grid or SRB space,” said Rajasekar.

The SRB also provides multiple ways to access data in a grid environment. For example, in the UK data grid, researchers can connect directly to SRB, or connect using Grid-ftp to local data, or connect to the SRB through Condor. Another issue in grid data access is overcoming the latency or delays caused by data sets that are physically separated over wide-area networks. This can be addressed by data-caching, mirroring or replicating data at different locations, data streaming, and using remote proxy operations to extract smaller data subsets at the source for faster network access. The SRB supports all of these methods.

The SRB also includes support for security and flexible access control for sharing data with colleagues. The SRB system runs on UNIX; Windows 98, NT, Me, and 2000; Red Hat Linux (6.2); and Macintosh OS X platforms. Fully functioning SRB demo versions for Windows 98, NT, and 2000 are available on a free CD-ROM, which may be obtained by sending e-mail to srb@sdsc.edu.

With the SRB the size of data collections and number of files can grow very large. As the number of files increases into the millions, they are organized into containers of files, with related files stored in the same container for faster access. “As data grows, users can scale up their SRB collections all the way to the multi-terabyte data sets found on today’s largest production data grids,” said Phil Andrews, co-director of SDSC’s DAKS program. —PT


PROJECT LEADERS
Arcot Rajasekar,
Reagan Moore,
Michael Wan
SDSC

PARTICIPANTS
Tony Hey
Engineering and Physical Sciences Research Council


Paul Durham, David Boyd
e-Science Centre at CLRC


Kerstin Kleese van Dam
CLRC-Daresbury Laboratory, e-Science Centre


www.e-science.clrc.ac.uk

www.grid-support.ac.uk

www.npaci.edu/DICE/SRB