News Archive

National Science Data Fabric Catalog Grows toward AI-Integrated Scientific Innovation

Published June 27, 2023

NSDF logo

The NSDF Catalog encompasses 68 repositories with over 75 petabytes of data.

By Kimberly Mann Bruch

Since its inception in September 2021, the National Science Data Fabric (NSDF) Initiative—a pilot project that democratizes data-driven scientific discovery across an open network of institutions via data delivery, shared storage, computing and more—has made significant progress. Its latest development is the growing NSDF Catalog, which currently houses nearly 70 repositories ranging from geosciences databases to NASA imagery datasets.

The San Diego Supercomputer Center at UC San Diego, in collaboration with the University of Utah, has played a key role in the catalog’s development—all led by the University of Tennessee Knoxville team. The catalog project is being led by Dr. Jakob Luettgau, Research Assistant Professor and Dr. Michela Tauffer, the Dongarra Professor in UT’s Min H. Kao Department of Electrical Engineering and Computer Science.

“We’ve been working on this multi-federation catalog that houses more than 1.5 billion records from 68 community repositories,” said SDSC Director Frank Würthwein, who is a co-principal investigator on the NSDF initiative. “This means that we have over 75 petabytes of data cataloged and that requires a robust infrastructure as well as solid software.”

For the infrastructure, the NSDF is utilizing DoubleCloud, an industry partner that provides free hosting for the catalog. Software developers are able to build a modern data stack and sub-second analytics for their portals with fully managed, tightly integrated open-source technologies.

“This catalog is likely to increase tenfold over the upcoming year,” said Christine Kirkpatrick, SDSC’s Research Services Division director. “We’re housing everything from geosciences databases to hefty NASA imagery datasets—and growing by the month.”

Würthwein said that NSDF is one of a collection of projects at SDSC aimed at establishing an open national data and knowledge infrastructure for all of open science. “SDSC’s involvement with projects in this area cover the entire vertical stack, from hardware such as Open Science Data Federation, Open Storage Network and National Research Platform to higher level federation of data catalogs like NSDF and DeCODER, or the Democratized Cyberinfrastructure for Open Discovery to Enable Research.” 

An important concept that NSDF has been focused on is ensuring that the repositories are set up to be findable, accessible, interoperable and reusable (FAIR). Kirkpatrick said that this ties in with a couple of projects with which she is involved—specifically one for which she serves as principal investigator called FARR (FAIR and ML, AI Readiness, and AI Reproducibility). She also serves as the GO FAIR U.S. Office Head; GO FAIR U.S., which is an international effort to share, discuss and advance global collaboration in this arena, encompasses worldwide entities aimed at open science and shared data efforts.

“SDSC has been a leader in federated community data and knowledge infrastructure for over two decades,” said Ilkay Altintas, chief data science officer at SDSC and director of its Cyberinfrastructure and Convergence Research and Education (CICORE) Division, which currently works on a number of data hub initiatives including the Quantum Data Hub and the Wildfire and Landscape Resilience Data Hub.

“The next frontier for us is to create equitable and open services as an ecosystem for AI-integrated scientific innovations and solutions to societal challenges,” Altintas said. “NSDF is a key foundational infrastructure toward this goal.”

NSDF is supported by the National Science Foundation (grant no. 2138811).