Faculty and industry leaders discussed the latest research, services, and education development in big data. (l to r) Michael Zeller, founder/CEO of San Diego-based Zementis; SDSC Director Michael Norman, SDSC’s PACE Director Natasha Balac, Larry Smarr, founding director CalIT2; and Stefan Savage, professor, Computer Science & Engineering, UC San Diego. Image: Erik Jepsen, UC San Diego Publications.
Never in human history has there been such a sustained period of explosive growth in data generated from basic science, medical advances, climate research, astronomical observations, even retail transactions. Add to that the steep rise in other kinds of data generated via social media, such as emails, Google, Facebook, or YouTube videos.
Big data – defined as the gathering, storage, and analysis of massive amount of computerized data – is also big business: a market growing at 45 percent annually that will reach $25 billion by 20151.
Harnessing big data – and having the ability to extract meaningful information from it to advance scientific discovery – was a key topic during a recent symposium called “Big Data at Work: A Conversation with the Experts,” held earlier this month at Atkinson Hall at UC San Diego’s Qualcomm Institute.
Many compare the growth of big data to the overwhelming force of a tsunami. Michael Norman, director of the San Diego Supercomputer Center (SDSC) at the University of San Diego, California, suggests an elevator as a more appropriate analogy since big data “keeps rising and rising” at almost unimaginable rates.
When SDSC opened its doors in 1985, supercomputing was synonymous with modeling and simulation. An activity in and of itself, today that kind of research can generate massive data sets. For example, a computer simulation of a magnitude 8.0 earthquake that originated in Northern California, rumbled down to Southern California, and finally ended up in Palm Springs generated 700 trillion bytes of data. That’s equivalent to 14,000 50-gigabyte Blu-ray discs.
“What’s different about big data in science now is the challenge of taking trillion of bytes of data that come from somewhere else, reading them into a supercomputer, and doing something with that data that contributes to new science and new knowledge,” Norman told symposium attendees.
The increased importance of digital data to the science and research community has prompted SDSC to transform itself in the past decade into a big data center, not only in terms of its data-intensive resources such as Gordon – the first supercomputer to use flash-based memory that makes it up to 100 times faster than other clusters – but for its expertise in all aspects of data management.
Such processing speeds as seen in Gordon are needed to handle the “3 Vs” that characterize big data: Volume, Velocity, and Variety.
“Volume refers to the amount of data, velocity is the rate at which it streams across the network into databases, and variety is the different kinds of data that have to be put together in order to create insights,” explained Norman.
Where does all this data come from? From everywhere and from everyone. Emails alone constitute the single largest generator of big data on the planet at almost 3,000 petabytes (PB) per year, according to a 2013 Wired magazine article. Next comes Facebook (183 PB), Google search index (98 PB), and YouTube video uploads per year (15 PB). Compare that to the data generated by the Large Hadron Collider (15 PB), the Library of Congress digital collection (5 PB), or the NASDAQ stock market database (3 PB), according to that same report.
An unforeseen challenge involving big data occurs when universities or industries acquire instruments, such as genome sequencers, that generate terabytes of data per day without giving thought to where that data will be stored, and ultimately how the data will be analyzed.
“In academia, the challenge is not just in creating infrastructures for big data, but also the transformation of what we teach,” said SDSC’s Norman. “How we teach students new tools, new methodologies, and new kinds of questions to ask.”
San Diego is home to two industries – biotechnology and wireless communications – that are coming together in powerful ways in terms of big data, particularly in the area of personal health monitoring. In response, universities such as UC San Diego are creating data science programs to prepare students for this and other career opportunities of tomorrow.
“Under discussion is a master’s in advanced study in data science and engineering,” said Norman. “This is a joint effort between SDSC and the Department of Science and Engineering at UC San Diego.”
The term supercomputing, once considered a hyped buzzword, gave rise to the academic field of study known as computational science. “In the same way, big data will give rise to an academic discipline known as data science,” said Norman.
1CRISIL GR&A (Credit Rating Information Services of India Limited Global Research & Analytics
As an Organized Research Unit of UC San Diego, SDSC is considered a leader in data-intensive computing and cyberinfrastructure, providing resources, services, and expertise to the national research community, including industry and academia. Cyberinfrastructure refers to an accessible, integrated network of computer-based resources and expertise, focused on accelerating scientific inquiry and discovery. SDSC supports hundreds of multidisciplinary programs spanning a wide variety of domains, from earth sciences and biology to astrophysics, bioinformatics, and health IT. With its two newest supercomputers, Trestles and Gordon, and a new system called Comet to be deployed in early 2015, SDSC is a partner in XSEDE (Extreme Science and Engineering Discovery Environment), the most advanced collection of integrated digital resources and services in the world.
Jan Zverina, SDSC Communications
858 534-5111 or firstname.lastname@example.org
Warren R. Froelich, SDSC Communications
858 822-3622 or email@example.com