SDSS IPP Webinar Series





Contact Info

For questions please contact: or 858.534.8321


arrow UC San Diego
arrow SDSC

SDSC Industry Partners Program (IPP)

Webinar Series

Crafting Benchmarks for Big Data: The Big Data Top 100

Wednesday, March 12, 2014
11:00 - Noon PST / 2:00 – 3:00 PM EST

Natasha BalacChaitan Baru, Ph.D.
Associate Director, Data Initiatives, SDSC
Director, Center for Large-scale Data Systems Research (CLDS), SDSC

This webinar has passed.

View webinar slides.

View recording.

Webinar Abstract

Since early 2012, the Center for Large Scale Data Systems Research (CLDS) at the San Diego Supercomputer Center has fostered a community activity in big data benchmarking via the Workshops for Big Data Benchmarking ( These workshops have identified a set of characteristics of big data applications that apply to industry as well as scientific application scenarios, involving pipelines of processing with steps that include aggregation, cleaning, and annotation of large volumes of data; filtering, integration, fusion, subsetting, and compaction of data; and, subsequent analysis, including visualization, data mining, predictive analytics and, eventually, decision making. One of the outcomes of the WBDB workshops, for example, has been the formation of a Transaction Processing Council subcommittee on Big Data, which is initially defining a Hadoop systems benchmark, TPC-HS, based on Terasort. TPC-HS would be a simple, functional benchmark that would assist in determining basic resiliency and scalability features of large-scale systems. Other proposals are also actively under development including BigBench, which extend the TPC-H benchmark for big data scenarios; Big Decision Benchmark from HP; HiBench from Intel; and the Deep Analytics Pipeline (DAP), which defines a sequence of end-to-end processing steps consisting of some of the operations mentioned above. Pipeline benchmarks reveal the need for different processing modalities and system characteristics for different steps in the pipeline. For example, early processing steps may process very large volumes of data and may benefit from a Hadoop and MapReduce-style of computing, while later steps may operate on more structured data and may require, say, SMP-style architectures or very large memory systems.

This talk will provide an overview of these benchmark activities and discuss opportunities for collaboration and future work with industry partners.


About the Speaker

Chaitan Baru is a Distinguished Scientist and research staff member at the San Diego Supercomputer Center. He has played a leadership role in a number of national-scale cyberinfrastructure R&D initiatives across a wide range of science disciplines from earth sciences to ecology, biomedical informatics, and healthcare. One of his current initiatives is an industry-academia effort to define big data benchmarks and establish a BigData Top100 List (see He also coordinates the SDSC Data Science Institute initiative for education and training in data science. Prior to joining SDSC in 1996, Baru was involved in the development of IBM's early UNIX-based shared-nothing database systems (DB2 Parallel Edition), where he also led a team that produced the industry's first result for a decision support benchmark (TPC-D). Baru has also served on the faculty of Computer Science and Engineering at the University of Michigan, Ann Arbor. He has a B.Tech. from the Indian Institute of Technology, Madras; and ME and PhD from the University of Florida, all in Electrical Engineering.