Predicting queue times on space-sharing parallel computers

by Allen B. Downey

This paper appeared at the 11th International Parallel Processing Symposium, Geneva, Switzerland, April 1997 (IPPS '97).

Here is the version that appeared (gzipped postscript).

Click here to see the html version.

I recently presented this paper in a talk for the Parallel Computing Lab at UCSD. Here are the slides I used (gzipped postscript).


Abstract

We present statistical techniques for predicting the queue times experienced by jobs submitted to a space-sharing parallel machine with first-come-first-served (FCFS) scheduling. We apply these techniques to trace data from the Intel Paragon at the San Diego Supercomputer Center and the IBM SP2 at the Cornell Theory Center. We show that it is possible to predict queue times with accuracy that is acceptable for several intended applications. The coefficient of correlation between our predicted queue times and the actual values from the simulated schedules is between 0.65 and 0.72.

Introduction

On space-sharing parallel machines, it is useful to be able to predict how long a submitted job will be queued before processors are allocated to it. Some of the applications of these predictions are:

This paper presents a model of the workload on a parallel machine and shows how to use this model to predict queue times. We present observations of the workload on the Intel Paragon at the San Diego Supercomputer Center (SDSC) and the IBM SP2 at the Cornell Theory Center (CTC), and show that they fit the proposed model well. We use these workloads and a trace-driven simulator to evaluate the proposed techniques for predicting queue times. We conclude that our techniques can predict queue times on real machines with accuracy sufficient for the proposed application.

Conclusions


downey@sdsc.edu