Predicting queue times on space-sharing parallel computers
by
Allen B. Downey
This paper appeared at the 11th International
Parallel Processing Symposium, Geneva, Switzerland, April 1997
(IPPS '97).
Here is the version that appeared
(gzipped postscript).
Click here to see the html version.
I recently presented this paper in a talk for the Parallel Computing
Lab at UCSD. Here are the slides I used
(gzipped postscript).
Abstract
We present statistical techniques for predicting the queue times
experienced by jobs submitted to a space-sharing parallel machine with
first-come-first-served (FCFS) scheduling. We apply these techniques to
trace data from the Intel Paragon at the San Diego Supercomputer
Center and the IBM SP2 at the Cornell Theory Center. We show that it
is possible to predict queue times with accuracy that is acceptable
for several intended applications. The coefficient of correlation
between our predicted queue times and the actual values from the simulated
schedules is between 0.65 and 0.72.
Introduction
On space-sharing parallel machines, it is useful to be able to predict
how long a submitted job will be queued before processors are
allocated to it. Some of the applications of these predictions are:
- Load metrics: They provide a measure of load that is more
concrete than abstractions such as load average, allowing users to
make decisions about what jobs to run, where to run them and what size
problems they can solve in an allotted time.
- Internal resource selection: They allow malleable parallel jobs
(jobs that are not limited to a specific cluster size, but can run on
any number of processors) to choose a cluster size that will minimize
their turnaround time (the sum of queue time and run time).
- External resource selection: They allow distributed jobs to
choose among various computing resources on a network, based
on the quality of service they expect to receive at each host.
This paper presents a model of the workload on a parallel machine
and shows how to use this model to predict queue times. We present
observations of the workload on the Intel Paragon at the San Diego
Supercomputer Center (SDSC) and the IBM SP2 at the Cornell Theory
Center (CTC), and show that they fit the proposed model well. We use
these workloads and a trace-driven simulator to evaluate the proposed
techniques for predicting queue times. We conclude that our techniques
can predict queue times on real machines with accuracy sufficient for
the proposed application.
Conclusions
- We have proposed a workload model for batch jobs on
space-sharing parallel computers, based on a uniform-log
distribution of lifetimes. The proposed model captures the behavior
of real workloads in two different environments. This model can be
used to generate synthetic workloads for simulating and evaluating
scheduling policies for similar environments.
- We have used this workload model to develop statistical
techniques for predicting queue times for incoming jobs. The proposed
techniques worked well in simulations of both of the environments we
tested. These techniques should be useful for several applications,
especially selection of cluster sizes on space-sharing parallel
computers.
- Many scheduling policies for supercomputing environments depend
on information provided by users about the resource requirements of their
jobs. We observe that this information is often unreliable, but show that
the proposed techniques are able to distill this information in a useful
way.
downey@sdsc.edu