A parallel workload model and its implications for processor allocation

by Allen B. Downey

This paper will appear at the 6th IEEE International Symposium on High Performance Distributed Computing (HPDC 97). The postscript of the final version is here.

I recently presented this paper at AT&T Research. Here are the slides I used.

The submitted version of this paper was been published as U.C. Berkeley Technical Report CSD-96-922.

The abstract, introduction, and conclusions are below.


Abstract

We develop a workload model based on the observed behavior of parallel computers at the San Diego Supercomputer Center and the Cornell Theory Center. This model gives us insight into the performance of strategies for scheduling malleable jobs on space-sharing parallel computers. We find that Adaptive Static Partitioning (ASP), which has been reported to work well for other workloads, is inferior to some FIFO strategies that adapt better to system load. The best of the strategies we consider is one that explicitly restricts cluster sizes when load is high (a variation of Sevcik's A+ strategy).

Introduction

Space-sharing, distributed-memory multiprocessors, like the Intel Paragon, the Cray T3E and the IBM SP2, are often used in supercomputing environments to support scientific applications. These environments typically have the following characteristics:

In current systems, users choose cluster sizes for their jobs by hand, and the system does not have the option of allocating more or fewer than the requested number of processors. The factors that should influence the choice of a cluster size include the characteristics of the application (resource requirements), the load on the system (resource availability) and the performance requirements of the user (deadlines, desired throughput, etc.). But users generally do not have the information, tools, or inclination to weigh all of these factors accurately. Allowing the system to make this decision has the potential to improve system utilization and reduce users' wait times.

Toward this end, prior studies have proposed allocation strategies that choose automatically how many processors to allocate to each job. Most of these studies evaluate the proposed strategies with analysis and simulation based on hypothetical workloads. Each of the strategies we consider in this paper has been evaluated, under different workload assumptions, in at least three prior studies. One of these compares the performance of these strategies over a wide range of workload parameters, and argue that the discrepancies among various studies are due to differences in the hypothesized workloads.

The goal of this paper is to focus this debate by constructing a new workload model based on observations of space-sharing systems running scientific workloads. We intend this model to cover the range of workload parameters most relevant to current workloads and hence most applicable to real systems. By narrowing the range of workload parameters, we are able to examine the proposed strategies in more detail and gain insight into the reasons for their success or failure.

Conclusions


downey@sdsc.edu