Featured Story CornerBatch Queue Prediction Tool—Daniel Nurmi, UCSBBatch scheduling for large-scale parallel machines such as DataStar and TeraGrid allows parallel programs to obtain exclusive access to the processors that they use when they execute. By dedicating a set of nodes or processors to each job for its execution duration, the scheduler eliminates the execution overhead that would otherwise be caused by time sharing. To ensure that the largest number of jobs can run simultaneously, each job carries with it a node count and a maximum run time so that the scheduler can pack the machine as efficiently as possible. Thus jobs that fit together on the machine at the same time share the processor "space" and hence the term "space sharing" is often used to refer to this scheduling discipline. Scheduling Tools at SDSC
If there are not enough free nodes available to execute a job when it is submitted to the scheduler, the job is queued until enough nodes are freed by terminating programs to fulfill the job's node count request. However, to keep machine utilization as high as possible, modern batch schedulers such as Catalina, Maui, and PBS use sophisticated packing algorithms that try to balance the efficiency benefit of scheduling shorter, smaller jobs (since finding a fit is often easier) against the possibility that many small jobs will starve large jobs of the nodes that they require. However, predicting the amount of time a job of any size, either large or small, will wait in queue is remarkably difficult. While a number of research efforts have attempted to develop statistical models that accurately predict individual job waiting times, the prevalent strategy in use today simply simulates the scheduling algorithm in faster-than-real time based on the node counts and maximum requested execution times of all jobs in queue. Since new jobs submitted may change the queuing order, and executing jobs often use less than their maximum requested time, these simulation methods often fail to produce results that are accurate enough to satisfy most user communities. At the University of California, Santa Barbara (UCSB), we have been studying this problem for SDSC with the goal of developing a new prediction capability for SDSC and ETF (Extensible Terascale Facility or TeraGrid) machines. While we cannot predict the exact delay for each job, using a new set of statistical methods, we have developed a software infrastructure and a set of tools that correctly predict the bounds on the delay each job will experience when it is submitted. With this infrastructure — either at the command line or via the web — users can determine the time bound on the queuing delay that their jobs will experience at a specified level of confidence. How Hard Could this Be?One very tangible way to see why predicting queuing delay is a complex problem is to simply inspect empirical job wait times on a typical HPC machine over a short time period. In the graph (which is typical of production workloads), each data point represents how long its corresponding job waited in the "normal" queue on DataStar from July 11th to July 18th, 2006. As the figure shows,the variance in the wait times is quite high (there are six orders of magnitude between the shortest and longest wait time), and the behavior of collections of jobs does not seem to remain stable over time.
At any point in time, making an exact prediction for the next job to be submitted is clearly a daunting task. Even when jobs are separated into classes based on their requested node counts and execution times, the actual delay experienced by each job remains very difficult to predict. Bound PredictionsOur work, which is based on a new statistical approach, results in predictions of the bounds on individual job wait times. The methodology combines a technique for change-point detection with a clustering method (to automatically identify job classes experiencing similar delays) and a method for estimating bound from each cluster. Change-point detection is necessary to cope with hidden events that cause the system response to change suddenly. For example, job delays may be quite different after a system crash or software upgrade than before, or after a user with a large allocation suddenly initiates a large workload. To make accurate predictions, our method trims away previous history that it identifies as being from a period of time characterized by different statistical properties than the current period. Because the scheduler considers both node count and requested execution time when deciding which jobs to release for execution, we also incorporate a clustering mechanism that identifies job classes based on these parameters. Intuitively, users sense that small, short jobs can be more readily scheduled (and hence experience shorter delays) than large, long ones. Our method automatically identifies the size categories that experience different delays, and makes predictions from each category separately. Finally, the methodology is dynamic in that it constantly reconfigures the historical data it considers, and the job categorization as new job delay data becomes available. To do so, it monitors the queue submissions for each queue — using Network Weather Service queue sensors — and adjusts its internal state as each job is allowed to begin executing. Bounds predictions allow users to specify a level of confidence and to get a prediction for the bound on the queuing delay corresponding to that level. For example, a user might want to know what the bound on delay is with 95% confidence which is to say that there is at worst only a 5% chance that job he or she submits will wait longer than the reported bound. Notice that a 75% bound is likely to be significantly less conservative since the user is willing to accept a 25% chance of the bound being exceeded. ToolsIn order to make bounds predictions available to users in real-time, we have developed a number of tools based on the Network Weather Service monitoring and forecasting infrastructure. Users can access real-time predictions for a number of HPC machines via an online prediction website http://nws.cs.ucsb.edu/batchq where we show the 95%, 75% and 50% (median) job wait time predictions for a number of machine/queue/nodes/walltime combinations over time. A second tool reports essentially the same data that is being displayed on the Web site via a UNIX command-line tool. With this tool, which has been installed on SDSC resources, a user can query up-to-date bound predictions whenever they like. By default, the tool (which can be executed by running 'bqp' from a UNIX shell on your login nodes) outputs a matrix of 95% upper bound predictions for various job types. See below for a typical default output from the command.
In the table presented by bqp, we can determine the maximum number of seconds that out job of certain node/walltime dimensions should wait if we submitted it shortly after running the tool. For instance, if we have a job which requests 2 nodes for 5 minutes, we consult the table and find that the job should wait no longer than 69458 seconds, 95% of the time. To get an even better picture of the job wait time, we can inspect other bounds.
Above we seen that by specifying 75% and 50% using the '-p' flag to bqp, we can see that our same 2 node, 5 minute job will wait no longer than 27575 seconds 75% of the time and no longer than 7715 seconds 50% of the time. With these bound values in hand, we have a fairly complete idea of what to expect in terms of wait time at the time when we submit out job. It is interesting to note that we have verified that these bounds are extremely accurate, and when the tool says that a job will wait no more than N seconds 95% of the time, then in almost every data set we've analyzed (12 sites over a 9 year period) that prediction is correct. In addition to the prediction matrix, the command-line tool also features the ability to show the job dimensions which result in the 'best' current bound predictions (use the '-B' command-line option to bqp). This way, a user with flexible resource requirements can find the node number and request time combination which will help the job make it through the queue the faster than other combinations. Finally, the command-line tool has been instrumented to help users with deadlines determine the probability of their job getting through the queue on time. In the screenshot below, a user has a job requesting 4 nodes for 3600 seconds (one hour) and needs the job to begin execution no later than 6 hours from the time they ran the tool. We can see that when the bqp was executed, the job had a 40% chance of making it through the queue in 6 hours or less.
Bounds Prediction as Statistical QoSIn some sense, while the prediction method we have developed cannot make an exact prediction for individual user jobs, it does provide a type of statistical Quality of Service (QoS) guarantee. Each job submitted to the system will experience a delay no larger than the specified bound with the percentage certainty specified. We believe that this type of prediction is beneficial to the SDSC user community for two reasons. First, the upper bound on queuing delay is usually the critical number to estimate. Most users (although not all) are happy if their respective jobs execute early (before the prediction), rather than late. Secondly, unlike previous methods, the predictions come with a quantitative certainty level that is rigorous and constantly verified. Thus users have a mathematical way of reasoning about the potential error in predictions for their specific jobs. Acknowledgement:We would like to thank a number of sites and projects for supporting this project. Special thanks go to SDSC for driving feature development, supporting infrastructure deployment, and providing essential data and technical support. This project is supported by the VGrADS project (NSF grant CCF-0331654) as well as NSF grants CCF-0526005 and NGS-0305390. This work is being conducted at the University of California, Santa Barbara. ReferencesFor more information, please visit the online prediction website and see the following research papers. For information and a full list of command line options, see "Predeicting Batch Queue Times on DataStar and the IA-64 Cluser Using BQP"
J. Brevik, D. Nurmi, R. Wolski, Predicting Bounds on Queuing Delay for
Batch-scheduled Parallel Machines, Proceedings of ACM Principles and
Practices of Parallel Programming (PPoPP) 2006, March, 2006, New York,
NY http://pompone.cs.ucsb.edu/~nurmi/ppopp06.pdf
|