SDSC Thread Graphic Issue 6, June 2006





RSS RSS Feed (What is this?)

User Services Director:
Anke Kamrath

Editor:
Subhashini Sivagnanam

Graphics Designer:
Diana Diehl

Application Designer:
Fariba Fana


Help Desk: User Questions

Frequently asked questions from our users

—Lani Nguyen

Dear SDSC Consulting,
I have a question regarding job priorities. I noticed that in addition to the normal, express, and high queues, there is also a job priority that is set externally dictating how soon the job starts. It appears that the jobs with higher node requirements (>100) have higher priority than jobs with smaller node requirement. On what basis is the priority determined?

Answer:
At SDSC, we encourage running higher node count jobs on our machines that cannot be run else where. The larger the node count, the higher the priority. In certain situvations, we can help users with smaller node count, if they need to meet a particular deadline. Jobs waiting in the queue for more than 4 days will receive a higher priority.

Dear SDSC Consulting,
I have roughly 30 jobs sitting at the top of the queue requesting four 8-way nodes for 18 hrs each and none have been submitted to run for some time now. What is the most useful way to use this cluster with respect to submitting jobs and reducing lag time between submission and running? Any help would be greatly appreciated.

Answer:
When a job is scheduled to run, the requested number of nodes must be in an "idle" (free of running jobs) state before the job can start. For example, to run a 168 node (1344 CPU) job the system must wait until 168 nodes are idle, and this means that many jobs will have to finish before a large job can start. Since SDSC currently allows batch jobs that can run up to 18 hours, there can be a very substantial amount of CPU time unused while the system is waiting for jobs to complete, unless there is a supply of short time limit jobs which can "fit in" and finish before a large job is scheduled to start. The scheduler attempts to do this automatically, but if most users specify the maximum 18 hour job limit there will be no suitable candidates.
Consider the worst case situation: a 256 node job is scheduled to run, but a 18 hour 16 node job is running with all the other processors idle! With 265 nodes total, the 256 node job cannot start until the 16 node job ends. In this situation, (265-16)*8*18 = 35,856 CPU hours would be completely lost. This is greater than some entire annual allocations!
If you can submit a job which fits within the time and processor counts given by "show_bf" commands, then your job will run right away, and everyone will benefit from more effective utilization of the machine.

Dear SDSC Consulting,
I understand the need for limiting the number of jobs that can run at once on the queue but I do not have any jobs actually running right now. All of my jobs are sitting directly below the running jobs and look to be the next jobs to be submitted to run but none have run in a couple of days. Is there some limit to how many jobs I can have queued? Also a couple of jobs terminated with an error a couple of days ago, could this have prompted the queue to not run any more of my jobs?

Answer:
A user can have six jobs in the queue. If you submit more than six jobs, your job will be in the AXJOBQUEUEDPERUSERPOLICY state. After your six jobs done, then the next six jobs will be sent to waiting list and run.

Lani Nguyen is reachable via e-mail at halianka@sdsc.edu

Did you know ..?

Always use MP_INFOLEVEL environment variable or the -infolevel option when you invoke POE to help trouble shooting abnormal job termination problems, for example:
cp: cannot stat `/dsgpfs/username/dir1/program': A file or directory in the path name does not exist.
ERROR: 0031-250 task 160: Terminated
Setting either of these to 6 gives you the maximum number of diagnostic messages when you run your program. - Eva Hocks.