Help Desk: User QuestionsFrequently asked questions from our users
—Lani Nguyen
Dear SDSC Consulting,
I have a question regarding job priorities. I noticed that in addition to
the normal, express, and high queues, there is also a job priority that is set
externally dictating how soon the job starts. It appears that the jobs with higher node requirements (>100) have higher priority than jobs with smaller node requirement. On what basis is the priority determined?
Answer:
At SDSC, we encourage running higher node count jobs on our machines that cannot be run else where.
The larger the node count, the higher the priority. In certain situvations, we can help users with smaller node count, if they need to meet a particular deadline. Jobs waiting in the queue for more than 4 days will receive a higher priority.
Dear SDSC Consulting,
I have roughly 30 jobs sitting at the top of the
queue requesting four 8-way nodes for 18 hrs each and none have been
submitted to run for some time now. What is the most useful way to use
this cluster with respect to submitting jobs and reducing lag time
between submission and running? Any help would be greatly appreciated.
Answer:
When a job is scheduled to run, the requested number of nodes must be in an "idle"
(free of running jobs) state before the job can start. For example, to run a 168
node (1344 CPU) job the system must wait until 168 nodes are idle, and this means
that many jobs will have to finish before a large job can start. Since SDSC
currently allows batch jobs that can run up to 18 hours, there can be a very substantial amount of CPU time unused while the system is waiting for jobs to complete, unless there is a supply of short time limit jobs which can "fit in" and finish before a large job is scheduled to start. The scheduler attempts to do this automatically, but if most users specify the maximum 18 hour job limit there will be no suitable candidates.
Consider the worst case situation: a 256 node job is scheduled to run, but a 18 hour 16 node job is running with all the other processors idle! With 265 nodes total,
the 256 node job cannot start until the 16 node job ends. In this situation,
(265-16)*8*18 = 35,856 CPU hours would be completely lost. This is greater than some
entire annual allocations!
If you can submit a job which fits within the time and processor counts given by
"show_bf" commands, then your job will run right away, and everyone will benefit from
more effective utilization of the machine.
Dear SDSC Consulting,
I understand the need for limiting the number of jobs that can run at
once on the queue but I do not have any jobs actually running right
now. All of my jobs are sitting directly below the running jobs and
look to be the next jobs to be submitted to run but none have run in a
couple of days. Is there some limit to how many jobs I can have
queued? Also a couple of jobs terminated with an error a couple of
days ago, could this have prompted the queue to not run any more of my
jobs?
Answer:
A user can have six jobs in the queue. If you submit more than six jobs, your job will be in the AXJOBQUEUEDPERUSERPOLICY state. After your six jobs done, then the next six jobs will be sent to waiting list and run.
Lani Nguyen is reachable via e-mail at halianka@sdsc.edu
|