Running Jobs on DataStar
Overview
Users can run compute-intensive jobs either interactively or as batch jobs as follows (see the diagram below for a summary):
- dslogin.sdsc.edu (dslogin): Submit your batch jobs using the llsubmit command (P655 or P690 nodes)
- dsdirect.sdsc.edu (dsdirect): Run serial jobs directly from command line
- dspoe.sdsc.edu (dspoe): Run parallel jobs from the command line or via interactive queue using the Parallel Operating Environment (POE) and Loadleveler scripts
While both p690 and p655 batch jobs can be submitted via Loadleveler using the llsubmit command from dslogin, interactive jobs can be run in a number of ways:
- P655 express and interactive jobs
- Log in to dspoe and run Parallel Operating Environment (poe) commands from the command line This capability is meant for debugging and has a 30 minute run time limit.
- Log in to dspoe and use llsubmit to run immediately in the interactive queue (shared access) or express queue (sole access). These capabilities are meant for debugging and timing.
- P690 interactive jobs
- Log in to dsdirect; serial jobs can be run from the command line.
- Log in to dsdirect; parallel jobs can be run with the poe32 command.
- In addition, dsdirect can be used for longer pre and post processing runs. See the documentation below for information on how to run in all of the above ways.
Please note: Compute-intensive jobs should not be run directly on dslogin.sdsc.edu, as it interferes with other users' work as well as a number of important system functions. dslogin is best used only for non-intensive tasks such as editing, compiling and submitting programs, and transferring data.
Running Interactive Jobs
Serial jobs from the command line on dsdirect.sdsc.edu
Users can log on to dsdirect.sdsc.edu and run serial jobs interactively from the command line. This resource is for uses such as debugging, visualization, and post-processing data. Charging will be based on CPU and memory use. For example, the charge for a job that uses all of a node's memory will be equivalent to a job that uses all 32 CPUs of the node. (See more info on How Accounts Are Charged.)
Parallel jobs using loadleveler and POE on dspoe.sdsc.edu & dsdirect.sdsc.edu
There are three p655 nodes set up for interactive use, which can be accessed only from dspoe.sdsc.edu. To run an interactive MPI job on the p655 nodes use the following code (on one line):
% poe executable your_job_args -nodes n -tasks_per_node m -rmpool 1 -euilib us -euidevice sn_all
To run an interactive MPI job on a p690 node log on to dsdirect.sdsc.edu and use the poe32 command as follows:
% poe32 executable your_job_args -nodes 1 -tasks_per_node m
Please note that you can use environmental variables to simplify the above commands. This is illustrated for the dspoe.sdsc.edu (p655) case as follows:
% setenv MP_NODES 2
% setenv MP_TASKS_PER_NODE 4
% setenv MP_RMPOOL 1
% setenv MP_EUILIB us
% poe a.out your_job_args
Please note: Compute-intensive jobs should not be run directly on dslogin.sdsc.edu, as it interferes with other users' work as well as a number of important system functions. dslogin is best used only for non-intensive tasks such as editing, compiling and submitting programs, and transferring data.
Running Batch Jobs Using LoadLeveler
- The batch queues should be used for production runs.
- The queues can be accessed only through Loadleveler.
- All job submissions must be made from dslogin.sdsc.edu.
- The available p655 and p690 queues are detailed below.
- LoadLeveler offers both a command line interface (llsubmit) and an X-Windows based GUI called xloadl
LoadLeveler Commands
| LoadLeveler Example | Function |
| llsubmit your_job_script | Submit jobscript to queue |
| llcancel your_jobID | Cancel your job |
| showq | Shows jobs in queue and the order in which they will run |
| llq -s jobID | Obtain detailed status of a job |
| llq | Shows the status of all the jobs in the Loadleveler queue. Please note that this will not show the order in which the jobs will run. |
More information about LoadLeveler can be found in the IBM
LoadLeveler Manual (PDF, 506 pp., 4 MB) .
p655 (8-way) nodes
A Simple LoadLeveler Script
The example Loadleveler script for p655 nodes below will run with 8 tasks per node on n nodes. You can modify it for your needs as required. The best way to test the script is to use the interactive queue on dspoe.sdsc.edu. In this case the class should be set to "interactive".
A Simple LoadLeveler Script
For a detailed explanation of the script below, view the Annotated Loadleveler Script.
#!/usr/bin/ksh
#@environment = COPY_ALL;\
#AIXTHREAD_SCOPE=S;\
#MP_ADAPTER_USE=dedicated;\
#MP_CPU_USE=unique;\
#MP_CSS_INTERRUPT=no;\
#MP_EAGER_LIMIT=64K;\
#MP_EUIDEVELOP=min;\
#MP_LABELIO=yes;\
#MP_POLLING_INTERVAL=100000;\
#MP_PULSE=0;\
#MP_SHARED_MEMORY=yes;\
#MP_SINGLE_THREAD=yes;\
#RT_GRQ=ON;\
#SPINLOOPTIME=0;\
#YIELDLOOPTIME=0
#@account_no = your_account
#@class = normal
#@node = n
#@tasks_per_node = 8
#@wall_clock_limit = 04:00:00
#@node_usage = not_shared
#@network.MPI = sn_all, shared, US
#@job_type = parallel
#@job_name= job.$(jobid)
#@output = LL_out.$(jobid)
#@error = LL_err.$(jobid)
#@notification = always
#@notify_user = your_email
#@initialdir = /gpfs/your_username
#@queue
poe your_executable executable_args
Save this script with the desired name. Run using llsubmit:
%llsubmit your_script
p690 (32-way) nodes
The key issue to be considered in the scripts for p690 nodes is that the nodes are shared among users. Therefore, node_usage keywords should be set to "shared" as in the example below. Since these nodes are shared, the users need to specify the amount of resources, i.e., the number of CPUs and memory that their program needs. Loadleveler will not start the job until the specified resources are available on the node(s). The resources can be specified as follows:
#@node = 1
#@tasks_per_node = 2
#@resources = ConsumableCpus(8) ConsumableMemory(2gb)
The #@resources specification defines maximal resources required for each MPI task. Hence, in the above case resource requests are as follows:
- Number of CPUs requested per node = (tasks_per_node)*(ConsumableCpus) = 2*8 =16
- Amount of memory requested per node = (tasks_per_node)*(ConsumableMemory) = 2*2 = 4gb
Sample resource request lines for MPI, OPENMP, and Mixed (OPENMP+MPI) programs are tabulated below.
| Pure MPI programs | Set the ConsumableCpus to 1 (it is the default), and define the number of MPI tasks through #@node and #@tasks_per_node ConsumableMemory should be set to the maximum amount of memory your program will use per MPI task. | #@node = 2 #@tasks_per_node = 8 #@resources = ConsumableCpus(1) ConsumableMemory(10gb) The specifications above will give you 8 tasks per node and a total of 80GB of memory. The total number of MPI tasks will be (node)*(tasks_per_node) = 2*8 = 16 |
| Simple Multithreaded
programs (OPENMP or PTHREADS) |
The ConsumableCpus should be set to the maximum number of threads (processors) your program will use. The number of nodes (#@node) and tasks_per_node (#@tasks_per_node) should be set to 1. ConsumableMemory should be set to the maximum total amount of memory your program will use. | #@node = 1 #@tasks_per_node = 1 #@resources = ConsumableCpus(8) ConsumableMemory(25gb)
The above specification will allow you to run a program with 8 threads and a total of 25GB of memory. |
| Hybrid MPI/Multithreaded programs | Set the ConsumableCpus to the maximum number of threads each MPI task will
spawn. Set ConsumableMemory to the maximum total amount
of memory each MPI task will use. As before, set the
number of MPI tasks in #@node and #@tasks_per_node.
|
#@node = 1 #@tasks_per_node = 4 #@resources = ConsumableCpus(8) ConsumableMemory(25gb)
The above specification will allow you to run a program with 4 MPI tasks with 8 threads each (total of 32 processors) and a total of 100GB of memory . |
The example below is for running an OpenMP program on 16 threads (i.e., 16 processors out of 32) on a p690 node, using 2 GB of memory total.
When using the p690s you should find out your code's memory high-water mark using hpmcount and specify the memory requirements in the submit script. In your scripts, please specify the memory amount somewhat higher than your best estimate to allow for system processes. If your job requires more than 128 GB of memory, you should use the ConsumableMemory setting to request the high memory (256 GB) P690 node (see table above for examples).
#!/usr/bin/ksh
#@environment = COPY_ALL;\
#AIXTHREAD_SCOPE=S;\
#MP_ADAPTER_USE=dedicated;\
#MP_CPU_USE=unique;\
#MP_CSS_INTERRUPT=no;\
#MP_EAGER_LIMIT=64K;\
#MP_EUIDEVELOP=min;\
#MP_LABELIO=yes;\
#MP_POLLING_INTERVAL=100000;\
#MP_PULSE=0;\
#MP_SHARED_MEMORY=yes;\
#MP_SINGLE_THREAD=no;\
#RT_GRQ=ON ;\
#XLSMPOPTS="stack=67108864"
#@account_no = your_account
#@class = normal32
#@node = 1
#@tasks_per_node = 1
#@resources = ConsumableCpus(16) ConsumableMemory(2gb)
#@wall_clock_limit = 00:10:00
#@node_usage = shared
#@network.MPI = sn_all, shared, US
#@job_type = parallel
#@job_name= job.$(jobid)
#@output = LL_out.$(jobid)
#@error = LL_err.$(jobid)
#@notification = always
#@notify_user = your_email
#@initialdir = /gpfs/your_username
#@queue
export MEMORY_AFFINITY=MCM
export MP_TASK_AFFINITY=MCM
## And add the following export lines for OpenMP/Multithreaded jobs.
export MALLOCMULTIHEAP=heaps:16
export OMP_NUM_THREADS=16
poe executable_OMP
Be sure not to make a common mistake of setting both #@tasks_per_node and ConsumableCpus to values greater than 1 if all you need is to run a simple MPI program. The total number of processors you are requesting is the number of MPI tasks times the number of consumable cpus. Also if the product is greater than the total number of processors available on the node (32) the job will sit in the queue indefinitely and the start time will be shown as “none” in the showq output. The same applies to memory requirements.
See the Loadleveler Topics page for a more full-featured Loadleveler scripts for codes that use mixed MPI/OpenMP, Pthreads, as well as for a complete description of Loadleveler tags.
Job Scheduling and Priorities
On DataStar, the batch queue system uses the Catalina scheduler to manage job submissions. The job scripts are submitted to batch queues via LoadLeveler, assigned priority by Catalina, and then released to execute as system resources become available. The LoadLeveler scheduler attempts to make the best use of resources by filling the available processors.
Currently, the Catalina scheduler gives preference to jobs with higher node requirements and to users with larger allocations. Large jobs are given greater priority because they have greater difficulty accumulating all the nodes they need, resulting in long queue wait times for large jobs. In addition, larger size jobs cannot be run elsewhere on smaller clusters. Users are encouraged to contact SDSC Consulting at consult@sdsc.edu for help with scaling up their codes and take advantage of the opportunity to run large jobs on the SDSC clusters. Under special circumstances such as impending paper deadlines, class project deadlines, etc., the job priorities can be modified if required.
Batch Queues
The following is the algorithm for charging Service Units (SUs) from a user's allocation; where P is priority depending on the job class (see the table below).
p655 (8-way) nodes are charged as follows:
SUs = P x Wallclock_Hours x Num_Nodes x 8
p690 (32-way) nodes are charged as follows:
SUs = P x Wallclock_Hours x Num_Processors
Queues for DataStar Allocations:
The queues below are available for DataStar-only allocations. You may use reslist from the command line to list the CPUs available for your account. (Possible CPU names may be SSTR, SNP690, SNP655, or SSTRNP.) See the Accounting section of the DataStar Getting Started page ("How Accounts Are Charged" and "How to Check Your Remaining Allocation") for more info on reslist.
DataStar has 272 (8-way) P655+ and 7 (32-way) P690 compute nodes. The 1.5 GHz 8-way nodes (176 in number) have 16 GB, the 1.7 GHz 8-way nodes (96 in number) have 32 GB, while the 32-way nodes have 128 GB of memory. The 1.5 GHz node queues are named normal, high, and express.
| Queue Name |
Node Type and Memory Limit | Max Wallclock_Hours |
Max Num_Nodes | P |
| express * | p655 nodes (8-CPU), 16 GB | 2 | 4 | 1.8 |
| high | p655 nodes (8-CPU), 1.5 GHz, 16 GB 1.7 GHz, 32 GB |
18 | 265 | 2 |
| normal | p655 nodes (8-CPU), 1.5 GHz, 16 GB 1.7 GHz, 32 GB |
18 | 265 | 1 |
| high32 | p690 nodes (32-CPU), 128 GB | 18 | 5 | 2 |
| normal32 | p690 nodes (32-CPU), 128 GB | 18 | 5 | 1 |
* Please note that the jobs requesting the express queue must be submitted from dspoe.sdsc.edu.
Backfill Opportunities
When a batch job starts execution on DataStar, it will normally run to completion, or the expiration of its wall-clock time limit without any interruptions from the operating system (unless an error condition is encountered, such as a programming error or system problem that causes the job to end prematurely). Although it is technically possible to suspend and resume MPP batch jobs on some operating systems, and even to roll out their memory image out to disk in order to free processors for higher priority work, SDSC's policy is not to interrupt running jobs or to suspend jobs because of the large amount of disk that is often required to do so.
Consequently, when a large node count job is scheduled to run, the requested number of nodes must be in an "idle" (free of running jobs) state before the job can start. For example, to run a 168 node (1344 CPU) job the system must wait until 168 nodes are idle, and this means that many jobs will have to finish before the large job can start. Since SDSC currently allows batch jobs that run up to 18 hours, there can be a very substantial amount of CPU time unused while the system is waiting for jobs to complete, unless there is a supply of short time limit jobs which can "fit in" and finish before a large job is scheduled to start. The LoadLeveler scheduler attempts to do this automatically, but if most users specify the maximum 18 hour job limit there will be no suitable candidates.
In the figure below, a 4 node system is running two 2 node jobs (A& B) when a 4 node job (C) is scheduled. Unless there is a 1 or 2 node job which can run completely in the time between the end of Job A and the end of Job B, there will be idle time and no other work will be scheduled until Job C completes.
| ... Job A |<-----idle----->| Job C | ... Job A |<-----idle----->| Job C | ... Job B | Job C | ... Job B | Job C |
Consider the worst case situation: a 168 node job is scheduled to run,
but a 18 hour 16 node job is running with all the other processors idle!
With 183 nodes total, the 168 node job cannot start until the 16 node
job ends. In this situation, (183-16)*8*18 = 24,048 CPU hours would be
completely lost. This is greater than some entire annual allocations!
This "backfill" problem presents an opportunity for users who have work which can be performed in shorter runs and perhaps with variable processor count. If a user job is submitted which fits in the available idle processors, and can complete before the next scheduled job, it can significantly improve the total throughput of the system.
Other useful commands:
show_q |
Displays the current state of the running and queued jobs. If the number of nodes in use is significantly less than the total number of nodes (171), then an opportunity for backfill jobs is present. The column "REMAINING" in the ACTIVE JOBS section shows the wall clock time remaining until a running job ends. The column "RES_START" in the IDLE JOBS section shows the estimated time until a queued job will begin execution. |
show_bf |
Indicates how many nodes are free and for how long until a scheduled job will start. |
If you can submit a job which fits within the time and processor counts
given by either the "show_bf" or "backfill" commands,
then your job will run right away, and everyone will benefit from more
effective utilization of the machine.
Debugging Programs
Compile your program with the -g flag to produce an object file with symbol table references that can be translated by the debugger. It is also advisable to turn off all optimization (-O type) flags during compilation for debugging. Programs may be debugged using dbx, pdbx, or TotalView.
TotalView Parallel Debugger
To run TotalView debugger interactively on p655 nodes:
- Log in to dspoe.sdsc.edu
- Use the tvpoe wrapper script:
dspoe% tvpoe a.out -nodes 1 -tasks_per_node 4 -rmpool 1
To run TotalView debugger on the interactive p690 node:
- Use the tvpoe32 wrapper script:
dspoe% tvpoe32 a.out -nodes 1 -tasks_per_node 4 -rmpool 1
To begin a TotalView session:
- Click "Go" in the TotalView poe window.
- Wait for the job to proceed through the LoadLeveler scheduler.
- Click "Yes" on the window that will pop up within a few minutes.
- Now you should be able to see your source code, set breakpoints, and do all other debug operations.
Be sure to run your interactive and debugging jobs only on the nodes of the type allowed by your allocation; otherwise your jobs will be rejected. Core dumping is limited to 32 MB per process by default. Users may request a higher limit from SDSC Consulting.
For more information on Totalview, read the Etnus TotalView Guide, Etnus Web site, or LLNL's TotalView Tutorial.





