Running & Debugging Jobs on the IA-64 Linux Cluster
Running: Batch and Interactive Jobs
Running Jobs in Batch - PBS
The SDSC TeraGrid IA-64 cluster uses the Portable Batch System (PBS) to define and manage batch queues. PBS allows the user to submit one or more jobs for execution, with job parameters specified in a job script.
Batch Queues
Currently only one queue "dque" is normally available for all batch jobs. The queue limit is an 18 hour wallclock time. Users may submit as many batch jobs as they wish, but there is a current maximum of 6 batch jobs that can be enqueued (scheduled).
To reduce email load on the mailservers, please specify an email address in your PBS script, i.e.,:
/bin/bash
#PBS -l walltime=00:20:00
#PBS -M your_username@sdsc.edu
[Currently mail fom PBS batch jobs is disabled. We hope to have this function restored soon.]
qsub
To see a full list of mail commands, type man qsub at the command prompt. Details on the -m and -M commands follows:
-m mail_options
Defines the set of conditions under which the execution server will send a mail message
about the job. The mail_options argument is a string which consists of either the single
character "n", or one or more of the characters "a", "b", and "e".
If the character "n" is specified, no mail will be sent.
For the letters "a", "b", and "e":
a mail is sent when the job is aborted by the batch system.
b mail is sent when the job begins execution.
e mail is sent when the job terminates.
If the -m option is not specified, mail will be sent if the job is aborted.
-M user_list
Declares the list of users to whom mail is sent by the execution server when it sends
mail about the job.
The user_list argument is of the form:
user[@host][,user[@host],...]
If unset, the list defaults to the submitting user at the qsub host, i.e. the job owner.
The qsub "-W" option allows qsub to be used from within an executing script to submit a command/executable to the PBS batch queue.
Viewing Output of Running Batch Jobs
Users can monitor output from running jobs. To do this, create a .pbs_spool directory (note the dot) in your home directory, with permission set to 755. When the job runs, you should see 2 files in this directory, corresponding to UNIX std.out and std.err that can be viewed using "tail -f filename".
NOTE: use the numerical portion of your job id from PBS when using PBS commands. The alternative is to use qstat -f to obtain the full job id. Please note that qstat -a and qstat print out a limited number of characters in the job id field. This can result in a job id string that is invalid.
Stdout/stderr files are temporarily stored in your $HOME/.pbs_spool directory while your job is running. Therefore, your home directory and the .pbs_spool subdirectory must have execute permissions for other users:
chmod o+x $HOME $HOME/.pbs_spool
Please do not change permissions in your .pbs_spool directory after your jobs are in the queue.
Some PBS commands and their functions are as follows:
| Function | PBS example |
| submit a batch job to a queue | qsub [list of qsub options] script_name "man qsub" for more options |
| create your own interactive nodes | "qsub -I -V -l walltime=00:30:00 -l nodes=4:ppn=2" will put you on one of the compute nodes. Upon your exit of the node, or the wall time limit of 30 minutes in this example, the interactive nodes will expire. |
| display the status of PBS batch jobs | qstat -a "man qstat" for more options |
| delete (cancel) a queued job | qdel PBS_JOBID |
| show all running jobs on system | qstat -r |
| show detailed information of the specified job | qstat -f PBS_JOBID |
| show all queues on system | qstat -q |
| show queue limits for all queues | qstat -Q |
| show quick information of the server | qstat -B |
| shows node status | pbsnodes -a |
The following is an example of a PBS batch script for running an MPI job (the script is the top set of ten lines, and is explained in the bottom set of ten lines):
1 2 3 4 5 6 7 8 9 10 11 |
#!/bin/csh #PBS -q dque #PBS -N my_job #PBS -l nodes=10:ppn=2 #PBS -l walltime=0:50:00 #PBS -o file.out #PBS -e file.err #PBS -V #PBS -A your-account-number cd /work/username mpirun -v -machinefile $PBS_NODEFILE -np 20 ./a.out |
1 2 3 4 5 6 7 8 9 10 11 |
use queue called "dque" current job name is "my_job" request 10 nodes and 2 processors per node reserve the requested nodes for 50 minutes standard output to a file called "file.out" standard error to a file called "file.err" export all my environment variables to the job Specifies which account job is to be charged to. This is OPTIONAL if you have only one account. If you have more than one account and OMIT this line, the job will be charged to your default account. change to my working directory run my parallel job |
Sometimes, output freom a user job may be so large that it exceeds the quota limit in their home directory - this means that spooled output may be lost (output will still be written to stdout/stderr). To handle such a situation, users can create a symbolic link from ~/.pbs_spool to /gpfs. For example:
tg-login1 /users/user-name> mkdir /gpfs/user-name/spool
tg-login1 /users/user-name> ln -s /gpfs/user-name/spool
.pbs_spool
tg-login1 /users/user-name> ls -l .pbs_spool/
total 0
tg-login1 /users/user-name> ls -l .pbs_spool
lrwxrwxrwx 1 user-name user-group 17 2005-08-30 14:00
.pbs_spool -> /gpfs/user-name/spool
This would spool .ER and .OU files to /gpfs/user-name/spool. We hope to have PBS implementing a feature to allow the default spool directory for .pbs_spool to be specified by the user.
Using PBS Job Dependencies to submit multiple jobs
Sometimes users wish to submit multiple jobs but want the jobs to execute in a particular order - "job 2" shouldn't execute before "job 1" finishes. One way to handle this is to use the "qsub -W" option to define job dependencies. In the following example, I wish to submit 3 jobs, with job 2 to be scheduled for execution after job 1, and job 3 to be scheduled after job 2. In this example, the script "run-1-2-3" submits the jobs and looks like:
#!/bin/bash FIRST= 'qsub /gpfs/projects/frederik/Qsub-Tests/TG/PBS_script-1' echo $FIRST SECOND='qsub -W depend=afterok:$FIRST /gpfs/projects/frederik/Qsub-Tests/TG/PBS_script-2' echo $SECOND THIRD='qsub -W depend=afterok:$SECOND /gpfs/projects/frederik/Qsub-Tests/TG/PBS_script-3' echo $THIRD exit 0
The files: PBS_script-1, PBS_script-2, and PBS_script-3 are "normal" PBS scripts. For example, PBS_script-1 looks like:
#!/bin/csh #PBS -q dque #PBS -N Stommel1 #PBS -l nodes=2:ppn=2 #PBS -l walltime=0:05:00 #PBS -o stommel1.out #PBS -e stommel1.err #PBS -V #PBS -M frederik@sdsc.edu #PBS -m abe /bin/csh cd /gpfs/projects/frederik/Qsub-Tests/TG mpirun -v -machinefile $PBS_NODEFILE -np 4 ./stc_01 < st.in
In the example, all three jobs are submitted to the PBS batch queue; inititally, only job PBS_script-1 will be scheduled—jobs PBS_script-2, and PBS_script-3 will be put on "hold" status. Once job PBS_script-1 finishes job PBS_script-2 will be scheduled, with job PBS_script-3 job still on "hold", and when PBS_script-2 job finishes, finally, PBS_scrip-3 job will be scheduled.
Please direct any questions you have about using PBS to help@teragrid.org.
Interactive
DO NOT run interactive jobs on the login nodes, as this may adversely affect system performance and other other users.
At SDSC, there are two ways to run interactive jobs - using the dedicated interactive nodes (Shared Interactive Nodes), or using the batch nodes. Usung the Shared Interactive nodes has the advantage to users of being "on demand", with the following two disadvantages:
- nodes are shared - at any one time, there may be one or more other users running on the same nodes
- maximum number of nodes that are allocated for this type of use is 4
Interactive use via batch nodes has the following advantages:
- Nodes are dedicated to one user at a time
- Users can run on as many nodes as are in the batch pool
and the following disadvantage:
- users will have to wait until the request can be honored by the scheduler
Using Dedicated Interactive Nodes (Shared Interactive Nodes)
There are 4 nodes (two CPUs in each node) that are dedicated to shared interactive use. These nodes provide quick access to interactivity, however, the nodes are shared - at any given time, there may one or more processes running from other users and performance may be low.
- Running serial - log onto one of the nodes and enter the name of your executable, i.e., a.out
- Running parallel jobs with MPI:
define a file, which in this case is named "hostfile". to pass to the machinefile argument for mpirun:
mpirun -machinefile hostfile -np 4 ./myscript.pl
and hostfile looks like this:
tg-c130
tg-c130
tg-c128
tg-c128
The node names for the Shared Interactive Nodes:
tg-c127, tg-c128, tg-c129, tg-c130.
Limitations on Using Shared Interactive Pool
Do not use these nodes for production runs - they are for test, debugging purposes only. If you need interactive access for longer periods, try using nodes from the batch pool (see above).
Interactive Use of Batch Nodes
Interactive use in this manner is through the PBS batch system and Catalina scheduler. The following section describes how to request nodes for interactive use.
- Request nodes for interactive use:
qsub -I -V -l walltime=00:30:00 -l nodes=4:ppn=2
This request 4 nodes for interactive use (using 2 cpus/node) for a maximum time of 30 minutes.
The system will respond:
qsub: waiting for "jobname" to start
When node(s) are ready to be used, the system will respond - qsub: "jobname" ready and wil give the name(s) of the assigned node(s). - User can now run any interactive command. For example, to run an MPI program, parallel-test on the 4 nodes, 8 cpus:
mpirun -np 8 -machinefile $PBS_NODEFILE parallel-test
- When the requested wall clock time expires, the system responds with qsub. Please note the available time begins as soon as the scheduler tells you your slot is available, and ends according to wall clock time, not when you start your job.
- Use the command show_bf to see how many interactive nodes are available or when the interactive nodes will be available.
Running different jobs concurrently
There are several ways to concurrently run different executables with same/different data. One way to accomplish this is to use a combination of a perl script in conjunction with MPI. The method that will be described in the following uses a simple perl script and a small MPI program to execute the contents ("commands") contained in a text file (called "input-file" in the following text) that is input for the perl script. Process 0 executes line 1, Process 1 executes, line 2, etc., so the number of MPI processes must match the number of lines of commands in "input-file". Since each line may refer to an entire command script or program, the method provides some flexibility.
The perl script looks like:
#!/usr/bin/perl
#
# This script executes a command from a list of files
# based on the current MPI id.
#
# Last modified: Mar/11/2005
#
# call getid to get the MPI id number
($myid,$numprocs) = split(/\s+/,`./getid`);
$file_id = $myid;
$file_id++;
# open file and execute appropriate command
$file_to_use = $ARGV[0];
open (INPUT_FILE, $file_to_use) or &showhelp;
for ($i = 1; $i <= $file_id; $i++)
{
$buf = <INPUT_FILE>;
}
system("$buf");
close INPUT_FILE;
sub showhelp
{
print "\nUsage: mpiscript.pl <filename>\n\n";
print "<filename> should contain a list of executables,
one-per-line, including the path.\n\n";
}
The MPI program looks like:
#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
main(int argc, char **argv)
{
int numprocs, my_id;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
MPI_Finalize();
printf("%d %d\n",my_id,numprocs);
} /* end main() */
- Create a copy of the perl script in your home directory
- Compile MPI program - mpicc -o getid getid.c
- Create a PBS batch script that looks like (in this case, we're running on 2 nodes for a total of 4 mpi processes, and " input-file" is in the same directory as myscript.pl and getid executable):
#!/bin/sh # #PBS -q dque #PBS -N your-job-name #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:05:00 #PBS -o outfile #PBS -e errfile # ## Export all my environment variables to the job #PBS -V # # ## Run my parallel job mpirun -machinefile $PBS_NODEFILE -np 2 ./myscript.pl input-file
In the following example, "input-file" contained the lines:
ls ls -l
and the output looks like:
Tst-pbsdsh file.err file.out getid getid.c getid.o input-file myscript.pl run-test test-dsh total 692 drwxr-xr-x 2 frederik use300 4096 2005-04-25 08:47 Tst-pbsdsh -rw-r--r-- 1 frederik use300 0 2004-11-24 13:12 file.err -rw-r--r-- 1 frederik use300 903 2004-11-24 13:12 file.out -rwxr-xr-x 1 frederik use300 670092 2005-07-13 15:15 getid -rw-r--r-- 1 frederik use300 326 2005-07-13 15:14 getid.c -rw-r--r-- 1 frederik use300 2856 2005-07-13 15:15 getid.o -rw-r--r-- 1 frederik use300 9 2005-07-13 15:12 input-file -rwxr--r-- 1 frederik use300 648 2005-07-13 15:16 myscript.pl -rw-r--r-- 1 frederik use300 308 2005-07-13 15:29 run-test -rwxr--r-- 1 frederik use300 202 2004-11-24 13:03 test-dsh
Catalina Commands
Catalina is the name of the batch scheduler that runs in conjunction with PBS to manage batch jobs. Using a combination of job resource request parameters (job runtime, number of nodes, etc.), Catalina determines the order in which batch jobs will begin executing.
|
show_q This displays information on jobs in the queue. The first section displays running jobs. These are ordered by TimeRemaining. The second section shows Eligible jobs. These are ordered by priority. The last section shows Ineligible jobs. Options: | |
| [--?] | Display available options |
| [--help] | Display available options |
| [--full] | Display the full set of default info. |
| [--job] | Display job IDs |
| [--state] | Display job state |
| [--class] | Display job class |
| [--limit] | Display wall clock limit |
| [--remaining] | Display time remaining for Running job |
| [--startt] | Display start time for Running job |
| [--resstartt] | Display start time for reserved jobs |
| [--qos] | Display QOS |
| [--user] | Display owner of job |
| [--group] | Display group of job |
| [--account] | Display account of job |
| [--nodes] | Display number of nodes requested |
| [--taskmap] | Display tasks for each node of job |
| [--systemqt] | Display system queue time for job |
| [--submitt] | Display submit time for job |
| [--reason] | Display ineligible reason |
|
show_res This displays information on currently active reservations, orderedby start time. Options: | |
| [--?] | Display available options |
| [--help] | Display available options |
| [--full] | Display the full set of default info. |
| [--overlap] | Display reservations that overlap other reservations. |
| [--nodegrep] | Display all reservations on each node. |
| [--readable] | Display dates in a readable form instead of epoch time. |
| [--res=<reservation id>] | Display the information for a specific reservation id. |
| [--start] | Display start date. |
| [--end] | Display end date. |
| [--relstart] | Display start time, relative to now, in hours. |
| [--relend] | Display end time, relative to now, in hours. |
| [--duration] | Display duration, in hours. |
| [--nodes] | Display number of nodes reserved. |
| [--purpose] | Display the type of reservation, job, standing, etc. |
| [--comment] | Display any comment associated with reservation. |
| [--node_list] | Display list of nodes reserved. |
| [--job_rest] | Display Python code for allowed jobs. |
|
Without options, show_res defaults to the following: show_res --readable --relstart --relend --duration --nodes --job --purpose Options not described here:
[--affinity_calculation] |
|
Monitoring Batch Queues
Users can monitor PBS batch queues using the PBS utility qstatus: "qstatus -r" shows all running jobs. In addition, SDSC has created local utilities that provide more information about scheduled jobs:
/usr/local/apps/catalina_wrappers/bin/showq
will show time remaining for jobs that are currently running as well as estimated start times for jobs that are waiting on nodes to become available.
/usr/local/apps/catalina_wrappers/bin/show_bf
will give information on available time slots - users who are trying to choose job parameters that allow their jobs to run more quickly may find this to be a useful utility, which shows open "slots" - nodes and time.
Debugging Programs
The TotalView (TV) debugger is now available for serial and parallel code debugging on this system. The current version is 7.0. TV is designed for interactive use and provides both command-line and GUI-based debugging. The TV license is for a maximum of up to 16 cpus. When running on more than one cpu, change the string "C%" to "/usr/bin/ssh" in the "Launch Strings" preference box, in order for TV to access batch nodes. Please see the link given below for complete TotalView documentation on how to use this utility.
The gnu serial debugger, gdb, may also be used on serial programs only. See "man gdb" for complete information on how to use gdb.
To compile your program for using gdb or the TotalView debugger, use the -g compile line option. For example:
mpcc -g do_mpi.c -o do_mpi
To run TotalView (TV) in interactive mode using the Totalview X-Windows based GUI:
- Ensure TV is in your search path - enter: "which totalview" - should get as a response: "/usr/local/bin/totalview"
- Set up your local X Server to allow a remote window from SDSC
- Set TV to launch with ssh
- Invoke TV
- Select "PREFERENCES" from "FILE" tab
- In "PREFERENCES" change "C%" to /usr/bin/ssh in the "Launch Strings" preference box.
- Select "OK" in the "PREFERENCES" window
- Quit TV
- request N nodes for interactive use (in this example, I'm requesting 1 node) for 10 minutes:
qsub -I -V -l walltime=00:10:00 -l nodes=1:ppn=2
- Launch program and bring up TV GUI for control:
mpirun -tv -machinefile $PBS_NODEFILE -np 2 ./pi.exe




