• Comparison between SLURM and Torque (PBS)
    
    torque queue = slurm partition
    
    
    Comparison of some common commands in SLURM and in Torque (PBS) and Maui. 
    
    
    Task  	 			Torque/PBS  			SLURM
    -------------------------------------------------------------------------
    Submit a job  			qsub myjob.sh  			sbatch myjob.sh
    Delete a job  			qdel 123  			scancel 123
    Show job status  		qstat  				squeue
    Show expected job start time  - (showstart in Maui/Moab)  	squeue --start
    Show queue info  		qstat -q  			sinfo
    Show queue details		qstat -Q -f              scontrol show partition 
    				  mdiag -c
    Show job details  		qstat -f 123  			scontrol show job 123
    Show queue details  		qstat -Q -f   		scontrol show partition 
    Show node details  		pbsnode n0000  			scontrol show node n0000
    Show QoS details  - 		(mdiag -q  in Maui/Moab)	sacctmgr show qos 
    
    
    

    Job control

    Action Slurm Torque/PBS Maui SGE ------------------------------------------------------------------------------------------------------------ Get information about the job scontrol show job "jobid" qstat -f "jobid" checkjob Display the queue info squeue qstat showq qstat Delete a job scancel "jobid" qdel qdel Clean up leftover job momctl -c JobID Submit a job srun/sbatch/salloc testjob qsub testjob msub qsub Submit a interactive job salloc -N 4 -p active sh qsub -I qlogin Display all job info squeue -al qstat -f scontrol show job Display job scontrol show job "jobID" qstat -f "jobID" Display free processors srun --test-only -p normal \ -n 1 -t 10:00 sh showbf Display the expected start time squeue --start -j "jobid" showstart "jobid" Display blocked jobs squeue --start mdiag -b/showq -b Display queues/partitions scontrol show partition qstat -Qf mdiag -c Display queue sinfo -h qstat -q sinfo -o "%P %l %c %D " Start job scontrol update JobId=1234 \ qrun runjob StartTime=now Hold job scontrol update JobId=1234 \ qhold sethold StartTime=now+30days Release hold job scontrol update JobId=1234 \ qrls releasehold StartTime=now Pending job scontrol requeue 1234 Graphical Frontend sview xpbs qmon set priority scontrol update JobId=592 \ setspri 10000 592 nice=-10000 preempt job scontrol requeue $jobid mjobctl -R $jobid suspend job scontrol suspend $jobid mjobctl -s $jobid resume job scontrol resume $jobid mjobctl -r $jobid list job resources sacct -j xxx -l tracejob xxx

    Node control

    Action Slurm Torque/PBS Maui ------------------------------------------------------------------------------------------------ Display node info scontrol show node="node" pbsnodes "node" checknode "node" squeue -w "node" list feature and resource sinfo -o "%15N %10c %10m %25f %10G" pbsnodes Drain node scontrol update NodeName=gpu-1-4 pbsnodes -oN "Timeout" tcc-1-4 State=DRAIN Reason=Timeout Clear node scontrol update NodeName=gpu-1-4 pbsnodes -cN "" tcc-1-4 State=RESUME List down Nodes sinfo --list-reasons pbsnodes -ln sinfo --long -R sinfo -RlN list reservation scontrol show res showres sinfo -T set reservation scontrol create res starttime=now duration=04:00:00 Nodecnt=1 Users=hocks Flags=IGNORE setres -n tideker -s 00:00:01_06/11/2014 -d 24:00:00 -f tideker-node TASKS==256 pbsnodes -l all: alias pn='sinfo --format="%25N %.3D %9P %11T %.4c %14C %.8z %.8m %.4d %.8w %10f %20E"' pbsnodes -ln alias pl='sinfo --states=down,drain,fail,no_respond,maint,unk --format="%12n %20f %20H %12u %32E"' showstart: alias lj='sacct -o user,jobid,jobname,state,node,start' list allocated/idle/other/total: sinfo --long -o %F

    Configuration control

    Action Slurm Torque/PBS Maui ------------------------------------------------------------------------------------------------ show configuration scontrol show config showconfig
  • Job State
    
           CA  CANCELLED       Job was explicitly cancelled by the user or system administrator.  The job may or
                               may not have been initiated.
    
           CD  COMPLETED       Job has terminated all processes on all nodes.
    
           CF  CONFIGURING     Job has been allocated resources, but are waiting for them to  become  ready  for
                               use (e.g. booting).
    
           CG  COMPLETING      Job  is  in  the process of completing. Some processes on some nodes may still be
                               active.
    
           F   FAILED          Job terminated with non-zero exit code or other failure condition.
    
           NF  NODE_FAIL       Job terminated due to failure of one or more allocated nodes.
    
           PD  PENDING         Job is awaiting resource allocation.
    
           PR  PREEMPTED       Job terminated due to preemption.
    
           R   RUNNING         Job currently has an allocation.
    
           S   SUSPENDED       Job has an allocation, but execution has been suspended.
    
           TO  TIMEOUT         Job terminated upon reaching its time limit.
    
    
    
  • format options
    The field specifications available include:
    
                  %a  State/availability of a partition
    
                  %A  Number of nodes by state in  the  format  "allocated/idle".
                      Do  not use this with a node state option ("%t" or "%T") or
                      the different node states will be placed on separate lines.
    
                  %c  Number of CPUs per node
    
                  %d  Size of temporary disk space per node in megabytes
    
                  %D  Number of nodes
    
                  %f  Features associated with the nodes
    
                  %F  Number   of   nodes   by   state   in   the  format  "allo-
                      cated/idle/other/total".  Do not use this with a node state
                      option  ("%t" or "%T") or the different node states will be
                      placed on separate lines.
    
                  %g  Groups which may use the nodes
    
                  %h  Jobs may share nodes, "yes", "no", or "force"
    
                  %l  Maximum time for any job  in  the  format  "days-hours:min-
                      utes:seconds"
    
                  %m  Size of memory per node in megabytes
    
                  %N  List of node names
    
                  %P  Partition name
    
                  %r  Only user root may initiate jobs, "yes" or "no"
    
                  %R  The  reason a node is unavailable (down, drained, or drain-
                      ing states)
    
                  %s  Maximum job size in nodes
    
                  %t  State of nodes, compact form
    
                  %T  State of nodes, extended form
    
                  %w  Scheduling weight of the nodes
    
                  %.<*>
                      right justification of the field
    
                  %<*>
                      size of field
    
    
  • Job variables
    
    Environment Variable  	Torque/PBS  		SLURM
    --------------------------------------------------------
    JOB ID  		PBS_JOBID  		SLURM_JOB_ID / SLURM_JOBID
    JOB NAME  		PBS_JOBNAME  		SLURM_JOB_NAME
    NODE LIST  		PBS_NODELIST  		SLURM_JOB_NODELIST / SLURM_NODELIST
    JOB SUBMIT DIRECTORY  	PBS_O_WORKDIR  		SLURM_SUBMIT_DIR
    JOB ARRAY ID (INDEX)  	PBS_ARRAY_INDEX  	SLURM_ARRAY_TASK_ID
    USER			PBS_USER		SLURM_JOB_USER
    
  • prolog/epilogue
  • Environment variables per Job
  • slurm script for maui "showq" for cluster "comet"
    
    #!/bin/ksh
    
    if [ $# != 1 ] ; then
       Flag="all"
    else
       Flag=$1
    fi
    
    NORM="\033[0m"
    RED_F="\033[31m"; RED_B="\033[41m"
    BLINK="\033[5m"
    GREEN_F="\033[32m"
    
    if [[ $Flag == "-h" ]] ; then
     echo "Flags : -r running , -b blocked, -h help , no flag shows all jobs"
    fi
    
    if [[ $Flag == "-r" || $Flag == "all" ]] ; then
    echo "ACTIVE JOBS--------------------"
    echo "             JOBID PARTITION NAME      USER     ST                        TIME  NODES CPU NODELIST"
     /usr/bin/squeue -h -l | grep comet
     WC1=$(/usr/bin/squeue -h| grep comet|wc -l)
    echo -e "${RED_F}\n active jobs $WC1 "
     /usr/bin/squeue -h -l -p compute |grep comet | awk 'BEGIN{s=0};{s=s+8};END{print " " s " exclusive allocated Nodes" }'
     /usr/bin/squeue -h -l -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %.6C %R"| grep comet | awk 'BEGIN{s=0};{s=s+9};END{print " " s " allocated CPUs" }'echo -e "${NORM}"
    fi
    
    if [[ $Flag == "-b" || $Flag == "all" ]] ; then
    echo "BLOCKED JOBS----------------"
    echo "            JOBID  PARTITION NAME         USER ST          START_TIME  NODES  SCHEDNODES          REASON"
     /usr/bin/squeue -h -l --start  | grep -v None
     WC2=$(/usr/bin/squeue -h | grep -v comet|wc -l)
    echo -e "${RED_F} \n blocked jobs: $WC2 "
     /usr/bin/squeue -h -l -p compute| grep -v comet | awk 'BEGIN{s=0};{s=s+8};END{print " " s " Partition COMPUTE requested Nodes " }'
     /usr/bin/squeue -h -l -p shared -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %.6C %R"| grep -v comet | awk 'BEGIN{s=0};{s=s+9};END{print " " s " Partition SHARED requested CPUs " }'
    echo -e "${NORM}"
    
    
    fi
    if [[ $Flag == "all" ]] ; then
    echo -e "${RED_F} Total jobs: $(expr $WC1 + $WC2) ${NORM}" 
    fi
    
    
    
  • colour prompt
    
    
    red # export  PS1='\[\e[1;33m\][\u@\h \W]\$\[\e[0m\] '
    yellow # export  PS1='\[\e[1;32m\][\u@\h \W]\$\[\e[0m\] '
    green # export  PS1='\[\e[1;30m\][\u@\h \W]\$\[\e[0m\] '
    blue # export  PS1='\[\e[1;34m\][\u@\h \W]\$\[\e[0m\] '