TSCC Quick Start Guide

Technical Summary

The Triton Shared Computing Cluster (TSCC) provides central colocation and systems administration for your purchased cluster nodes (“condo cluster”), as well as providing a "hotel" service for those with temporary or bursty high-performance computing (HPC) needs. TSCC provides different kinds of compute nodes in the cluster, including General Computing Nodes and  Graphics Processing Unit (GPU) Nodes.

TSCC is now open to new purchases with refreshed technology and special vendor pricing.

TSCC free trial

For a free trial, email tscc-info@ucsd.edu and provide your:

  • Name
  • Department
  • UCSD affiliation (grad student, post-doc, faculty, etc.)

Trial accounts are 250 core-hours valid for 90 days.

Quick Start Guide to Running Jobs on TSCC

System Access - Logging In

To login to the TSCC, use the following hostname:

tscc-login.sdsc.edu

Following are examples of Secure Shell (ssh) commands that may be used to login to the TSCC:

ssh <your_username>@tscc-login.sdsc.edu
ssh -l <your_username> tscc-login.sdsc.edu

More information about Secure Shell may be found in the New User guide. SDSC security policy may be found at the SDSC Security site. Download the TSCC Quick Reference Guide [PDF].

Important Guidelines for Running Jobs

  • Please do not write job output to your home directory ( /home/$USER). NFS filesystems have a single server which handles all the metadata and storage requirements. This means that if a job writes from multiple compute nodes and cores, the load is focused on this one server.
  • The Lustre parallel filesystem (/oasis/tscc/scratch) is optimized for efficient handling of large files, however it doesn't work nearly as well when writing many small files. We recommend using this filesystem only if your metadata load is modest – i.e., you have O(10)-O(200) files open simultaneously.
  • Use local scratch (/state/partition1/$USER/$PBS_JOBID) if your job writes a lot of files from each task. The local scratch filesystem is purged at the end of each job, so you will need to copy out files that you want to retain after the job completes.

Running Jobs with TORQUE

TSCC uses the TORQUE Resource Manager (also known by its historical name Portable Batch System, or PBS) with the Maui Cluster Scheduler to define and manage job queues. TORQUE allows the user to submit one or more jobs for execution, using parameters specified in a job script.

Job Queue Summary Descriptions

The intended uses for the submit queues are as follows:

  • hotel The hotel queue supports all non-contributor users of TSCC. Jobs submitted to this queue will use only the nodes purchased for the IDI project shared cluster. As such, the total number of cores running all hotel jobs is limited to the total number of nodes in that cluster (currently 640).
  • home
    This is a routing queue intended for all submissions to group-specific clusters; if you intend for your job to run only within the nodes you have contributed, submit to this queue. Some users may belong to more than one home group; in this case, a default will be in effect, and using a non-default group will be specified in the job submission.
  • condo The condo queue is exclusive to contributors, but allows jobs to run on nodes in addition to those purchased. This means that more cores can be in use than were contributed by the project, but it also limits the run time to eight hours to allow the node owners to have access per their contracted agreement.
  • glean The glean queue will allow jobs to run free of charge on any idle condo nodes. These jobs will be terminated whenever the other queues receive job submissions that can use the idle nodes. This queue is exclusive to condo participants.

Job Queue Characteristics

Default Walltimes Changed

The default walltime for all queues is now one hour. Max cores has been updated on some queues as well. Max walltimes are still in force per the below list.

Queue Limits
condo
max walltime = 8 hours 
default walltime = 1 hour 
max user cores = 512

gpu-condo 
max walltime = 8 hours 
max user cores = 84

hotel 
max walltime = 168 hours 
default walltime = 1 hour 
max user cores = varies

gpu-hotel 
max walltime = 168 hours 
max user cores = unlimited  

pdafm
max walltime = 72 hours 
default walltime = 1 hour
max user cores = 96

home 
max walltime = unlimited 
default walltime = 1 hour 
max user cores = unlimited  

glean 
default walltime = 1 hour 
max user cores = 1024

For the hotel, condo, pdafm and home queues, jobs charges are based on the number of cores allocated. Memory is allocated in proportion to the number of cores on the node.

Memory per Allocated Core by Queue Type
Queue# CoresMemory (GB)GB Memory per Core
hotel 16 64 4
condo 16 64 or 128 4 or 8
pdafm 32 512 16
home 16 64 or 128 4 or 8

Queue Usage Policies and Restrictions

All nodes in the system are shared, and up to 16 jobs can run on each.

Anyone can submit to the hotel queue. The total number of processors in use by all jobs running via this queue is capped at 640. Jobs submitted to this queue run only on machines with 64GB memory. (Some nodes have 128GB; hotel nodes shouldn't run there).

The home queue is available to groups that have contributed nodes to the TSCC. Usage limits for those queues are equal to the number of cores contributed. Similarly, the condo queue is also restricted to contributors, so that sharing access to nodes in this queue becomes a benefit of contributing nodes to the cluster.

The glean queue is available only to node contributors of the condo cluster. Jobs are not charged but must run on idle cores and will be canceled immediately when the core is needed for a regular condo job.

Only members of Unix groups defined for node contributors are allowed to submit to the home queue. The home queue will route jobs to specific queues on the submitter's group membership, so the specific queue name is not used in the job submission. The total number of processors in use by all jobs running via each contributor's home queue is equal to the number of cores they contributed to the condo cluster.

Only members of Unix home groups are allowed to submit to condo (i.e., no hotel users). There is no total processor limit for the condo queue. If the system is sufficiently busy that all available processors are in use and both the hotel and condo queues have jobs waiting, the hotel jobs will run first as long as the total processors used by hotel jobs doesn't exceed the 640-processor limit. Condo jobs do not run on hotel nodes.

Note!

All TSCC nodes practically have slightly less than the nominal amount of memory available, due to system overhead. Jobs that attempt to use more than the specified proportion will be killed.

To submit a job for the PDAFM nodes, specify the pdafm queue. For example:

#PBS -q pdafm
#PBS -l nodes=2:np=20

To reduce email load on the mailservers, please specify an email address in your TORQUE script. For example:

#PBS -l walltime=00:20:00
#PBS -M <your_username@ucsd.edu>
#PBS -m mail_options

or using the command line:

qsub -m mail_options -M <your_username@ucsd.edu>

These mail_options are available:

n no mail a mail is sent when the job is aborted by the batch system.
b mail is sent when the job begins execution.
e mail is sent when the job terminates.

Back to top

Submitting a Job

Submitting with a Job Script

To submit a script to TORQUE:

qsub <batch_script>

The following is an example of a TORQUE batch script for running an MPI job. The script lines are discussed in the comments that follow.

#!/bin/csh
#PBS -q <queue name>
#PBS -N <job name>
#PBS -l nodes=10:ppn=2
#PBS -l walltime=0:50:00
#PBS -o <output file>
#PBS -e <error file>
#PBS -V
#PBS -M <email address list>
#PBS -m abe
#PBS -A <account>
cd /oasis/tscc/scratch/<user name>
mpirun -v -machinefile $PBS_NODEFILE -np 20 <./mpi.out>

Comments for the above script:

#PBS -q <queue name>

Specify queue to which job is being submitted, one of:

  • hotel
  • gpu-hotel
  • condo
  • gpu-condo
  • pdafm
  • glean
#PBS -N <job name>

Specify the name of the job.

#PBS -l nodes=10:ppn=2

Request 10 nodes and 2 processors per node.

#PBS -l walltime=0:50:00

Reserve the requested nodes for 50 minutes.

#PBS -o <output file>

Redirect standard output to a file.

#PBS -e <error file>

Redirect standard error to a file.

#PBS -V

Export all user environment variables to the job.

#PBS -M <email address list>

List users, separated by commas, to receive email from this job.

#PBS -m abe

Set of conditions under which the execution server will send email about the job: (a)bort, (b)egin, (e)nd.

#PBS -A <account>

Specify account to be charged for running the job; optional if user has only one account. If more than one account is available and this line is omitted, job will be charged to default account.

To ensure the correct account is charged, it is recommended that the -A option always be used.

cd /oasis/tscc/scratch/<user name>

Change to user's working directory in the Lustre filesystem.

mpirun -v -machinefile $PBS_NODEFILE -np 20 <./mpi.out>

Run as a parallel job, in verbose output mode, using 20 processors, on the nodes specified by the list contained in the file referenced by $PBS_NODEFILE, and send the output to file mpi.out in current working directory.

Back to top

TORQUE Commands

Common TORQUE Commands
CommandDescription
qstat -a Display the status of batch jobs
qdel <pbs_jobid> Delete (cancel) a queued job
qstat -r Show all running jobs on system
qstat -f <pbs_jobid> Show detailed information of the specified job
qstat -q Show all queues on system
qstat -Q Show queues limits for all queues
qstat -B Show quick information of the server
pbsnodes -a Show node status

*View the qstat manpage for more options.

Back to top

GPU Queue Details

Because of the confusion caused by keeping the hotel general computing nodes and GPU nodes in a single pool–the GPU nodes have no IB connection, so multi-node jobs can fail unless you're careful with the submission–the GPU nodes have their own queues, named gpu-hotel and gpu-condo. To run on a general computing GPU, use a command similar to:

# qsub -I -q gpu-hotel -l nodes=1:ppn=3

This command will run your job on one of the three hotel GPU nodes and allocate a GPU to your job. Because the GPU nodes contain 12 cores and 4 GPUs each, one GPU is allocated to the job per every three cores requested. Allocated GPUs are referenced by the CUDA_VISIBLE_DEVICES environment variable. Applications using the CUDA libraries will discover GPU allocations through that variable.

Similarly, condo-based GPUs are accessible through the gpu-condo queue, which provides users who have contributed GPU nodes to the cluster with access to each other's nodes. Like the general computing condo queue, jobs submitted to this queue have an 8-hour time limit.

Condo owners can glean cycles on condo GPU nodes via the general glean queue. To do so, just add :gpu to the node resource specification. For example:

# qsub -I -q glean -l nodes=1:ppn=3:gpu

Back to top

Submitting an Interactive Job

The following is an example of a TORQUE command for running an interactive job.

qsub -I -l nodes=10:ppn=2 -l walltime=0:50:00

The standard input, output, and error streams of the job are connected through qsub to the terminal session in which qsub is running.

Back to top

Monitoring the Batch Job Queues

Users can monitor batch queues using these commands:

qstat

The command output shows the job Ids and queues, for example:

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
90.tscc-46                PBStest          hocks                  0 R hotel
91.tscc-46                PBStest          hocks                  0 Q hotel
92.tscc-46                PBStest          hocks                  0 Q hotel
showq

This command shows the jobs running, queued and blocked:

active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME
94                    hocks    Running     8    00:09:53  Fri Apr  3 13:40:43
1 active job               8 of 16 processors in use by local jobs (50.00%)
                            8 of 8 nodes active      (100.00%)

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT              QUEUETIME
95                    hocks       Idle     8    00:10:00  Fri Apr  3  13:40:04
96                    hocks       Idle     8    00:10:00  Fri Apr  3  13:40:05
2 eligible jobs

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT             QUEUETIME
0 blocked jobs
Total jobs:  3
showbf

This command gives information on available time slots:

Partition     Tasks  Nodes      Duration   StartOffset       StartDate
---------     -----  -----  ------------  ------------  --------------
ALL               8      8      INFINITY      00:00:00  13:45:30_04/03

Users who are trying to choose parameters that allow their jobs to run more quickly may find this a convenient way to find open nodes and time slots.

Back to top

Usage Monitoring

The following commands will report the current account balances:

gbalance -u <user_name>

or

gbalance -p <group_name>

The following commands will generate a summary of all activity on the account:

gstatement -u <user_name>

gstatement -p <group_name>

Obtaining Support for TSCC Jobs

For any questions, please send email to tscc-support@ucsd.edu.

TSCC Software

Installed and Supported Software

The TSCC runs CentOS version 6.3. PGI and Intel compilers are available, as are mvapich2 and openmpi. Over 50 additional software applications and libraries are installed on the system, and system administrators regularly work with researchers to extend this set as time/costs allow.

Applications Software

This list is subject to change. Specifics of installed location, version and other details may change as the packages are updated. Please contact tscc-support@ucsd.edu for details.

PackageTopic AreaLicense TypePackage Home PageUser Install LocationInstalled on:
(L)ogin,
(C)ompute,
(B)oth
bbcp Data Transfer Private License bbcp Home Page /opt/bbcp B
ATLAS (Automatically Tuned Linear Algebra Software) Mathematics BSD ATLAS Home Page /opt/atlas B
AMOS (A Modular, Open-Source whole genome assembler) Genomics OSI AMOS Home Page /opt/amos B
Amber (Assisted Model Building with Energy Refinement) Molecular Dynamics Amber 12 Software License Amber Home Page /opt/amber C
ABySS (Assembly by Short Sequences) Genomics BC Cancer Agency academic ABySS Home Page /opt/abyss B
APBS (Adaptive Poisson-Boltzmann Solver) Bioinformatics BSD, MIT APBS Home Page /opt/apbs C
BEAST Bioinformatics, Phylogenetics GNU LGPL BEAST Home Page /opt/beast C
BLAT Bioinformatics, Genetics Free Non-commercial BLAT User Guide /opt/biotools/blat B
bbFTP Large Parallel File Transfer GNU GPL bbFTP Home Page /opt/bbftp B
Boost C++ Library Boost Software License Boost Home Page /opt/boost B
Bowtie Short Read Aligner Bioinformatics GNU GPL Bowtie Home Page /opt/biotools/bowtie B
Burrows-Wheeler Aligner (BWA) Bioinformatics GNU GPL BWA Home Page /opt/biotools/bwa B
Cilk Parallel Programming GNU GPL Cilk Home Page /opt/cilk B
CPMD Molecular Dynamics Free for noncommercial research CPMD Home Page /opt/cpmd/bin B
DDT Graphical Parallel Debugger Licensed DDT Home Page /opt/ddt B
PackageTopic AreaLicense TypePackage Home PageUser Install LocationInstalled on:
(L)ogin,
(C)ompute,
(B)oth
FFTW General GNU GPL FFTW Home Page /opt/fftw B
FPMPI MPI Programming Licenced FPMPI Home Page /opt/mvapich2
–and–
/opt/openmpi
B
FSA (Fast Statistical Alignment) Genetics Licensed FSA Home Page /opt/fsa TBD
GAMESS Chemistry No-cost Site License GAMESS Home Page /opt/gamess B
Gaussian Structure Modeling Commercial Gaussian Home Page /opt/gaussian B
Genome Analysis Toolkit (GATK) Bioinformatics BSD Open Source GATK Home Page /opt/biotools/GenomeAnalysisTK B
GROMACS (Groningen Machine for Chemical Simulations) Molecular Dynamics GNU GPL Gromacs Home Page /opt/gromacs B
GSL (GNU Scientific Library) C/C++ Library GNU GPL GSL Home Page /opt/gsl B
IDL Visualization Licensed IDL Home Page /opt/idl B
IPython Parallel Computing BSD IPython Home Page /opt/ipython B
JAGS (Just Another Gibbs Sampler) Statistical Analysis GNU GPL, MIT JAGS Home Page /opt/jags B
LAPACK (Linear Algebra PACKage) Mathematics BSD LAPACK Home Page /opt/lapack B
matplotlib Python Graphing Library PSF (Python Software Foundation) matplotlib Home Page /opt/scipy/lib64/python2.4/site-packages B
LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) Molecular Dynamics Simulator GPL LAMMPS Home Page /opt/lammps C
MATLAB Parallel Development Environment Licensed MATLAB Home Page /opt/matlab.2011a
–and–
/opt/matlab_server_2012b
B
PackageTopic AreaLicense TypePackage Home PageUser Install LocationInstalled on:
(L)ogin,
(C)ompute,
(B)oth
Mono Development Framework MIT, GNU GPL Mono Home Page /opt/mono B
MUMmer Bioinformatics Artistic License MUMmer Home Page /opt/mummer B
mxml (Mini-XML) Bioinformatics GNU GPL Mini-XML Home Page /opt/mxml B
Octave Numerical Computation GNU GPL Octave Home Page /opt/octave B
openmpi Parallel Library Generic Open MPI Home Page /opt/openmpi/intel/mx
–or–
/opt/openmpi/pgi/mx
–or–
/opt/openmpi/gnu/mx
B
NAMD Molecular Dynamics, BioInformatics Non-Exclusive, Non-Commercial Use NAMD Home Page /opt/namd C
NCO NetCDF Support Generic NCO Home Page /opt/nco/intel
–or–
/opt/nco/pgi
–or–
/opt/nco/gnu
B
NetCDF General Licensed (free) NetCDF Home Environment Module B
NoSE (Network Simulation Environment) Networking GNU GPL NoSE Home Perl Module B
NumPy (Numerical Python) Scientific Calculation BSD NumPy Home /opt/scipy/lib64/python2.4/site-packages B
NWChem Chemistry EMSL (free) NWChem Home Page /opt/nwchem B
ParMETIS (Parallel Graph Partitioning and Fill-reducing Matrix Ordering) Numerical Computation Open ParMETIS Home Page /opt/parmetis B
PDT (Program Database Toolkit) Software Engineering Open PDT Home Page /opt/pdt B
PETSc (Portable, Extensible Toolkit for Scientific Computation) Mathematics Open PETSc Home Page /opt/petsc B
PyFITS Astrophysics BSD PyFITS Home Page /opt/scipy/lib64/python2.4/site-packages B
Python General Scripting BSD Python Home Page /opt/python/bin B
pytz Python TimeZone Module MIT PyTZ Home Page /opt/scipy/lib64/python2.4/site-packages B
R Statistical Computing and Graphics GNU GPL R Home Page /opt/R B
RapidMiner Data Mining Open Source under AGPL3 RapidMiner Home Page /opt/rapidminer B
SAMTools (Sequence Alignment/Map) Bioinformatics BSD, MIT SAMtools Home Page /opt/biotools/samtools B
ScaLAPACK (Scalable Linear Algebra PACKage) Mathematics modified BSD ScaLAPACK Home Page /opt/scalapack B
SciPy (Scientific Python) Scientific Computing BSD SciPy Home Page /opt/scipy B
SIESTA Molecular Dynamics SIESTA LICENCE for COMPUTER CENTRES SIESTA Home Page /opt/siesta B
SOAPdenovo (Short Oligonucleotide Analysis Package) Bioinformatics GNU GPLv3 SOAPdenovo Home Page /opt/biotools/soapdenovo B
SPRNG (The Scalable Parallel Random Number Generators Library) General Scientific None SPRNG Home Page /opt/sprng B
SuperLU Mathematics Regents of the University of California SuperLU Home Page /opt/superlu B
TAU Tuning and Analysis Utilities GNU GPL TAU Home Page /opt/tau/intel
–or–
/opt/tau/pgi
–or–
/opt/tau/gnu
B
Tecplot Simulation Analytics Tecplot License Tecplot Home Page /opt/tecplot B
Trilinos Software Engineering BSD and LGPL Trilinos Home Page /opt/trilinos B
VASP (Vienna Ab initio Simulation Package) Molecular Dynamics University of Vienna VASP Home Page /opt/vasp B
Velvet (Short read de novo assembler using de Bruijn graphs) Bioinformatics GNU GPL Velvet Home Page /opt/biotools/velvet B
WEKA (Waikato Environment for Knowledge Analysis) Data Mining GNU GPL WEKA Home Page /opt/weka B
PackageTopic AreaLicense TypePackage Home PageUser Install LocationInstalled on:
(L)ogin,
(C)ompute,
(B)oth

System Software

System Software

PackageTopic AreaLicense TypePackage Home PageUser Install LocationInstalled on:
(L)ogin,
(C)ompute,
(B)oth
CentOS Operating System Open Source CentOS Home Page N/A B
Ganglia N/A Open Source Ganglia Home Page /opt/ganglia B
Gold Allocation Manager Open Source Gold Home Page /opt/gold L
Hadoop Distributed Processing Framework Apache License Hadoop Home Page /opt/hadoop B
HDF4 (Hierarchical Data Format) Data Management BSD License HDF4 Home Page /opt/hdf4 B
HDF5 (Hierarchical Data Format) Data Management BSD License HDF5 Home Page /opt/hdf5 B
IPM (Integrated Performance Modeling) Profiling Free IPM Home Page /opt/ipm B
Lustre Scalable File System GNU GPL Lustre Home Page /opt/lustre B
Maui Workload Scheduler GNU Lesser GPL Maui Scheduler SourceForge Page /opt/maui L
Modules Environment Variable Management GNU GPL Modules Home Page /opt/modules B
mvapich2 Message Passing Interface Open Source MVAPICH2 Home Page /opt/mvapich2/gnu/ib/bin
–or–
/opt/mvapich2/intel/ib/bin
–or–
/opt/mvapich2/pgi/ib/bin
B
myHadoop System Administration BSD myHadoop Home Page /opt/myhadoop B
Nagios System Monitor GNU GPL Nagios Home Page /opt/nagios Management Node Only
Nagios Plugins System Monitor GNU GPL Nagios Plugins Home Page /opt/nagios Management Node Only
NSCA (Nagios Service Check Acceptor) System Monitor GNU GPL NSCA Home Page /opt/nagios B
PAPI (Performance Application Programming Interface) Performance Monitor Unknown PAPI Home Page /opt/papi B
TORQUE Resource Manager Open Source TORQUE Home Page /opt/torque B
PackageTopic AreaLicense TypePackage Home PageUser Install LocationInstalled on:
(L)ogin,
(C)ompute,
(B)oth

Compilers

Compilers

PackageTopic AreaLicense TypePackage Home PageUser Install LocationInstalled on:
(L)ogin,
(C)ompute,
(B)oth
CMake Cross Platform Makefile Generator Open CMake Home Page /opt/cmake B
GNU Compilers C and Fortran Compilers CentOS Core License GNU Compiler Collection Home Page /usr/bin/gcc
–and–
/usr/bin/gfortran
B
Intel Compilers C and Fortran Compilers Licensed (flexlm) Intel Compilers Home Page /opt/intel B
Java Compiler Open Java Home Page /usr/bin/javac B
PGI Compilers C and Fortran Compilers Licensed (flexlm) PGI Compilers Home Page /opt/pgi B
UPC (Unified Parallel C) Compiler Parallel Computing BSD UPC Home Page /opt/upc B

Requesting Additional Software

Users can install software in their home directories. If interest is shared with other users, requested installations can become part of the core software repository. Please submit new software requests to tscc-support@ucsd.edu.

Back to top

System Information

Hardware Specifications

There are three kinds of compute nodes in the cluster: General Computing Nodes, GPU Nodes, and Petascale Data Analysis Facility (PDAF) Nodes. The current specifications for each type of node are as follows:

General Computing Nodes
Processors Dual-socket, 8-core, 2.6GHz Intel Xeon E5-2670 (Sandy Bridge)
Memory 64GB (4GB/core) (128GB memory optional)
Network 10GbE (QDR InfiniBand optional)
Hard Drive 500GB onboard (second hard drive or SSD optional)
Warranty 3-years
GPU Nodes
Host Processors Dual-socket, 6-core, 2.3GHz Intel Xeon E5-2630 (Sandy Bridge)
GPUs 4 NVIDIA GeForce GTX 680 (GTX Titan upgrade available)
Memory 32GB (64GB/128GB memory optional)
Network 10GbE (QDR InfiniBand optional)
Hard Drive 500GB + 240GB SSD
Warranty 3-years
PDAF (shared resource; pay-as-you-go only)
Processors 8-socket, 4-core AMD Shanghai Opteron
Memory 512 GB
Network 10 GbE

(IDI will annually update the hardware choices for general computing and GPU condo purchasers, to stay abreast of technology/cost advances.)

Network

TSCC nodes with the QDR InfiniBand (IB) option connect to 32–port IB switches, allowing up to 512 cores to communicate at full bisection bandwidth for low latency parallel computing.

Storage

TSCC users will receive 100GB of backed-up home file storage, and shared access to the 200+ TB Data Oasis Lustre-based high performance parallel file system. (There is a 90–day purge policy on Data Oasis, and this storage is not backed up.)

Note

Additional persistent storage can be mounted from lab file servers over the campus network or purchased from SDSC.

For more information, contact tscc-info@ucsd.edu.

TSCC New Purchases

For information on joining the Triton Shared Computing cluster, please see the Purchase Info page.

Back to top