Gordon User Guide

Technical Summary

Gordon logo

Gordon is a dedicated XSEDE cluster designed by Appro and SDSC consisting of 1024 compute nodes and 64 I/O nodes. Each compute node contains two 8-core 2.6 GHz Intel EM64T Xeon E5 (Sandy Bridge) processors and 64 GB of DDR3-1333 memory. The I/O nodes each contain two 6-core 2.67 GHz Intel X5650 (Westmere) processors, 48 GB of DDR3-1333 memory, and sixteen 300 GB Intel 710 solid state drives. The network topology is a 4x4x4 3D torus with adjacent switches connected by three 4x QDR InfiniBand links (120 Gb/s). Compute nodes (16 per switch) and I/O nodes (1 per switch) are connected to the switches by 4x QDR (40 Gb/s). The theoretical peak performance of Gordon is 341 TFlop/s.

Transition to Comet

Please read this important information for users transitioning from Gordon. The SDSC Gordon cluster is scheduled to be decommissioned at the end of March, 2017. To minimize the impact of this change and allow you to continue computing without interruption, we will be transitioning Gordon users to Comet. More details are provided in the Transitioning from Gordon section of the Comet User Guide.

Technical Details

System ComponentConfiguration
Intel EM64T Xeon E5 Compute Nodes
Sockets 2
Cores 16
Clock speed 2.6 GHz
Flop speed 333 Gflop/s
Memory capacity 64 GB
Memory bandwidth 85 GB/s
STREAM Triad bandwidth 60 GB/s
I/O Nodes
Sockets 2
Cores 12
Clock speed 2.67 GHz
Memory capacity 48 GB
Memory bandwidth 64 GB/s
Flash memory 4.8 TB
Full System
Total compute nodes 1024
Total compute cores 16384
Peak performance 341 Tflop/s
Total memory 64 TB
Total memory bandwidth 87 TB/s
Total flash memory 300 TB
QDR InfiniBand Interconnect
Topology 3D Torus
Link bandwidth 8 Gb/s (bidirectional)
Peak bisection bandwidth TB/s (bidirectional)
MPI latency 1.3 µs
DISK I/O Subsystem
File Systems NFS, Lustre
Storage capacity (usable) 1.5 PB
I/O bandwidth 100 GB/s

Gordon supports the XSEDE core software stack, which includes remote login, remote computation, data movement, science workflow support, and science gateway support toolkits.

Systems Software Environment

Software FunctionDescription
Cluster Management Rocks
Operating System CentOS
File Systems NFS, Lustre
Scheduler and Resource Manager Catalina, TORQUE
XSEDE Software CTSS
User Environment Modules
Compilers Intel and PGI Fortran, C, C++
Message Passing Intel MPI, MVAPICH, Open MPI
Debugger DDT
Performance IPM, mpiP, PAPI, TAU

System Access

As an XSEDE computing resource, Gordon is accessible to XSEDE users who are given time on the system. In order to get an account, users will need to submit a proposal through the XSEDE Allocation Request System.

Interested users may contact SDSC User Support for assistance with applying for time on Gordon (see sidebar for contact information).

Logging in to Gordon

Gordon supports Single Sign On through the XSEDE User Portal and from the command line using an XSEDE-wide password. To login to Gordon from the command line, use the hostname:

gordon.sdsc.edu 

The following are examples of Secure Shell (ssh) commands that may be used to log in to Gordon:

ssh <your_username>@gordon.sdsc.edu
ssh -l <your_username> gordon.sdsc.edu 

Notes and hints

  • When you login to gordon.sdsc.edu, you will be assigned one of the four login nodes gordon-ln[1-4].sdsc.edu. These nodes are identical in both architecture and software environment. Users should normally login to gordon.sdsc.edu, but can directly access one of the four nodes directly if they see poor performance.
  • Please feel free to append your public RSA key to your ~/.ssh/authorized_keys file to enable access from authorized hosts without having to enter your password. Make sure you have a password on the private key on your local machine. You can use ssh-agent or keychain to avoid repeatedly typing the private key password.

Do not use the login nodes for computationally intensive processes.  These nodes are meant for compilation, file editing, simple data analysis, and other tasks that use minimal compute resources. All computationally demanding jobs should be submitted and run through the batch queuing system.

Modules

The Environment Modules package provides for dynamic modification of your shell environment. Module commands set, change, or delete environment variables, typically in support of a particular application. They also let the user choose between different versions of the same software or different combinations of related codes.

For example, if the Intel module and mvapich2_ib module are loaded and the user compiles with mpif90, the generated code is compiled with the Intel Fortran 90 compiler and linked with the mvapich2_ib MPI libraries.

Several modules that determine the default Gordon environment are loaded at login time. These include the MVAPICH implementation of the MPI library and the Intel compilers. We strongly suggest that you use this combination whenever possible to get the best performance.

Useful Modules Commands

Here are some common module commands and their descriptions:

Command

Description

module list

List the modules that are currently loaded

module avail

List the modules that are available

module display <module_name>

Show the environment variables used by <module name> and how they are affected

module unload <module name>

Remove <module name> from the environment

module load <module name>

Load <module name> into the environment

module swap <module one> <module two>

Replace <module one> with <module two> in the environment

Loading and unloading modules

You must remove some modules before loading others.

Some modules depend on others, so they may be loaded or unloaded as a consequence of another module command. For example, if intel and mvapich2_ib are both loaded, running the command module unload intel will automatically unload mvapich2_ib. Subsequently issuing the module load intel command does not automatically reload mvapich2_ib.

If you find yourself regularly using a set of module commands, you may want to add these to your configuration files (.bashrc for bash users, .cshrc for C shell users). Complete documentation is available in the module(1) and modulefile(4) manpages.

Module: command not found

The error message module: command not found is sometimes encountered when switching from one shell to another or attempting to run the module command from within a shell script or batch job.  The reason that the module command may not be inherited as expected is that it is defined as a function for your login shell. If you encounter this error execute the following from the command line (interactive shells) or add to your shell script (including Torque batch scripts)

source /etc/profile.d/modules.sh 

Software Packages

Sample scripts using some common applications and libraries are provided in /home/diag/opt/sdsc/scripts. If you have any questions regarding setting up run scripts for your application, please email help@xsede.org.

Software Package

Compiler Suites

Parallel Interface

AMBER: Assisted Model Building with Energy Refinement

intel

mvapich2_ib

APBS: Adaptive Poisson-Boltzmann Solver

intel

mvapich2_ib

Car-Parrinello 2000 (CP2K)

intel

mvapich2_ib

DDT

 

 

FFTW: Fastest Fourier Transform in the West

intel,pgi,gnu

mvapich2_ib

GAMESS: General Atomic Molecular Electronic Structure System

intel

native: sockets, ip over ib
vsmp: scalemp mpich2

GAUSSIAN

pgi

Single node, shared memory

GROMACS: GROningen MAchine for Chemical Simulations

intel

mvapich2_ib

HDF4/HDF5: Hierarchical Data Format

intel,pgi,gnu

mvapich2_ib for hdf5

Lammps:Large-scale Atomic/Molecular Massively Parallel Simulator.

intel

mvapich2_ib

NAMD: NAnoscale Molecular Dynamics

intel

mvapich2_ib

NCO: netCDF Operators

intel,pgi,gnu

none

netCDF: Network Common Data Format

Intel,pgi,gnu

none

Python modules (scipy etc)

gnu:ipython,nose,pytz
intel:matplotlib,numpy,scipy,pyfits,

None

P3DFFT: Parallel 3-D Fast Fourier Transforms

intel,pgi,gnu

mvapich2_ib

Singularity: User Defined Images

none

none

VisIt Visualization Package

intel

openmpi


Software Package Descriptions

AMBER

AMBER is package of molecular simulation programs including SANDER (Simulated Annealing with NMR-Derived Energy Restraints) and a modified version PMEME (Particle Mesh Ewald Molecular Dynamics) that is faster and more scalable.

APBS

APBS evaluates the electrostatic properties of solvated biomolecular systems.

APBS documentation

CP2K

CP2K is a program to perform simulations of molecular systems. It provides a general framework for different methods such as Density Functional Theory (DFT) using a mixed Gaussian and plane waves approach (GPW) and classical pair and many-body potentials.

CP2K documentation

DDT

DDT is a debugging tool for scalar, multithreaded and parallel applications.

DDT Debugging Guide from TACC

FFTW

FFTW is a library for computing the discrete Fourier transform in one or more dimensions, of arbitrary input size, and of both real and complex data.

FFTW documentation

GAMESS

GAMESS is a program for ab initio quantum chemistry. GAMESS can compute SCF wavefunctions, and correlation corrections to these wavefunctions as well as Density Functional Theory.

GAMESS documentation, examples, etc.

Gaussian 09

Gaussian 09 provides state-of-the-art capabilities for electronic structure modeling.

Gaussian 09 User's Reference

GROMACS

GROMACS is a versatile molecular dynamics package, primarily designed for biochemical molecules like proteins, lipids and nucleic acids.

GROMACS Online Manual

HDF

HDF is a collection of utilities, applications and libraries for manipulating, viewing, and analyzing data in HDF format.

HDF 5 Resources

LAMMPS

LAMMPS is a classical molecular dynamics simulation code.

LAMMPS User Manual

NAMD

NAMD is a parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems.

NAMD User's Guide

NCO

NCO operates on netCDF input files (e.g. derive new data, average, print, hyperslab, manipulate metadata) and outputs results to screen or text, binary, or netCDF file formats.

NCO documentation on SourceForge

netCDF

netCDF is a set of libraries that support the creation, access, and sharing of array-oriented scientific data using machine-independent data formats.

netCDF documentation on UCAR's Unidata Program Center

Python Modules (scipy etc.)

The Python modules under /opt/scipy consist of: node, numpy, scipy, matplotlib, pyfits, ipython and pytz.

Video tutorial from a TACC workshop on Python

Python videos from Khan Academy

The HPC University Python resources

P3DFFT

Parallel Three-Dimensional Fast Fourier Transforms is a library for large-scale computer simulations on parallel platforms. 3D FFT is an important algorithm for simulations in a wide range of fields, including studies of turbulence, climatology, astrophysics and material science.

Read the User Guide on GitHub.

Singularity: User Defined Images

Singularity is a platform to support users that have different environmental needs then what is provided by the resource or service provider. While the high level perspective of other container solutions seems to fill this niche very well, the current implementations are focused on network service virtualization rather than application level virtualization focused on the HPC space. Because of this, Singularity leverages a workflow and security model that makes it a very reasonable candidate for shared or multi-tenant HPC resources like Comet without requiring any modifications to the scheduler or system architecture. Additionally, all typical HPC functions can be leveraged within a Singularity container (e.g. InfiniBand, high performance file systems, GPUs, etc.). While Singularity supports MPI running in a hybrid model where you invoke MPI outside the container and it runs the MPI programs inside the container, we have not yet tested this.

" Examples for various modes of usage are available in /home/diag/opt/scripts/Singularity. Please email help@xsede.org (reference Gordon as the machine, and SDSC as the site) if you have any further questions about usage and configuration. Details on the Signularity project are available at http://singularity.lbl.gov/#home.

VisIt Visualization Package

The VisIt visualization package supports remote submission of parallel jobs and includes a Python interface that provides bindings to all of its plots and operators so they may be controlled by scripting.

Getting Started With VisIt tutorial

Managing Your Accounts

The show_accounts command lists the accounts that you are authorized to use, together with a summary of the used and remaining time.

[user@gordon-ln ~]$ show_accounts
ID name   project   used          available    used_by_proj
---------------------------------------------------------------
<user> <project> <SUs by user> <SUs available> <SUs by proj> 

To charge your job to one of these projects replace  << project >> with one from the list and put this PBS directive in your job script:

  #PBS -A << project >>

Many users will have access to multiple accounts (e.g. an allocation for a research project and a separate allocation for classroom or educational use). On some systems a default account is assumed, but please get in the habit of explicitly setting an account for all batch jobs. Awards are normally made for a specific purpose and should not be used for other projects.

Adding users to an account

Project PIs and co-PIs can add or remove users from an account. To do this, log in to your XSEDE portal account and go to the add user page.

Charging

The charge unit for all SDSC machines, including Gordon, is the SU (service unit) and corresponds to the use of one compute core for one hour. Keep in mind that your charges are based on the resources that are tied up by your job and don’t necessarily reflect how the resources are used.

Unlike some of SDSC’s other major compute resources, Gordon does not provide for shared use of a compute node. A serial job that requests a compute node for one hour will be charged 16 SUs (16 cores x 1 hour), regardless of how the processors-per-node parameter is set in the batch script.

The large memory vSMP nodes, on the other hand, will be charged according to the number of cores requested. Be sure to request as many cores as your job needs, but be aware that if you request all 256 cores within a vSMP node, your job will be charged at the rate of 256 SUs per hour.

Compiling

Gordon provides the Intel, Portland Group (PGI), and GNU compilers along with multiple MPI implementations (MVAPICH2, MPICH2, OpenMPI). Most applications will achieve the best performance on Gordon using the Intel compilers and MVAPICH2 and the majority of libraries installed on Gordon have been built using this combination. Although other compilers and MPI implementations are available, we suggest using these only for compatibility purposes.

All three compilers now support the Advanced Vector Extensions (AVX). Using AVX, up to eight floating point operations can be executed per cycle per core, potentially doubling the performance relative to non-AVX processors running at the same clock speed. Note that AVX support is not enabled by default and compiler flags must be set as described below.

Using the Intel compilers (Default/Suggested)

The Intel compilers and the MVAPICH2 MPI implementation will be loaded by default. If you have modified your environment, you can reload by executing the following commands at the Linux prompt or placing in your startup file (~/.cshrc or ~/.bashrc)

module purge
module load intel mvapich2_ib

For AVX support, compile with the -xHOST option. Note that -xHOST alone does not enable aggressive optimization, so compilation with -O3 is also suggested. The -fast flag invokes -xHOST, but should be avoided since it also turns on interprocedural optimization (-ipo), which may cause problems in some instances.

Intel MKL libraries are available as part of the "intel" modules on Gordon. Once this module is loaded, the environment variable MKL_ROOT points to the location of the mkl libraries. The MKL link advisor can be used to ascertain the link line (change the MKL_ROOT aspect appropriately).

For example to compile a C program statically linking 64 bit scalapack libraries on Trestles:

module unload mvapich2_ib
module unload pgi
module load intel
module load mvapich2_ib
mpicc -o pdpttr.exe pdpttr.c \
    -I$MKL_ROOT/include ${MKL_ROOT}/lib/intel64/libmkl_scalapack_lp64.a \
    -Wl,--start-group ${MKL_ROOT}/lib/intel64/libmkl_intel_lp64.a \
    ${MKL_ROOT}/lib/intel64/libmkl_core.a ${MKL_ROOT}/lib/intel64/libmkl_sequential.a \
    -Wl,--end-group ${MKL_ROOT}/lib/intel64/libmkl_blacs_intelmpi_lp64.a -lpthread -lm

For more information on the Intel compilers: [ifort | icc | icpc] -help

Serial

MPI

OpenMP

MPI+OpenMP

Fortran

ifort

mpif90

ifort -openmp

mpif90 -openmp

C

icc

mpicc

icc -openmp

mpicc -openmp

C++

icpc

mpicxx

icpc -openmp

mpicxx -openmp

Note for vSMP users on Gordon – MPI applications intended for use on the large memory vSMP nodes should use MPICH2 instead of MVAPICH2. This version of MPICH2 has been specially tuned for optimal vSMP message passing performance.

Note for C/C++ users: compiler warning - feupdateenv is not implemented and will always fail. For most users, this error can safely be ignored. By default, the Intel C/C++ compilers only link against Intel's optimized version of the C standard math library (libmf). The error stems from the fact that several of the newer C99 library functions related to floating point rounding and exception handling have not been implemented.

Using the PGI compilers

The PGI compilers can be loaded by executing the following commands at the Linux prompt or placing in your startup file (~/.cshrc or ~/.bashrc)

module purge
module load pgi mvapich2_ib

For AVX support, compile with -fast

For more information on the PGI compilers: man [pgf90 | pgcc | pgCC]

Serial

MPI

OpenMP

MPI+OpenMP

Fortran

pgf90

mpif90

pgf90 -mp

mpif90 -mp

C

pgcc

mpicc

pgcc -mp

mpicc -mp

C++

pgCC

mpicxx

pgCC -mp

mpicxx -mp

Using the GNU compilers

The GNU compilers can be loaded by executing the following commands at the Linux prompt or placing in your startup files (~/.cshrc or ~/.bashrc)

module purge
module load gnu openmpi

For AVX support, compile with -mavx. Note that AVX support is only available in version 4.6 or later, so it is necessary to explicitly load the gnu/4.6.1 module until such time that it becomes the default.

For more information on the GNU compilers: man [gfortran | gcc | g++]

Serial

MPI

OpenMP

MPI+OpenMP

Fortran

gfortran

mpif90

gfortran -fopenmp

mpif90 -fopenmp

C

gcc

mpicc

gcc -fopenmp

mpicc -fopenmp

C++

g++

mpicxx

g++ -fopenmp

mpicxx -fopenmp

Notes and Hints

  • The mpif90, mpicc, and mpicxx commands are actually wrappers that call the appropriate serial compilers and load the correct MPI libraries. While the same names are used for the Intel, PGI and GNU compilers, keep in mind that these are completely independent scripts.
  • If you use the PGI or GNU compilers or switch between compilers for different applications, make sure that you load the appropriate modules before running your executables.
  • When building OpenMP applications and moving between different compilers, one of the most common errors is to use the wrong flag to enable handling of OpenMP directives. Note that Intel, PGI, and GNU compilers use the -openmp, -mp, and -fopenmp flags, respectively.
  • Explicitly set the optimization level in your makefiles or compilation scripts. Most well written codes can safely use the highest optimization level (-O3), but many compilers set lower default levels (e.g. GNU compilers use the default -O0, which turns off all optimizations).
  • Turn off debugging, profiling, and bounds checking when building executables intended for production runs as these can seriously impact performance. These options are all disabled by default. The flag used for bounds checking is compiler dependent, but the debugging (-g) and profiling (-pg) flags tend to be the same for all major compilers.

Running Jobs on Regular Compute Nodes

Gordon uses the TORQUE Resource Manager, together with the Catalina Scheduler, to manage user jobs. If you are familiar with PBS, note that TORQUE is based on the original PBS project and shares most of its syntax and user interface. Whether you run in batch mode or interactively, you will access the compute nodes using the qsub command as described below. Remember that computationally intensive jobs should be run only on the compute nodes and not the login nodes. Gordon has two queues available:

Queue NameMax WalltimeMax NodesComments
normal 48 hrs 64 Used for exclusive access to regular (non-vsmp) compute nodes             
vsmp 48 hrs 1 Used for shared access to vamp node

Submitting jobs

A job can be submitted using the qsub command, with the job parameters either specified on the command line or in a batch script. Except for simple interactive jobs, most users will find it more convenient to use batch scripts.  

$ qsub -l nodes=2:ppn=16:native,walltime=1:00:00 -q normal ./my_job.sh
$ qsub ./my_batch_script.sh

Batch script basics

TORQUE batch scripts consist of two main sections. The top section specifies the job parameters (e.g. resources being requested, notification details) while the bottom section contains user commands. A sample job script that can serve as a starting point for most users is shown below. Content that should be modified as necessary is between angled brackets (the brackets are not part of the code).

#!/bin/bash
#PBS -q normal
#PBS -l nodes=<2>:ppn=<16>:native
#PBS -l walltime=<1:00:00>
#PBS -N <jobname>
#PBS -o <my.out>
#PBS -e <my.err>
#PBS -A <abc100>
#PBS -M <email_address>
#PBS -m abe
#PBS -V
# Start of user commands - comments start with a hash sign (#)
cd $PBS_O_WORKDIR
<usercommands>

The first line indicates that the file is a bash script and can therefore contain any valid bash code. The lines starting with "#PBS" are special comments that are interpreted by TORQUE and must appear before any user commands.

The second line states that we want to run in the queue named "normal". Lines three and four define the resources being requested: 2 nodes with 16 processors per node, for one hour (1:00:00). The next three lines (5-7) are not essential, but using them will make it easier for you to monitor your job and keep track of your output. In this case, the job will be appear as “jobname” in the queue; stdout and stderr will be directed to "my.out" and "my.err", respectively. The next line specifies that the usage should be charged to account abc123. Lines 9 and 10 control email notification: notices should be sent to "username@domain.edu" when the job aborts (a), begins (b), or ends (e).

Finally, "#PBS –V" specifies that your current environment variables should be exported to the job. For example, if the path to your executable is found in your PATH variable and your script contains the line "#PBS –V", then the path will also be known to the batch job.

The statement “cd $PBS_O_WORKDIR” changes the working directory to the directory where the job was submitted. This should always be done unless you provide full paths to all executables and input files. The remainder of the script is normally used to run your application.

Interactive jobs

There are several small but important differences between running batch and interactive jobs:

  1. Interactive jobs require the -I flag
  2. All node resource requests (specified using -l) must appear as a single, comma-separated list
  3. To ensure obtaining nodes with flash storage, the ":flash" specifier must be included in the node properties

The following command shows how to get interactive use of one node, with flash, for 30 minutes (note the comma between "flash" and "walltime"):

qsub -I -q normal -l nodes=2:ppn=16:native:flash,walltime=30:00 -A abc123

If your interactive job does not need the flash filesystem, you may also explicitly request the "noflash" property to utilize the subset of Gordon nodes without a SSD:

qsub -I -q normal -l nodes=2:ppn=16:native:noflash,walltime=30:00 -A abc123

If neither :flash or :noflash is requested, the interactive job may land on a mix of nodes with and without SSDs. Be reminded that this only applies to jobs submitted without a batch script, and jobs submitted via a submission script automatically receive the :flash property.

Monitoring and deleting jobs

Use the qstat command to monitor your jobs and qdel to delete a job. Some useful options are described below. For a more detailed understanding of the queues see the User Guide section Torque in Depth.

Running MPI jobs – regular compute nodes

MPI jobs are run using the mpirun_rsh command. When your job starts, TORQUE will create a PBS_NODEFILE listing the nodes that had been assigned to your job, with each node name replicated ppn times. Typically ppn will be set equal to the number of physical cores on a node (16 for Gordon) and the number of MPI processes will be set equal to nodes x ppn. Relevant lines from a batch script are shown below.

#PBS -l nodes=2:ppn=16:native
cd $PBS_O_WORKDIR
mpirun_rsh -np 32 –hostfile $PBS_NODEFILE ./my_mpi_app  [command line args]

Sometimes you may want to run fewer than 16 MPI processes per node; for instance, if the per-process memory footprint is too large (> 4 GB on Gordon) to execute one process per core or you are running a hybrid application that had been developed using MPI and OpenMP. In these cases you should us the ibrun command and the -npernode option instead:

# Running 16 MPI processes across two nodes
#PBS -l nodes=2:ppn=16:native
cd $PBS_O_WORKDIR
ibrun -npernode 8 ./my_mpi_app [command line args]

The ibrun command will automatically detect that your job requested nodes=2:ppn=16 but only place 8 processes on each of those two nodes.

Running OpenMP jobs – regular compute nodes

For an OpenMP application, set and export the number of threads. Relevant lines from a batch script are shown below.

#PBS -l nodes=1:ppn=16:native
cd /oasis/scratch/$USER/temp_project
export OMP_NUM_THREADS=16
./my_openmp_app [command line args]

Running hybrid (MPI + OpenMP) jobs – regular compute nodes

For hybrid parallel applications (MPI+OpenMP), use ibrun to launch the job and specify both the number of MPI processes and the number of threads per process. To pass environment variables through ibrun, list the key/value pairs before the ibrun command. Ideally, the product of the MPI process count and the threads per process should equal the number of physical cores on the nodes being used.

Hybrid jobs will typically use one MPI process per node with the threads per node equal to the number of physical cores per node, but as the examples below show this is not required.

# 2 MPI processes x 16 threads/node = 2 nodes x 16 cores/node = 32
#PBS -l nodes=2:ppn=1:native
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=16
ibrun -npernode 1 -tpp 16 .my_hybrid_app [command line args]

# 8 MPI processes x 4 threads/node = 2 nodes x 16 cores/node = 32
#PBS -l nodes=2:ppn=16:native
export OMP_NUM_THREADS=16
cd /oasis/scratch/$USER/temp_project
ibrun -npernode 4 -tpp 4 ./my_hybrid_app [command line args]

Network topology and communication sensitive jobs

The Gordon network topology is a 4x4x4 3D torus of switches with 16 compute nodes attached to each switch. For applications where the locality of the compute nodes is important, the user can control the layout of the job. Note that requesting a smaller number of maximum switch hops rather than relying on the default behavior may increase the length of time that a job waits in the queue. The maxhops option is specified in the job script as follows:

#PBS -v Catalina_maxhops=HOPS

Where HOPS is set to the integer values 0-3 or the string None. The behavior is described below:

   0: only nodes connected to the same switch will be used
   1: only nodes connected by at most one inter-switch link will be used
   2: only nodes connected by at most two inter-switch links will be used
   3: only nodes connected by at most three inter-switch links will be used
   None: any nodes may be used, regardless of switch proximity

To get nodes spanning the minimum number of hops for jobs of different sizes, use the following guidelines:

   <=16 nodes: Catalina_maxhops=0 (1 switch, default for jobs up to 16             nodes)
   <=32 nodes: Catalina_maxhops=1 (2 neighbor switches)
   <=64 nodes: Catalina_maxhops=2 (2x2 mesh of switches)
   <=128 nodes: Catalina_maxhops=3 (2x2x2 grid)

If you do not specify a Catalina_maxhops value for jobs larger than 16 nodes, your job will use the default of Catalina_maxhops=None and scatter your job across all of Gordon's 4x4x4 torus. The maximum distance between any two node pairs is 6 hops.

Note and hints

Try to provide a realistic upper estimate for the wall time required by your job. This will often improve your turnaround time, especially on a heavily loaded machine.

Storage Overview

SSD Scratch Space

Both the native (non-vSMP) compute nodes and the vSMP nodes on Gordon have access to fast flash storage. Most of the native compute nodes mount a single 300 GB SSD (280 GB usable space), but some of the native nodes, labeled with the "bigflash" property, have access to a 4.4 TB SSD filesystem. The vSMP nodes have access to one or more RAID 0 devices each built from sixteen 300 GB SSDs (4.4 TB usable space).  

The latency to the SSDs is several orders of magnitude lower than that for spinning disk (<100 microseconds vs. milliseconds) making them ideal for user-level check pointing and applications that need fast random I/O to large scratch files. Users can access the SSDs only during job execution under the following directories:

Regular native compute nodes:

/scratch/$USER/$PBS_JOBID

Note that the regular compute nodes have 280GB of usable space in the scratch SSD location. For jobs needing larger scratch space, the options are "bigflash" compute nodes and the vSMP nodes.

Bigflash native compute nodes:

/scratch/$USER/$PBS_JOBID

Additionally, users will have to change the PBS node request line:

#PBS –l nodes=1:ppn=16:native:bigflash

If more than one bigflash node is needed for a single job, the following line also needs to be added since not all bigflash nodes share the same switch:

#PBS –v Catalina_maxhops=None

vSMP nodes:

/scratch1/$USER/$PBS_JOBID (16-way and larger)

/scratch2/$USER/$PBS_JOBID (32-way and larger)

For the vSMP nodes, /scratch is a symbolic link to /scratch1 and is provided to maintain a consistent interface to the SSDs across all SDSC resources that contain flash drives (Gordon and Trestles).

These directories are purged immediately after the job terminates, therefore any desired files must be copied to the parallel file system before the end of the job.

Noflash Nodes:

A small subset of Gordon nodes do not have a flash filesystem. If your job does not require the use of SSDs, you may wish to explicit request the "noflash" property when submitting your job (e.g., #PBS -l nodes=1:ppn=16:native:noflash) to prevent it from having to wait for flash-equipped compute nodes to become available. By default, all non-interactive jobs go to nodes with SSD scratch space, and the noflash nodes are only allocated when explicitly requested.

Parallel Lustre File System

SDSC provides several Lustre-based parallel file systems. These include a shared Projects Storage area that is mounted on both Gordon and Trestles plus separate scratch file systems. Collectively known as Data Oasis, these file systems are accessed through 64 Object Storage Servers (OSSs) and have a capacity of 4 PB.

Lustre Project Space

SDSC Projects Storage has 1.4 PB of capacity and is available to all allocated users of Trestles and Gordon. It is accessible at:

/oasis/projects/nsf/<allocation>/<user>

where <allocation> is your 6-character allocation name, found by running show_accounts. The default allocation is 500 GB of Project Storage to be shared among all users of a project. Projects that require more than 500 GB of Project Storage must request additional space via the XSEDE POPS system in the form of a storage allocation request (for new allocations) or supplement (for existing allocations).

Project Storage is not backed up and users should ensure that critical data are duplicated elsewhere. Space is provided on a per-project basis and is available for the duration of the associated compute allocation period. Data will be retained for 3 months beyond the end of the project, by which time data must be migrated elsewhere.

Lustre Scratch Space

The Gordon Scratch space has a capacity of 1.6 PB and is currently configured as follows. 

/oasis/scratch/$USER/$PBS_JOBID
/oasis/scratch/$USER/temp_project

Users cannot write directly to the /oasis/scratch/$USER directory and must instead write to one of the two subdirectories listed above.

The former is created at the start of a job and should be used for applications that require a shared scratch space or need more storage than can be provided by the flash disks. Because users can only access this scratch space after the job starts, executables and other input data must be copied here from the user's home directory from within a job's batch script. Unlike SSD scratch storage, the data stored in Lustre scratch is not purged immediately after job completion. Instead, users will have time after the job completes to copy back data they wish to retain to their projects directories, home directories, or to their home institution.

The latter is intended for medium-term storage outside of running jobs, but is subject to purge with a minimum of five days notice if it begins to approach capacity.

Note that both directories ($PBS_JOBID and temp_project) are part of the same file system and therefore served by the same set of OSSs. User jobs can read or write to either location with the same performance. To avoid the overhead of unnecessary data movement, read directly from temp_project rather than copying to the $PBS_JOBID directory. Similarly, files that should be retained after completion of a job should be written directly to temp_project.

Home File System

After logging in, users are placed in their home directory, /home, also referenced by the environment variable $HOME. The home directory is limited in space and should be used only for source code storage. Jobs should never be run from the home file system, as it is not set up for high performance throughput. Users should keep usage on $HOME under 100GB. Backups are currently being stored on a rolling 8-week period. In case of file corruption/data loss, please contact us at help@xsede.org to retrieve the requested files.

Using the vSMP Nodes

Most applications will be run on the regular compute nodes, but in certain instances you will want to use the large-memory vSMP nodes:

  • Serial or threaded applications requiring more than 64 GB of memory
  • MPI applications requiring more than 64 GB per process
  • Applications that can take advantage of a RAM file system (ramfs)

Before reading this section, we suggest that you first become familiar with the user guide sections that cover compiling and running jobs.

Compilation instructions are the same for the regular compute nodes and the vSMP nodes, although ScaleMP’s specially tuned version of MPICH2 should be used when building MPI applications.

Jobs are submitted to the vSMP nodes using TORQUE and, with the exception of specifying a different queue name, all of the instructions given in the user guide sections describing the batch submission system still apply. In the remainder of this section, we describe the additional steps needed to make effective use of the vSMP nodes. Note that we do not provide complete scripts below, but rather just the critical content needed to run under vSMP.

Memory and core usage

Since multiple user jobs can be run simultaneously on a single vSMP node, there has to be a mechanism in place to ensure that jobs do not use more than their fair share of the memory. In addition, jobs should run on distinct sets of physical nodes to prevent contention for resources. Our approach is to implement the following policies:

  • Cores are requested in units of 16
  • Memory is allocated in proportion to number of cores requested

Each physical node has 64 GB of memory, but after accounting for the vSMP overhead the amount available for user jobs is closer to 60 GB. A serial job requiring 120 GB of memory should request 32 cores.

  #PBS -l nodes=1:ppn=32:vsmp
OpenMP jobs

The example below shows a partial batch script for an OpenMP job that will use 16 threads. Note that the queue name is set to vsmp.

  #PBS -q vsmp 
  #PBS -l nodes=1:ppn=16:vsmp
  export LD_PRELOAD=/opt/ScaleMP/libvsmpclib/0.1/lib64/libvsmpclib.so
  export LD_PRELOAD=/usr/lib/libhoard.so:$LD_PRELOAD
  export PATH=/opt/ScaleMP/numabind/bin:$PATH
  export OMP_NUM_THREADS=16
  export KMP_AFFINITY=compact,verbose,0,`numabind --offset 16`
  ./openmp-app

The first export statement preloads the vSMP libraries needed for an application to run on the vSMP nodes. The second export statement is not strictly required, but is suggested since using the Hoard library can lead to improved performance particularly when multiple threads participate in dynamic memory management. The third export statement prepends the PATH variable with the location of the numabind command, while the fourth sets the number of OpenMP threads.

The final export statement requires a more in depth explanation and contains the “magic” for running on a vSMP node. Setting the KMP_AFFINITY variable determines how the threads will be mapped to compute cores. Note that it is used only for OpenMP jobs and will not affect the behavior of pThreads codes.

The assignment maps the threads compactly to cores, as opposed to spreading out across cores, provides verbose output (will appear in the job’s stderr file), and starts the mapping with the first core in the set. The numabind statement looks for an optimal set of 16 contiguous cores, where optimal typically means that the cores span the smallest number of physical nodes and have the lightest load. Enclosing within the back ticks causes the numabind output to appear in place as the final argument in the definition.

For most users, this KMP_AFFINITY assignment will be adequate, but be aware that there are many options for controlling the placement of OpenMP threads. For example, to run on 16 cores evenly spaced and alternating over 32 contiguous cores, set the ppn and offset values to 32, the number of threads to 16, and the affinity type to scatter. Relevant lines are shown below.

  #PBS -l nodes=1:ppn=32:vsmp   
  export OMP_NUM_THREADS=16  
  export KMP_AFFINITY=scatter,verbose,0,`numabind --offset 32`
pThreads jobs

As mentioned in the previous section, setting the KMP_AFFINITY variable has no effect on pThreads jobs. Instead you will need to create an additional configuration file with all of the contents appearing on a single line. It is absolutely critical that the executable name assigned to the pattern is the same as that used in the script.

  $ cat config  

(output shown on two lines for clarity)
  name=myconfig pattern=pthreads-app verbose=0 process_allocation=multi   
                  task_affinity=cpu rule=RULE-procgroup.so flags=ignore_idle  

The sleep statement in the following batch script ensures that enough time has elapsed to create all of the threads before they are bound to cores by numabind.

  #PBS -q vsmp  
  #PBS -l nodes=1:ppn=16:vsmp
  export LD_PRELOAD=/opt/ScaleMP/libvsmpclib/0.1/lib64/libvsmpclib.so
  export LD_PRELOAD=/usr/lib/libhoard.so:$LD_PRELOAD
  export PATH=/opt/ScaleMP/numabind/bin:$PATH
  ./pthreads-app &
  sleep 10
  numabind --config config 

Serial jobs

Serial job submission is very similar to that for OpenMP jobs, except that the taskset command is used together with numabind to bind the process to a core. Even though a single core will be used for the computations, remember to request a sufficient number of cores to obtain the memory required (~60 GB per 16 cores).

  #PBS -q vsmp  
  #PBS -l nodes=1:ppn=16:vsmp 
  export LD_PRELOAD=/opt/ScaleMP/libvsmpclib/0.1/lib64/libvsmpclib.so
  export PATH=/opt/ScaleMP/numabind/bin:$PATH
  taskset -c `numabind --offset 1` ./scalar-app

MPI jobs

If you require less than 64 GB of memory per MPI process, we suggest that you run on the regular compute nodes. While ScaleMP does provide a specially tuned version of the MPICH2 library designed to achieve optimal performance under vSMP, the vSMP foundation software does add a small amount of unavoidable communications latency (applies not just to vSMP, but any hypervisor layer).

Before compiling your MPI executable to run under vSMP, unload the mvapich module and replace with mpich2

  $ module swap mvapich2_ib mpich2_ib
 

To run an MPI job, where the number of processes equals the number of compute cores requested, use the following options. Note that VSMP_PLACEMENT is set to PACKED, indicating that the MPI processes will be mapped to a set of contiguous compute cores.

  #PBS -q vsmp  
  #PBS -l nodes=1:ppn=32:vsmp
  export PATH=/opt/ScaleMP/mpich2/1.3.2/bin:$PATH
  export VSMP_PLACEMENT=PACKED 
  export VSMP_VERBOSE=YES 
  export VSMP_MEM_PIN=YES 
  vsmputil --unpinall 
  time mpirun -np 32 ./mpitest 

The situation is slightly more complicated if the number of MPI processes is smaller than the number of requested cores. Typically you will want to spread the MPI processes evenly across the requested cores (e.g. when running 2 MPI processes across 64 cores, the processes will be mapped to cores 1 and 33). In the example below the placement has been changed from “PACKED” to “SPREAD^2^32”. The SPREAD keyword indicates that the processes should be spread across the cores and “^2^32” is a user defined topology of 2 nodes with 32 cores per node.

  #PBS -q vsmp  
  #PBS -l nodes=1:ppn=64:vsmp
  export PATH=/opt/ScaleMP/mpich2/1.3.2/bin:$PATH
  export VSMP_PLACEMENT=SPREAD^2^32 
  export VSMP_VERBOSE=YES 
  export VSMP_MEM_PIN=YES 
  vsmputil --unpinall 
  mpirun -np 2 ./mpitest 
 

Note, hints, and additional resources

We have only touched on the basics required for job submission. For additional information see the following external resources:

Thread Affinity Interface (KMP_AFFINITY)

The Hoard Memory Allocator

Torque in Depth

This section takes a closer look at the some of the more useful TORQUE commands. Basic information on job submission can be found in the User Guide section Running Jobs.  For a comprehensive treatment, please see the official TORQUE documentation and the man pages for qstat, qmgr, qalter, and pbsnodes. We also describe several commands that are specific to the Catalina scheduler and not part of the TORQUE distribution.

Listing jobs (qstat -a)

Running the qstat command without any arguments shows the status of all batch jobs. The –a flag is suggested since this provides additional information such as the number of nodes being used and the required time. The following output from qstat –a has been edited slightly for clarity.

  $ qstat -a
                                         Req'd  Req'd  Elap
  Job ID   Username Queue   Jobname NDS  Memory Time S Time
  -------- -------- ------  ------- --- ------ ----- - -----
  1.gordon user1    normal  g09       1    1gb 26:00 R 23:07
  2.gordon user2    normal  task1    32    --  48:00 R 32:15
  3.gordon user2    normal  task2     2    1gb 24:00 H   --
  4.gordon user3    normal  exp17     1    --  12:00 C 01:32
  5.gordon user3    normal  stats1    8    --  12:00 Q   --
  6.gordon user3    normal  stats2    8    --  12:00 E 15:27

The output is mostly self-explanatory, but a few points are worth mentioning. The Job ID listed in the first column will be needed if you want to alter, delete, or obtain more information about a job.  On SDSC resources, only the numeric portion of the Job ID is needed. The queue, number of nodes, wall time, and required memory specified in your batch script are also listed. For jobs that have started running, the elapsed time is shown. The column labeled "S" lists the job status.

R = running
Q = queued
H = held
C = completed after having run
E = exiting after having run

Jobs can be put into a held state for a number of reasons including job dependencies (e.g. task2 cannot start until task1 completes) or a user exceeding the number of jobs that can be in a queued state.

On a busy system, the qstat output can get to be quite long. To limit the output to just your own jobs, use the –u $USER option

  $ qstat –a –u $USER
                                         Req'd  Req'd  Elap
  Job ID   Username Queue   Jobname NDS  Memory Time S Time
  -------- -------- ------  ------- --- ------ ----- - -----
  2.gordon user2    normal  task1    32    --  48:00 R 32:15
  3.gordon user2    normal  task2     2    1gb 24:00 H   --

Detailed information for a job (qstat -f jobid)

Running qstat –f  jobid provides the full status for a job. In addition to the basic information listed by qstat –a, this includes the job’s start time, compute nodes being used, CPU and memory usage, and account being charged.

Nodes allocated to a job (qstat -n jobid)

To see the list of nodes allocated to a job, use qstat –n. Note that this output doesn’t reflect actual usage, but rather the resources that had been requested. Knowing where your job is running is valuable information since you’ll be able to access those nodes for the duration of your job to monitor processes, threads, and resource utilization.

Altering job properties (qalter)

The qalter command can be used to modify the properties of a job. Note that the modifiable attributes will depend on the job state (e.g. number of nodes requested cannot be changed after a job starts running). See the qalter man page for more details.

  $ qstat –a 8
                                         Req'd  Req'd  Elap
  Job ID   Username Queue   Jobname NDS  Memory Time S Time
  -------- -------- ------  ------- --- ------ ----- - -----
  8.gordon user4    normal  task1    32    --  10:00 R  6:15
  
  $ qalter –l walltime=9:00 8
  $ qstat –a 8
                                         Req'd  Req'd  Elap
  Job ID   Username Queue   Jobname NDS  Memory Time S Time
  -------- -------- ------  ------- --- ------ ----- - -----
  8.gordon user2    normal  task1    32    --  09:00 R  6:15

Specifying job dependencies

In some cases, you may wish to submit multiple jobs that have dependencies. For example, a number of mutually independent tasks must be completed before a final application is run. This can be easily accomplished using the depend attribute for qsub. In the example below, the job described in pbs_script_d is not allowed to start until jobs 273, 274 and 275 have terminated, with or without errors.

qsub –W depend=afterany:273:274:275 pbs_script_d

While this is useful, it requires the user to manually submit the jobs and construct the job list. To easily automate this process, capture the jobids using back ticks when the earlier jobs are submitted (recall that qsub writes the jobid to stdout). Using the previous example and assuming that the earlier jobs were launched using the scripts pbs_script_[abc], we can do the following:

JOBA = `qsub pbs_script_a`
JOBB = `qsub pbs_script_b`
JOBC = `qsub pbs_script_c`
qsub –W depend=afterany:$JOBA:$JOBB:$JOBC pbs_script_d

This is only a very brief introduction to controlling job dependencies. The depend attribute provides a rich variety of options to launch jobs before or after other jobs have started or terminated in a specific way. For more details, see the additional attributes section of the qsub User Guide.

Node attributes (pbsnodes -a)

The full set of attributes for the nodes can be listed using pbsnodes –a. To limit output to a single node, provide the node name.

Node states (pbsnodes -l)

Running pbsnodes –l lists nodes that are unavailable to the batch systems (e.g. have a status of “down”, “offline”, or “unknown”). To see the status of all nodes, use pbsnodes –l all. Note that nodes with a “free” status are not necessarily available to run jobs. This status applies both to nodes that are idle and to nodes that are running jobs using a ppn value smaller than the number of physical cores.

Monitoring Your Job

In this section, we describe some standard tools that you can use to monitor your batch jobs. We suggest that you at least familiarize yourself with the section of the user guide that deals with running jobs to get a deeper understanding of the batch queuing system before starting this section.

Figuring out where your is job running

Using the qstat –n jobid command, you’ll be able to see a list of nodes allocated to your job. Note that this output doesn’t reflect actual usage, but rather the resources that had been requested. Knowing where your job is running is valuable information since you’ll be able to access those nodes for the duration of your job to monitor processes, threads, and resource utilization.

The output from qstat –n shown below lists the nodes that have been allocated for job 362256. Note that we have 16 replicates of the node gcn-5-22, each followed by a number corresponding to a compute core on the node.

[sinkovit@gordon-ln1 ~]$ qstat -n 362256
gcn-5-22/15+gcn-5-22/14+gcn-5-22/13+gcn-5-22/12
+gcn-5-22/11+gcn-5-22/10+gcn-5-22/9+gcn-5-22/8
+gcn-5-22/7+gcn-5-22/6+gcn-5-22/5+gcn-5-22/4
+gcn-5-22/3+gcn-5-22/2+gcn-5-22/1+gcn-5-22/1

Connecting to a compute node

Under normal circumstances, users cannot login directly to the compute nodes. This is done to prevent users from bypassing the batch scheduler and possibly interfering with other jobs that may be running on the nodes.

  $ ssh gcn-5-22
  Connection closed by 10.1.254.187

Access is controlled by the contents of the /etc/security/access.conf file, which for idle nodes denies access to everyone except root and other administrative users.

  $ cat /etc/security/access.conf 
  -:ALL EXCEPT root diag :ALL

The situation is different though once a batch job transitions from the queued state to the run state. A special script known as the prologue makes a number of changes to the compute environment, including altering the access.conf file so that the owner of the job running on the node can login.

  $ cat /etc/security/access.conf 
  -:ALL EXCEPT root diag username:ALL 

Once the job terminates, a corresponding script known as the epilogue undoes the changes made by the prologue. If you happen to be logged in to a compute node when your job terminates, your session will be terminated.

Static view of threads and processes with ps

Once you login into a compute node, the ps command can provide information about current processes. This is useful if you want to confirm that you are running the expected number of MPI processes, that all MPI processes are in the run state, and that processes are consuming relatively equal amounts of CPU time. The ps command has a large number of options, but two that are suggested are –l (long output) and –u user (restrict output to processes owned by username or user ID).

In the following example, we see that the user is running 16 MPI processes, named pmemd.MPI in the CMD column, on the compute node gcn-5-22. The TIME column shows that the processes are all using very nearly the same amount of CPU time and the S (state) column shows that these are all in the run (R) state. Not all applications will exhibit such ideal load balancing and processes may sometimes go into a sleep state (D or S) while performing I/O or waiting for an event to complete.

  [diag@gcn-5-22 ~]$ ps -lu username
  F S   UID   PID  PPID  C PRI  NI ADDR SZ  WCHAN  TTY    TIME CMD
  5 S 503412 23686 23684  0 76   0 - 23296  stext  ?  00:00:00 sshd
  0 S 503412 23689 23686  0 78   0 - 16473  wait   ?  00:00:00 bash
  0 S 503412 23732 23689  0 75   0 -  6912  stext  ?  00:00:00 mpispawn
  0 R 503412 23735 23732 99 85   0 - 297520 stext  ?  12:40:50 pmemd.MPI
  0 R 503412 23736 23732 99 85   0 - 213440 stext  ?  12:40:55 pmemd.MPI
  0 R 503412 23737 23732 99 85   0 - 223276 stext  ?  12:40:54 pmemd.MPI
  0 R 503412 23738 23732 99 85   0 - 255149 stext  ?  12:40:55 pmemd.MPI
  [ --- 12 additional lines very similar to previous line not shown --- ]
 

The ps command with the m (not –m) option gives thread-level information, with the summed CPU usage for the threads in each process followed by per-thread usage. In the examples below, we first run ps to see the five MPI processes, then use the m option to see the six threads per process.

  [diag@gcn-4-28 ~]$ ps -lu cipres # Processes only
  F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY    TIME CMD
  0 R 503187 22355 22354 99 85   0 - 245402 stext ?  06:05:46 raxmlHPC
  0 R 503187 22356 22354 99 85   0 - 246114 stext ?  06:05:47 raxmlHPC
  0 R 503187 22357 22354 99 85   0 - 245325 stext ?  06:05:47 raxmlHPC
  0 R 503187 22358 22354 99 85   0 - 222089 stext ?  06:05:47 raxmlHPC
  0 R 503187 22359 22354 99 85   0 - 245630 stext ?  06:05:47 raxmlHPC
 
  [diag@gcn-4-28 ~]$ ps m -lu cipres # Processes and threads
  F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY    TIME CMD
  0 - 503187 22355 22354 99  -   - - 245402 -     ?    402:38 raxmlHPC
  0 R 503187    -     - 99  85   0 -     - stext  -     67:06 -
  1 R 503187    -     - 99  85   0 -     - stext  -     67:06 -
  1 R 503187    -     - 99  85   0 -     - stext  -     67:06 -
  1 R 503187    -     - 99  85   0 -     - stext  -     67:06 -
  1 R 503187    -     - 99  85   0 -     - stext  -     67:06 -
  1 R 503187    -     - 99  85   0 -     - stext  -     67:06 -
  [ --- similar output for other processes/threads not shown --- ]
 

Dynamic view of threads and processes with top

While ps gives a static view of the processes, top can be used to provide a dynamic view. By default the output of top is updated every three seconds and lists processes in order of decreasing CPU utilization. Another advantage of top is that it displays the per-process memory usage thereby providing a very easy way to confirm that your job is not exceeding the available physical memory on a node.

In the first example below, the CPU usage is around 600% for all five processes, indicating that each process had spawned multiple threads. We can confirm this by enabling top to display thread-level information with the H option. (This option can either be specified on the top command line or entered after top is already running)

  [diag@gcn-4-28 ~]$ top # Processes only
    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  22356 cipres    25   0  961m 251m 2316 R 600.0  0.4 481:32.53 raxmlHPC
  22359 cipres    25   0  959m 252m 2312 R 600.0  0.4 481:31.90 raxmlHPC                                                               
  22355 cipres    25   0  958m 256m 7448 R 599.6  0.4 481:31.55 raxmlHPC                                                            
  22357 cipres    25   0  958m 249m 2300 R 599.6  0.4 481:32.17 raxmlHPC                                                            
  22358 cipres    25   0  867m 249m 2312 R 599.6  0.4 481:32.42 raxmlHPC
  [ --- Additional lines not shown ---]

  [diag@gcn-4-28 ~]$ top H # Processes and threads
    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  22355 cipres    25   0  958m 256m 7448 R 100.1  0.4  82:35.68 raxmlHPC                                                              
  22387 cipres    25   0  958m 256m 7448 R 100.1  0.4  82:35.70 raxmlHPC                                                              
  22388 cipres    25   0  958m 256m 7448 R 100.1  0.4  82:35.93 raxmlHPC                                                               
  22375 cipres    25   0  961m 252m 2316 R 100.1  0.4  82:36.03 raxmlHPC                                                               
  22377 cipres    25   0  961m 252m 2316 R 100.1  0.4  82:36.04 raxmlHPC 
  [ --- output for other 9 threads and other lines not shown ---] 
 

Additional Resources

We have not covered all of the options and capabilities of qstat, ps, and top. For more information consult the man pages for these commands.

Dedicated I/O Nodes (Gordon ION)

A limited number of Gordon’s 64 flash-based I/O nodes are available as dedicated resources. The PI and designated users of a dedicated I/O node have exclusive use of the resource at any time for the duration of the award without having to go through the batch scheduler. Any data stored on the nodes is persistent (but not backed up) and can only be removed by the users of the project. These nodes are referred to as the "Gordon ION" and must be requested separately from the regular compute cluster.

Each I/O node contains two 6-core 2.66 GHz Intel Westmere processors, 48 GB of DDR3-1333 memory and sixteen 300 GB Intel 710-series solid-state drives. The drives will typically be configured as a single RAID 0 device with 4.4 TB of usable space. However, this default configuration can be changed based on the needs of the project and in collaboration with SDSC staff.  Each of the I/O nodes can mount both the NFS home directories and Data Oasis, SDSC’s 4 PB Lustre-based parallel file system, via two 10 GbE connections. The aggregate sequential bandwidth for the 16 SSDs within an I/O node was measured to be 4.3 GB/s and 3.4 GB/s for reads and writes, respectively. The corresponding random performance is 600 KIOPS for reads and 37 KIOPS for writes.

Recommended Use

Gordon ION awards are intended for projects that can specifically benefit from persistent, dedicated access to the flash storage. Applications that only need temporary access to the flash storage (e.g. fast scratch space, staging of data sets that will be accessed multiple times within a batch job) can do so via a regular Gordon compute cluster award.

Gordon ION is particularly well suited to database and data-mining applications where high levels of I/O concurrency exist, or where the I/O is dominated by random access data patterns. The resource should be of interest to those who, for example, want to provide a high-performance query engine for scientific or other community databases. Consequently, Gordon ION allocations are for long-term, exclusive, and dedicated use of the awardee.

SDSC staff can help you configure the I/O node based on your requirements. This includes the installation of relational databases (MySQL, PostgreSQL, DB2) and Apache web services.

Allocations

Dedicated Gordon I/O nodes are only available through the XSEDE startup allocations process. Normally a single I/O node will be awarded, but two I/O nodes can be awarded for exceptionally well-justified requests. Projects that also have accompanying heavy computational requirements can ask for up to 16 compute nodes for each I/O node.

Successful allocation requests must describe how you will make use of the I/O nodes. This should include relevant benchmarks on spinning disks, with projections of how the applications will scale when using flash drives. Additionally, the request should include a strong justification for why these should be provided as a dedicated resource—for example, providing long-term access to data for a large community via a Science Gateway

Gordon’s flash-based I/O nodes are a unique resource in XSEDE and we encourage you to contact SDSC staff to discuss the use of this resource before you submit your allocation request.  SDSC applications staff will be able to help you understand if this resource is a good match for your project, and subsequently, provide assistance in getting started with your project.