Prerequisites
- The local schedulers must support user-settable reservations
- The user must have a login at each machine
- The user's home directory must be shared among the login and compute nodes of each cluster
- MPICH must be installed on each cluster, with ch_p4
Usage
gur.py --submit --jobfile=<jobfile>
gur.py --cancel --metajobfile=<metajobfile>
gur.py --run --metajobfile=<metajobfile>
Flow
- Administrator configures GUR to talk to each cluster (see config file)
- User generates job file describing how the grid job should be run (see job file)
- User submits GUR job
- GUR sets up user's ssh, if necessary
- GUR sets up user's ssh access to each cluster, if necessary
- GUR creates synchronized user reservations at all clusters
- GUR stages executable
- binary copy
- remote compile
- leave it to user
- GUR creates placeholder batch jobs to run within reservations
- sleep
- MPICH serv_p4 (if serv_p4)
- GUR binds placeholder job to user reservation
- GUR binds reservation to placeholder job
- GUR retrieves node list for each user reservation
- User runs GUR job
- GUR sets up MPICH
- GUR creates procgroup file from reservation node lists
- GUR adds the executable path to each ~/.server_apps file on each cluster (if serv_p4)
- GUR sleeps until run time
- GUR runs a test MPICH job
- GUR runs real MPICH job
Examples
Demo runs
Submit output
Filtered Submit output
Run output
Filtered Run output
Job with existing reservation
Issues
- Integration with parallel environment
- MPICH can be compiled to use a wrapper around ssh.
The wrapper can be told to run ssh in BatchMode and to
skip StrictHostKeyChecking
- MPICH -with-comm=shared is broken on our Linux (2.4.18-17.7.x)
To Do
- For serv_p4 jobs, need to add entry to .rhosts for client workstation