DEVELOPERS' CORNER draft

Parallel Processing Expectations on the new Cray T3E machine (or some other title...)

As SDSC is about to replace its current Cray T3D parallel supercomputer with a more powerful Cray T3E with twice as many processors (256 processors), higher I/O capacity, more memory and disk/space, we would like to take advantage of this opportunity to reflect on some of the lessons learned.

A recent comparative computational study between the Cray C90 and the Cray T3D, genetic variation simulation, reveals some useful trends and rules of thumb that should carry over to the T3E in an even more dramatic fashion.

When trying to choose "an optimal" high performance architecture to solve one's scientific problem, many factors usually come into play. Cost, turnaround time (including time spent waiting in queues before executing), and amount of effort required to parallelize a piece of code for a new architecture are most often foremost on a researcher's mind. Being able to evaluate what the expected tradeoffs will be is never a trivial task and more often than not a luxury that is difficult to afford. Consulting computer benchmarks can give a reasonable estimate of computing speed but can't provide load balance feedback at a particular site.

The following paragraphs discuss computing expansion factors for the genetic variation simulation to better help with architecture evaluation. The expansion factor compares the actual turnaround time to that which would be expected on a dedicated single CPU Cray system:

EF = (Queue time + Running time) / (User time + System time + I/O time)
Where:
	Queue time = time spent in a queue
	Running time = time spent in run state
	User time = time spent performing numerical calculations
	System time = time spent processing system calls
        CPU time = User time + System time
	I/O time = time spent waiting for I/O completion.

C90 runs

In the case of our simulation program, a single-CPU run on the Cray C90 in the small 2 MWords memory queue produced (I/O was negligeable):
Queue name  Queue time	Running time  CPU time   EF   Cost (miniSUs)
----------  ----------	------------  ---------  ---  --------------
 low	    19.9 hours	10.3 hours    9.5 hours	 3.2   4.25 
 medium	     7.3 hours	10.5 hours    9.5 hours	 1.9   8.49 
 high	     0.8 hours	10.2 hours    9.5 hours	 1.2  15.34 
An expansion factor of 3.2 tells you roughly that a 10 hour job would take 32 clock hours to compute. Compared to an EF of 1.2, we would only need 12 clock hours but we would end up being charged 3.6 times more.

Note that an earlier run of the same problem when many more jobs had been sumbmitted to the small 2 MWords memory queue produced a Queue time for the low priority queue of 53 hours which gave us an EF of 32!

T3D runs

Hard to predict Queue time variations such as this one are why when pressed with time, researchers will frequently run in higher priority and more costly queues. This is also a good reason to explore reducing expansion factors by reinstrumenting simulation codes to run on parallel platforms.

This is what was carried out next, where the Fortran77 code was slightly modified for the Cray T3D using the CRAFT (Cray Research Adaptive ForTran) programming model, an extension of Fortran 77 with several Fortran 90 features for designing parallel programs. Using Fortran90 directly, PVM, or even MPI would have been equivalent and equally straightforward as the simulation package we ported contained highly parallelizable and independent loops requiring minimal interprocessor synchronization.

Expansion factors of less than 1 can be found for parallel systems. Results were for the small queue. Queue times were neligeable, so are not reported.

# Procs   CPU time  Running time  EF      Cost (miniSUs)
--------  --------  ------------  ------  --------------
    2     20.9 min  10.1 min      0.5		0.52 
    4	  20.8 min   5.2 min      0.25		0.52   
    8 	  22.9 min   2.9 min      0.13	 	0.57
   16 	  24.7 min   1.5 min      0.07	 	0.62
   32     24.9 min   0.8 min      0.04	 	0.62   
   64     33.7 min   0.5 min      0.02	 	0.84   
  128     35.3 min   0.3 min      0.01	 	0.88   
miniSU conversion rates were computed by: miniSU = (Running time * #Procs) / 40

40 was a fairly generous CPU speed conversion ratio between the C90 and T3D used for normalizing charging across machines.

Relative speedup and cost

The following figure compares T3D performance versus C90 performance as well as relative cost. Elapsed time in the figure refers to "Running time", queue refers to "Queue time", cost refers to normalized miniSUs.


Performance and cost comparisons

On the T3D, once a cluster of processors is allocated for a job, it is devoted to that job entirely, without time-sharing between other users. Consequently, one is charged for the entire time using a PE times the number of PEs. The implication is that the slowest PE dictates the cost of the run. We plotted

# Procs   CPU time  Actual CPU time used   % difference
--------  --------  --------------------  ---------------
    2     20.9 min  	20.9 min		 0 % 
    4	  20.8 min 	20.5 min		 1 %
    8 	  22.9 min	21.8 min		 5 %
   16 	  24.7 min	21.6 min		14 %
   32     24.9 min	22.0 min		13 %
   64     33.7 min	22.5 min		49 %
  128     35.3 min	22.3 min		58 %
CONCLUDING COMMENTS NEEDED... more later!