A recent comparative computational study between the Cray C90 and the Cray T3D, genetic variation simulation, reveals some useful trends and rules of thumb that should carry over to the T3E in an even more dramatic fashion.
When trying to choose "an optimal" high performance architecture to solve one's scientific problem, many factors usually come into play. Cost, turnaround time (including time spent waiting in queues before executing), and amount of effort required to parallelize a piece of code for a new architecture are most often foremost on a researcher's mind. Being able to evaluate what the expected tradeoffs will be is never a trivial task and more often than not a luxury that is difficult to afford. Consulting computer benchmarks can give a reasonable estimate of computing speed but can't provide load balance feedback at a particular site.
The following paragraphs discuss computing expansion factors for the genetic variation simulation to better help with architecture evaluation. The expansion factor compares the actual turnaround time to that which would be expected on a dedicated single CPU Cray system:
Where:
Queue time = time spent in a queue
Running time = time spent in run state
User time = time spent performing numerical calculations
System time = time spent processing system calls
CPU time = User time + System time
I/O time = time spent waiting for I/O completion.
Queue name Queue time Running time CPU time EF Cost (miniSUs) ---------- ---------- ------------ --------- --- -------------- low 19.9 hours 10.3 hours 9.5 hours 3.2 4.25 medium 7.3 hours 10.5 hours 9.5 hours 1.9 8.49 high 0.8 hours 10.2 hours 9.5 hours 1.2 15.34An expansion factor of 3.2 tells you roughly that a 10 hour job would take 32 clock hours to compute. Compared to an EF of 1.2, we would only need 12 clock hours but we would end up being charged 3.6 times more.
Note that an earlier run of the same problem when many more jobs had been sumbmitted to the small 2 MWords memory queue produced a Queue time for the low priority queue of 53 hours which gave us an EF of 32!
This is what was carried out next, where the Fortran77 code was slightly modified for the Cray T3D using the CRAFT (Cray Research Adaptive ForTran) programming model, an extension of Fortran 77 with several Fortran 90 features for designing parallel programs. Using Fortran90 directly, PVM, or even MPI would have been equivalent and equally straightforward as the simulation package we ported contained highly parallelizable and independent loops requiring minimal interprocessor synchronization.
Expansion factors of less than 1 can be found for parallel systems. Results were for the small queue. Queue times were neligeable, so are not reported.
# Procs CPU time Running time EF Cost (miniSUs)
-------- -------- ------------ ------ --------------
2 20.9 min 10.1 min 0.5 0.52
4 20.8 min 5.2 min 0.25 0.52
8 22.9 min 2.9 min 0.13 0.57
16 24.7 min 1.5 min 0.07 0.62
32 24.9 min 0.8 min 0.04 0.62
64 33.7 min 0.5 min 0.02 0.84
128 35.3 min 0.3 min 0.01 0.88
miniSU conversion rates were computed by:
miniSU = (Running time * #Procs) / 40
40 was a fairly generous CPU speed conversion ratio between the C90 and T3D used for normalizing charging across machines.
On the T3D, once a cluster of processors is allocated for a job, it is devoted to that job entirely, without time-sharing between other users. Consequently, one is charged for the entire time using a PE times the number of PEs. The implication is that the slowest PE dictates the cost of the run. We plotted
# Procs CPU time Actual CPU time used % difference
-------- -------- -------------------- ---------------
2 20.9 min 20.9 min 0 %
4 20.8 min 20.5 min 1 %
8 22.9 min 21.8 min 5 %
16 24.7 min 21.6 min 14 %
32 24.9 min 22.0 min 13 %
64 33.7 min 22.5 min 49 %
128 35.3 min 22.3 min 58 %
CONCLUDING COMMENTS NEEDED... more later!