FAST3D Final Report

The FAST3D PET project was a straightforward review of the performance of a beta version of the FAST3D package. A test run of the code was profiled to identify those portions of the code where most of the time is spent. The LCPFCT routines were optimized and the performance limitations of the code studied. No parallel optimization was performed, but a parallel transpose routine is included for further consideration by the authors of FAST3D. This report is organized as follows:
 
  1. Scientific background and nature of the code
  2. Profile of execution
  3. Single processor performance and optimization of LCPFCT
    1. The technique of counting cycles
    2. Alpha 21164 (CRAY T3E) architecture and performance
      1. Floating point performance
      2. Memory system performance
    3. Power2 Super Chip (IBM SP2) architecture and performance
      1. Floating point performance
      2. Memory system performance
    4. Single precision on the IBM SP2
    5. Cycle costs of MAX and MIN
    6. Alternatives to MAX and MIN for processors without conditional move instructions
    7. Cycle costs of ABS and SIGN
    8. Performance evaluation of the original code: T3E
    9. Performance evaluation of the original code: SP2
    10. Performance evaluation of revised code: T3E
    11. Performance evaluation of revised code: SP2
  4. Other optimization prospects
  5. Inter-procedural analysis
  6. Array declarations and COMMON
  7. Parallel transposes
  8. Conclusions
  9. References

Scientific background and nature of the code

The FAST3D program is basically a wrapper for the LCPFCT algorithm and subroutines. LCPFCT (Laboratory for Computational Physics Flux-Corrected Transport) is an algorithm for simulating conservation laws in one dimension with extreme changes in the scalar fields (e.g., density) being modeled. Higher dimensional problems are solved by a sequence of 1-dimensional applications of the LCPFCT algorithm and (parallel) transpositions of the simulation data.

Problems  FAST3D can be used to solve arise in the context of fluid dynamics in High-Reynolds number flow, and the example problems provided with the FAST3D code solve blast and muzzle flash problems, as well as air flow around various bodies at high Mach number.

Profile of execution

A four processor run of the bluntbody test case with negligible parallel overhead results in the following execution profile on the IBM SP2 at SDSC:

       cumulative     self              self    total
 %time    seconds  seconds    calls  ms/call  ms/call  name

  41.4     547.29   547.29 21077280     0.03     0.03  .lcpfct [5]
   8.7     662.45   115.16  2119680     0.05     0.05  .put_pvs [8]
   8.7     776.83   114.38  4215456     0.03     0.03  .velocity [9]
   8.0     882.25   105.42  2107728     0.05     0.41  .gasdyn [4]
   5.9     960.84    78.59  2377728     0.03     0.03  .get_pvs [10]
   4.3    1017.42    56.58  2107728     0.03     0.03  .cnvfct [11]
   3.7    1065.75    48.33  2142720     0.02     0.02  .get_gvs [12]
   2.8    1102.70    36.95                             .__mcount [13]
   2.0    1129.32    26.62  8430912     0.00     0.00  .sources [14]
   1.8    1153.35    24.03  2115684     0.01     0.01  .copygrid [15]
   1.7    1175.96    22.61      612    36.94  1935.58  .ijk_fct [3]
   1.6    1196.57    20.61                             .readsocket [16]
   1.5    1216.90    20.33  2115072     0.01     0.07  .get_ijk_cols [6]
   1.1    1231.42    14.52  2115072     0.01     0.01  .cell_va [17]

The subroutines lcpfct, velocity, gasdyn, cnvfct and sources account for 64.4% of the execution time in this run. lcpfct is in fact the "computational kernel" of the code and the subroutines gasdyn,velocity, cnvfct and sources support the general 1-d flux-corrected transport algorithm in lcpfct.f. We will assume this profile roughly holds for both target architectures we will be studying below. These target architectures are the Cray T3E and the IBM SP2. We could in principle profile the code on the Cray T3E, but at the time of writing the code had not been ported to that platform. The T3E and SP2 are MPP platforms. The T3E uses the Compaq Alpha 21164 processor and the IBM SP2 utilizes the IBM Power2 Super Chip (P2SC) processors. For the platforms at SDSC, the T3E processors run at 300 MHz, and the SP2 at 166 MHz. The profile above contains less than 2% parallel overhead (the readsocket calls), so that we are essentially seeing a profile of single processor performance. Note that the __mcount calls are due to profiling. We concentrate on single processor performance, and leave parallel tuning comments for the section on parallel transposes below.

Single processor performance and optimization of LCPFCT

The technique of counting cycles

While there are a number of approaches to tuning code, most are some form of black magic. While it may be possible to acquire general coding habits for high performance, new architectures will always be able to surprise the programmer. The best way to write code for performance is to understand the architecture and program for it. Although one may hope that compilers are able to tune a code to a high level of performance without intervention, this almost always turns out not to be the case. Unfortunately, it becomes difficult to maintain a single code with good performance on all platforms. However, the approach of this report is to carefully analyze performance so that the reader can take these ideas to other architectures as well. Most of the features of the Alpha and P2SC are typical of RISC processors and similar concerns will exist. The author hopes it is interesting to see both what is common in tuning for the T3E and SP2 and what is different.

The only way to compare the performance of a code to what we expect the performance to be for a given computer is to count cycles. This allows us to assess the performance of the code independent of the programming language or any effects other than the architecture on which the code is running. The technique of counting cycles comes down to several assumptions and steps:

  1. Assume that for a given loop those values which can remain in registers will (compilers are good at this). This includes reused elements of vectors and any scalars which are found in the loop. In general, you may want to assume that data that has ascended the memory hierarchy to a given level will remain at that level,
  2. Assume that a loop can be unrolled to yield enough independent operations to keep the pipelines busy,
  3. Assume an infinite number of registers so that step 2 can take place,
  4. Assume load/stores and arithmetic can overlap in time, as indicated by the architecture sections above,
  5. Assume common subexpressions will not be recomputed,
  6. Ignore the expense of integer computations implicit in the code, like computing addresses, and loop index operations,
  7. Unless it is known that data will be in memory or in cache, separately compute cycles required both for data in memory and in cache. For example, if data are reused a large number of times, it is reasonable to desire that those data be in cache during the reuse. It may be some challenge to organize the data and processing so that the reused data remain in cache, but at least it is a noble cause. On the other hand, large vectors of data which are accessed sequentially will not be in cache, and it will be useful to see exactly what the performance penalties are.
  8. Assume there are no pathological data alignments in cache which cause excess memory-cache traffic and increase cycles required for loads and stores.

In the process of tuning the code,several of the above assumptions can and will fail. Trying to understand which assumptions fail and generate poor code performance is a large part of the tuning process. Typical examples of this are:

  1. dependent operations which cause poor use of the pipelines,
  2. cache conflicts which cause apparent reduced memory bandwidth,
  3. excess work in a loop causing register spill and pipelining problems.

The required cycles for most operations in which we are interested are given in the section below on the architectures. We have a good grasp of cycles required to move data from memory to cache, from cache to register, and how many cycles are required to do the floating point computations. In the loops to be analyzed, we will be placing the operations and cycle predictions in a table with three sections. We briefly describe the tables using the SP2 as an example.
 

FLOATING POINT:
                                   REAL*8                REAL*4
Operations                    w/o FMA    w/FMA      w/o FMA    w/FMA
----------                    -------    -----      -------    -----
FMA                              0          2           0         2
*                                2          1           2         1
+                                7          5           7         5
ABS                              3          3           3         3
CVLS(=)                                                 1         1

Predicted cycles[FPOPs/2]:      6         5.5        6.5        6

This floating point section for the IBM SP specifies the number and type of operations. The predicted cycles entries are based on  the cycle costs from the above architecture descriptions and the operation counts, and all formulas for cycle predictions are given in square brackets. All cycle predictions are per loop iteration. Double and single precision entries are given for the SP, and we assume throughout that no rearrangements of operations are allowed which could change the numerical results. A column is given including the FMA operation also assuming that no changes in numerical results are allowed. This strict adherence to numerical results is guaranteed on the SP by compiling with the -O3 -qstrict options, or with -O2. Note that compiling with -O3 -qstrict does not mean that all intermediate results are converted to single precision; only an equals sign results in a conversion back to single precision. The most important quantities are highlighted.

MEMORY<->CACHE:

Operations                         REAL*8                 REAL*4
----------                         ------                 ------
LOAD                                 3                      3
STORE                                1                      1
STORE after LOAD

Predicted cycles:          [L*1.3+S*2.3+SaL*1.0]  [L*0.7+S*1.3+SaL*0.6]
Predicted cycles:               3*1.3+1*2.3=           3*0.7+1*1.3=
Predicted cycles:                   6.2                    3.4

The memory-to-cache section specifies the number of loads and stores required by the loop in the infinite loop iteration limit, where all loads and stores must go to main memory. Again, the important cycle predictions are highlighted. The memory-to-cache predictions assume that there are no pathological cache conflicts due to data alignment in cache.

CACHE<->REGISTERS:

Operations
----------
LOAD                                 3
STORE                                1

Predicted cycles(w/o quad)           2  [(LOADs + STOREs)/2]
Predicted cycles(w/  quad)           1  [(LOADs + STOREs)/4]

The cache-to-register section predicts cycles required for data in cache. The highlighted predictions are shown for both quad loads and double loads, which support 4 and 2 double precision loads per cycle, respectively. For single precision loads, the quad loads are unavailable and the w/o quad predictions should be used.

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*4               6
Data in cache, REAL*8               5.5
Data in memory, REAL*4              6
Data in memory, REAL*8              6.2
 

Finally, the first three tables are used to predict overall per iteration cycle cost, assuming load/stores and calculations overlap. For data in memory, the maximum of the MEMORY<->CACHE predicted cycles and floating point predicted cycles is used. For data in cache, the maximum of the CACHE<->REGISTER predicted cycles and floating point predicted cycles is used. Sometimes loops will be referred to as load-store bound for obvious reasons. What we really want are CPU (or floating point) bound loops that are doing work and hiding things we don't care about behind this work. Most loops have been tested and run on the platform in question and an observed cycle count is given for the same run parameters (single precision, double precision, etc). All observed cycle counts are from code which isolates the loop in question and also prevents any cache alignment problems. The observed cycle counts should be interpreted as representative of the best performance of the processor, given that it is running assembly generated by a particular compiler. Actual performance may vary.

Below we examine a number of loops in the original and revised code and find that most code exhibits a large degree of predictability. However, there are a number of operations unique to FAST3D which require some extra analysis first: MAX, MIN, ABS and SIGN, which are covered after we review the SP2 and T3E processor characteristics.

Alpha 21164 (CRAY T3E) architecture and performance

Since we want to predict performance by calculating how many cycles will be required to perform a sequence of operations, we need to know:

1) given a  FORTRAN or C loop, what the hardware should actually be doing,
2) how long the desired operations should take.

For the T3E, this typically comes down to understanding how floating point data moves through the following functional diagram:


 

Not shown are the integer pipelines called E0 and E1, which do non-floating point work and access their own set of 32 integer registers. The integer pipelines also execute all (including floating point) loads and stores. The Alpha issues an instruction to each of the pipelines FM (floating point multiply), FA (add), E0 and E1 every cycle. Loads and stores can occur while floating point operations are being done.

Floating point performance

The Alpha 21164 is capable of executing two floating point instructions per clock cycle, provided these operations do not both fall in the same pipeline. The 21164 has a floating point multiply pipeline (FM) and a floating point add pipeline(FA), both of which have a pipeline length of 5 cycles. This means that to achieve peak performance, the body of a loop should make available 10 independent operations (5 add and 5 multiply) per iteration; dependent floating point operations take 5 cycles, not 1, since they cannot be overlapped. All floating point operations execute in either the FM or FA pipeline, and we will note where important operations other than arithmetic execute. A full list of the pipelines in which Alpha instructions execute is given in the Alpha Architecture Handbook.

In summary, we can write:

Predicted floating point cycles (T3E) = max(FM ops, FA ops),

where FM ops occur in the FM pipeline and FA ops occur in the FA pipeline.
 

Memory system Performance

The memory system on the T3E refers to the hardware which moves data from main memory to cache and from cache to register. Although the T3E has two levels of cache, we will gloss over that fact in an effort to simplify our cycle counting and tuning. All of the timings below have been verified with simple test loops and so should be fairly good estimates of peak performance of the memory system.

Operation                   Latency(CP/64 byte line)    Bandwidth(CP/8 byte word)
Stream buffered load/store           24                          3
Load                                 85                          12
Store                                85                          7

The figures for stream buffered bandwidth have been coerced into one value. In fact, loads are listed as 20% faster than stores, with 3 cycles per word an average and adequate for our analysis below, since (as we are about to see) half the memory access time is latency. These latencies and bandwidths can be used to compute an expected per word access time. We concentrate on memory access via the hardware streams. For regular access of memory, we can expect the hardware to prefetch the data using 6 hardware streams which lie between memory and cache.

If we access memory at stride 1 (as is the case in the loops we will tune in FAST3D), then every 8 loads we cause a cache miss and incur the latency in the table. So per required load or store we get 3  + (24/8)  cycles. In other words, 6  cycles per word. This 6 cycle cost is confirmed by code experiments, although stores alone require about 9 cycles per word. This is probably because of the write-allocate nature of the cache: a write which misses in cache causes the data in memory to be loaded into cache before being written back out again when the cache line is needed by other data. In fact, one would expect stores on average to take about twice as long as loads. This is the case for the SP (see below), but the T3E typically needs about the same time for loads and stores.

Cache-to-register load bandwidths are listed in the T3E Fortran Optimization Guide as two 8-byte words per cycle, and one 8 byte word per cycle on store. Latencies of dubious meaning are also given in the Guide. While we may eventually figure out the details, it is sufficient to note that the peak transfer rate is only occasionally attained, and that cache-to-register times are as much as twice what the peak rate would predict. A typical value from testing would indicate that cache-to-register cycle cost predictions based on the published values above should be multiplied by about 1.6.

A NOTE ON STREAM BUFFERS: The stream buffers are an off-chip mechanism which prefetch same stride accesses to main memory after the second miss in cache. These buffers are usually controlled by either or both the system administrator and the user. Some systems may have the stream buffers deactivated for hardware and software problems present in early releases of the system. One can check by running memory-bound code after alternately executing the cshell commands setenv SCACHE_D_STREAMS 1 (streams on) and setenv SCACHE_D_STREAMS 0 (streams off). If there is no change in performance, then the system administrator has prohibited streams.

In the diagram, we have left out the 8 Kbyte cache (the data cache) between the cache shown (the secondary cache) and the registers. It is not particularly interesting for the analysis below. The prime reason to recall its existence is to prevent any catastrophic misalignments of data; the data cache is 1024 eight byte words long, so accessing data 1024 elements apart will cause excess data movement between cache levels.
 

Power 2 Super Chip (P2SC) (IBM SP2) architecture and performance

CPU Performance

As was the case for the T3E, we are generally trying to predict how floating point data moves through the following functional diagram of the P2SC:

Many details of the chip are not shown. Not shown are two fixed-point (integer) units (FXUs) which perform integer calculations on their own set of registers. The FXUs are also responsible for executing loads and stores, so that calculations and memory operations can overlap.

Floating Point Performance


The P2SC has replicated floating point units that perform the same tasks. So 2 multiplies, adds, or multiply-adds can take place in the same cycle. A multiply-add or FMA is a single instruction that can be written a=b*c+d. Since the floating point pipelines are 2 to 3 cycles long (FMAs require 3 cycles), we need 4 to 6 independent floating point operations (*,+ or FMA) to keep the floating point units busy on every cycle. The floating point units perform only 64 bit IEEE arithmetic, with conversion from single precision on load and a pre-store conversion instruction, discussed in detail below.

Summing this up, we can write:

Predicted floating point cycles (SP2) = FPOPs/2,

where an FPOP is any of FMA, *,+, etc.

Memory system performance

The P2SC processors have a very straightforward memory architecture. There is a single level of cache, and no stream buffers as on the T3E. The cache is in a four way set-associative arrangement, and is very fast. Four way set associative cache divides the cache into four sections which map to main memory in the same way. So when up to four words align such that the same cache line is used, the cache can still hold all four lines simultaneously. The cache miss latency is documented by IBM to be 8-12 cycles, and direct evidence of bandwidth is pretty hard to find, although 860 Mbytes/s is believed to the right number number for the SP2 nodes at SDSC. The hardware can in fact do better than this on loads, and a pretty good model of the memory to cache performance can be obtained by some simple load-store code tests. For j loads and k stores of 8 byte words, we require 1.3 j + 2.3 k cycles (at stride 1).  For 4 byte words, 0.7 j + 1.3 k cycles per word are required, almost twice as fast. Note that an average (double precision) load/store rate of 1.8 cycles per load corresponds to 740 Mbytes/sec for the 166 MHz P2SC nodes.

The store on the P2SC can also be thought of as a load and then a store-after-load. Since the cache is a write-allocate cache, a store to data not already in cache involves loading the data into cache, followed by a write back to memory when the cache line is needed for other data. So we can think of a store-after-load as taking 1.0 cycles for a total of 2.3 cycles for a store as shown in the above formula.

Due to the nature of cache buffering, note how these performance figures apply in general to data movement between memory and cache. We know that a cache line is 32 words. If we access a particular word of memory, then the entire cache line surrounding that word must be loaded into cache as well. So if actually use every word, the above calculation predicts how many cycles are required for every load and store. However, if we access every other word, then for every load we see in our code, we are actually loading two words - one is used, the other simply comes along for the ride. So stride two access of j loads and k stores requires 2.6 j + 4.6 k cycles. The worst case is for strides greater than 31, where only 1 word in each cache line is used. In this case, the formula indicates that j loads and k stores would require 41.6 j + 73.6 k cycles. In practice, the P2SC does better than this for large strides, requiring only about 44 cycles for one load and one store at stride 32. It would appear that the P2SC is able to overlap loads and stores very well for large stride access.

Cache-to-register performance is very good relative to processor speed on the P2SC. There are two options for loading from cache. There are single (4 bytes or single precision) and double word load instructions, which can load a single or double precision word every cycle, and also quad load instructions, which can load two consecutive double words per cycle into consecutive registers. While the second option is faster, it makes it more difficult for the compiler to schedule a loop, since every load consumes two registers. Also, since there are two fixed point units, loads and stores between cache and register require either 1/2 or 1/4 (for quad loads of double precision data) cycle. IBM compilers will only generate quad load instructions given the appropriate compiler options. In particular, the options -qarch=p2sc and -qtune=p2sc. It is also possible to use pwr2 in place of p2sc.
 

Single Precision on the IBM SP2

Single precision calculations (4 byte floating point operands declared with FORTRAN REAL) are not native calculations on the P2SC. All processor registers and floating point operations are 8 byte, and assembly language conversion instructions are supplied to generate real single precision results. There are 3 types of conversion instructions. There is a load floating point single instruction, a store floating point single instruction, and a convert floating point double to single instruction. When a single precision floating point equation is to be evaluated, the single precision values are moved from cache to register and converted to double precision simultaneously. The right hand side of the equation completes entirely in double precision, and the single precision left hand side is generated by a CVLS instruction which also performs any IEEE exception handling. The value is then written to cache using the store floating point single instruction. From the point of view of optimization, this introduces additional cycles in the floating point units. Essentially, each equals sign generates a CVLS - an additional floating point operation. In the single precision cycle cost estimates below, we introduce an extra floating point operation for every equals sign. Also, there are no quad word single precision load and store instructions, so cache to register bandwidth is lower than the highest double precision bandwidth. Single precision also requires more registers to do the same work in a loop as double precision, since the CVLS instructions require source and destination registers.

The advantage of using single precision is that the cache to memory time is cut nearly in half, and of course we can store twice as many numbers in available memory. The single precision bandwidths are listed in the previous section on memory system performance for the IBM SP2. (The CRAY FORTRAN compiler for the T3E does not support 8 byte floating point arithmetic, so we do not discuss single precision performance on the T3E.)

MAX  and MIN

The LCPFCT algorithm is rather distinct in the floating point operations required. For example,  the lcpfct subroutine contains the following loop, which is at the core of the unique mathematical properties of the algorithm:

          Do 5 I = I1, IN
             FLXH(I+1) = FSGN(I+1) * AMAX1 ( 0.0,
     &                AMIN1 ( TERM(I+1), FABS(I+1), TERP(I+1) ) )
             RHON(I) = RHOTD(I) - RLN(I)*(FLXH(I+1) - FLXH(I))
    5        SOURCE(I) = 0.0

Let us concentrate first on the max and min functions (the generic version of AMAX1 and AMIN1). This sequence of floating point operations

max(0.0,min(a,b,c))

or:

max(0.0,min(a,min(b,c)))
 

Cray T3E

On the DEC Alpha 21164 (the T3E node processors), and any processor which implements conditional move instructions, a single evaluation of min(a,b) in assembly language looks like:

inst.   operands       pipeline      function
-----   --------       --------      --------
ldt     f6,-16(r1)     E0 or E1      load operand a into register f6
ldt     f3,-16(r4)     E0 or E1      load operand b into register f3
cpys    f6,f6,f13      FM or FA      copy f6 into f13
cmptle  f6,f3,f2       FA            compare f6 to f3
fcmoveq f2,f3,f13      FA            move f3 to f13 if f3<f6
stt     f13            E0 or E1      store the result

This is a sequence of two loads, a copy, a compare, a conditional move and a final store of the result. The pipeline in which the instruction executes is given in the last column; E0 and E1 are the two integer pipelines, FM is the floating point multiply unit, and FA is the floating point add pipeline. These are given in order to help us count the cycles required for the min/max operation. If there were no other operations to be performed, then this sequence of operations would require the sum of all the operation pipeline lengths to complete, or one less than the pipeline depth due to forwarding. (One less than the pipeline depth is called the pipeline latency). For the above operations, the latencies are 2 cycles for loads (the integer pipeline depth is 3) and 4 cycles for the FM or FA operations. The store operation produces no result and no latency is present. With independent min evaluations, say due to unrolling and clever scheduling, we can overlap operations in different pipelines and also keep the pipelines full. The two loads can overlap and execute in one cycle. This loading can also overlap with copy operations in the FM pipeline which can also overlap with operations in the FA pipeline. The longest chain of operations is in fact in the FA pipeline, where the cmptle and fcmoveq operations must complete in sequence; they cannot overlap.  As an example of this overlapping, let us consider a loop consisting only of min evaluations. If we unroll the loop by 4, then we can keep the FA pipeline filled and generate a cmptle and fcmoveq result every cycle. A picture of what we hope the Alpha 21164 would be doing is as follows, where the pipelines have been drawn out to show how resource conflicts control the sequence and timing of execution:

                .
                .
                .
                .
                .
-        12345 FM                 |
FCMOVEQ4 12345 FA                 |
-        123   E0             Iteration I-1
-        123   E1                 |
CYCLE:   .....12345678....
 ------------------------------------------
 CPYS1    12345 FM                |
 CMPTLE1  12345 FA                |
 LD5      123   E0                |
 LD6      123   E1                |
  CPYS2    12345 FM               |
  CMPTLE2  12345 FA               |
  LD7      123   E0               |
  LD8      123   E1               |
   CPYS3    12345 FM              |
   CMPTLE3  12345 FA              |
   ST1      123   E0              |
   ST2      123   E1              |
    CPYS4    12345 FM             |
    CMPTLE4  12345 FA             |
    ST3      123   E0             |
    ST4      123   E1         Iteration I
     -        12345 FM            |
     FCMOVEQ1 12345 FA            |
     -        123   E0            |
     -        123   E1            |
      -        12345 FM           |
      FCMOVEQ2 12345 FA           |
      -        123   E0           |
      -        123   E1           |
       -        12345 FM          |
       FCMOVEQ3 12345 FA          |
       LD1      123   E0          |
       LD2      123   E1          |
        -        12345 FM         |
        FCMOVEQ4 12345 FA         |
        LD3      123   E0         |
        LD4      123   E1         |

CYCLE:   .....12345678....
----------------------------------------
         CPYS1    12345 FM        |
         CMPTLE1  12345 FA  Iteration I+1
         LD1      123   E0        |
         LD2      123   E1        |
                .
                .
                .
                .
                .

Note that the stores in iteration I are being performed for iteration I-1,  while loads are being performed for iteration I+1. Forwarding is assumed for every pipeline, so that we need not wait for the last stage to write the register file. In the eight cycles shown, four evaluations of a min are computed, completing on cycles 5 through 8, yielding a performance of 2 cycles per evaluation. For the purpose of counting floating point cycles, we can think of a min as two floating point adds and a multiply. Assuming sufficient unrolling, we can assume overlapping and ignore the dependencies. Even if the other instructions in a loop iteration are not other mins, we still tie up the FA unit for two cycles. Also note that the FA pipeline is the only pipeline whose resources are busy every cycle. Both the integer and floating-point multiply pipelines are unused for four complete cycles.

Interestingly enough, the T3E does not manage 2 cycles per iteration for a loop composed entirely of min evaluations. Instead, 3.3 cycles are required. However, for the loop composed of

max(0.0,min(a,b,c))

evaluations, the T3E can complete each iteration of this loop unrolled by 3 in 7.2 cycles, or 2.6 cycles per min/max evaluation. In the author's experience with counting cycles on the T3E, getting only 70% of the predicted performance is not atypical, so these timings are reasonable.
 

IBM SP

The IBM SP implements the above operation differently. A single evaluation of c=min(a,b) in SP assembly looks like:

inst.   operands                functional unit   
----    --------                ---------------   
lfd     fp4=a(gr3,-7984)        FXU
lfd     fp2=b(gr3,-11984)       FXU
fcmpu   cr4=fp2,fp4             FPU
bc      CL.242,cr4,0x20/flt     BPU
fmr     fp4=fp2                 FPU
   CL.242:
stfd    c(gr3,16)=fp4           FXU

The quad loads present in the actual code have been eliminated for simplicity, but for many iterations of the loop, the compiler has managed to schedule quad loads and stores which reduce the time spent in each double precision load or store to essentially 1/4 cycle. Scheduling quad loads and stores can difficult since you have to load and store to adjacent registers and adjacent memory locations. There are three processing units in the Power2SC architecture involved: the fixed-point unit (FXU), the floating-point unit(FPU), and the branch-processing unit(BPU). There are two FXUs, two FPUs and one BPU, so given some loop unrolling, we can expect quite a lot of overlapping of work, even though a single iteration of min in a loop results in the dependent sequence of operations shown in the assembly fragment. There is really no indication in the IBM documentation about predicting required cycles in such a sequence of instructions. Instead, the documentation states that there are two sources of delay. The first is the unavoidable delay of the fcmpu in the floating point pipeline, which has a latency of 1-2 cycles. The second delay comes about in the branch instruction as the BPU decides what to do next. The fact that it is unknown which instruction will be executed after the branch instruction makes the pipelining of instructions more difficult (or perhaps impossible), so the exact effect of the branch instruction is unclear. The best we can do is compare to run times we predict without the conditionals and compare and complain. Several tests have been done to determine the effect of conditional branches. First, a loop composed entirely of min evaluations has been unrolled and timed. The best performance is 2.5 cycles per iteration, after manually unrolling by 8. A quick inspection of the above SP assembly indicates that there are 1.5 cycles of work per iteration in an FXU, and 1-2 cycles in an FPU, depending on whether or not the branch is taken. Assuming that both the 2 FXUs and the 2 FPUs can be utilized, and random data are used, this is and average of 3/4 cycle of work per iteration. So the presence of the conditional branch has significantly reduced the performance from what we expect from a fully pipelined sequence of instructions. A second test is performed by timing a loop consisting of max(0.0,min(a,min(b,c))) evaluations, which requires about 6 cycles per iteration after some experimentation with unrolling (see VDMAXMIN below). There are 3 loads and 1 store, 3-6 floating point operations(depending on how many branches are taken), and 3 branch instructions per iteration. If the branch instructions were not costing us anything, we would expect this to require 2.25 cycles per iteration on average. A fairly consistent explanation of these timings is a branch delay of 2-2.5 cycles during which the BPU is busy but the other functional units are not. While we also suspect that pipelining is disrupted by the presence of the branch instructions, this is a difficult effect to quantify. Removing the branch instructions entirely is a possible modification (see the next section ).

Since there are unused cycles in the FPU and FXUs while the BPU is busy, it is (in principle) possible to hide instructions independent of the branch outcome behind the BPU delay (see the IBM Optimization and Tuning Guide for Fortran, C, and C++). As in the case of SIGN evaluations (see VDSGN and VDSIGNDIFF below), one might wonder how many of the idle FPU and FXU cycles can be reclaimed by adding work to a loop containing only max(0.0,min(a,min(b,c))).  No option that the author has tried has managed to hide any additional work in this loop that is appropriate for the lcpfct subroutine. In fact, adding work can result in serious reductions in overall performance, as will be demonstrated in performance evaluation of code revisions below. We have had marginally better luck with the SIGN evaluations.

The subroutine VDMAXMIN:

      subroutine vdmaxmin(n,term,fabs,terp,flxh)

      implicit none
      real*8 term(*),fabs(*),terp(*),flxh(*)
      integer n8,n,i

      n8=rshift(n,3)*8

      do i=1,n8,8
            flxh(i) =  max ( 0.0d0,
     &            min ( term(i), fabs(i), terp(i) ) )
            flxh(i+1) =  max ( 0.0d0,
     &            min ( term(i+1), fabs(i+1), terp(i+1) ) )
            flxh(i+2) =  max ( 0.0d0,
     &            min ( term(i+2), fabs(i+2), terp(i+2) ) )
            flxh(i+3) =  max ( 0.0d0,
     &            min ( term(i+3), fabs(i+3), terp(i+3) ) )
            flxh(i+4) =  max ( 0.0d0,
     &            min ( term(i+4), fabs(i+4), terp(i+4) ) )
            flxh(i+5) =  max ( 0.0d0,
     &            min ( term(i+5), fabs(i+5), terp(i+5) ) )
            flxh(i+6) =  max ( 0.0d0,
     &            min ( term(i+6), fabs(i+6), terp(i+6) ) )
            flxh(i+7) =  max ( 0.0d0,
     &            min ( term(i+7), fabs(i+7), terp(i+7) ) )
      end do
      do i=n8+1,n
            flxh(i) =  max ( 0.0d0,
     &            min ( term(i), fabs(i), terp(i) ) )
      end do

      return
      end
 

when compiled with

xlf90 -qfixed -O3 -qtune=p2sc -qarch=p2sc -c vdmaxmin.f

can return a max(0.0,min(a,min(b,c))) evaluation in as little as 6.1 cycles:

vector length        cycles per evaluation
                     real*4    real*8
-------------        ----------------
64                    6.8       6.7
128                   6.6       6.4
256                   6.4       6.2
512                   6.3       6.1

These tests assume that the data are in cache.

Alternatives to MAX and MIN for processors without conditional move instructions


Although we have seen in the previous section that we can in some cases evaluate independent max and min instructions fairly quickly, we will find in practice that the branch instructions play havoc with scheduling of instructions in a loop. In fact, the loops below with the worst performance will be those containing conditional branches. Frequently, the expected execution time is doubled or tripled over what we might surmise from the above analysis of performance. The above analysis essentially says that if you assume the best case of the conditional branch expense being completely hidden, then you should expect a cycle or two to be consumed in each max or min. The existence of the conditional move instruction entails no particular problem in scheduling, because the operations complete in the floating point pipelines just as multiplies and adds. One might ask if there is a way to avoid the branch instructions the compiler generates for its MAX and MIN instructions. The answer is yes, although the alternate method does not necessarily yield the same floating point results.

Consider the two identities:

2 max(A,B) = A+B+abs(A-B)
2 min(A,B) = A+B-abs(A-B)

where abs is the absolute value. Ignoring for the moment that these are not floating point identities, we can rewrite our required expression

max(0.0,min(A,B,C)) = .125*[A+B-abs(A-B)+2C - abs(A+B-abs(A-B)-2C)
                       +abs(A+B-abs(A-B)+2C - abs(A+B-abs(A-B)-2C))]

Since the operations involved are now addition, multiplication and absolute value, we can count the cycles required for this alternate version of the computation quite easily. Let us assume that we have enough independent evaluations and unrolling of a loop has exposed these to the compiler. We now count operations and cycles:

FLOATING POINT:
                                   REAL*8                REAL*4
Operations                    w/o FMA    w/FMA      w/o FMA    w/FMA
----------                    -------    -----      -------    -----
FMA                              0          2           0         2
*                                2          1           2         1
+                                7          5           7         5
ABS                              3          3           3         3
CVLS(=)                                                 1         1

Predicted cycles:                6         5.5         6.5        6
 
 

MEMORY<->CACHE:

Operations                         REAL*8                 REAL*4
----------                         ------                 ------
LOAD                                 3                      3
STORE                                1                      1
STORE after LOAD

Predicted cycles:               3*1.3+1*2.3=           3*0.7+1*1.3=
Predicted cycles:                   6.2                    3.4
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 3
STORE                                1

Predicted cycles(w/o quad)           2
Predicted cycles(w/  quad)           1

Overall cycles per iteration     predicted
----------------------------     ---------
Data in cache, REAL*4               6
Data in cache, REAL*8               5.5
Data in memory, REAL*4              6
Data in memory, REAL*8              6.2
 

We have used  values for the loads and stores assuming that all three operands are loaded from vectors and the result is stored back to a vector. Even in the worst case the computation is only marginally memory bound; for in-cache data, this evaluation is certainly CPU bound for double precision data. For single precision calculations, the computation will certainly be CPU bound (this is good).

Note that the floating point differences between these two methods of computing max and min can be quite large. The max and min calculations have essentially the same problems as floating point adds and subtracts:

A=1.0,      B=.12345678:  min(A,B)=          0.1234567836
                          (A+B-abs(A-B))/2=  0.1234567761

A=10000.0,  B=.12345678:  min(A,B)=          0.1234567836
                          (A+B-abs(A-B))/2=  0.1232518256

A=1.0*10^35,B=.12345678:  min(A,B)=          0.1234567836
                          (A+B-abs(A-B))/2=  0.0000000000E+00
 

These comparisons are for single precision arithmetic.

We do briefly discuss the use of this alternate form of max and min in the performance section below. However, the reader may wish to discount this method entirely due to its finite precision effects on the accuracy of calculations. In the end, we are able to get fairly good performance without resorting to this alternative. We comment on this more in the section on performance analysis of the revised code below.
 
 

ABS and SIGN


There are several other operations present in the LCPFCT kernel whose cycle costs are not immediately obvious. These are ABS, the absolute value, and SIGN, a two-input, single-output function which returns a number whose absolute value is the first argument, and with the same sign as the second argument. A quick look at the assembly reveals how these operations are accomplished.
 

IBM SP

The absolute value is a assembly instruction on the IBM SP.  For example, the loop

        do i=1,n
        c(i)=abs(a(i))
        end do

results in a loop containing the assembly instruction:
 

     inst     operands                 functional unit
     -----    -----------              ---------------
     fabs     fp3=fp2                  FPU

Therefore, we expect each floating point unit to be able to compute an absolute value every cycle.

The SIGN function is succinctly defined by:

c=sign(a,b): if b < 0, then c=-abs(a)
             else, c=abs(a).

Like MIN and MAX, the SP implements this function as a conditional.  Consider the loop

        do i=1,n
        flxh(i) = sign(cst,term(i))
        end do

The following evaluates a single sign function. In lcpfct, the first argument is a constant and the positive and negative absolute values can be stored outside the loop. The following fragment of a the loop body assumes that fp0(floating point register 0)=0.0, fp1=abs(a), and fp2=-abs(a). Each component of term is loaded into floating point register 5.

     inst     operands                 functional unit
     -----    -----------              ---------------
     .
     .
     .
     fmr      fp8,fp1                  FPU
     lfq      fp5,-11992(r3)           FXU
     fcmpu    7,fp0,fp5                FPU
     bc       B0_IF_NOT,CR7_GT,__L460  BPU
     fmr      fp8,fp2                  FPU
__L460:
     stfdu    fp8,8(r3)                FXU
     .
     .
     .

Again, the effect of the conditional branch is difficult to assess a priori in terms of cycles, but we do know how to count the remainder of the operations: there are 2 to 3 (depending on the sign of the operand in fp5) cycles spent in a floating point unit, and one cycle in a fixed point unit for the load and store. So if the branch instruction cost is completely hidden, and both floating point units can be completely utilized, we would expect each sign evaluation to require 2.5/2=1.25 cycles on average. For the above loop consisting only of sign evaluations, unrolling by 8 yields a code which runs in 2.5 cycles per iteration (see VDSGN below), confirming the branch penalty of 2-2.5 cycles mentioned in the previous section. So the floating point units have an extra 1.25 cycles per iteration in which they are idle. In practice, we are able to reclaim an additional half cycle of work per iteration by computing the differences of the input vector, a task which has to be done in actual lcpfct algorithm (see VDSGNDIFF below). Experimentation with reclaiming the remaining .75 cycles has not been successful.  In fact, scheduling more work (that is appropriate for lcpfct) per iteration than is specified in VDSGNDIFF causes the additional work to take longer than if it were put in a separate loop which could then be efficiently pipelined without the disruption of scheduling caused by branch instructions. Note that the performance of VDSGN compares favorably to the vector MASS version of the SIGN function (which takes two vector operands and therefore is not appropriate), which is listed as 3.5 cycles per iteration in the MASSVP2 library (see http://www.rs6000.ibm.com/resource/technology/MASS/ ). The vectorized version of SIGN is the subroutine VDSGN:

      subroutine vdsgn(n,scalar,in,out)

      implicit none
      real*8 scalar,in(*),out(*)
      integer n8,n,i

      n8=rshift(n,3)*8

      do i=1,n8,8
          out(i) = sign(scalar,in(i))
          out(i+1) = sign(scalar,in(i+1))
          out(i+2) = sign(scalar,in(i+2))
          out(i+3) = sign(scalar,in(i+3))
          out(i+4) = sign(scalar,in(i+4))
          out(i+5) = sign(scalar,in(i+5))
          out(i+6) = sign(scalar,in(i+6))
          out(i+7) = sign(scalar,in(i+7))
      end do
      do i=n8+1,n
          out(i)=sign(scalar,in(i))
      end do

      return
      end

When compiled with

xlf90 -qfixed -O3 -qtune=p2sc -qarch=p2sc -c vdsgn.f

this subroutine can return a SIGN evaluation in as little as 2.5 cycles:

vector length        cycles per SIGN
-------------        ---------------
64                   3.1
128                  2.8
256                  2.7
512                  2.5

These tests assume that the data are in cache. Single precision cycle costs are slightly higher for long vectors at 2.6 cycles per iteration.

The subroutine VDSGNDIFF reclaims some of the idle FPU cycles and requires the same time per iteration as VDSGN:

      subroutine vdsgndiff(n,scalar,in,out)

      implicit none
      real*8 scalar,in(*),out(*)
      integer n8,n,i

      n8=rshift(n-1,3)*8

      do i=1,n8,8
          out(i+1) = sign(scalar,(in(i+1)-in(i)))
          out(i+2) = sign(scalar,(in(i+2)-in(i+1)))
          out(i+3) = sign(scalar,(in(i+3)-in(i+2)))
          out(i+4) = sign(scalar,(in(i+4)-in(i+3)))
          out(i+5) = sign(scalar,(in(i+5)-in(i+4)))
          out(i+6) = sign(scalar,(in(i+6)-in(i+5)))
          out(i+7) = sign(scalar,(in(i+7)-in(i+6)))
          out(i+8) = sign(scalar,(in(i+8)-in(i+7)))
      end do
      do i=n8+1,n-1
          out(i+1)=sign(scalar,in(i+1)-in(i))
      end do

      return
      end

However, for single precision calculations (i.e., VSSGNDIFF), the code is much slower at almost 4 cycles per iteration. The extra CVLS instruction and the subtract are unable to hide behind the branches.

CRAY T3E

The T3E implements ABS (absolute value) using the CPYS (copy sign) instruction. For example the FORTRAN loop:

        do i=1,n
           flxh(i) = abs(fsgn(i))
        end do

becomes:

$LL00009:                                     ; Codeblock 10
        subq      r2,  1,         r2          ; Ln 28 0x00000160 (In 107)
        cpys      f31, f2,        f0          ; Ln 29 0x00000164 (In 102)
        ldt       f2,  8(r1)                  ; Ln 29 0x00000168 BLOAD (In 271)
        addq      r1,  8,         r1          ; Ln 29 0x0000016c (In 105)
        stt       f0,  (r3)                   ; Ln 29 0x00000170 (In 104)
        addq      r3,  8,         r3          ; Ln 29 0x00000174 (In 106)
        bgt       r2,  $LL00009               ; Ln 30 0x00000178 (In 108)
 

In this case, f0 holds values of flxh(i), and f2 holds fsgn(i). CPYS evaluates the absolute value by reading floating point register 31 (f31), a register which always returns a floating point zero when read. In other words, the sign of zero is assigned to the value in f2. CPYS executes in either the FM or FA pipeline, so we can expect up to two of these to complete in a single cycle.

SIGN is also accomplished as a single call to CPYS. For example, the FORTRAN loop:

         do i=1,n
           flxh(i) = sign(fabs(i),fsgn(i))
         end do

in Alpha assembly is very simply:
 

$LL00005:                                     ; Codeblock 6
        cpys      f3,  f2,        f0          ; Ln 29 0x000001e0 (In 129)
        ldt       f3,  8(r2)                  ; Ln 29 0x000001e4 BLOAD (In 306)
        subq      r1,  1,         r1          ; Ln 28 0x000001e8 (In 135)
        ldt       f2,  8(r3)                  ; Ln 29 0x000001ec BLOAD (In 304)
        addq      r2,  8,         r2          ; Ln 29 0x000001f0 (In 132)
        stt       f0,  (r4)                   ; Ln 29 0x000001f4 (In 131)
        addq      r3,  8,         r3          ; Ln 29 0x000001f8 (In 133)
        addq      r4,  8,         r4          ; Ln 29 0x000001fc (In 134)
        bgt       r1,  $LL00005               ; Ln 30 0x00000200 (In 136)
 

so that we again have as many as two SIGN evaluations per cycle, in terms of floating point unit work.
 

Performance evaluation of the original code

The usefulness of FAST3D to the user community partly stems from its modularity. It is probably true that one could rewrite the code in such a way to destroy much of this modularity and the usefulness of the code, with a benefit in performance. We comment on this below in the Other prospects section. We will not pursue any such modifications in detail. We will concentrate on modifications within subroutines, and specifically within the most time consuming routine, lcpfct.
 

A detailed analysis of lcpfct


lcpfct is a sequence of 5 loops. These are presented below with the extraneous intervening instructions removed, except for a (commented out) subroutine call:
 

          Do 1 I = I1P, IN
             FLXH(I) = HADUDTH(I) * ( RHOO(I) + RHOO(I-1) )
    1        DIFF(I) = NULH(I) * ( RHOO(I) - RHOO(I-1) )

          Do 2 I = I1, IN
             LORHOT(I) = LO(I)*RHOO(I) + SOURCE(I) + (FLXH(I)-FLXH(I+1))
             LNRHOT(I) = LORHOT(I) + (DIFF(I+1) - DIFF(I))
             RHOT(I)  = LORHOT(I)*RLN(I)
    2        RHOTD(I) = LNRHOT(I)*RLN(I)

c             Call SML_CELL ( i1, iN )

          Do 3 I = I1P, IN
             FLXH(I) = MULH(I) * ( RHOT(I) - RHOT(I-1) )
    3        DIFF(I) = RHOTD(I) - RHOTD(I-1)

          Do 4 I = I1, IN
             FABS(I+1) = ABS ( FLXH(I+1) )
             FSGN(I+1) = SIGN ( DIFF1, DIFF(I+1) )
             TERM(I+1) = FSGN(I+1)*LN(I)*DIFF(I)
    4        TERP(I) = FSGN(I)*LN(I)*DIFF(I+1)

          Do 5 I = I1, IN
             FLXH(I+1) = FSGN(I+1) * AMAX1 ( 0.0,
     &                AMIN1 ( TERM(I+1), FABS(I+1), TERP(I+1) ) )
             RHON(I) = RHOTD(I) - RLN(I)*(FLXH(I+1) - FLXH(I))
    5        SOURCE(I) = 0.0

We want to estimate how long we expect these loops to take on both of our target architectures. But first, let's have a look at the loops and try to understand what work is actually necessary. If we look through the remainder of the LCPFCT subroutines, we note that many of the vector variables in the above 5 loops are internal to lcpfct. These variables are passed through COMMON and are defined in the file FCT_vars.h, the appropriate sections of which are included below:
 

C /FCT_VELO/ Holds velocity-dependent flux coefficients
      Real      HADUDTH(Mpt),  NULH(Mpt),     MULH(Mpt)
      Real      EPSH(Mpt),     VDTODR(Mpt)

      Save      /FCT_VELO/
      Common    /FCT_VELO/ HADUDTH, NULH, MULH, EPSH, VDTODR

C /FCT_MISC/ Holds the source array and diffusion coefficient
      Real      SOURCE(Mpt), DIFF1

      Save      /FCT_MISC/
      Common    /FCT_MISC/ SOURCE, DIFF1

C /FCT_SCRH/ Holds scratch arrays for use by LCPFCT and CNVFCT. FCTSC1,
C FCTSC2 are called SCRH, SCR1 in the published version.
      Real      FCTSC1(Mpt),   FCTSC2(Mpt),   DIFF(Mpt)
      Real      FLXH(Mpt),     FABS(Mpt),     FSGN(Mpt)
      Real      TERM(Mpt),     TERP(Mpt),     LNRHOT(Mpt)
      Real      LORHOT(Mpt),   RHOT(Mpt),     RHOTD(Mpt)

      Save      /FCT_SCRH/
      Common    /FCT_SCRH/
     &          FCTSC1, FCTSC2, DIFF, FLXH, FABS, FSGN, TERM, TERP,
     &          LNRHOT, LORHOT, RHOT, RHOTD

Note that the subroutine SML_CELL, in the source code file vce.f, takes vector variables rhotd (or lnrhot) and ln as inputs and returns a smoothed rhotd (or lnrhot) vector. In the interest of maintaining the generality of the code, we will assume that rhotd is needed at the point in the code where SML_CELL might be called. Otherwise, it is a temporary vector variable. An inspection of the rest of the LCPFCT algorithm shows the following for the routine lcpfct:

inputs    outputs      local vector workspace
------    -------      ----------------------
RHOO      RHON         FLXH
SOURCES   RHOTD        DIFF
HADUDTH                LORHOT
NULH                   LNRHOT
LO                     RHOT
RLN                    FABS
LN                     FSIGN
MULH                   TERM
                       TERP

This table reveals that there as many vector temporaries as there are input and outputs. We should be concerned about this because the loads and stores associated with using these local workspaces may add considerable cycles to the execution of the algorithm. Let us begin by counting cycles in the original 5 loops. Since the two target platforms have different processor architectures, we have collected the operation counts for each loop in several ways. We include the loops to make the operation counts easier to understand. All operation and cycle counts are per iteration. We cover the T3E first, and then the SP2.
 
 

Performance evaluation of the original code: T3E

An analysis of the performance of the original code on the T3E will reveal the important properties of the lcpfct algorithm. Since CRAY FORTRAN on the T3E does not support REAL*4 arithmetic, we are constrained to testing with REAL*8 data. We analyze the code loop-by-loop, both predicting performance based on our knowledge of the architecture and measuring the performance using carefully devised test codes. These test codes have all been compiled with

f90 -O2 -Ounroll2 -Opipeline2

unless otherwise noted in the tables or loop comments. These compiler options have been used to insure that floating point results are not changed by the compiler, and for performance of the resulting code. Other compiler options (e.g. -O3) make little difference in the performance, with -Ounroll2 being the most important options for performance. Although the -Opipeline2 option does not generally improve performance for load-store bound code, we have included it for consistency; it will help in the revised code. The performance of the loops in the original code only varies slightly when -Opipeline2 is omitted; some loops run slightly faster, some loops run slightly slower. All of the test codes have the vector quantities declared in common and alignment carefully adjusted to prevent the pathological alignment of vector elements in cache.
 

Loop1:

          Do 1 I = I1P, IN
             FLXH(I) = HADUDTH(I) * ( RHOO(I) + RHOO(I-1) )
    1        DIFF(I) = NULH(I) * ( RHOO(I) - RHOO(I-1) )
 

FLOATING POINT:

Operations                         REAL*8
----------                         ------
*                                    2
+                                    2
Predicted cycles:                   2 [MAX(FM ops, FA ops)]
 
 

MEMORY<->CACHE:

Operations                         REAL*8
----------                         ------
LOAD                                 3
STORE                                2

Predicted cycles:                 3*6+2*6=
Predicted cycles:                  30    [6*(LOADS+STORES)]
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 3
STORE                                2

Predicted cycles                   3.5  [LOADS/2 + STORES]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*8              3.5                    5.5
Data in memory,REAL*8              30                     30

Loop1 has independent iterations and adequate unrolling should provide plenty of independent calculations for pipelining. For both in-cache and in-memory data, the loop is load-store bound.

Loop 2:

          Do 2 I = I1, IN
             LORHOT(I) = LO(I)*RHOO(I) + SOURCE(I) + (FLXH(I)-FLXH(I+1))
             LNRHOT(I) = LORHOT(I) + (DIFF(I+1) - DIFF(I))
             RHOT(I)  = LORHOT(I)*RLN(I)
    2        RHOTD(I) = LNRHOT(I)*RLN(I)

FLOATING POINT:

Operations                         REAL*8
----------                         ------
*                                    3
+                                    5
Predicted cycles:                   5 [MAX(FM ops, FA ops)]
 
 

MEMORY<->CACHE:

Operations                         REAL*8
----------                         ------
LOAD                                 6
STORE                                4

Predicted cycles:                 6*6+4*6=
Predicted cycles:                   60  [6*(LOADS+STORES)]
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 6
STORE                                4

Predicted cycles                   7.0  [LOADS/2 + STORES]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*8              7.0                    18.5
Data in memory,REAL*8              60                    117

Loop 2 is also a load-store bound loop, regardless of whether the data are in memory or cache. For data in memory, the already expensive set of ten loads and stores is even more costly as the Alpha runs out of stream buffers. As noted in the architecture section, these stream buffers have an enormous impact on cache-to-memory bandwidth and latency. The temporary vectors LORHOT and LNRHOT are included only for generality in the algorithm; these vectors are not used in this version of the lcpfct subroutine. Finally, this is also a loop with independent iterations.
 
 

Loop3:
          Do 3 I = I1P, IN
             FLXH(I) = MULH(I) * ( RHOT(I) - RHOT(I-1) )
    3        DIFF(I) = RHOTD(I) - RHOTD(I-1)

FLOATING POINT:

Operations                         REAL*8
----------                         ------
*                                    1
+                                    2
Predicted cycles:                   2 [MAX(FM ops, FA ops)]
 
 

MEMORY<->CACHE:

Operations                         REAL*8
----------                         ------
LOAD                                 3
STORE                                2

Predicted cycles:                 3*6+2*6=
Predicted cycles:                   30  [6*(LOADS+STORES)]
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 3
STORE                                2

Predicted cycles                   3.5  [LOADS/2 + STORES]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*8              3.5                    5.9
Data in memory,REAL*8              30                     30

Loop 3 writes to 2 vectors (FLXH and DIFF) which are completely local to loops 3,4 and 5. Note that this is also a load-store bound loop.
 

Loop4:
          Do 4 I = I1, IN
             FABS(I+1) = ABS ( FLXH(I+1) )
             FSGN(I+1) = SIGN ( DIFF1, DIFF(I+1) )
             TERM(I+1) = FSGN(I+1)*LN(I)*DIFF(I)
    4        TERP(I) = FSGN(I)*LN(I)*DIFF(I+1)

FLOATING POINT:

Operations                         REAL*8
----------                         ------
*                                    4
+                                    0
ABS  [1 FM or FA]                    1
SIGN [1 FM or FA]                    1
Predicted cycles:                   4 [MAX(FM ops, FA ops)]
 
 

MEMORY<->CACHE:

Operations                         REAL*8
----------                         ------
LOAD                                 3
STORE                                4

Predicted cycles:                 3*6+4*6=
Predicted cycles:                   42  [6*(LOADS+STORES)]
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 3
STORE                                4

Predicted cycles                   5.5  [LOADS/2 + STORES]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*8              5.5                    9.7
Data in memory,REAL*8              42                     66
 

Loop 4 is different from the first 3 loops, since the usage of FSGN introduces a dependency. Specifically, iteration I is dependent on iteration I-1 because of the 2nd and 4th statements of the do loop body. However, each iteration does have independent operations which can be evaluated while we are waiting on the evaluation of the SIGN function.Since the SIGN function (CPYS instruction) latency is 4 cycles, we need 4 cycles of independent work to prevent the SIGN evaluation from becoming a bottleneck. Fortunately, there are 4 multiplies (i.e., 4 cycles) we can be working on while the next element of FSGN is being computed. Therefore the dependency should not impose a performance penalty.

Again, loop 4 is load-store bound, and since there are 7 loads and stores, the memory-to-cache bandwidth is again reduced by the limit of 6 stream buffers, although this effect is not as severe as loop2.

Loop5:
          Do 5 I = I1, IN
             FLXH(I+1) = FSGN(I+1) * AMAX1 ( 0.0,
     &                AMIN1 ( TERM(I+1), FABS(I+1), TERP(I+1) ) )
             RHON(I) = RHOTD(I) - RLN(I)*(FLXH(I+1) - FLXH(I))
    5        SOURCE(I) = 0.0
 

FLOATING POINT:

Operations                         REAL*8
----------                         ------
*                                    2
+                                    2
MAX/MIN [2 FA, 1 FM or FA]           3
Predicted cycles:                   8 [MAX(FM ops, FA ops)]
 
 

MEMORY<->CACHE:

Operations                         REAL*8
----------                         ------
LOAD                                 6
STORE                                3

Predicted cycles:                 6*6+3*6=
Predicted cycles:                   54  [6*(LOADS+STORES)]
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 6
STORE                                3

Predicted cycles                    6  [LOADS/2 + STORES]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*8              8                    19.4
Data in memory,REAL*8              54                    101
Data in cache, REAL*8,unroll 3     8                    13.0
 

Loop 5 has a dependency introduced by the usage of FLXH; iteration I is dependent on the results of iteration I-1. However, there are more than 6 cycles of pipelined work (and therefore more than the pipeline latency of 4 cycles) involved in computing the next element of FLXH so we do not expect the dependency to limit performance. The assumes the loop can be unrolled sufficiently without running out of floating point registers.

We have also included a timing for a manually unrolled loop. The FORTRAN compiler default unroll count, set by compiling with -Ounroll2 unrolls too many times and software pipelining is disabled due to register overflow. Manually unrolling by 3 (using the !dir$ unroll 3 directive just before the do loop to be unrolled)increases performance dramatically. The fact that performance is potentially degraded is easily checked by looking at compiler output. In this case, the listing file executable.lst generated by compiling with the flag -r3 contained the optimization messages with regard to loop 5 (which began at line 60):

19) cf90-6005,Scalar,Line=60 A loop starting at line 60 was unrolled 4 times.
20) cf90-9870,Scalar,Line=60 Pipelining of loop disabled because of high pressure: ipressure=2, fpressure=35.

Note that the memory bandwidth is again reduced because 9 streams into memory are needed and only six are available. Finally, this is the only loop in the 5 loops in the original code which is not load-store bound for data in cache.
 
 

Conclusions:

Original code performance summary (loops 1-5):

Cycles per iteration:
                         predicted floating point     predicted     observed
                         ------------------------     ---------     --------
Data in cache, REAL*8              21                    27.5         59.6
Data in memory, REAL*8             21                    216          344

Performance evaluation of the original code: SP2

The SP2 FORTRAN compiler supports both REAL*4 and REAL*8 arithmetic, so we review both cases for each loop in the original code. For every loop, we have generated a test code compiled with

xlf90 -O3 -qstrict

or

xlf90 -O3 -qstrict -qtune=p2sc -qarch=p2sc.

The tables note which options are appropriate for each loop timing in the observed results. As in the T3E case, the array variables have been declared in common and their alignment in memory adjusted to prevent pathological alignment problems in cache.
 

Loop1:

          Do 1 I = I1P, IN
             FLXH(I) = HADUDTH(I) * ( RHOO(I) + RHOO(I-1) )
    1        DIFF(I) = NULH(I) * ( RHOO(I) - RHOO(I-1) )

FLOATING POINT:
                                   REAL*8                REAL*4
Operations                    w/o FMA    w/FMA      w/o FMA    w/FMA
----------                    -------    -----      -------    -----
FMA                              0          0           0         0
*                                2          2           2         2
+                                2          2           2         2
CVLS(=)                          0          0           1         1

Predicted cycles:[FPOPs/2]       2         2          2.5      2.5
 
 

MEMORY<->CACHE:

Operations                         REAL*8                 REAL*4
----------                         ------                 ------
LOAD                                 3                      3
STORE                                2                      2
STORE after LOAD

Predicted cycles:          [L*1.3+S*2.3+SaL*1.0]  [L*0.7+S*1.3+SaL*0.6]
Predicted cycles:               3*1.3+2*2.3=           3*0.7+2*1.3=
Predicted cycles:                  8.5                   4.7
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 3
STORE                                2

Predicted cycles(w/o quad)         2.5     [(LOADs + STOREs)/2]
Predicted cycles(w/  quad)         1.25    [(LOADs + STOREs)/4]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*4              2.5                    4.2
Data in cache, REAL*8              2.5                    3.7
Data in cache,p2sc, REAL*4         2.5                    4.2
Data in cache,p2sc, REAL*8         2.0                    2.2
Data in memory,REAL*4              4.7                    6.3
Data in memory,REAL*8              8.5                   10.0

Loop 1 is marginally load-store bound for REAL*4, in-cache data. For REAL*8 data, quad loads make the loop FPU bound and much higher performance is obtained. As one would expect, this loop is almost twice as fast for REAL*4 data when the data is in memory.

Loop 2:

          Do 2 I = I1, IN
             LORHOT(I) = LO(I)*RHOO(I) + SOURCE(I) + (FLXH(I)-FLXH(I+1))
             LNRHOT(I) = LORHOT(I) + (DIFF(I+1) - DIFF(I))
             RHOT(I)  = LORHOT(I)*RLN(I)
    2        RHOTD(I) = LNRHOT(I)*RLN(I)

FLOATING POINT:
                                   REAL*8                REAL*4
Operations                    w/o FMA    w/FMA      w/o FMA    w/FMA
----------                    -------    -----      -------    -----
FMA         &nb