FAST3D Final Report

The FAST3D PET project was a straightforward review of the performance of a beta version of the FAST3D package. A test run of the code was profiled to identify those portions of the code where most of the time is spent. The LCPFCT routines were optimized and the performance limitations of the code studied. No parallel optimization was performed, but a parallel transpose routine is included for further consideration by the authors of FAST3D. This report is organized as follows:
 
  1. Scientific background and nature of the code
  2. Profile of execution
  3. Single processor performance and optimization of LCPFCT
    1. The technique of counting cycles
    2. Alpha 21164 (CRAY T3E) architecture and performance
      1. Floating point performance
      2. Memory system performance
    3. Power2 Super Chip (IBM SP2) architecture and performance
      1. Floating point performance
      2. Memory system performance
    4. Single precision on the IBM SP2
    5. Cycle costs of MAX and MIN
    6. Alternatives to MAX and MIN for processors without conditional move instructions
    7. Cycle costs of ABS and SIGN
    8. Performance evaluation of the original code: T3E
    9. Performance evaluation of the original code: SP2
    10. Performance evaluation of revised code: T3E
    11. Performance evaluation of revised code: SP2
  4. Other optimization prospects
  5. Inter-procedural analysis
  6. Array declarations and COMMON
  7. Parallel transposes
  8. Conclusions
  9. References

Scientific background and nature of the code

The FAST3D program is basically a wrapper for the LCPFCT algorithm and subroutines. LCPFCT (Laboratory for Computational Physics Flux-Corrected Transport) is an algorithm for simulating conservation laws in one dimension with extreme changes in the scalar fields (e.g., density) being modeled. Higher dimensional problems are solved by a sequence of 1-dimensional applications of the LCPFCT algorithm and (parallel) transpositions of the simulation data.

Problems  FAST3D can be used to solve arise in the context of fluid dynamics in High-Reynolds number flow, and the example problems provided with the FAST3D code solve blast and muzzle flash problems, as well as air flow around various bodies at high Mach number.

Profile of execution

A four processor run of the bluntbody test case with negligible parallel overhead results in the following execution profile on the IBM SP2 at SDSC:

       cumulative     self              self    total
 %time    seconds  seconds    calls  ms/call  ms/call  name

  41.4     547.29   547.29 21077280     0.03     0.03  .lcpfct [5]
   8.7     662.45   115.16  2119680     0.05     0.05  .put_pvs [8]
   8.7     776.83   114.38  4215456     0.03     0.03  .velocity [9]
   8.0     882.25   105.42  2107728     0.05     0.41  .gasdyn [4]
   5.9     960.84    78.59  2377728     0.03     0.03  .get_pvs [10]
   4.3    1017.42    56.58  2107728     0.03     0.03  .cnvfct [11]
   3.7    1065.75    48.33  2142720     0.02     0.02  .get_gvs [12]
   2.8    1102.70    36.95                             .__mcount [13]
   2.0    1129.32    26.62  8430912     0.00     0.00  .sources [14]
   1.8    1153.35    24.03  2115684     0.01     0.01  .copygrid [15]
   1.7    1175.96    22.61      612    36.94  1935.58  .ijk_fct [3]
   1.6    1196.57    20.61                             .readsocket [16]
   1.5    1216.90    20.33  2115072     0.01     0.07  .get_ijk_cols [6]
   1.1    1231.42    14.52  2115072     0.01     0.01  .cell_va [17]

The subroutines lcpfct, velocity, gasdyn, cnvfct and sources account for 64.4% of the execution time in this run. lcpfct is in fact the "computational kernel" of the code and the subroutines gasdyn,velocity, cnvfct and sources support the general 1-d flux-corrected transport algorithm in lcpfct.f. We will assume this profile roughly holds for both target architectures we will be studying below. These target architectures are the Cray T3E and the IBM SP2. We could in principle profile the code on the Cray T3E, but at the time of writing the code had not been ported to that platform. The T3E and SP2 are MPP platforms. The T3E uses the Compaq Alpha 21164 processor and the IBM SP2 utilizes the IBM Power2 Super Chip (P2SC) processors. For the platforms at SDSC, the T3E processors run at 300 MHz, and the SP2 at 166 MHz. The profile above contains less than 2% parallel overhead (the readsocket calls), so that we are essentially seeing a profile of single processor performance. Note that the __mcount calls are due to profiling. We concentrate on single processor performance, and leave parallel tuning comments for the section on parallel transposes below.

Single processor performance and optimization of LCPFCT

The technique of counting cycles

While there are a number of approaches to tuning code, most are some form of black magic. While it may be possible to acquire general coding habits for high performance, new architectures will always be able to surprise the programmer. The best way to write code for performance is to understand the architecture and program for it. Although one may hope that compilers are able to tune a code to a high level of performance without intervention, this almost always turns out not to be the case. Unfortunately, it becomes difficult to maintain a single code with good performance on all platforms. However, the approach of this report is to carefully analyze performance so that the reader can take these ideas to other architectures as well. Most of the features of the Alpha and P2SC are typical of RISC processors and similar concerns will exist. The author hopes it is interesting to see both what is common in tuning for the T3E and SP2 and what is different.

The only way to compare the performance of a code to what we expect the performance to be for a given computer is to count cycles. This allows us to assess the performance of the code independent of the programming language or any effects other than the architecture on which the code is running. The technique of counting cycles comes down to several assumptions and steps:

  1. Assume that for a given loop those values which can remain in registers will (compilers are good at this). This includes reused elements of vectors and any scalars which are found in the loop. In general, you may want to assume that data that has ascended the memory hierarchy to a given level will remain at that level,
  2. Assume that a loop can be unrolled to yield enough independent operations to keep the pipelines busy,
  3. Assume an infinite number of registers so that step 2 can take place,
  4. Assume load/stores and arithmetic can overlap in time, as indicated by the architecture sections above,
  5. Assume common subexpressions will not be recomputed,
  6. Ignore the expense of integer computations implicit in the code, like computing addresses, and loop index operations,
  7. Unless it is known that data will be in memory or in cache, separately compute cycles required both for data in memory and in cache. For example, if data are reused a large number of times, it is reasonable to desire that those data be in cache during the reuse. It may be some challenge to organize the data and processing so that the reused data remain in cache, but at least it is a noble cause. On the other hand, large vectors of data which are accessed sequentially will not be in cache, and it will be useful to see exactly what the performance penalties are.
  8. Assume there are no pathological data alignments in cache which cause excess memory-cache traffic and increase cycles required for loads and stores.

In the process of tuning the code,several of the above assumptions can and will fail. Trying to understand which assumptions fail and generate poor code performance is a large part of the tuning process. Typical examples of this are:

  1. dependent operations which cause poor use of the pipelines,
  2. cache conflicts which cause apparent reduced memory bandwidth,
  3. excess work in a loop causing register spill and pipelining problems.

The required cycles for most operations in which we are interested are given in the section below on the architectures. We have a good grasp of cycles required to move data from memory to cache, from cache to register, and how many cycles are required to do the floating point computations. In the loops to be analyzed, we will be placing the operations and cycle predictions in a table with three sections. We briefly describe the tables using the SP2 as an example.
 

FLOATING POINT:
                                   REAL*8                REAL*4
Operations                    w/o FMA    w/FMA      w/o FMA    w/FMA
----------                    -------    -----      -------    -----
FMA                              0          2           0         2
*                                2          1           2         1
+                                7          5           7         5
ABS                              3          3           3         3
CVLS(=)                                                 1         1

Predicted cycles[FPOPs/2]:      6         5.5        6.5        6

This floating point section for the IBM SP specifies the number and type of operations. The predicted cycles entries are based on  the cycle costs from the above architecture descriptions and the operation counts, and all formulas for cycle predictions are given in square brackets. All cycle predictions are per loop iteration. Double and single precision entries are given for the SP, and we assume throughout that no rearrangements of operations are allowed which could change the numerical results. A column is given including the FMA operation also assuming that no changes in numerical results are allowed. This strict adherence to numerical results is guaranteed on the SP by compiling with the -O3 -qstrict options, or with -O2. Note that compiling with -O3 -qstrict does not mean that all intermediate results are converted to single precision; only an equals sign results in a conversion back to single precision. The most important quantities are highlighted.

MEMORY<->CACHE:

Operations                         REAL*8                 REAL*4
----------                         ------                 ------
LOAD                                 3                      3
STORE                                1                      1
STORE after LOAD

Predicted cycles:          [L*1.3+S*2.3+SaL*1.0]  [L*0.7+S*1.3+SaL*0.6]
Predicted cycles:               3*1.3+1*2.3=           3*0.7+1*1.3=
Predicted cycles:                   6.2                    3.4

The memory-to-cache section specifies the number of loads and stores required by the loop in the infinite loop iteration limit, where all loads and stores must go to main memory. Again, the important cycle predictions are highlighted. The memory-to-cache predictions assume that there are no pathological cache conflicts due to data alignment in cache.

CACHE<->REGISTERS:

Operations
----------
LOAD                                 3
STORE                                1

Predicted cycles(w/o quad)           2  [(LOADs + STOREs)/2]
Predicted cycles(w/  quad)           1  [(LOADs + STOREs)/4]

The cache-to-register section predicts cycles required for data in cache. The highlighted predictions are shown for both quad loads and double loads, which support 4 and 2 double precision loads per cycle, respectively. For single precision loads, the quad loads are unavailable and the w/o quad predictions should be used.

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*4               6
Data in cache, REAL*8               5.5
Data in memory, REAL*4              6
Data in memory, REAL*8              6.2
 

Finally, the first three tables are used to predict overall per iteration cycle cost, assuming load/stores and calculations overlap. For data in memory, the maximum of the MEMORY<->CACHE predicted cycles and floating point predicted cycles is used. For data in cache, the maximum of the CACHE<->REGISTER predicted cycles and floating point predicted cycles is used. Sometimes loops will be referred to as load-store bound for obvious reasons. What we really want are CPU (or floating point) bound loops that are doing work and hiding things we don't care about behind this work. Most loops have been tested and run on the platform in question and an observed cycle count is given for the same run parameters (single precision, double precision, etc). All observed cycle counts are from code which isolates the loop in question and also prevents any cache alignment problems. The observed cycle counts should be interpreted as representative of the best performance of the processor, given that it is running assembly generated by a particular compiler. Actual performance may vary.

Below we examine a number of loops in the original and revised code and find that most code exhibits a large degree of predictability. However, there are a number of operations unique to FAST3D which require some extra analysis first: MAX, MIN, ABS and SIGN, which are covered after we review the SP2 and T3E processor characteristics.

Alpha 21164 (CRAY T3E) architecture and performance

Since we want to predict performance by calculating how many cycles will be required to perform a sequence of operations, we need to know:

1) given a  FORTRAN or C loop, what the hardware should actually be doing,
2) how long the desired operations should take.

For the T3E, this typically comes down to understanding how floating point data moves through the following functional diagram:


 

Not shown are the integer pipelines called E0 and E1, which do non-floating point work and access their own set of 32 integer registers. The integer pipelines also execute all (including floating point) loads and stores. The Alpha issues an instruction to each of the pipelines FM (floating point multiply), FA (add), E0 and E1 every cycle. Loads and stores can occur while floating point operations are being done.

Floating point performance

The Alpha 21164 is capable of executing two floating point instructions per clock cycle, provided these operations do not both fall in the same pipeline. The 21164 has a floating point multiply pipeline (FM) and a floating point add pipeline(FA), both of which have a pipeline length of 5 cycles. This means that to achieve peak performance, the body of a loop should make available 10 independent operations (5 add and 5 multiply) per iteration; dependent floating point operations take 5 cycles, not 1, since they cannot be overlapped. All floating point operations execute in either the FM or FA pipeline, and we will note where important operations other than arithmetic execute. A full list of the pipelines in which Alpha instructions execute is given in the Alpha Architecture Handbook.

In summary, we can write:

Predicted floating point cycles (T3E) = max(FM ops, FA ops),

where FM ops occur in the FM pipeline and FA ops occur in the FA pipeline.
 

Memory system Performance

The memory system on the T3E refers to the hardware which moves data from main memory to cache and from cache to register. Although the T3E has two levels of cache, we will gloss over that fact in an effort to simplify our cycle counting and tuning. All of the timings below have been verified with simple test loops and so should be fairly good estimates of peak performance of the memory system.

Operation                   Latency(CP/64 byte line)    Bandwidth(CP/8 byte word)
Stream buffered load/store           24                          3
Load                                 85                          12
Store                                85                          7

The figures for stream buffered bandwidth have been coerced into one value. In fact, loads are listed as 20% faster than stores, with 3 cycles per word an average and adequate for our analysis below, since (as we are about to see) half the memory access time is latency. These latencies and bandwidths can be used to compute an expected per word access time. We concentrate on memory access via the hardware streams. For regular access of memory, we can expect the hardware to prefetch the data using 6 hardware streams which lie between memory and cache.

If we access memory at stride 1 (as is the case in the loops we will tune in FAST3D), then every 8 loads we cause a cache miss and incur the latency in the table. So per required load or store we get 3  + (24/8)  cycles. In other words, 6  cycles per word. This 6 cycle cost is confirmed by code experiments, although stores alone require about 9 cycles per word. This is probably because of the write-allocate nature of the cache: a write which misses in cache causes the data in memory to be loaded into cache before being written back out again when the cache line is needed by other data. In fact, one would expect stores on average to take about twice as long as loads. This is the case for the SP (see below), but the T3E typically needs about the same time for loads and stores.

Cache-to-register load bandwidths are listed in the T3E Fortran Optimization Guide as two 8-byte words per cycle, and one 8 byte word per cycle on store. Latencies of dubious meaning are also given in the Guide. While we may eventually figure out the details, it is sufficient to note that the peak transfer rate is only occasionally attained, and that cache-to-register times are as much as twice what the peak rate would predict. A typical value from testing would indicate that cache-to-register cycle cost predictions based on the published values above should be multiplied by about 1.6.

A NOTE ON STREAM BUFFERS: The stream buffers are an off-chip mechanism which prefetch same stride accesses to main memory after the second miss in cache. These buffers are usually controlled by either or both the system administrator and the user. Some systems may have the stream buffers deactivated for hardware and software problems present in early releases of the system. One can check by running memory-bound code after alternately executing the cshell commands setenv SCACHE_D_STREAMS 1 (streams on) and setenv SCACHE_D_STREAMS 0 (streams off). If there is no change in performance, then the system administrator has prohibited streams.

In the diagram, we have left out the 8 Kbyte cache (the data cache) between the cache shown (the secondary cache) and the registers. It is not particularly interesting for the analysis below. The prime reason to recall its existence is to prevent any catastrophic misalignments of data; the data cache is 1024 eight byte words long, so accessing data 1024 elements apart will cause excess data movement between cache levels.
 

Power 2 Super Chip (P2SC) (IBM SP2) architecture and performance

CPU Performance

As was the case for the T3E, we are generally trying to predict how floating point data moves through the following functional diagram of the P2SC:

Many details of the chip are not shown. Not shown are two fixed-point (integer) units (FXUs) which perform integer calculations on their own set of registers. The FXUs are also responsible for executing loads and stores, so that calculations and memory operations can overlap.

Floating Point Performance


The P2SC has replicated floating point units that perform the same tasks. So 2 multiplies, adds, or multiply-adds can take place in the same cycle. A multiply-add or FMA is a single instruction that can be written a=b*c+d. Since the floating point pipelines are 2 to 3 cycles long (FMAs require 3 cycles), we need 4 to 6 independent floating point operations (*,+ or FMA) to keep the floating point units busy on every cycle. The floating point units perform only 64 bit IEEE arithmetic, with conversion from single precision on load and a pre-store conversion instruction, discussed in detail below.

Summing this up, we can write:

Predicted floating point cycles (SP2) = FPOPs/2,

where an FPOP is any of FMA, *,+, etc.

Memory system performance

The P2SC processors have a very straightforward memory architecture. There is a single level of cache, and no stream buffers as on the T3E. The cache is in a four way set-associative arrangement, and is very fast. Four way set associative cache divides the cache into four sections which map to main memory in the same way. So when up to four words align such that the same cache line is used, the cache can still hold all four lines simultaneously. The cache miss latency is documented by IBM to be 8-12 cycles, and direct evidence of bandwidth is pretty hard to find, although 860 Mbytes/s is believed to the right number number for the SP2 nodes at SDSC. The hardware can in fact do better than this on loads, and a pretty good model of the memory to cache performance can be obtained by some simple load-store code tests. For j loads and k stores of 8 byte words, we require 1.3 j + 2.3 k cycles (at stride 1).  For 4 byte words, 0.7 j + 1.3 k cycles per word are required, almost twice as fast. Note that an average (double precision) load/store rate of 1.8 cycles per load corresponds to 740 Mbytes/sec for the 166 MHz P2SC nodes.

The store on the P2SC can also be thought of as a load and then a store-after-load. Since the cache is a write-allocate cache, a store to data not already in cache involves loading the data into cache, followed by a write back to memory when the cache line is needed for other data. So we can think of a store-after-load as taking 1.0 cycles for a total of 2.3 cycles for a store as shown in the above formula.

Due to the nature of cache buffering, note how these performance figures apply in general to data movement between memory and cache. We know that a cache line is 32 words. If we access a particular word of memory, then the entire cache line surrounding that word must be loaded into cache as well. So if actually use every word, the above calculation predicts how many cycles are required for every load and store. However, if we access every other word, then for every load we see in our code, we are actually loading two words - one is used, the other simply comes along for the ride. So stride two access of j loads and k stores requires 2.6 j + 4.6 k cycles. The worst case is for strides greater than 31, where only 1 word in each cache line is used. In this case, the formula indicates that j loads and k stores would require 41.6 j + 73.6 k cycles. In practice, the P2SC does better than this for large strides, requiring only about 44 cycles for one load and one store at stride 32. It would appear that the P2SC is able to overlap loads and stores very well for large stride access.

Cache-to-register performance is very good relative to processor speed on the P2SC. There are two options for loading from cache. There are single (4 bytes or single precision) and double word load instructions, which can load a single or double precision word every cycle, and also quad load instructions, which can load two consecutive double words per cycle into consecutive registers. While the second option is faster, it makes it more difficult for the compiler to schedule a loop, since every load consumes two registers. Also, since there are two fixed point units, loads and stores between cache and register require either 1/2 or 1/4 (for quad loads of double precision data) cycle. IBM compilers will only generate quad load instructions given the appropriate compiler options. In particular, the options -qarch=p2sc and -qtune=p2sc. It is also possible to use pwr2 in place of p2sc.
 

Single Precision on the IBM SP2

Single precision calculations (4 byte floating point operands declared with FORTRAN REAL) are not native calculations on the P2SC. All processor registers and floating point operations are 8 byte, and assembly language conversion instructions are supplied to generate real single precision results. There are 3 types of conversion instructions. There is a load floating point single instruction, a store floating point single instruction, and a convert floating point double to single instruction. When a single precision floating point equation is to be evaluated, the single precision values are moved from cache to register and converted to double precision simultaneously. The right hand side of the equation completes entirely in double precision, and the single precision left hand side is generated by a CVLS instruction which also performs any IEEE exception handling. The value is then written to cache using the store floating point single instruction. From the point of view of optimization, this introduces additional cycles in the floating point units. Essentially, each equals sign generates a CVLS - an additional floating point operation. In the single precision cycle cost estimates below, we introduce an extra floating point operation for every equals sign. Also, there are no quad word single precision load and store instructions, so cache to register bandwidth is lower than the highest double precision bandwidth. Single precision also requires more registers to do the same work in a loop as double precision, since the CVLS instructions require source and destination registers.

The advantage of using single precision is that the cache to memory time is cut nearly in half, and of course we can store twice as many numbers in available memory. The single precision bandwidths are listed in the previous section on memory system performance for the IBM SP2. (The CRAY FORTRAN compiler for the T3E does not support 8 byte floating point arithmetic, so we do not discuss single precision performance on the T3E.)

MAX  and MIN

The LCPFCT algorithm is rather distinct in the floating point operations required. For example,  the lcpfct subroutine contains the following loop, which is at the core of the unique mathematical properties of the algorithm:

          Do 5 I = I1, IN
             FLXH(I+1) = FSGN(I+1) * AMAX1 ( 0.0,
     &                AMIN1 ( TERM(I+1), FABS(I+1), TERP(I+1) ) )
             RHON(I) = RHOTD(I) - RLN(I)*(FLXH(I+1) - FLXH(I))
    5        SOURCE(I) = 0.0

Let us concentrate first on the max and min functions (the generic version of AMAX1 and AMIN1). This sequence of floating point operations

max(0.0,min(a,b,c))

or:

max(0.0,min(a,min(b,c)))
 

Cray T3E

On the DEC Alpha 21164 (the T3E node processors), and any processor which implements conditional move instructions, a single evaluation of min(a,b) in assembly language looks like:

inst.   operands       pipeline      function
-----   --------       --------      --------
ldt     f6,-16(r1)     E0 or E1      load operand a into register f6
ldt     f3,-16(r4)     E0 or E1      load operand b into register f3
cpys    f6,f6,f13      FM or FA      copy f6 into f13
cmptle  f6,f3,f2       FA            compare f6 to f3
fcmoveq f2,f3,f13      FA            move f3 to f13 if f3<f6
stt     f13            E0 or E1      store the result

This is a sequence of two loads, a copy, a compare, a conditional move and a final store of the result. The pipeline in which the instruction executes is given in the last column; E0 and E1 are the two integer pipelines, FM is the floating point multiply unit, and FA is the floating point add pipeline. These are given in order to help us count the cycles required for the min/max operation. If there were no other operations to be performed, then this sequence of operations would require the sum of all the operation pipeline lengths to complete, or one less than the pipeline depth due to forwarding. (One less than the pipeline depth is called the pipeline latency). For the above operations, the latencies are 2 cycles for loads (the integer pipeline depth is 3) and 4 cycles for the FM or FA operations. The store operation produces no result and no latency is present. With independent min evaluations, say due to unrolling and clever scheduling, we can overlap operations in different pipelines and also keep the pipelines full. The two loads can overlap and execute in one cycle. This loading can also overlap with copy operations in the FM pipeline which can also overlap with operations in the FA pipeline. The longest chain of operations is in fact in the FA pipeline, where the cmptle and fcmoveq operations must complete in sequence; they cannot overlap.  As an example of this overlapping, let us consider a loop consisting only of min evaluations. If we unroll the loop by 4, then we can keep the FA pipeline filled and generate a cmptle and fcmoveq result every cycle. A picture of what we hope the Alpha 21164 would be doing is as follows, where the pipelines have been drawn out to show how resource conflicts control the sequence and timing of execution:

                .
                .
                .
                .
                .
-        12345 FM                 |
FCMOVEQ4 12345 FA                 |
-        123   E0             Iteration I-1
-        123   E1                 |
CYCLE:   .....12345678....
 ------------------------------------------
 CPYS1    12345 FM                |
 CMPTLE1  12345 FA                |
 LD5      123   E0                |
 LD6      123   E1                |
  CPYS2    12345 FM               |
  CMPTLE2  12345 FA               |
  LD7      123   E0               |
  LD8      123   E1               |
   CPYS3    12345 FM              |
   CMPTLE3  12345 FA              |
   ST1      123   E0              |
   ST2      123   E1              |
    CPYS4    12345 FM             |
    CMPTLE4  12345 FA             |
    ST3      123   E0             |
    ST4      123   E1         Iteration I
     -        12345 FM            |
     FCMOVEQ1 12345 FA            |
     -        123   E0            |
     -        123   E1            |
      -        12345 FM           |
      FCMOVEQ2 12345 FA           |
      -        123   E0           |
      -        123   E1           |
       -        12345 FM          |
       FCMOVEQ3 12345 FA          |
       LD1      123   E0          |
       LD2      123   E1          |
        -        12345 FM         |
        FCMOVEQ4 12345 FA         |
        LD3      123   E0         |
        LD4      123   E1         |

CYCLE:   .....12345678....
----------------------------------------
         CPYS1    12345 FM        |
         CMPTLE1  12345 FA  Iteration I+1
         LD1      123   E0        |
         LD2      123   E1        |
                .
                .
                .
                .
                .

Note that the stores in iteration I are being performed for iteration I-1,  while loads are being performed for iteration I+1. Forwarding is assumed for every pipeline, so that we need not wait for the last stage to write the register file. In the eight cycles shown, four evaluations of a min are computed, completing on cycles 5 through 8, yielding a performance of 2 cycles per evaluation. For the purpose of counting floating point cycles, we can think of a min as two floating point adds and a multiply. Assuming sufficient unrolling, we can assume overlapping and ignore the dependencies. Even if the other instructions in a loop iteration are not other mins, we still tie up the FA unit for two cycles. Also note that the FA pipeline is the only pipeline whose resources are busy every cycle. Both the integer and floating-point multiply pipelines are unused for four complete cycles.

Interestingly enough, the T3E does not manage 2 cycles per iteration for a loop composed entirely of min evaluations. Instead, 3.3 cycles are required. However, for the loop composed of

max(0.0,min(a,b,c))

evaluations, the T3E can complete each iteration of this loop unrolled by 3 in 7.2 cycles, or 2.6 cycles per min/max evaluation. In the author's experience with counting cycles on the T3E, getting only 70% of the predicted performance is not atypical, so these timings are reasonable.
 

IBM SP

The IBM SP implements the above operation differently. A single evaluation of c=min(a,b) in SP assembly looks like:

inst.   operands                functional unit   
----    --------                ---------------   
lfd     fp4=a(gr3,-7984)        FXU
lfd     fp2=b(gr3,-11984)       FXU
fcmpu   cr4=fp2,fp4             FPU
bc      CL.242,cr4,0x20/flt     BPU
fmr     fp4=fp2                 FPU
   CL.242:
stfd    c(gr3,16)=fp4           FXU

The quad loads present in the actual code have been eliminated for simplicity, but for many iterations of the loop, the compiler has managed to schedule quad loads and stores which reduce the time spent in each double precision load or store to essentially 1/4 cycle. Scheduling quad loads and stores can difficult since you have to load and store to adjacent registers and adjacent memory locations. There are three processing units in the Power2SC architecture involved: the fixed-point unit (FXU), the floating-point unit(FPU), and the branch-processing unit(BPU). There are two FXUs, two FPUs and one BPU, so given some loop unrolling, we can expect quite a lot of overlapping of work, even though a single iteration of min in a loop results in the dependent sequence of operations shown in the assembly fragment. There is really no indication in the IBM documentation about predicting required cycles in such a sequence of instructions. Instead, the documentation states that there are two sources of delay. The first is the unavoidable delay of the fcmpu in the floating point pipeline, which has a latency of 1-2 cycles. The second delay comes about in the branch instruction as the BPU decides what to do next. The fact that it is unknown which instruction will be executed after the branch instruction makes the pipelining of instructions more difficult (or perhaps impossible), so the exact effect of the branch instruction is unclear. The best we can do is compare to run times we predict without the conditionals and compare and complain. Several tests have been done to determine the effect of conditional branches. First, a loop composed entirely of min evaluations has been unrolled and timed. The best performance is 2.5 cycles per iteration, after manually unrolling by 8. A quick inspection of the above SP assembly indicates that there are 1.5 cycles of work per iteration in an FXU, and 1-2 cycles in an FPU, depending on whether or not the branch is taken. Assuming that both the 2 FXUs and the 2 FPUs can be utilized, and random data are used, this is and average of 3/4 cycle of work per iteration. So the presence of the conditional branch has significantly reduced the performance from what we expect from a fully pipelined sequence of instructions. A second test is performed by timing a loop consisting of max(0.0,min(a,min(b,c))) evaluations, which requires about 6 cycles per iteration after some experimentation with unrolling (see VDMAXMIN below). There are 3 loads and 1 store, 3-6 floating point operations(depending on how many branches are taken), and 3 branch instructions per iteration. If the branch instructions were not costing us anything, we would expect this to require 2.25 cycles per iteration on average. A fairly consistent explanation of these timings is a branch delay of 2-2.5 cycles during which the BPU is busy but the other functional units are not. While we also suspect that pipelining is disrupted by the presence of the branch instructions, this is a difficult effect to quantify. Removing the branch instructions entirely is a possible modification (see the next section ).

Since there are unused cycles in the FPU and FXUs while the BPU is busy, it is (in principle) possible to hide instructions independent of the branch outcome behind the BPU delay (see the IBM Optimization and Tuning Guide for Fortran, C, and C++). As in the case of SIGN evaluations (see VDSGN and VDSIGNDIFF below), one might wonder how many of the idle FPU and FXU cycles can be reclaimed by adding work to a loop containing only max(0.0,min(a,min(b,c))).  No option that the author has tried has managed to hide any additional work in this loop that is appropriate for the lcpfct subroutine. In fact, adding work can result in serious reductions in overall performance, as will be demonstrated in performance evaluation of code revisions below. We have had marginally better luck with the SIGN evaluations.

The subroutine VDMAXMIN:

      subroutine vdmaxmin(n,term,fabs,terp,flxh)

      implicit none
      real*8 term(*),fabs(*),terp(*),flxh(*)
      integer n8,n,i

      n8=rshift(n,3)*8

      do i=1,n8,8
            flxh(i) =  max ( 0.0d0,
     &            min ( term(i), fabs(i), terp(i) ) )
            flxh(i+1) =  max ( 0.0d0,
     &            min ( term(i+1), fabs(i+1), terp(i+1) ) )
            flxh(i+2) =  max ( 0.0d0,
     &            min ( term(i+2), fabs(i+2), terp(i+2) ) )
            flxh(i+3) =  max ( 0.0d0,
     &            min ( term(i+3), fabs(i+3), terp(i+3) ) )
            flxh(i+4) =  max ( 0.0d0,
     &            min ( term(i+4), fabs(i+4), terp(i+4) ) )
            flxh(i+5) =  max ( 0.0d0,
     &            min ( term(i+5), fabs(i+5), terp(i+5) ) )
            flxh(i+6) =  max ( 0.0d0,
     &            min ( term(i+6), fabs(i+6), terp(i+6) ) )
            flxh(i+7) =  max ( 0.0d0,
     &            min ( term(i+7), fabs(i+7), terp(i+7) ) )
      end do
      do i=n8+1,n
            flxh(i) =  max ( 0.0d0,
     &            min ( term(i), fabs(i), terp(i) ) )
      end do

      return
      end
 

when compiled with

xlf90 -qfixed -O3 -qtune=p2sc -qarch=p2sc -c vdmaxmin.f

can return a max(0.0,min(a,min(b,c))) evaluation in as little as 6.1 cycles:

vector length        cycles per evaluation
                     real*4    real*8
-------------        ----------------
64                    6.8       6.7
128                   6.6       6.4
256                   6.4       6.2
512                   6.3       6.1

These tests assume that the data are in cache.

Alternatives to MAX and MIN for processors without conditional move instructions


Although we have seen in the previous section that we can in some cases evaluate independent max and min instructions fairly quickly, we will find in practice that the branch instructions play havoc with scheduling of instructions in a loop. In fact, the loops below with the worst performance will be those containing conditional branches. Frequently, the expected execution time is doubled or tripled over what we might surmise from the above analysis of performance. The above analysis essentially says that if you assume the best case of the conditional branch expense being completely hidden, then you should expect a cycle or two to be consumed in each max or min. The existence of the conditional move instruction entails no particular problem in scheduling, because the operations complete in the floating point pipelines just as multiplies and adds. One might ask if there is a way to avoid the branch instructions the compiler generates for its MAX and MIN instructions. The answer is yes, although the alternate method does not necessarily yield the same floating point results.

Consider the two identities:

2 max(A,B) = A+B+abs(A-B)
2 min(A,B) = A+B-abs(A-B)

where abs is the absolute value. Ignoring for the moment that these are not floating point identities, we can rewrite our required expression

max(0.0,min(A,B,C)) = .125*[A+B-abs(A-B)+2C - abs(A+B-abs(A-B)-2C)
                       +abs(A+B-abs(A-B)+2C - abs(A+B-abs(A-B)-2C))]

Since the operations involved are now addition, multiplication and absolute value, we can count the cycles required for this alternate version of the computation quite easily. Let us assume that we have enough independent evaluations and unrolling of a loop has exposed these to the compiler. We now count operations and cycles:

FLOATING POINT:
                                   REAL*8                REAL*4
Operations                    w/o FMA    w/FMA      w/o FMA    w/FMA
----------                    -------    -----      -------    -----
FMA                              0          2           0         2
*                                2          1           2         1
+                                7          5           7         5
ABS                              3          3           3         3
CVLS(=)                                                 1         1

Predicted cycles:                6         5.5         6.5        6
 
 

MEMORY<->CACHE:

Operations                         REAL*8                 REAL*4
----------                         ------                 ------
LOAD                                 3                      3
STORE                                1                      1
STORE after LOAD

Predicted cycles:               3*1.3+1*2.3=           3*0.7+1*1.3=
Predicted cycles:                   6.2                    3.4
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 3
STORE                                1

Predicted cycles(w/o quad)           2
Predicted cycles(w/  quad)           1

Overall cycles per iteration     predicted
----------------------------     ---------
Data in cache, REAL*4               6
Data in cache, REAL*8               5.5
Data in memory, REAL*4              6
Data in memory, REAL*8              6.2
 

We have used  values for the loads and stores assuming that all three operands are loaded from vectors and the result is stored back to a vector. Even in the worst case the computation is only marginally memory bound; for in-cache data, this evaluation is certainly CPU bound for double precision data. For single precision calculations, the computation will certainly be CPU bound (this is good).

Note that the floating point differences between these two methods of computing max and min can be quite large. The max and min calculations have essentially the same problems as floating point adds and subtracts:

A=1.0,      B=.12345678:  min(A,B)=          0.1234567836
                          (A+B-abs(A-B))/2=  0.1234567761

A=10000.0,  B=.12345678:  min(A,B)=          0.1234567836
                          (A+B-abs(A-B))/2=  0.1232518256

A=1.0*10^35,B=.12345678:  min(A,B)=          0.1234567836
                          (A+B-abs(A-B))/2=  0.0000000000E+00
 

These comparisons are for single precision arithmetic.

We do briefly discuss the use of this alternate form of max and min in the performance section below. However, the reader may wish to discount this method entirely due to its finite precision effects on the accuracy of calculations. In the end, we are able to get fairly good performance without resorting to this alternative. We comment on this more in the section on performance analysis of the revised code below.
 
 

ABS and SIGN


There are several other operations present in the LCPFCT kernel whose cycle costs are not immediately obvious. These are ABS, the absolute value, and SIGN, a two-input, single-output function which returns a number whose absolute value is the first argument, and with the same sign as the second argument. A quick look at the assembly reveals how these operations are accomplished.
 

IBM SP

The absolute value is a assembly instruction on the IBM SP.  For example, the loop

        do i=1,n
        c(i)=abs(a(i))
        end do

results in a loop containing the assembly instruction:
 

     inst     operands                 functional unit
     -----    -----------              ---------------
     fabs     fp3=fp2                  FPU

Therefore, we expect each floating point unit to be able to compute an absolute value every cycle.

The SIGN function is succinctly defined by:

c=sign(a,b): if b < 0, then c=-abs(a)
             else, c=abs(a).

Like MIN and MAX, the SP implements this function as a conditional.  Consider the loop

        do i=1,n
        flxh(i) = sign(cst,term(i))
        end do

The following evaluates a single sign function. In lcpfct, the first argument is a constant and the positive and negative absolute values can be stored outside the loop. The following fragment of a the loop body assumes that fp0(floating point register 0)=0.0, fp1=abs(a), and fp2=-abs(a). Each component of term is loaded into floating point register 5.

     inst     operands                 functional unit
     -----    -----------              ---------------
     .
     .
     .
     fmr      fp8,fp1                  FPU
     lfq      fp5,-11992(r3)           FXU
     fcmpu    7,fp0,fp5                FPU
     bc       B0_IF_NOT,CR7_GT,__L460  BPU
     fmr      fp8,fp2                  FPU
__L460:
     stfdu    fp8,8(r3)                FXU
     .
     .
     .

Again, the effect of the conditional branch is difficult to assess a priori in terms of cycles, but we do know how to count the remainder of the operations: there are 2 to 3 (depending on the sign of the operand in fp5) cycles spent in a floating point unit, and one cycle in a fixed point unit for the load and store. So if the branch instruction cost is completely hidden, and both floating point units can be completely utilized, we would expect each sign evaluation to require 2.5/2=1.25 cycles on average. For the above loop consisting only of sign evaluations, unrolling by 8 yields a code which runs in 2.5 cycles per iteration (see VDSGN below), confirming the branch penalty of 2-2.5 cycles mentioned in the previous section. So the floating point units have an extra 1.25 cycles per iteration in which they are idle. In practice, we are able to reclaim an additional half cycle of work per iteration by computing the differences of the input vector, a task which has to be done in actual lcpfct algorithm (see VDSGNDIFF below). Experimentation with reclaiming the remaining .75 cycles has not been successful.  In fact, scheduling more work (that is appropriate for lcpfct) per iteration than is specified in VDSGNDIFF causes the additional work to take longer than if it were put in a separate loop which could then be efficiently pipelined without the disruption of scheduling caused by branch instructions. Note that the performance of VDSGN compares favorably to the vector MASS version of the SIGN function (which takes two vector operands and therefore is not appropriate), which is listed as 3.5 cycles per iteration in the MASSVP2 library (see http://www.rs6000.ibm.com/resource/technology/MASS/ ). The vectorized version of SIGN is the subroutine VDSGN:

      subroutine vdsgn(n,scalar,in,out)

      implicit none
      real*8 scalar,in(*),out(*)
      integer n8,n,i

      n8=rshift(n,3)*8

      do i=1,n8,8
          out(i) = sign(scalar,in(i))
          out(i+1) = sign(scalar,in(i+1))
          out(i+2) = sign(scalar,in(i+2))
          out(i+3) = sign(scalar,in(i+3))
          out(i+4) = sign(scalar,in(i+4))
          out(i+5) = sign(scalar,in(i+5))
          out(i+6) = sign(scalar,in(i+6))
          out(i+7) = sign(scalar,in(i+7))
      end do
      do i=n8+1,n
          out(i)=sign(scalar,in(i))
      end do

      return
      end

When compiled with

xlf90 -qfixed -O3 -qtune=p2sc -qarch=p2sc -c vdsgn.f

this subroutine can return a SIGN evaluation in as little as 2.5 cycles:

vector length        cycles per SIGN
-------------        ---------------
64                   3.1
128                  2.8
256                  2.7
512                  2.5

These tests assume that the data are in cache. Single precision cycle costs are slightly higher for long vectors at 2.6 cycles per iteration.

The subroutine VDSGNDIFF reclaims some of the idle FPU cycles and requires the same time per iteration as VDSGN:

      subroutine vdsgndiff(n,scalar,in,out)

      implicit none
      real*8 scalar,in(*),out(*)
      integer n8,n,i

      n8=rshift(n-1,3)*8

      do i=1,n8,8
          out(i+1) = sign(scalar,(in(i+1)-in(i)))
          out(i+2) = sign(scalar,(in(i+2)-in(i+1)))
          out(i+3) = sign(scalar,(in(i+3)-in(i+2)))
          out(i+4) = sign(scalar,(in(i+4)-in(i+3)))
          out(i+5) = sign(scalar,(in(i+5)-in(i+4)))
          out(i+6) = sign(scalar,(in(i+6)-in(i+5)))
          out(i+7) = sign(scalar,(in(i+7)-in(i+6)))
          out(i+8) = sign(scalar,(in(i+8)-in(i+7)))
      end do
      do i=n8+1,n-1
          out(i+1)=sign(scalar,in(i+1)-in(i))
      end do

      return
      end

However, for single precision calculations (i.e., VSSGNDIFF), the code is much slower at almost 4 cycles per iteration. The extra CVLS instruction and the subtract are unable to hide behind the branches.

CRAY T3E

The T3E implements ABS (absolute value) using the CPYS (copy sign) instruction. For example the FORTRAN loop:

        do i=1,n
           flxh(i) = abs(fsgn(i))
        end do

becomes:

$LL00009:                                     ; Codeblock 10
        subq      r2,  1,         r2          ; Ln 28 0x00000160 (In 107)
        cpys      f31, f2,        f0          ; Ln 29 0x00000164 (In 102)
        ldt       f2,  8(r1)                  ; Ln 29 0x00000168 BLOAD (In 271)
        addq      r1,  8,         r1          ; Ln 29 0x0000016c (In 105)
        stt       f0,  (r3)                   ; Ln 29 0x00000170 (In 104)
        addq      r3,  8,         r3          ; Ln 29 0x00000174 (In 106)
        bgt       r2,  $LL00009               ; Ln 30 0x00000178 (In 108)
 

In this case, f0 holds values of flxh(i), and f2 holds fsgn(i). CPYS evaluates the absolute value by reading floating point register 31 (f31), a register which always returns a floating point zero when read. In other words, the sign of zero is assigned to the value in f2. CPYS executes in either the FM or FA pipeline, so we can expect up to two of these to complete in a single cycle.

SIGN is also accomplished as a single call to CPYS. For example, the FORTRAN loop:

         do i=1,n
           flxh(i) = sign(fabs(i),fsgn(i))
         end do

in Alpha assembly is very simply:
 

$LL00005:                                     ; Codeblock 6
        cpys      f3,  f2,        f0          ; Ln 29 0x000001e0 (In 129)
        ldt       f3,  8(r2)                  ; Ln 29 0x000001e4 BLOAD (In 306)
        subq      r1,  1,         r1          ; Ln 28 0x000001e8 (In 135)
        ldt       f2,  8(r3)                  ; Ln 29 0x000001ec BLOAD (In 304)
        addq      r2,  8,         r2          ; Ln 29 0x000001f0 (In 132)
        stt       f0,  (r4)                   ; Ln 29 0x000001f4 (In 131)
        addq      r3,  8,         r3          ; Ln 29 0x000001f8 (In 133)
        addq      r4,  8,         r4          ; Ln 29 0x000001fc (In 134)
        bgt       r1,  $LL00005               ; Ln 30 0x00000200 (In 136)
 

so that we again have as many as two SIGN evaluations per cycle, in terms of floating point unit work.
 

Performance evaluation of the original code

The usefulness of FAST3D to the user community partly stems from its modularity. It is probably true that one could rewrite the code in such a way to destroy much of this modularity and the usefulness of the code, with a benefit in performance. We comment on this below in the Other prospects section. We will not pursue any such modifications in detail. We will concentrate on modifications within subroutines, and specifically within the most time consuming routine, lcpfct.
 

A detailed analysis of lcpfct


lcpfct is a sequence of 5 loops. These are presented below with the extraneous intervening instructions removed, except for a (commented out) subroutine call:
 

          Do 1 I = I1P, IN
             FLXH(I) = HADUDTH(I) * ( RHOO(I) + RHOO(I-1) )
    1        DIFF(I) = NULH(I) * ( RHOO(I) - RHOO(I-1) )

          Do 2 I = I1, IN
             LORHOT(I) = LO(I)*RHOO(I) + SOURCE(I) + (FLXH(I)-FLXH(I+1))
             LNRHOT(I) = LORHOT(I) + (DIFF(I+1) - DIFF(I))
             RHOT(I)  = LORHOT(I)*RLN(I)
    2        RHOTD(I) = LNRHOT(I)*RLN(I)

c             Call SML_CELL ( i1, iN )

          Do 3 I = I1P, IN
             FLXH(I) = MULH(I) * ( RHOT(I) - RHOT(I-1) )
    3        DIFF(I) = RHOTD(I) - RHOTD(I-1)

          Do 4 I = I1, IN
             FABS(I+1) = ABS ( FLXH(I+1) )
             FSGN(I+1) = SIGN ( DIFF1, DIFF(I+1) )
             TERM(I+1) = FSGN(I+1)*LN(I)*DIFF(I)
    4        TERP(I) = FSGN(I)*LN(I)*DIFF(I+1)

          Do 5 I = I1, IN
             FLXH(I+1) = FSGN(I+1) * AMAX1 ( 0.0,
     &                AMIN1 ( TERM(I+1), FABS(I+1), TERP(I+1) ) )
             RHON(I) = RHOTD(I) - RLN(I)*(FLXH(I+1) - FLXH(I))
    5        SOURCE(I) = 0.0

We want to estimate how long we expect these loops to take on both of our target architectures. But first, let's have a look at the loops and try to understand what work is actually necessary. If we look through the remainder of the LCPFCT subroutines, we note that many of the vector variables in the above 5 loops are internal to lcpfct. These variables are passed through COMMON and are defined in the file FCT_vars.h, the appropriate sections of which are included below:
 

C /FCT_VELO/ Holds velocity-dependent flux coefficients
      Real      HADUDTH(Mpt),  NULH(Mpt),     MULH(Mpt)
      Real      EPSH(Mpt),     VDTODR(Mpt)

      Save      /FCT_VELO/
      Common    /FCT_VELO/ HADUDTH, NULH, MULH, EPSH, VDTODR

C /FCT_MISC/ Holds the source array and diffusion coefficient
      Real      SOURCE(Mpt), DIFF1

      Save      /FCT_MISC/
      Common    /FCT_MISC/ SOURCE, DIFF1

C /FCT_SCRH/ Holds scratch arrays for use by LCPFCT and CNVFCT. FCTSC1,
C FCTSC2 are called SCRH, SCR1 in the published version.
      Real      FCTSC1(Mpt),   FCTSC2(Mpt),   DIFF(Mpt)
      Real      FLXH(Mpt),     FABS(Mpt),     FSGN(Mpt)
      Real      TERM(Mpt),     TERP(Mpt),     LNRHOT(Mpt)
      Real      LORHOT(Mpt),   RHOT(Mpt),     RHOTD(Mpt)

      Save      /FCT_SCRH/
      Common    /FCT_SCRH/
     &          FCTSC1, FCTSC2, DIFF, FLXH, FABS, FSGN, TERM, TERP,
     &          LNRHOT, LORHOT, RHOT, RHOTD

Note that the subroutine SML_CELL, in the source code file vce.f, takes vector variables rhotd (or lnrhot) and ln as inputs and returns a smoothed rhotd (or lnrhot) vector. In the interest of maintaining the generality of the code, we will assume that rhotd is needed at the point in the code where SML_CELL might be called. Otherwise, it is a temporary vector variable. An inspection of the rest of the LCPFCT algorithm shows the following for the routine lcpfct:

inputs    outputs      local vector workspace
------    -------      ----------------------
RHOO      RHON         FLXH
SOURCES   RHOTD        DIFF
HADUDTH                LORHOT
NULH                   LNRHOT
LO                     RHOT
RLN                    FABS
LN                     FSIGN
MULH                   TERM
                       TERP

This table reveals that there as many vector temporaries as there are input and outputs. We should be concerned about this because the loads and stores associated with using these local workspaces may add considerable cycles to the execution of the algorithm. Let us begin by counting cycles in the original 5 loops. Since the two target platforms have different processor architectures, we have collected the operation counts for each loop in several ways. We include the loops to make the operation counts easier to understand. All operation and cycle counts are per iteration. We cover the T3E first, and then the SP2.
 
 

Performance evaluation of the original code: T3E

An analysis of the performance of the original code on the T3E will reveal the important properties of the lcpfct algorithm. Since CRAY FORTRAN on the T3E does not support REAL*4 arithmetic, we are constrained to testing with REAL*8 data. We analyze the code loop-by-loop, both predicting performance based on our knowledge of the architecture and measuring the performance using carefully devised test codes. These test codes have all been compiled with

f90 -O2 -Ounroll2 -Opipeline2

unless otherwise noted in the tables or loop comments. These compiler options have been used to insure that floating point results are not changed by the compiler, and for performance of the resulting code. Other compiler options (e.g. -O3) make little difference in the performance, with -Ounroll2 being the most important options for performance. Although the -Opipeline2 option does not generally improve performance for load-store bound code, we have included it for consistency; it will help in the revised code. The performance of the loops in the original code only varies slightly when -Opipeline2 is omitted; some loops run slightly faster, some loops run slightly slower. All of the test codes have the vector quantities declared in common and alignment carefully adjusted to prevent the pathological alignment of vector elements in cache.
 

Loop1:

          Do 1 I = I1P, IN
             FLXH(I) = HADUDTH(I) * ( RHOO(I) + RHOO(I-1) )
    1        DIFF(I) = NULH(I) * ( RHOO(I) - RHOO(I-1) )
 

FLOATING POINT:

Operations                         REAL*8
----------                         ------
*                                    2
+                                    2
Predicted cycles:                   2 [MAX(FM ops, FA ops)]
 
 

MEMORY<->CACHE:

Operations                         REAL*8
----------                         ------
LOAD                                 3
STORE                                2

Predicted cycles:                 3*6+2*6=
Predicted cycles:                  30    [6*(LOADS+STORES)]
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 3
STORE                                2

Predicted cycles                   3.5  [LOADS/2 + STORES]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*8              3.5                    5.5
Data in memory,REAL*8              30                     30

Loop1 has independent iterations and adequate unrolling should provide plenty of independent calculations for pipelining. For both in-cache and in-memory data, the loop is load-store bound.

Loop 2:

          Do 2 I = I1, IN
             LORHOT(I) = LO(I)*RHOO(I) + SOURCE(I) + (FLXH(I)-FLXH(I+1))
             LNRHOT(I) = LORHOT(I) + (DIFF(I+1) - DIFF(I))
             RHOT(I)  = LORHOT(I)*RLN(I)
    2        RHOTD(I) = LNRHOT(I)*RLN(I)

FLOATING POINT:

Operations                         REAL*8
----------                         ------
*                                    3
+                                    5
Predicted cycles:                   5 [MAX(FM ops, FA ops)]
 
 

MEMORY<->CACHE:

Operations                         REAL*8
----------                         ------
LOAD                                 6
STORE                                4

Predicted cycles:                 6*6+4*6=
Predicted cycles:                   60  [6*(LOADS+STORES)]
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 6
STORE                                4

Predicted cycles                   7.0  [LOADS/2 + STORES]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*8              7.0                    18.5
Data in memory,REAL*8              60                    117

Loop 2 is also a load-store bound loop, regardless of whether the data are in memory or cache. For data in memory, the already expensive set of ten loads and stores is even more costly as the Alpha runs out of stream buffers. As noted in the architecture section, these stream buffers have an enormous impact on cache-to-memory bandwidth and latency. The temporary vectors LORHOT and LNRHOT are included only for generality in the algorithm; these vectors are not used in this version of the lcpfct subroutine. Finally, this is also a loop with independent iterations.
 
 

Loop3:
          Do 3 I = I1P, IN
             FLXH(I) = MULH(I) * ( RHOT(I) - RHOT(I-1) )
    3        DIFF(I) = RHOTD(I) - RHOTD(I-1)

FLOATING POINT:

Operations                         REAL*8
----------                         ------
*                                    1
+                                    2
Predicted cycles:                   2 [MAX(FM ops, FA ops)]
 
 

MEMORY<->CACHE:

Operations                         REAL*8
----------                         ------
LOAD                                 3
STORE                                2

Predicted cycles:                 3*6+2*6=
Predicted cycles:                   30  [6*(LOADS+STORES)]
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 3
STORE                                2

Predicted cycles                   3.5  [LOADS/2 + STORES]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*8              3.5                    5.9
Data in memory,REAL*8              30                     30

Loop 3 writes to 2 vectors (FLXH and DIFF) which are completely local to loops 3,4 and 5. Note that this is also a load-store bound loop.
 

Loop4:
          Do 4 I = I1, IN
             FABS(I+1) = ABS ( FLXH(I+1) )
             FSGN(I+1) = SIGN ( DIFF1, DIFF(I+1) )
             TERM(I+1) = FSGN(I+1)*LN(I)*DIFF(I)
    4        TERP(I) = FSGN(I)*LN(I)*DIFF(I+1)

FLOATING POINT:

Operations                         REAL*8
----------                         ------
*                                    4
+                                    0
ABS  [1 FM or FA]                    1
SIGN [1 FM or FA]                    1
Predicted cycles:                   4 [MAX(FM ops, FA ops)]
 
 

MEMORY<->CACHE:

Operations                         REAL*8
----------                         ------
LOAD                                 3
STORE                                4

Predicted cycles:                 3*6+4*6=
Predicted cycles:                   42  [6*(LOADS+STORES)]
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 3
STORE                                4

Predicted cycles                   5.5  [LOADS/2 + STORES]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*8              5.5                    9.7
Data in memory,REAL*8              42                     66
 

Loop 4 is different from the first 3 loops, since the usage of FSGN introduces a dependency. Specifically, iteration I is dependent on iteration I-1 because of the 2nd and 4th statements of the do loop body. However, each iteration does have independent operations which can be evaluated while we are waiting on the evaluation of the SIGN function.Since the SIGN function (CPYS instruction) latency is 4 cycles, we need 4 cycles of independent work to prevent the SIGN evaluation from becoming a bottleneck. Fortunately, there are 4 multiplies (i.e., 4 cycles) we can be working on while the next element of FSGN is being computed. Therefore the dependency should not impose a performance penalty.

Again, loop 4 is load-store bound, and since there are 7 loads and stores, the memory-to-cache bandwidth is again reduced by the limit of 6 stream buffers, although this effect is not as severe as loop2.

Loop5:
          Do 5 I = I1, IN
             FLXH(I+1) = FSGN(I+1) * AMAX1 ( 0.0,
     &                AMIN1 ( TERM(I+1), FABS(I+1), TERP(I+1) ) )
             RHON(I) = RHOTD(I) - RLN(I)*(FLXH(I+1) - FLXH(I))
    5        SOURCE(I) = 0.0
 

FLOATING POINT:

Operations                         REAL*8
----------                         ------
*                                    2
+                                    2
MAX/MIN [2 FA, 1 FM or FA]           3
Predicted cycles:                   8 [MAX(FM ops, FA ops)]
 
 

MEMORY<->CACHE:

Operations                         REAL*8
----------                         ------
LOAD                                 6
STORE                                3

Predicted cycles:                 6*6+3*6=
Predicted cycles:                   54  [6*(LOADS+STORES)]
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 6
STORE                                3

Predicted cycles                    6  [LOADS/2 + STORES]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*8              8                    19.4
Data in memory,REAL*8              54                    101
Data in cache, REAL*8,unroll 3     8                    13.0
 

Loop 5 has a dependency introduced by the usage of FLXH; iteration I is dependent on the results of iteration I-1. However, there are more than 6 cycles of pipelined work (and therefore more than the pipeline latency of 4 cycles) involved in computing the next element of FLXH so we do not expect the dependency to limit performance. The assumes the loop can be unrolled sufficiently without running out of floating point registers.

We have also included a timing for a manually unrolled loop. The FORTRAN compiler default unroll count, set by compiling with -Ounroll2 unrolls too many times and software pipelining is disabled due to register overflow. Manually unrolling by 3 (using the !dir$ unroll 3 directive just before the do loop to be unrolled)increases performance dramatically. The fact that performance is potentially degraded is easily checked by looking at compiler output. In this case, the listing file executable.lst generated by compiling with the flag -r3 contained the optimization messages with regard to loop 5 (which began at line 60):

19) cf90-6005,Scalar,Line=60 A loop starting at line 60 was unrolled 4 times.
20) cf90-9870,Scalar,Line=60 Pipelining of loop disabled because of high pressure: ipressure=2, fpressure=35.

Note that the memory bandwidth is again reduced because 9 streams into memory are needed and only six are available. Finally, this is the only loop in the 5 loops in the original code which is not load-store bound for data in cache.
 
 

Conclusions:

Original code performance summary (loops 1-5):

Cycles per iteration:
                         predicted floating point     predicted     observed
                         ------------------------     ---------     --------
Data in cache, REAL*8              21                    27.5         59.6
Data in memory, REAL*8             21                    216          344

Performance evaluation of the original code: SP2

The SP2 FORTRAN compiler supports both REAL*4 and REAL*8 arithmetic, so we review both cases for each loop in the original code. For every loop, we have generated a test code compiled with

xlf90 -O3 -qstrict

or

xlf90 -O3 -qstrict -qtune=p2sc -qarch=p2sc.

The tables note which options are appropriate for each loop timing in the observed results. As in the T3E case, the array variables have been declared in common and their alignment in memory adjusted to prevent pathological alignment problems in cache.
 

Loop1:

          Do 1 I = I1P, IN
             FLXH(I) = HADUDTH(I) * ( RHOO(I) + RHOO(I-1) )
    1        DIFF(I) = NULH(I) * ( RHOO(I) - RHOO(I-1) )

FLOATING POINT:
                                   REAL*8                REAL*4
Operations                    w/o FMA    w/FMA      w/o FMA    w/FMA
----------                    -------    -----      -------    -----
FMA                              0          0           0         0
*                                2          2           2         2
+                                2          2           2         2
CVLS(=)                          0          0           1         1

Predicted cycles:[FPOPs/2]       2         2          2.5      2.5
 
 

MEMORY<->CACHE:

Operations                         REAL*8                 REAL*4
----------                         ------                 ------
LOAD                                 3                      3
STORE                                2                      2
STORE after LOAD

Predicted cycles:          [L*1.3+S*2.3+SaL*1.0]  [L*0.7+S*1.3+SaL*0.6]
Predicted cycles:               3*1.3+2*2.3=           3*0.7+2*1.3=
Predicted cycles:                  8.5                   4.7
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 3
STORE                                2

Predicted cycles(w/o quad)         2.5     [(LOADs + STOREs)/2]
Predicted cycles(w/  quad)         1.25    [(LOADs + STOREs)/4]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*4              2.5                    4.2
Data in cache, REAL*8              2.5                    3.7
Data in cache,p2sc, REAL*4         2.5                    4.2
Data in cache,p2sc, REAL*8         2.0                    2.2
Data in memory,REAL*4              4.7                    6.3
Data in memory,REAL*8              8.5                   10.0

Loop 1 is marginally load-store bound for REAL*4, in-cache data. For REAL*8 data, quad loads make the loop FPU bound and much higher performance is obtained. As one would expect, this loop is almost twice as fast for REAL*4 data when the data is in memory.

Loop 2:

          Do 2 I = I1, IN
             LORHOT(I) = LO(I)*RHOO(I) + SOURCE(I) + (FLXH(I)-FLXH(I+1))
             LNRHOT(I) = LORHOT(I) + (DIFF(I+1) - DIFF(I))
             RHOT(I)  = LORHOT(I)*RLN(I)
    2        RHOTD(I) = LNRHOT(I)*RLN(I)

FLOATING POINT:
                                   REAL*8                REAL*4
Operations                    w/o FMA    w/FMA      w/o FMA    w/FMA
----------                    -------    -----      -------    -----
FMA                              0          1           0         1
*                                3          2           3         2
+                                5          4           5         4
CVLS(=)                          0          0           4         4

Predicted cycles:[FPOPs/2]       4        3.5          6       5.5
 
 

MEMORY<->CACHE:

Operations                         REAL*8                 REAL*4
----------                         ------                 ------
LOAD                                 6                      6
STORE                                4                      4
STORE after LOAD

Predicted cycles:          [L*1.3+S*2.3+SaL*1.0]  [L*0.7+S*1.3+SaL*0.6]
Predicted cycles:               6*1.3+4*2.3=           6*0.7+4*1.3=
Predicted cycles:                  17                     9.4
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 6
STORE                                4

Predicted cycles(w/o quad)          5     [(LOADs + STOREs)/2]
Predicted cycles(w/  quad)         2.5    [(LOADs + STOREs)/4]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*4              5.5                    6.3
Data in cache, REAL*8              5.0                    8.8
Data in cache,p2sc, REAL*4         5.5                    7.0
Data in cache,p2sc, REAL*8         3.5                    4.5
Data in memory,REAL*4              9.4                    11.1
Data in memory,REAL*8             17.0                    17.0
 

Loop 2 is FPU bound for in-cache REAL*4 data and REAL*8 data with quad loads. Note that unlike the T3E, the SP2 has no hardware stream buffers and in-memory performance is as expected.
 
 

Loop3:
          Do 3 I = I1P, IN
             FLXH(I) = MULH(I) * ( RHOT(I) - RHOT(I-1) )
    3        DIFF(I) = RHOTD(I) - RHOTD(I-1)

FLOATING POINT:
                                   REAL*8                REAL*4
Operations                    w/o FMA    w/FMA      w/o FMA    w/FMA
----------                    -------    -----      -------    -----
FMA                              0          0           0         0
*                                1          1           1         1
+                                2          2           2         2
CVLS(=)                          0          0           2         2

Predicted cycles:[FPOPs/2]      1.5       1.5         2.5      2.5
 
 

MEMORY<->CACHE:

Operations                         REAL*8                 REAL*4
----------                         ------                 ------
LOAD                                 3                      3
STORE                                2                      2
STORE after LOAD

Predicted cycles:          [L*1.3+S*2.3+SaL*1.0]  [L*0.7+S*1.3+SaL*0.6]
Predicted cycles:               3*1.3+2*2.3=           3*0.7+2*1.3=
Predicted cycles:                  8.5                    4.7
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 3
STORE                                2

Predicted cycles(w/o quad)         2.5     [(LOADs + STOREs)/2]
Predicted cycles(w/  quad)         1.25    [(LOADs + STOREs)/4]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*4              2.5                    3.0
Data in cache, REAL*8              2.5                    3.25
Data in cache,p2sc, REAL*4         2.5                    3.0
Data in cache,p2sc, REAL*8         1.5                    1.85
Data in memory,REAL*4              4.7                    4.9
Data in memory,REAL*8              8.5                    8.3
 

Loop 3 is again marginally load-store bound for REAL*4 data or REAL*8 data without quad loads.

Loop4:
          Do 4 I = I1, IN
             FABS(I+1) = ABS ( FLXH(I+1) )
             FSGN(I+1) = SIGN ( DIFF1, DIFF(I+1) )
             TERM(I+1) = FSGN(I+1)*LN(I)*DIFF(I)
    4        TERP(I) = FSGN(I)*LN(I)*DIFF(I+1)

FLOATING POINT:
                                   REAL*8                REAL*4
Operations                    w/o FMA    w/FMA      w/o FMA    w/FMA
----------                    -------    -----      -------    -----
FMA                              0          0           0         0
*                                4          4           4         4
+                                0          0           0         0
ABS                              1          1           1         1
SIGN*                            1          1           1         1
CVLS(=)                          0          0           4         4

Predicted cycles:[FPOPs/2]      5.0       5.0         6.0      6.0
 
 

MEMORY<->CACHE:

Operations                         REAL*8                 REAL*4
----------                         ------                 ------
LOAD                                 3                      3
STORE                                4                      4
STORE after LOAD

Predicted cycles:          [L*1.3+S*2.3+SaL*1.0]  [L*0.7+S*1.3+SaL*0.6]
Predicted cycles:               3*1.3+4*2.3=           3*0.7+4*1.3=
Predicted cycles:                 13.1                    7.3
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 3
STORE                                4

Predicted cycles(w/o quad)         3.5     [(LOADs + STOREs)/2]
Predicted cycles(w/  quad)         1.75    [(LOADs + STOREs)/4]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*4              6.0                    8.5
Data in cache, REAL*8              5.0                    6.8
Data in cache,p2sc, REAL*4         6.0                    8.5
Data in cache,p2sc, REAL*8         5.0                    7.0
Data in memory,REAL*4              7.3                    9.5
Data in memory,REAL*8             13.1                    13.3
 
 

Loop4, as stated in the T3E section, has a loop-carried dependency through FSGN. There are plenty of calculations per iteration to prevent any pipelining problems with the short 2-3 cycle pipelines of the P2SC. Note that we have used the peak observed performance of SIGN from the introductory review of this function to predict performance at 2.5 cycles per SIGN. Loop 4 is FPU bound for in-cache data.

Loop5:
          Do 5 I = I1, IN
             FLXH(I+1) = FSGN(I+1) * AMAX1 ( 0.0,
     &                AMIN1 ( TERM(I+1), FABS(I+1), TERP(I+1) ) )
             RHON(I) = RHOTD(I) - RLN(I)*(FLXH(I+1) - FLXH(I))
    5        SOURCE(I) = 0.0

FLOATING POINT:
                                   REAL*8                REAL*4
Operations                    w/o FMA    w/FMA      w/o FMA    w/FMA
----------                    -------    -----      -------    -----
FMA                              0          1           0         1
*                                2          1           2         1
+                                2          1           2         1
MAXMIN*                          1          1           1         1
CVLS(=)                          0          0           2         2

Predicted cycles:[FPOPs/2]      8.1       7.6         8.6      8.1
 
 

MEMORY<->CACHE:

Operations                         REAL*8                 REAL*4
----------                         ------                 ------
LOAD                                 6                      6
STORE                                3                      3
STORE after LOAD

Predicted cycles:          [L*1.3+S*2.3+SaL*1.0]  [L*0.7+S*1.3+SaL*0.6]
Predicted cycles:               6*1.3+3*2.3=           6*0.7+3*1.3=
Predicted cycles:                 14.7                   8.1
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 6
STORE                                3

Predicted cycles(w/o quad)         4.5     [(LOADs + STOREs)/2]
Predicted cycles(w/  quad)         2.25    [(LOADs + STOREs)/4]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*4              8.1                    15.9
Data in cache, REAL*8              7.6                    14.2
Data in cache,p2sc, REAL*4         8.1                    15.9
Data in cache,p2sc, REAL*8         7.6                    13.8
Data in memory,REAL*4              8.1                    16.0
Data in memory,REAL*8             14.7                    16.7

Loop 5 performs particularly poorly on the P2SC. There is too much work in this loop. We experiment with this in detail in the revised code section. Note we again use observed maximum performance of the MAX and MIN calculations from previous sections of the report to estimate floating point cycles for these functions.
 
 
 
 

Conclusions

Original code performance summary (loops 1-5):

Cycles per iteration:

                              predicted floating point     predicted   observed
                              ------------------------     ---------   --------
Data in cache, REAL*4                  24.6                   24.6       37.9
Data in cache, REAL*8                  19.6                   22.6       36.75
Data in cache,p2sc, REAL*4             24.6                   24.6       38.6
Data in cache,p2sc, REAL*8             19.6                   19.6       29.35
Data in memory, REAL*4                 24.6                   34.2       47.8
Data in memory, REAL*8                 19.6                   61.8       65.3

 

Performance evaluation of revised code: T3E


The major conclusions for code optimization from the performance evaluation of the original code for the T3E are is that the original code is load-store bound and the number of loads and stores should be reduced by fusing loops and eliminating temporary vectors. We therefore begin by fusing the loops in lcpfct into two loops, one on either side of the call to SML_CELL to retain the modularity of the code. Fusing the loops is made slightly more complex by the conditions imposed on either end of the interval before each loop. Loops 1 and 2 can be rewritten as loop 1.1. The original code for these loops is:
 

          DIFF(I1) = NULH(I1) * ( RHOO(I1) - RHO1M )
          FLXH(I1) = HADUDTH(I1) * ( RHOO(I1) + RHO1M )

          Do 1 I = I1P, IN
             FLXH(I) = HADUDTH(I) * ( RHOO(I) + RHOO(I-1) )
    1        DIFF(I) = NULH(I) * ( RHOO(I) - RHOO(I-1) )

          DIFF(INP) = NULH(INP) * ( RHONP - RHOO(IN) )
          FLXH(INP) = HADUDTH(INP) * ( RHONP + RHOO(IN) )

          Do 2 I = I1, IN
             LORHOT(I) = LO(I)*RHOO(I) + SOURCE(I) + (FLXH(I)-FLXH(I+1))
             LNRHOT(I) = LORHOT(I) + (DIFF(I+1) - DIFF(I))
             RHOT(I)  = LORHOT(I)*RLN(I)
    2        RHOTD(I) = LNRHOT(I)*RLN(I)
 

These two loops can be fused into loop 1.1:

Loop 1.1:
        rhot(i1)=rln(i1)*(lo(i1)*rhoo(i1)+source(i1)
     &                    +hadudth(i1)*(rhoo(i1)+rho1m)
     &                    -hadudth(i1+1)*(rhoo(i1+1)+rhoo(i1)))
        rhotd(i1)=rhot(i1)+
     &           rln(i1)*(nulh(i1+1)*(rhoo(i1+1)-rhoo(i1))
     &                    -nulh(i1)*(rhoo(i1)-rho1m))

        do i=i1+1,in-1
           rhot(i)=rln(i)*(lo(i)*rhoo(i)
     &                     +source(i)
     &                     +(hadudth(i)*(rhoo(i)+rhoo(i-1))
     &                       -hadudth(i+1)*(rhoo(i+1)+rhoo(i))))
           rhotd(i)=rhot(i)+rln(i)*
     &                     (nulh(i+1)*(rhoo(i+1)-rhoo(i))
     &                     -nulh(i)*(rhoo(i)-rhoo(i-1)))
           source(i)=0.0d0
        end do

        rhot(in)=rln(in)*(lo(in)*rhoo(in)+source(in)
     &                    +hadudth(in)*(rhoo(in)+rhoo(in-1))
     &                    -hadudth(in+1)*(rhonp+rhoo(in)))
        rhotd(in)=rhot(in)+
     &           rln(in)*(nulh(in+1)*(rhonp-rhoo(in))
     &                    -nulh(in)*(rhoo(in)-rhoo(in-1)))

FLOATING POINT:

Operations                         REAL*8
----------                         ------
*                                    5
+                                    7
Predicted cycles:                   7 [MAX(FM ops, FA ops)]
 
 

MEMORY<->CACHE:

Operations                         REAL*8
----------                         ------
LOAD                                 6
STORE                                3

Predicted cycles:                 6*6+3*6=
Predicted cycles:                   54  [6*(LOADS+STORES)]
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 6
STORE                                3

Predicted cycles                    6  [LOADS/2 + STORES]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*8              7                     11.7
Data in memory,REAL*8              54                    70.2
 

A more direct translation of the original loop would be:

Loop 1.1x:

        do i=i1+1,in-1
           tmp=lo(i)*rhoo(i)
     &                     +source(i)
     &                     +(hadudth(i)*(rhoo(i)+rhoo(i-1))
     &                       -hadudth(i+1)*(rhoo(i+1)+rhoo(i)))
           rhotd(i)=rln(i)*(tmp+
     &                     (nulh(i+1)*(rhoo(i+1)-rhoo(i))
     &                     -nulh(i)*(rhoo(i)-rhoo(i-1)))))
           rhot(i)=rln(i)*tmp
           source(i)=0.0d0
        end do

Loop 1.1x requires more registers than 1.1 and software pipelining is disabled by the compiler. Reducing the unroll factor eventually results in software pipelining of the loop but at lower performance than an unrolled version without software pipelining. In any case, loop 1.1x performs the same as loop 1.1 for data in memory, but for data in cache runs at 14.7 cycles per iteration. This is 3 cycles per iteration slower than loop 1.1. We will stick with loop 1.1 for the modified code, but if the numerical differences between loop 1.1 and loop 1.1x are important to the authors of FAST3D, then loop 1.1x should be used.

We have assumed that our compiler is smart enough to see common subexpressions (we can always explicitly define temporaries). A quick analysis of loop 1.1 reveals the advantages of this revision over the original loops 1 and 2. We summarize the difference in performance at the end of this section.

Note that we have put the zeroing of the SOURCE vector in this loop for a specific reason. Since the typical cache is write-allocate, we must always load a value into cache before writing it back out. Since the original 5 loops load SOURCE into memory twice (once in loop 2 and once in loop 5) it will be more efficient to only load the vector once and zero in the same loop. Unfortunately, this causes our tuned(fused) loop 1.1 to exceed six streams to memory. The reduction in performance(from 54 to 70 cycles per iteration) is not as serious as we have found in other loops, but we briefly demonstrate how one could avoid this.

Revision 1.2 splits the work in loop 1.1 and prevents stream buffer thrashing:

Loop 1.2a:

        do i=i1+1,in-1
           rhot(i)=rln(i)*(lo(i)*rhoo(i)
     &                     +source(i)
     &                     +(hadudth(i)*(rhoo(i)+rhoo(i-1))
     &                       -hadudth(i+1)*(rhoo(i+1)+rhoo(i))))
        end do

Loop 1.2b:

        do i=i1+1,in-1
           rhotd(i)=rhot(i)+rln(i)*
     &                     (nulh(i+1)*(rhoo(i+1)-rhoo(i))
     &                     -nulh(i)*(rhoo(i)-rhoo(i-1)))
           source(i)=0.0
        end do

Loop 1.2a:

FLOATING POINT:

Operations                         REAL*8
----------                         ------
*                                    3
+                                    4
Predicted cycles:                   4 [MAX(FM ops, FA ops)]
 
 

MEMORY<->CACHE:

Operations                         REAL*8
----------                         ------
LOAD                                 5
STORE                                1

Predicted cycles:                 5*6+1*6=
Predicted cycles:                   36  [6*(LOADS+STORES)]
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 5
STORE                                1

Predicted cycles                   3.5  [LOADS/2 + STORES]
 

Loop 1.2b:

FLOATING POINT:

Operations                         REAL*8
----------                         ------
*                                    2
+                                    3
Predicted cycles:                   3 [MAX(FM ops, FA ops)]
 
 

MEMORY<->CACHE:

Operations                         REAL*8
----------                         ------
LOAD                                 4
STORE                                2

Predicted cycles:                 4*6+2*6=
Predicted cycles:                   36  [6*(LOADS+STORES)]

CACHE<->REGISTERS:

Operations
----------
LOAD                                 4
STORE                                2

Predicted cycles                   4.0  [LOADS/2 + STORES]
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*8              8                    14.4
Data in memory,REAL*8              72                    66
 

Note, however that we have increased the number of temporary vectors over revision 1.1 and paid for our loop splitting by loading and storing these temporary vectors. A third revision blocks the above two loops to eliminate redundant loads of variables reused between the two loops, and runs somewhat faster:
 

Loop 1.3:

       do j=2,n-1,250
        do i=j,j+249
           rhot(i)=rln(i)*(lo(i)*rhoo(i)
     &                     +source(i)
     &                     +(hadudth(i)*(rhoo(i)+rhoo(i-1))
     &                       -hadudth(i+1)*(rhoo(i+1)+rhoo(i))))
        end do
        do i=j,j+249
           rhotd(i)=rhot(i)+rln(i)*
     &                     (nulh(i+1)*(rhoo(i+1)-rhoo(i))
     &                     -nulh(i)*(rhoo(i)-rhoo(i-1)))
           source(i)=0.0
        end do
        end do

Loop 1.3a:

FLOATING POINT:

Operations                         REAL*8
----------                         ------
*                                    3
+                                    4
Predicted cycles:                   4 [MAX(FM ops, FA ops)]
 
 

MEMORY<->CACHE:

Operations                         REAL*8
----------                         ------
LOAD                                 5
STORE                                1

Predicted cycles:                 5*6+1*6=
Predicted cycles:                   36  [6*(LOADS+STORES)]
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 5
STORE                                1

Predicted cycles                   3.5  [LOADS/2 + STORES]
 

Loop 1.3b:

FLOATING POINT:

Operations                         REAL*8
----------                         ------
*                                    2
+                                    3
Predicted cycles:                   3 [MAX(FM ops, FA ops)]
 
 

MEMORY<->CACHE:

Operations                         REAL*8
----------                         ------
LOAD                                 1
STORE                                2

Predicted cycles:                 1*6+2*6=
Predicted cycles:                   18  [6*(LOADS+STORES)]
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 4
STORE                                2

Predicted cycles                   4.0  [LOADS/2 + STORES]
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*8              8                     14.6
Data in memory,REAL*8              54                    56

Revision 1.3 is probably not worth the trouble, but demonstrates how the code could be blocked to reduce the loading and storing of temporary vectors and ameliorate problems due to larger grids running out of space in cache. We have paid a penalty for in-cache performance, but gained in out-of-cache performance. Note also that we must in practice introduce clean-up loops for vectors whose length is not an integer multiple of the block size (250 in this case).

In other words, for the T3E, it can be difficult to write a single code which performs well for long and short vectors. This is due to the presence of a limited number of stream buffers. An option is to check the length of the vectors when in the subroutine and dispatch to two different loops, but this assumes we have a good handle on the size and alignment of all of our temporary vectors and can block properly. This is addressed in the other optimization prospects section. Additionally, it was found that significantly better performance could be obtained from loop 1.3 by unrolling the first loop(1.3a) by 3 and the second loop(1.3b) by 4, using the !dir$ unroll directive.

Due to the evident complications arising with revised loops 1.2 and 1.3, we will use loop 1.1 as our final tuned loop.
 
 

Loops 3,4 and 5 in the original code can be rewritten as one loop as well. The original loops are:

          If ( PBC ) Then
             RHOT1M = RHOT(IN)
             RHOTNP = RHOT(I1)
             RHOTD1M = RHOTD(IN)
             RHOTDNP = RHOTD(I1)
          Else
             RHOT1M = SRHO1*RHOT(I1) + VRHO1
             RHOTNP = SRHON*RHOT(IN) + VRHON
             RHOTD1M = SRHO1*RHOTD(I1) + VRHO1
             RHOTDNP = SRHON*RHOTD(IN) + VRHON
          End If

c     Calculate the transported antidiffusive fluxes and transported
c     and diffused density differences . . .
C-----------------------------------------------------------------------
          FLXH(I1) = MULH(I1) * ( RHOT(I1) - RHOT1M )
          DIFF(I1) = RHOTD(I1) - RHOTD1M
          FABS(I1) = ABS ( FLXH(I1) )
          FSGN(I1) = SIGN ( DIFF1, DIFF(I1) )

          Do 3 I = I1P, IN
             FLXH(I) = MULH(I) * ( RHOT(I) - RHOT(I-1) )
    3        DIFF(I) = RHOTD(I) - RHOTD(I-1)

          FLXH(INP) = MULH(INP) * ( RHOTNP - RHOT(IN) )
          DIFF(INP) = RHOTDNP - RHOTD(IN)

c     Calculate the magnitude & sign of the antidiffusive flux followed
c     by the flux-limiting changes on the right and left . . .
C-----------------------------------------------------------------------
          Do 4 I = I1, IN
             FABS(I+1) = ABS ( FLXH(I+1) )
             FSGN(I+1) = SIGN ( DIFF1, DIFF(I+1) )
             TERM(I+1) = FSGN(I+1)*LN(I)*DIFF(I)
    4        TERP(I) = FSGN(I)*LN(I)*DIFF(I+1)

          If ( PBC ) Then
             TERP(INP) = TERP(I1)
             TERM(I1) = TERM(INP)
          Else
             TERP(INP) = BIGNUM
             TERM(I1) = BIGNUM
          End If

c     Correct the transported fluxes completely and then calculate the
c     new Flux-Corrected Transport densities . . .
C-----------------------------------------------------------------------
          FLXH(I1) = FSGN(I1) * AMAX1 ( 0.0,
     &                  AMIN1 ( TERM(I1), FABS(I1), TERP(I1) ) )

          Do 5 I = I1, IN
             FLXH(I+1) = FSGN(I+1) * AMAX1 ( 0.0,
     &                AMIN1 ( TERM(I+1), FABS(I+1), TERP(I+1) ) )
             RHON(I) = RHOTD(I) - RLN(I)*(FLXH(I+1) - FLXH(I))
    5        SOURCE(I) = 0.0

These three loops impose some additional work to handle the boundary conditions, but the new loop body is quite compact:

Loop 2.1:

      do i=i1+1,in-2
           tmpi=tmpip1
           fabsip1=abs(mulh(i+1)*(rhot(i+1)-rhot(i)))
           fsignip1=sign(diff1,rhotd(i+1)-rhotd(i))
           termip1=fsignip1*ln(i)*(rhotd(i)-rhotd(i-1))
           terpip1=fsignip1*ln(i+1)*(rhotd(i+2)-rhotd(i+1))
           tmpip1=fsignip1*amax1(0.0,amin1(termip1,fabsip1,terpip1))
           rhon(i)=rhotd(i)-rln(i)*(tmpip1-tmpi)
       end do

FLOATING POINT:

Operations                         REAL*8
----------                         ------
*                                    7
+                                    4
ABS  [1 FM or FA]                    1
SIGN [1 FM or FA]                    1
MAX/MIN [2 FA, 1 FM or FA]           3
Predicted cycles:                   10 [MAX(FM ops, FA ops)]
 
 

MEMORY<->CACHE:

Operations                         REAL*8
----------                         ------
LOAD                                5
STORE                                1

Predicted cycles:                 5*6+1*6=
Predicted cycles:                    36  [6*(LOADS+STORES)]
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 5
STORE                                1

Predicted cycles                    3.5  [LOADS/2 + STORES]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*8              10                      20
Data in memory,REAL*8              36                     36.7
 
 

The poor in-cache performance of loop 2.1 results from excessive register requirements in the loop. This can be seen by inspecting the compiler output from the test code for loop 2.1:

15) cf90-9870,Scalar,Line=39 Pipelining of loop disabled because of high pressure: ipressure=2, fpressure=34.

This occurs even when the loop is unrolled various amounts. This loop should be split, blocked and unrolled just like loop 1.3 above. On the other hand, we would pay for improved in-cache performance with additional load-store instructions, poor out-of-cache performance and additional code complexity. We might also need to experiment with manual unrolling of each of the subloops. We shall therefore remain satisfied with 20 cycles per iteration. Note that we should only expect to do as well as about 15 cycles for in-cache data based on experience with tuning loops for the T3E and the floating point operation count.

Conclusion:

We include the original code performance from above.

Original code performance summary (loops 1-5):

Cycles per iteration:
                        predicted floating point    predicted    observed
                        ------------------------    ---------    --------
Data in cache, REAL*8              21                 27.5         59.6
Data in memory, REAL*8             21                216          344

Revised code performance summary (loops 1.1 and 2.1):

Cycles per iteration:
                        predicted floating point    predicted    observed
                        ------------------------    ---------    --------
Data in cache, REAL*8               17                17           31.7
Data in memory, REAL*8              17                90          106.9

Performance evaluation of revised code: SP2

We concentrate on two optimizations for the SP. First, increase the ratio of floating point operations to load -store operations by eliminating temporary vectors. Second, improve the scheduling of the loop containing the MAX and MIN operations by experimenting with loop fusion and splitting.


As for the T3E, we begin by fusing the loops in lcpfct into two loops, one on either side of the call to SML_CELL to retain the modularity of the code. Fusing the loops is made slightly more complex by the conditions imposed on either end of the interval before each loop. Loops 1 and 2 can be rewritten as loop 1.1. The original code for these loops is:
 

          DIFF(I1) = NULH(I1) * ( RHOO(I1) - RHO1M )
          FLXH(I1) = HADUDTH(I1) * ( RHOO(I1) + RHO1M )

          Do 1 I = I1P, IN
             FLXH(I) = HADUDTH(I) * ( RHOO(I) + RHOO(I-1) )
    1        DIFF(I) = NULH(I) * ( RHOO(I) - RHOO(I-1) )

          DIFF(INP) = NULH(INP) * ( RHONP - RHOO(IN) )
          FLXH(INP) = HADUDTH(INP) * ( RHONP + RHOO(IN) )

          Do 2 I = I1, IN
             LORHOT(I) = LO(I)*RHOO(I) + SOURCE(I) + (FLXH(I)-FLXH(I+1))
             LNRHOT(I) = LORHOT(I) + (DIFF(I+1) - DIFF(I))
             RHOT(I)  = LORHOT(I)*RLN(I)
    2        RHOTD(I) = LNRHOT(I)*RLN(I)
 

These two loops can be fused into loop 1.1:

Loop 1.1:
        rhot(i1)=rln(i1)*(lo(i1)*rhoo(i1)+source(i1)
     &                    +hadudth(i1)*(rhoo(i1)+rho1m)
     &                    -hadudth(i1+1)*(rhoo(i1+1)+rhoo(i1)))
        rhotd(i1)=rhot(i1)+
     &           rln(i1)*(nulh(i1+1)*(rhoo(i1+1)-rhoo(i1))
     &                    -nulh(i1)*(rhoo(i1)-rho1m))

        do i=i1+1,in-1
           rhot(i)=rln(i)*(lo(i)*rhoo(i)
     &                     +source(i)
     &                     +(hadudth(i)*(rhoo(i)+rhoo(i-1))
     &                       -hadudth(i+1)*(rhoo(i+1)+rhoo(i))))
           rhotd(i)=rhot(i)+rln(i)*
     &                     (nulh(i+1)*(rhoo(i+1)-rhoo(i))
     &                     -nulh(i)*(rhoo(i)-rhoo(i-1)))
           source(i)=0.0d0
        end do

        rhot(in)=rln(in)*(lo(in)*rhoo(in)+source(in)
     &                    +hadudth(in)*(rhoo(in)+rhoo(in-1))
     &                    -hadudth(in+1)*(rhonp+rhoo(in)))
        rhotd(in)=rhot(in)+
     &           rln(in)*(nulh(in+1)*(rhonp-rhoo(in))
     &                    -nulh(in)*(rhoo(in)-rhoo(in-1)))

FLOATING POINT:
                                   REAL*8                REAL*4
Operations                    w/o FMA    w/FMA      w/o FMA    w/FMA
----------                    -------    -----      -------    -----
FMA                              0          4           0         4
*                                7          3           7         3
+                                9          5           9         5
CVLS(=)                          0          0           2         2

Predicted cycles:[FPOPs/2]       8                   9        7
 
 

MEMORY<->CACHE:

Operations                         REAL*8                 REAL*4
----------                         ------                 ------
LOAD                                 6                      6
STORE                                2                      2
STORE after LOAD                     1                      1

Predicted cycles:          [L*1.3+S*2.3+SaL*1.0]  [L*0.7+S*1.3+SaL*0.6]
Predicted cycles:           6*1.3+2*2.3+1*1.0=     6*0.7+3*1.3+1*0.6=
Predicted cycles:                 13.4                    8.7
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 6
STORE                                3

Predicted cycles(w/o quad)         4.5     [(LOADs + STOREs)/2]
Predicted cycles(w/  quad)         2.25    [(LOADs + STOREs)/4]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*4              7                      7.8
Data in cache, REAL*8              6                      6.7
Data in cache,p2sc, REAL*4         7                      7.8
Data in cache,p2sc, REAL*8         6                      6.8
Data in memory,REAL*4              8.7                    9.6
Data in memory,REAL*8             13.4                   14.2

Loop 1.1 is quite satisfactory for the P2SC processor. Performance for every case is at least 90% of predicted performance and the loop is CPU bound. We have only two temporary vectors(RHOT and RHOTD)in addition to the input vectors and blocking of this loop for out-of-cache data is far less important. We summarize the performance of this loop at the end of this section.

Again, a more direct translation of the original loop would be:

Loop 1.1x:

        do i=i1+1,in-1
           tmp=lo(i)*rhoo(i)
     &                     +source(i)
     &                     +(hadudth(i)*(rhoo(i)+rhoo(i-1))
     &                       -hadudth(i+1)*(rhoo(i+1)+rhoo(i)))
           rhotd(i)=rln(i)*(tmp+
     &                     (nulh(i+1)*(rhoo(i+1)-rhoo(i))
     &                     -nulh(i)*(rhoo(i)-rhoo(i-1)))))
           rhot(i)=rln(i)*tmp
           source(i)=0.0d0
        end do

Loop 1.1x requires more registers than 1.1 and runs somewhat slower than loop 1.1 on the SP2. For REAL*4 in-cache data, loop 1.1x requires 9.3 cycles per iteration, compared to 7.8 cycles per iteration for loop 1.1. For REAL*8 in-cache data, 7.2 cycles per iteration are required as compared to 6.8 for loop 1.1. We will stick with loop 1.1 for the modified code, but if the numerical differences between loop 1.1 and loop 1.1x are important to the authors of FAST3D, then loop 1.1x should be used. The performance difference between the two is not large.
 

Loops 3,4 and 5 in the original code can be rewritten as one loop as well. Again, we arrive at loop 2.1:

Loop 2.1:

      do i=i1+1,in-2
           tmpi=tmpip1
           fabsip1=abs(mulh(i+1)*(rhot(i+1)-rhot(i)))
           fsignip1=sign(diff1,rhotd(i+1)-rhotd(i))
           termip1=fsignip1*ln(i)*(rhotd(i)-rhotd(i-1))
           terpip1=fsignip1*ln(i+1)*(rhotd(i+2)-rhotd(i+1))
           tmpip1=fsignip1*amax1(0.0,amin1(termip1,fabsip1,terpip1))
           rhon(i)=rhotd(i)-rln(i)*(tmpip1-tmpi)
       end do

FLOATING POINT:
                                   REAL*8                REAL*4
Operations                    w/o FMA    w/FMA      w/o FMA    w/FMA
----------                    -------    -----      -------    -----
FMA                              0          1           0         1
*                                7          6           7         6
+                                6          5           6         5
ABS                              1          1           1         1
MAXMIN*                          1          1           1         1
SIGN*                            1          1           1         1
CVLS(=)                          0          0           6         6

Predicted cycles:[FPOPs/2]     15.6      15.1        17.6     17.1
 
 

MEMORY<->CACHE:

Operations                         REAL*8                 REAL*4
----------                         ------                 ------
LOAD                                 5                      5
STORE                                1                      1
STORE after LOAD                     0                      0

Predicted cycles:          [L*1.3+S*2.3+SaL*1.0]  [L*0.7+S*1.3+SaL*0.6]
Predicted cycles:           5*1.3+1*2.3+0*1.0=     5*0.7+1*1.3+0*0.6=
Predicted cycles:                  8.8                    4.8
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 5
STORE                                1

Predicted cycles(w/o quad)          3     [(LOADs + STOREs)/2]
Predicted cycles(w/  quad)         1.5    [(LOADs + STOREs)/4]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*4             17.1                    45.3
Data in cache, REAL*8             15.1                    41.4
Data in cache,p2sc, REAL*4        17.1                    45.3
Data in cache,p2sc, REAL*8        15.1                    41.3
Data in memory,REAL*4             17.1                    45.3
Data in memory,REAL*8             15.1                    45.6
 

* We are computing cycles required for evaluations of MAXMIN(6.1 cycles) and SIGN(2.5 cycles) based on the cycle costs of the vector subroutines VDMAXMIN and VDSGN above at large vector lengths. Also, since VSMAXMIN and VSSGN require about the same number of cycles as their double precision counterparts, a CVLS is absorbed into each call.
 

The performance of this loop is obviously very poor. Part of the problem is the large number of registers required to do the work in the loop body. There are several options to improving the performance of this loop on the SP. First, after some experimentation, it is clear that there are too many operations in the loop. For example, the loop

      do i=i1+1,in-2
           tmpi=tmpip1
           fabsip1=abs(mulh(i+1)*(rhot(i+1)-rhot(i)))
           fsignip1=diff1*(rhotd(i+1)-rhotd(i))
           termip1=fsignip1*ln(i)*(rhotd(i)-rhotd(i-1))
           terpip1=fsignip1*ln(i+1)*(rhotd(i+2)-rhotd(i+1))
           tmpip1=fsignip1*amax1(0.0,amin1(termip1,fabsip1,terpip1))
           rhon(i)=rhotd(i)-rln(i)*(tmpip1-tmpi)
       end do
 

completes in about 19 cycles per iteration for in-cache data (REAL*8), more than twice as fast as version 2.1. The only modification in in the definition of fsignip1 to remove the SIGN evaluation. This suggests that the loop needs to be split, since gains in processor time can cover the increased time storing and loading temporary vectors. The simplest split which actually runs faster than loop 2.1 for both REAL*8 and REAL*4 data is loop 2.2. (If we only split out the SIGN evaluation in order to minimize the number of temporary vectors, we find that the resulting loop still schedules poorly for REAL*4 data.)

Loop 2.2:

        do i=2,n-2
           fabs(i)=abs(mulh(i+1)*(rhot(i+1)-rhot(i)))
           fsign(i)=sign(diff1,rhotd(i+1)-rhotd(i))
        end do
        do i=2,n-2
           tmpi=tmpip1
           termip1=fsign(i)*ln(i)*(rhotd(i)-rhotd(i-1))
           terpip1=fsign(i)*ln(i+1)*(rhotd(i+2)-rhotd(i+1))
           tmpip1=fsign(i)*max(0.d0,min(termip1,fabs(i),terpip1))
           rhon(i)=rhotd(i)-rln(i)*(tmpip1-tmpi)
        end do

Loop 2.2a:

FLOATING POINT:
                                   REAL*8                REAL*4
Operations                    w/o FMA    w/FMA      w/o FMA    w/FMA
----------                    -------    -----      -------    -----
FMA                              0          0           0         0
*                                1          1           1         1
+                                2          2           2         2
ABS                              1          1           1         1
SIGN                             1          1           1         1
CVLS(=)                          0          0           2         2

Predicted cycles:[FPOPs/2]      4.5       4.5         5.5      5.5
 
 

MEMORY<->CACHE:

Operations                         REAL*8                 REAL*4
----------                         ------                 ------
LOAD                                 3                      3
STORE                                2                      2
STORE after LOAD                     0                      0

Predicted cycles:          [L*1.3+S*2.3+SaL*1.0]  [L*0.7+S*1.3+SaL*0.6]
Predicted cycles:           3*1.3+2*2.3+0*1.0=     3*0.7+2*1.3+0*0.6=
Predicted cycles:                  8.5                    4.7
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 3
STORE                                2

Predicted cycles(w/o quad)         2.5     [(LOADs + STOREs)/2]
Predicted cycles(w/  quad)         1.25    [(LOADs + STOREs)/4]

Loop 2.2b:

FLOATING POINT:
                                   REAL*8                REAL*4
Operations                    w/o FMA    w/FMA      w/o FMA    w/FMA
----------                    -------    -----      -------    -----
FMA                              0          1           0         1
*                                6          5           6         5
+                                3          2           3         2
MAXMIN                           1          1           1         1
CVLS(=)                          0          0           4         4

Predicted cycles:[FPOPs/2]     10.6      10.1        12.6     12.1
 
 

MEMORY<->CACHE:

Operations                         REAL*8                 REAL*4
----------                         ------                 ------
LOAD                                 5                      5
STORE                                1                      1
STORE after LOAD                     0                      0

Predicted cycles:          [L*1.3+S*2.3+SaL*1.0]  [L*0.7+S*1.3+SaL*0.6]
Predicted cycles:           5*1.3+1*2.3+0*1.0=     5*0.7+1*1.3+0*0.6=
Predicted cycles:                  8.8                   4.8
 
 

CACHE<->REGISTERS:

Operations
----------
LOAD                                 5
STORE                                1

Predicted cycles(w/o quad)              [(LOADs + STOREs)/2]
Predicted cycles(w/  quad)         1.5    [(LOADs + STOREs)/4]
 
 

Overall cycles per iteration     predicted               observed
----------------------------     ---------               --------
Data in cache, REAL*4             17.6                    21.0
Data in cache, REAL*8             14.6                    18.2
Data in cache,p2sc, REAL*4        17.6                    21.0
Data in cache,p2sc, REAL*8        14.6                    17.4
Data in memory,REAL*4             17.6                    23.7
Data in memory,REAL*8             18.6                    25.1
 

Loop 2.2 runs fairly well, but we are still a little short of the predicted performance. For in-cache data, we are at 84% of predicted performance, and at 74% predicted performance for data in memory. Evidently it is hard to schedule instructions when so many branches are present. Note that we have not stored the value of rhotd(i+1)-rhotd(i) computed in the first loop and used again in the second. This would introduce another temporary vector which will reduce out-of-cache performance and only increases total in-cache execution time (potentially) by 1/2 cycle.

We have still not tried the option of "vectorizing" the evaluations of SIGN and the sequence of MAXs and MINs in the evaluation of tmpip1. As stated in a previous section, we are able to unroll evaluations of both of these functions and attain a SIGN every 2.5 cycles, and the sequence MAX(0.0,MIN(A,B,C)) can complete in just over 6 cycles. Using the subroutines defined above, VDSGN and VDMAXMIN,  we rewrite the above loop as a sequence of loops:
 

Loop 2.3:

       do i=2,n-2
          diff(i)=rhotd(i)-rhotd(i-1)
       end do

       call vdsgn(n-3,diff1,diff(i+2),fsgn(i+2))

       do i=2,n-2
         term(i)=fsgn(i)*ln(i)*diff(i)
         terp(i)=fsgn(i)*ln(i+1)*diff(i+2)
         fabs(i)=abs(mulh(i+1)*(rhot(i+1)-rhot(i)))
       end do

       call vdmaxmin(n-3,term(2),fabs(2),terp(2),diff(2))

       do i=2,n-2
          rhon(i)=rhotd(i)-rln(i)*(fsgn(i+1)*diff(i+1)-fsgn(i)*diff(i))
       end do

Summary of 5 loop performance:
                                                   observed, vector length =
Overall cycles per iteration     predicted    64    128    256    512    out of cache
----------------------------     ---------    ----  ----   ----   ----   ------------
Data in cache, REAL*4              17.1      20.6  19.9   19.5   19.3
Data in cache, REAL*8              16.1      19.9  19.1   18.9   18.6
Data in cache,p2sc, REAL*4         17.1      20.6  19.8   19.8   19.1
Data in cache,p2sc, REAL*8         14.1      17.4  16.5   16.1   15.5
Data in memory,REAL*4              22.1       -     -      -      -       25.7
Data in memory,REAL*8              35.3       -     -      -      -       34.9

We have included the cycle costs as a function of loop iteration count to check the latency of starting up so many loops (and calling subroutines). At the cost of re-introducing temporary variables and thereby reducing out-of cache performance, we have managed to schedule in-cache operations more effectively. Our in-cache performance is at 90% or more of the predicted performance, and out-of-cache performance is above 86% of predicted performance. Loop 2.3 is 1.9 cycles per iteration faster than loop 2.2 for in cache data.

We abandon this fast in-cache loop because of its behavior for out-of-cache data. Of course, we could always block the set of loops so that temporary vectors would remain in cache, but this seems like a lot of trouble for only a little gain in performance over loop 2.2.

Another option not examined in any detail is the replacement of the MAX and MIN function with the functions described in a previous section which are based on sums, differences and ABS evaluations. This approach utilizes many registers, still causes the loops to be split and in the end requires about the same time to evaluate. An example would be loop 2.1NC:

Loop 2.1NC:

      do i=i1+1,in-2
          tmpi=tmpip1
          fabsip1=abs(mulh(i+1)*(rhot(i+1)-rhot(i)))
          fsignip1=sign(diff1,rhotd(i+1)-rhotd(i))
          termip1=fsignip1*ln(i)*(rhotd(i)-rhotd(i-1))
          terpip1=fsignip1*ln(i+1)*(rhotd(i+2)-rhotd(i+1))
          tmp1=termip1+fabsip1-abs(termip1-fabsip1)
          tmp2=2.0*terpip1+tmp1-abs(2.0*terpip1-tmp1)
          tmpip1=.125*fsignip1*(tmp2+abs(tmp2))
          rhon(i)=rhotd(i)-rln(i)*(tmpip1-tmpi)
       end do

Which executes in 35 cycles for data in cache, and 39 cycles for data not in cache. This is indeed faster than loop 2.1. However, loop 2.2NC:

Loop 2.2NC:

      do i=i1+1,in-2
          fabs(i)=abs(mulh(i+1)*(rhot(i+1)-rhot(i)))
          fsign(i)=sign(diff1,rhotd(i+1)-rhotd(i))
      end do
      do i=i1+1,in-2
          tmpi=tmpip1
          termip1=fsign(i)*ln(i)*(rhotd(i)-rhotd(i-1))
          terpip1=fsign(i)*ln(i+1)*(rhotd(i+2)-rhotd(i+1))
          tmp1=termip1+fabs(i)-abs(termip1-fabs(i))
          tmp2=2.0*terpip1+tmp1-abs(2.0*terpip1-tmp1)
          tmpip1=.125*fsign(i)*(tmp2+abs(tmp2))
          rhon(i)=rhotd(i)-rln(i)*(tmpip1-tmpi)
       end do

performs about 0.8 cycles per iteration faster than loop 2.2 for in-cache REAL*8 data, but requires over 30 cycles per iteration for REAL*4 data. Again, too many registers are required, and the CVLS instructions for REAL*4 arithmetic are the straws that break the camel's back. In general, we can expect the loops without conditionals to schedule better than those with conditionals, in the presence of other operations. However, the version without conditionals offers no great performance increase after requiring the same loop-splitting to be implemented.

Conclusion:

We include the original code performance from above.

Original code performance summary (loops 1-5):

Cycles per iteration:
                            predicted floating point    predicted   observed
                            ------------------------    ---------   --------
Data in cache, REAL*4                24.6                 24.6        37.9
Data in cache, REAL*8                19.6                 22.6        36.75
Data in cache,p2sc, REAL*4           24.6                 24.6        38.6
Data in cache,p2sc, REAL*8           19.6                 19.6        29.35
Data in memory, REAL*4               24.6                 34.2        47.8
Data in memory, REAL*8               19.6                 61.8        65.3

Revised code performance summary (loops 1.1 and 2.2):

Cycles per iteration:
                            predicted floating point    predicted   observed
                            ------------------------    ---------   --------
Data in cache, REAL*4               24.6                  24.6        28.8
Data in cache, REAL*8               20.6                  20.6        24.9
Data in cache,p2sc, REAL*4          24.6                  24.6        28.8
Data in cache,p2sc, REAL*8          20.6                  20.6        24.2
Data in memory, REAL*4              24.6                  26.3        33.3
Data in memory, REAL*8              20.6                  32.0        39.3

Other Optimization Techniques

Inter-procedural Analysis

The most efficient implementation of FAST3D on cache-based computers will reuse data as much as possible once loaded into cache. One aspect of the LCPFCT algorithm is that a number of vector temporaries are created in a sequence of separate loops which may even be in separate subroutines. Values are assigned in one loop and used again in a later loop. This can have two impacts on performance:

  1. If the accessed length of temporary vectors is short enough(see below), and no bad alignments of data in cache occur, then the main impact of vector temporaries is to increase the number of cache-to-register loads and stores over the absolute minimum. This absolute minimum of cache-to-register traffic occurs when temporary values reside only in registers, which is the case with temporary values in loops when register spill does not occur. If the number of floating point operations in each of the sequence of loops is sufficient, then there may be no reduction of performance over the best attainable performance. As we have seen above, some loops do not schedule well and it is best to split them, generating vector temporaries. In that case, the best performance is attained by the use of vector temporaries which, however, should remain in cache.
  2. If the accessed length of temporary vectors is too long, then executing the sequence of loops will cause other temporary vector data to be displaced from cache. This will reduce performance  since those values will have to be written back to memory then loaded into cache again in a later loop.

In general, it is very difficult to plan cache usage across multiple subroutines and loops. For example, let us look at the sequence of subroutines and loops which are executed for every one-dimensional data set pulled from main storage and passed through the LCPFCT algorithm. With reference to the code itself (especially gasdyn.f and lcpfct.f) and the profile above, the sequence of events is 1) a call to makegrid, which sets vectors LO,LN and RLN, 2) repeated calls to VELOCITY with input vector UH which sets vectors NULH,HADUDTH and MULH, and utilizes vectors AH, ADUGTH, EPSH, RLH, FCTSC1, DIFF, VDTODR, RNH and ROH,  3) multiple calls to SOURCES with different input vectors C and D, output vector SOURCE and utilizing vectors FCTSC1, DIFF, LO, and INDICES, 4) multiple calls to LCPFCT(and CNVFCT, which is very similar to LCPFCT) with different input vectors (RHOO) and outputs(RHON), and additional inputs LO,LN,RLN,NULH,HADUDTH,MULH and SOURCE. As we have seen, LCPFCT itself utilizes 10 local vector variables in the original version( RHOT, RHOTD, DIFF, FSGN,TERM,TERP, FABS, FLXH, LNRHOT,LORHOT), 2 local variables for the T3E revision (RHOT, RHOTD) and 4 local variables for the SP revision (RHOT,RHOTD,FSGN,FABS).

Certainly we have many vector temporaries which we would prefer remain in cache as LCPFCT and its support routines are repeatedly called. LCPFCT alone uses from 9 to 17 vectors, and since SOURCES is called 4/10 (for 3 dimensions) as often as LCPFCT, we should add in the workspace used by SOURCES; SOURCES utilizes an additional 6 to 7 vectors. While most of these are declared in named COMMON, the inputs to LCPFCT (rvold,rvnew,cvold,cvnew,rold,rnew,eold,enew) and SOURCES (pintm,grv,src) are not and may be aligned randomly with respect to the other data. These other inputs are in reality 13 vectors approximately of length Mpt (although they are declared differently). In total, we would really like to hold 28 to 37 vectors of length Mpt in cache. With  96K(T3E) to 128 K(SP2) on-chip caches, we can be sure of significant degradation in performance at grid dimensions of about 800 (for REAL*4 data) or about 400 (for REAL*8 data). This assumes that none of these vectors has alignment problems in cache. Given that three of these vectors can be aligned randomly with respect to the others (those not in common), we can expect serious problems at shorter vector lengths without further tuning.

There are three things to pursue in further tuning the computational kernel of FAST3D (i.e., SOURCES and LCPFCT). They are listed in order of difficulty:

  1. SOURCES (and VELOCITY) should be tuned to remove as many temporary vectors as possible. If this means fusing loops, then this should probably be done, unless the fused loop contains a conditional (such as MAX or MIN). In that case, extra vector temporaries may be necessary for a tuned SP version.
  2. ALL of the active variables in the computational kernel should be moved to a single common block and declared with vector lengths which are not an integer multiple of a power of 2. This includes the variables not declared in common in gasdyn.f: rvold,rvnew,cvold,cvnew,rold,rnew,eold,enew and pintm,grv,src. This block of variables will assure that excessive cache conflicts do not occur, and will provide the best performance for grids small enough for one-dimensional problems to fit entirely in cache.
  3. Reducing the dependence of performance on cache size can only be accomplished by blocking the entire sequence of computations (i.e., loops) into subloops which fit into cache. While this is not difficult for a sequence of loops in the same subroutine with the same limits and positive stride (the author notes that loops in the SOURCES subroutine have negative stride, dependent iterations), it is a difficult undertaking for a complex code designed to be modular and flexible. Alternatively, if variable alignment is always bad (even for small problems), then the impact of cache size will be reduced.

Array declarations and COMMON

The working arrays in the LCPFCT algorithm (declared in the include file FCT_vars.h) are all declared in named COMMON areas and their sizes controlled by the integer variable Mpt. In general this is good practice since variables in common will be consistently aligned with each other and with 128,64 and 32 bit boundaries. If these common areas contained a mixture of array variables of various element sizes, then they should be organized from largest to smallest in element size. Also, the Mpt variable should be declared so as not to promote alignment of variables with the same line in cache. The best way to do this is to not set Mpt to large powers of 2, or even as an integer multiple of a large power of 2.

Since variables in different named common areas (see the file FCT_vars.h) are used in the same program, it is suggested that all the named COMMON blocks be combined into a single COMMON block. Otherwise, common blocks of sufficient size may be aligned to page boundaries (as is done on the IBM SP), and subsets of the variables may therefore align on large powers of 2. For example, both HADUDTH (first in named common FCT_VELO) and SOURCE (first in named common FCT_MISC) may map to the same line of cache. As stated in the previous section, ALL of the active variables in the computational kernel should be moved to a single common block and declared with vector lengths which are not an integer multiple of a power of 2.

Parallel Transposes

The parallel portion of the FAST3D code is entirely contained in a single subroutine (blk_tp) which performs successive transposes of data in a Cartesian grid. While the details of this transpose have not been explored in any detail, we offer some general comments on parallel transposes.

There are at least three things to consider in implementing a parallel transpose algorithm:

  1. the amount of buffer space required by the algorithm over and above the data grid,
  2. the size of messages transferred between processors, the processor grid layout and scalability.

A parallel transpose code moves quite a lot of data around. Let us look at an example. Data is stored in a P by Q processor grid so that the initial 3-D data set A(1:L,1:M,1:N) is stored as A(1:L,p*M/P+1:(p+1)M/P,q*N/P+1:(q+1)*N/Q), where (p,q) is the processor coordinate in the P x Q processor grid. p and q are numbered so that 0<p<P and 0<q<Q. For simplicity, we have assumed that M and N are multiples of P and Q. In other words, the initial state has the first index of A all on each processor, and the last two indices distributed across the processor grid. To rearrange the data so that the second index of A is all on each processor (call this a left transpose), we want to redistribute A as A(p*L/P+1:(p+1)L/P,1:M,q*N/P+1:(q+1)*N/Q). Note that the third index range is unchanged by the transpose. This means that processors in different columns in the processor grid (i.e., different q) do not need to communicate. Equivalently, data is only sent along process rows. This indicates that scalability should be better than one might have imagined, since all processors do not need to communicate with all other processors to accomplish the transpose.

The amount of data movement that is needed is clear from the appearance of A before and after the transpose. It is clear that each processor has L*(M/P)*(N/Q) data before the transpose. It is also clear that only (L/P)*(M/P)*(N/Q) of these same data are needed after the transpose. This means that L(1-1/P)*(M/P)*(N/Q) data must be sent (and received) by each processor in the processor grid row. The data communicated by each processor is therefore 2*(1-1/P)*L*M*N/(P*Q). In other words, for large P almost all of the data must be moved, since the total data set consumes L*M*N/(P*Q) data on each processor.

In principle, all of the data to be moved can be collected and an MPI_ALLTOALL used to pass this data around. In practice, this involves generating as much buffer space as the original data grid consumed. For applications where the transposed data is a small fraction of the total data stored (unlike FAST3D), this type of transpose can be accomplished as shown in transposes.f, which makes use of MPI cartesian communicators and the FORTRAN 90 RESHAPE statement. The comments in the code explain the functionality in detail. This code is primarily included as a example of the use of MPI cartesian communicators. Although parallel performance should be relatively high for collective communications due to the use of MPI_ALLTOALL calls, we do not vouch for the efficiency of the FORTRAN 90 RESHAPE call, which may vary with the platform FORTRAN implementation.
 

Conclusions

The original FAST3D and LCPFCT code is written in such a way that many vector temporaries are defined and used. The reason for this is code flexibility, modularity, understandability and ease of modification. Unfortunately, this reduces performance relative to code with fewer vector temporaries on RISC systems with memory hierarchies. For the SP2, an architecture with a high performance memory system relative to processor performance, the original code can be tuned to increase performance in the kernel subroutine by a factor of 1.3 to 1.7, depending upon whether or not the data are in cache; for large problems, data will not be in cache, and the speedup factor of 1.7 will apply. This tuned performance is in excess of 85% of the peak performance the P2SC processors for the work in the kernel subroutine. For the T3E-600, an architecture with lower memory system performance relative to processor speed, the performance of the kernel subroutine can be increased by a factor of 1.9 to 3.3, again depending upon whether or not the data are in cache; a factor of 3.3 can be expected for large problems. This tuned performance is 77% or more of the expected performance of the kernel subroutine running on the Alpha processor. Since computers in the foreseeable future will have even greater disparities between processor and memory speed, the need for tuning the LCPFCT code will become even greater.

Two limitations of the LCPFCT code for performance are the use of the parallel transpose and a 1-D solution algorithm. These are limiting factors for parallel execution on RISC-based nodes since the parallel transpose has to communicate a large amount of data and the 1-D solution algorithm has low data reuse. Data reuse is essential to make the best use of cache. A 3-D finite difference algorithm can be very efficient on parallel RISC platforms for the same reasons. For example, a finite difference stencil which uses 6 nearest neighbors reuses the data 6(or 7) times. On the other hand, flux-corrected transport involves fairly expensive floating point operations (MAX,MIN and SIGN) and it is possible to fuse (and possibly block) the loops in the computational kernel of LCPFCT to reduce the expense of loading and storing data relative to floating point work. For the SP2, the SIGN and MAX/MIN calculations consume up to 40% of the time in the tuned kernel code. The important performance considerations of this tuning process were examined and further suggestions for tuning made in this report.

The principal objective of tuning FAST3D should be to complete the process partially undertaken in this report. First, loops should be rewritten in such a way to minimize the number of temporary vector variables. In fact, we have seen that this is a somewhat delicate process; as loops are fused, lower performance may eventually result for in-cache data as the number of registers required to process the fused loops exceeds the number of registers available. Second, the vector variables utilized in the computational kernel (lcpfct and its support routines) should be declared in such a way that cache conflicts are minimized. The best way to do this is to declare all of these variables in a single common block and assure their lengths do not generate cache conflicts. A minimal solution is to declare vector lengths as something other than integer multiples of an integer power of 2. Third, the authors of FAST3D should confirm that FAST3D cannot be rewritten in a more cache-friendly manner. It would be very nice if, regardless of the grid dimensions, the variables needed for the 1-D algorithm could be loaded into cache only once. If the LCPFCT 1-D algorithm were a simple sequence of unidirectional (stride +1) loads and stores of a set of vectors, then one way to accomplish this would be to block the entire sequence of loops (across subroutines, etc) such that the entire set of vectors remains in cache throughout the loop sequence. Given that data alignment problems are under control, this should already be happening for small grids.

There are serious problems with this sort of tuning for FAST3D, which is intended to solve a variety of problems. Code readability and modifiability suffers greatly. The worst offender is loop tuning; the process of controlling data alignment should be much more user-transparent to implement. A desirable solution for the loop tuning problem (but probably not the data alignment problem) is to find a code preprocessor which could do the loop tuning. Perhaps some time should be invested in the future to testing such tools. In general, results obtained by using high levels of optimization (i.e., which reorder and rewrite loops) provided by FORTRAN compilers are not very promising.

References

The Benchmarker's Guide to Single-processor Optimization for CRAY T3E Systems

Alpha Architecture Handbook, EC-QD2KC-TE (October 1998) - [PDF/5384 Kb]

CRAY T3E Fortran Optimization Guide, SG2518, 3.0

Optimization and Tuning Guide for Fortran, C, and C++, SC09-1705-01 (IBM)

RS/6000 Scientific and Technical Computing: POWER3 Introduction and Tuning Guide

Patterson, D., and Hennessey, J., Computer Architecture: A quantitative approach, Morgan Kaufman, San Francisco, 1996.