Problems FAST3D can be used to solve arise in the context of fluid dynamics in High-Reynolds number flow, and the example problems provided with the FAST3D code solve blast and muzzle flash problems, as well as air flow around various bodies at high Mach number.
cumulative
self
self total
%time seconds seconds
calls ms/call ms/call name
41.4 547.29 547.29 21077280
0.03 0.03 .lcpfct [5]
8.7 662.45 115.16
2119680 0.05 0.05
.put_pvs [8]
8.7 776.83 114.38
4215456 0.03 0.03
.velocity [9]
8.0 882.25 105.42
2107728 0.05 0.41
.gasdyn [4]
5.9 960.84
78.59 2377728 0.03
0.03 .get_pvs [10]
4.3 1017.42 56.58
2107728 0.03 0.03
.cnvfct [11]
3.7 1065.75 48.33
2142720 0.02 0.02
.get_gvs [12]
2.8 1102.70 36.95
.__mcount [13]
2.0 1129.32 26.62
8430912 0.00 0.00
.sources [14]
1.8 1153.35 24.03
2115684 0.01 0.01
.copygrid [15]
1.7 1175.96 22.61
612 36.94 1935.58 .ijk_fct [3]
1.6 1196.57 20.61
.readsocket [16]
1.5 1216.90 20.33
2115072 0.01 0.07
.get_ijk_cols [6]
1.1 1231.42 14.52
2115072 0.01 0.01
.cell_va [17]
The subroutines lcpfct, velocity, gasdyn, cnvfct and sources account for 64.4% of the execution time in this run. lcpfct is in fact the "computational kernel" of the code and the subroutines gasdyn,velocity, cnvfct and sources support the general 1-d flux-corrected transport algorithm in lcpfct.f. We will assume this profile roughly holds for both target architectures we will be studying below. These target architectures are the Cray T3E and the IBM SP2. We could in principle profile the code on the Cray T3E, but at the time of writing the code had not been ported to that platform. The T3E and SP2 are MPP platforms. The T3E uses the Compaq Alpha 21164 processor and the IBM SP2 utilizes the IBM Power2 Super Chip (P2SC) processors. For the platforms at SDSC, the T3E processors run at 300 MHz, and the SP2 at 166 MHz. The profile above contains less than 2% parallel overhead (the readsocket calls), so that we are essentially seeing a profile of single processor performance. Note that the __mcount calls are due to profiling. We concentrate on single processor performance, and leave parallel tuning comments for the section on parallel transposes below.
The only way to compare the performance of a code to what we expect the performance to be for a given computer is to count cycles. This allows us to assess the performance of the code independent of the programming language or any effects other than the architecture on which the code is running. The technique of counting cycles comes down to several assumptions and steps:
In the process of tuning the code,several of the above assumptions can and will fail. Trying to understand which assumptions fail and generate poor code performance is a large part of the tuning process. Typical examples of this are:
The required cycles for most operations in which we are interested are
given in the section below on the architectures. We have a good grasp of
cycles required to move data from memory to cache, from cache to register,
and how many cycles are required to do the floating point computations.
In the loops to be analyzed, we will be placing the operations and cycle
predictions in a table with three sections. We briefly describe the tables
using the SP2 as an example.
FLOATING POINT:
REAL*8
REAL*4
Operations
w/o FMA w/FMA w/o FMA
w/FMA
----------
------- ----- -------
-----
FMA
0
2
0 2
*
2
1
2 1
+
7
5
7 5
ABS
3
3
3 3
CVLS(=)
1 1
Predicted cycles[FPOPs/2]: 6 5.5 6.5 6
This floating point section for the IBM SP specifies the number and type of operations. The predicted cycles entries are based on the cycle costs from the above architecture descriptions and the operation counts, and all formulas for cycle predictions are given in square brackets. All cycle predictions are per loop iteration. Double and single precision entries are given for the SP, and we assume throughout that no rearrangements of operations are allowed which could change the numerical results. A column is given including the FMA operation also assuming that no changes in numerical results are allowed. This strict adherence to numerical results is guaranteed on the SP by compiling with the -O3 -qstrict options, or with -O2. Note that compiling with -O3 -qstrict does not mean that all intermediate results are converted to single precision; only an equals sign results in a conversion back to single precision. The most important quantities are highlighted.
MEMORY<->CACHE:
Operations
REAL*8
REAL*4
----------
------
------
LOAD
3
3
STORE
1
1
STORE after LOAD
Predicted cycles:
[L*1.3+S*2.3+SaL*1.0] [L*0.7+S*1.3+SaL*0.6]
Predicted cycles:
3*1.3+1*2.3=
3*0.7+1*1.3=
Predicted
cycles:
6.2
3.4
The memory-to-cache section specifies the number of loads and stores required by the loop in the infinite loop iteration limit, where all loads and stores must go to main memory. Again, the important cycle predictions are highlighted. The memory-to-cache predictions assume that there are no pathological cache conflicts due to data alignment in cache.
CACHE<->REGISTERS:
Operations
----------
LOAD
3
STORE
1
Predicted cycles(w/o quad)
2 [(LOADs + STOREs)/2]
Predicted cycles(w/ quad)
1 [(LOADs + STOREs)/4]
The cache-to-register section predicts cycles required for data in cache. The highlighted predictions are shown for both quad loads and double loads, which support 4 and 2 double precision loads per cycle, respectively. For single precision loads, the quad loads are unavailable and the w/o quad predictions should be used.
Overall cycles per iteration
predicted
observed
----------------------------
---------
--------
Data in cache,
REAL*4
6
Data in cache,
REAL*8
5.5
Data in memory, REAL*4
6
Data in memory, REAL*8
6.2
Finally, the first three tables are used to predict overall per iteration cycle cost, assuming load/stores and calculations overlap. For data in memory, the maximum of the MEMORY<->CACHE predicted cycles and floating point predicted cycles is used. For data in cache, the maximum of the CACHE<->REGISTER predicted cycles and floating point predicted cycles is used. Sometimes loops will be referred to as load-store bound for obvious reasons. What we really want are CPU (or floating point) bound loops that are doing work and hiding things we don't care about behind this work. Most loops have been tested and run on the platform in question and an observed cycle count is given for the same run parameters (single precision, double precision, etc). All observed cycle counts are from code which isolates the loop in question and also prevents any cache alignment problems. The observed cycle counts should be interpreted as representative of the best performance of the processor, given that it is running assembly generated by a particular compiler. Actual performance may vary.
Below we examine a number of loops in the original and revised code and find that most code exhibits a large degree of predictability. However, there are a number of operations unique to FAST3D which require some extra analysis first: MAX, MIN, ABS and SIGN, which are covered after we review the SP2 and T3E processor characteristics.
1) given a FORTRAN or C loop, what the hardware should actually be
doing,
2) how long the desired operations should take.
For the T3E, this typically comes down to understanding how floating point data moves through the following functional diagram:
Not shown are the integer pipelines called E0 and E1, which do non-floating point work and access their own set of 32 integer registers. The integer pipelines also execute all (including floating point) loads and stores. The Alpha issues an instruction to each of the pipelines FM (floating point multiply), FA (add), E0 and E1 every cycle. Loads and stores can occur while floating point operations are being done.
In summary, we can write:
Predicted floating point cycles (T3E) = max(FM ops, FA ops),
where FM ops occur in the FM pipeline and FA ops occur in the FA pipeline.
Operation
Latency(CP/64 byte line) Bandwidth(CP/8 byte word)
Stream buffered load/store
24
3
Load
85
12
Store
85
7
The figures for stream buffered bandwidth have been coerced into one value. In fact, loads are listed as 20% faster than stores, with 3 cycles per word an average and adequate for our analysis below, since (as we are about to see) half the memory access time is latency. These latencies and bandwidths can be used to compute an expected per word access time. We concentrate on memory access via the hardware streams. For regular access of memory, we can expect the hardware to prefetch the data using 6 hardware streams which lie between memory and cache.
If we access memory at stride 1 (as is the case in the loops we will tune in FAST3D), then every 8 loads we cause a cache miss and incur the latency in the table. So per required load or store we get 3 + (24/8) cycles. In other words, 6 cycles per word. This 6 cycle cost is confirmed by code experiments, although stores alone require about 9 cycles per word. This is probably because of the write-allocate nature of the cache: a write which misses in cache causes the data in memory to be loaded into cache before being written back out again when the cache line is needed by other data. In fact, one would expect stores on average to take about twice as long as loads. This is the case for the SP (see below), but the T3E typically needs about the same time for loads and stores.
Cache-to-register load bandwidths are listed in the T3E Fortran Optimization Guide as two 8-byte words per cycle, and one 8 byte word per cycle on store. Latencies of dubious meaning are also given in the Guide. While we may eventually figure out the details, it is sufficient to note that the peak transfer rate is only occasionally attained, and that cache-to-register times are as much as twice what the peak rate would predict. A typical value from testing would indicate that cache-to-register cycle cost predictions based on the published values above should be multiplied by about 1.6.
A NOTE ON STREAM BUFFERS: The stream buffers are an off-chip mechanism which prefetch same stride accesses to main memory after the second miss in cache. These buffers are usually controlled by either or both the system administrator and the user. Some systems may have the stream buffers deactivated for hardware and software problems present in early releases of the system. One can check by running memory-bound code after alternately executing the cshell commands setenv SCACHE_D_STREAMS 1 (streams on) and setenv SCACHE_D_STREAMS 0 (streams off). If there is no change in performance, then the system administrator has prohibited streams.
In the diagram, we have left out the 8 Kbyte cache (the data cache) between
the cache shown (the secondary cache) and the registers. It is not particularly
interesting for the analysis below. The prime reason to recall its existence
is to prevent any catastrophic misalignments of data; the data cache is
1024 eight byte words long, so accessing data 1024 elements apart will
cause excess data movement between cache levels.
As was the case for the T3E, we are generally trying to predict how floating point data moves through the following functional diagram of the P2SC:
Many details of the chip are not shown. Not shown are two fixed-point (integer) units (FXUs) which perform integer calculations on their own set of registers. The FXUs are also responsible for executing loads and stores, so that calculations and memory operations can overlap.
The P2SC has replicated floating point units that perform the same
tasks. So 2 multiplies, adds, or multiply-adds can take place in the same
cycle. A multiply-add or FMA is a single instruction that can be written
a=b*c+d. Since the floating point pipelines are 2 to 3 cycles long (FMAs
require 3 cycles), we need 4 to 6 independent floating point operations
(*,+ or FMA) to keep the floating point units busy on every cycle. The
floating point units perform only 64 bit IEEE arithmetic, with conversion
from single precision on load and a pre-store conversion instruction, discussed
in detail below.
Summing this up, we can write:
Predicted floating point cycles (SP2) = FPOPs/2,
where an FPOP is any of FMA, *,+, etc.
The store on the P2SC can also be thought of as a load and then a store-after-load. Since the cache is a write-allocate cache, a store to data not already in cache involves loading the data into cache, followed by a write back to memory when the cache line is needed for other data. So we can think of a store-after-load as taking 1.0 cycles for a total of 2.3 cycles for a store as shown in the above formula.
Due to the nature of cache buffering, note how these performance figures apply in general to data movement between memory and cache. We know that a cache line is 32 words. If we access a particular word of memory, then the entire cache line surrounding that word must be loaded into cache as well. So if actually use every word, the above calculation predicts how many cycles are required for every load and store. However, if we access every other word, then for every load we see in our code, we are actually loading two words - one is used, the other simply comes along for the ride. So stride two access of j loads and k stores requires 2.6 j + 4.6 k cycles. The worst case is for strides greater than 31, where only 1 word in each cache line is used. In this case, the formula indicates that j loads and k stores would require 41.6 j + 73.6 k cycles. In practice, the P2SC does better than this for large strides, requiring only about 44 cycles for one load and one store at stride 32. It would appear that the P2SC is able to overlap loads and stores very well for large stride access.
Cache-to-register performance is very good relative to processor speed
on the P2SC. There are two options for loading from cache. There are single
(4 bytes or single precision) and double word load instructions, which
can load a single or double precision word every cycle, and also quad load
instructions, which can load two consecutive double words per cycle into
consecutive registers. While the second option is faster, it makes it more
difficult for the compiler to schedule a loop, since every load consumes
two registers. Also, since there are two fixed point units, loads and stores
between cache and register require either 1/2 or 1/4 (for quad loads of
double precision data) cycle. IBM compilers will only generate quad load
instructions given the appropriate compiler options. In particular, the
options -qarch=p2sc and -qtune=p2sc. It is also possible
to use pwr2 in place of p2sc.
The advantage of using single precision is that the cache to memory time is cut nearly in half, and of course we can store twice as many numbers in available memory. The single precision bandwidths are listed in the previous section on memory system performance for the IBM SP2. (The CRAY FORTRAN compiler for the T3E does not support 8 byte floating point arithmetic, so we do not discuss single precision performance on the T3E.)
Do 5 I =
I1, IN
FLXH(I+1) = FSGN(I+1) * AMAX1 ( 0.0,
&
AMIN1 ( TERM(I+1), FABS(I+1), TERP(I+1) ) )
RHON(I) = RHOTD(I) - RLN(I)*(FLXH(I+1) - FLXH(I))
5
SOURCE(I) = 0.0
Let us concentrate first on the max and min functions (the generic version of AMAX1 and AMIN1). This sequence of floating point operations
max(0.0,min(a,b,c))
or:
max(0.0,min(a,min(b,c)))
inst. operands pipeline
function
----- --------
-------- --------
ldt f6,-16(r1)
E0 or E1 load operand a into register f6
ldt f3,-16(r4)
E0 or E1 load operand b into register f3
cpys f6,f6,f13
FM or FA copy f6 into f13
cmptle f6,f3,f2 FA
compare f6 to f3
fcmoveq f2,f3,f13 FA
move f3 to f13 if f3<f6
stt f13
E0 or E1 store the result
This is a sequence of two loads, a copy, a compare, a conditional move and a final store of the result. The pipeline in which the instruction executes is given in the last column; E0 and E1 are the two integer pipelines, FM is the floating point multiply unit, and FA is the floating point add pipeline. These are given in order to help us count the cycles required for the min/max operation. If there were no other operations to be performed, then this sequence of operations would require the sum of all the operation pipeline lengths to complete, or one less than the pipeline depth due to forwarding. (One less than the pipeline depth is called the pipeline latency). For the above operations, the latencies are 2 cycles for loads (the integer pipeline depth is 3) and 4 cycles for the FM or FA operations. The store operation produces no result and no latency is present. With independent min evaluations, say due to unrolling and clever scheduling, we can overlap operations in different pipelines and also keep the pipelines full. The two loads can overlap and execute in one cycle. This loading can also overlap with copy operations in the FM pipeline which can also overlap with operations in the FA pipeline. The longest chain of operations is in fact in the FA pipeline, where the cmptle and fcmoveq operations must complete in sequence; they cannot overlap. As an example of this overlapping, let us consider a loop consisting only of min evaluations. If we unroll the loop by 4, then we can keep the FA pipeline filled and generate a cmptle and fcmoveq result every cycle. A picture of what we hope the Alpha 21164 would be doing is as follows, where the pipelines have been drawn out to show how resource conflicts control the sequence and timing of execution:
.
.
.
.
.
- 12345 FM
|
FCMOVEQ4 12345 FA
|
- 123 E0
Iteration I-1
- 123 E1
|
CYCLE: .....12345678....
------------------------------------------
CPYS1 12345 FM
|
CMPTLE1 12345 FA
|
LD5 123 E0
|
LD6 123 E1
|
CPYS2 12345 FM
|
CMPTLE2 12345 FA
|
LD7 123 E0
|
LD8 123 E1
|
CPYS3 12345 FM
|
CMPTLE3 12345 FA
|
ST1 123
E0
|
ST2 123
E1
|
CPYS4 12345 FM
|
CMPTLE4 12345 FA
|
ST3 123
E0
|
ST4 123
E1 Iteration I
-
12345 FM
|
FCMOVEQ1 12345 FA
|
-
123 E0
|
-
123 E1
|
-
12345 FM |
FCMOVEQ2 12345 FA
|
-
123 E0
|
-
123 E1
|
-
12345 FM |
FCMOVEQ3 12345 FA
|
LD1
123 E0
|
LD2
123 E1
|
-
12345 FM |
FCMOVEQ4 12345 FA
|
LD3
123 E0 |
LD4
123 E1 |
CYCLE: .....12345678....
----------------------------------------
CPYS1
12345 FM |
CMPTLE1
12345 FA Iteration I+1
LD1
123 E0 |
LD2
123 E1 |
.
.
.
.
.
Note that the stores in iteration I are being performed for iteration I-1, while loads are being performed for iteration I+1. Forwarding is assumed for every pipeline, so that we need not wait for the last stage to write the register file. In the eight cycles shown, four evaluations of a min are computed, completing on cycles 5 through 8, yielding a performance of 2 cycles per evaluation. For the purpose of counting floating point cycles, we can think of a min as two floating point adds and a multiply. Assuming sufficient unrolling, we can assume overlapping and ignore the dependencies. Even if the other instructions in a loop iteration are not other mins, we still tie up the FA unit for two cycles. Also note that the FA pipeline is the only pipeline whose resources are busy every cycle. Both the integer and floating-point multiply pipelines are unused for four complete cycles.
Interestingly enough, the T3E does not manage 2 cycles per iteration for a loop composed entirely of min evaluations. Instead, 3.3 cycles are required. However, for the loop composed of
max(0.0,min(a,b,c))
evaluations, the T3E can complete each iteration of this loop
unrolled by 3 in 7.2 cycles, or 2.6 cycles per min/max evaluation. In the
author's experience with counting cycles on the T3E, getting only 70% of
the predicted performance is not atypical, so these timings are reasonable.
inst. operands
functional unit
---- --------
---------------
lfd fp4=a(gr3,-7984)
FXU
lfd fp2=b(gr3,-11984)
FXU
fcmpu cr4=fp2,fp4
FPU
bc CL.242,cr4,0x20/flt
BPU
fmr fp4=fp2
FPU
CL.242:
stfd c(gr3,16)=fp4
FXU
The quad loads present in the actual code have been eliminated for simplicity, but for many iterations of the loop, the compiler has managed to schedule quad loads and stores which reduce the time spent in each double precision load or store to essentially 1/4 cycle. Scheduling quad loads and stores can difficult since you have to load and store to adjacent registers and adjacent memory locations. There are three processing units in the Power2SC architecture involved: the fixed-point unit (FXU), the floating-point unit(FPU), and the branch-processing unit(BPU). There are two FXUs, two FPUs and one BPU, so given some loop unrolling, we can expect quite a lot of overlapping of work, even though a single iteration of min in a loop results in the dependent sequence of operations shown in the assembly fragment. There is really no indication in the IBM documentation about predicting required cycles in such a sequence of instructions. Instead, the documentation states that there are two sources of delay. The first is the unavoidable delay of the fcmpu in the floating point pipeline, which has a latency of 1-2 cycles. The second delay comes about in the branch instruction as the BPU decides what to do next. The fact that it is unknown which instruction will be executed after the branch instruction makes the pipelining of instructions more difficult (or perhaps impossible), so the exact effect of the branch instruction is unclear. The best we can do is compare to run times we predict without the conditionals and compare and complain. Several tests have been done to determine the effect of conditional branches. First, a loop composed entirely of min evaluations has been unrolled and timed. The best performance is 2.5 cycles per iteration, after manually unrolling by 8. A quick inspection of the above SP assembly indicates that there are 1.5 cycles of work per iteration in an FXU, and 1-2 cycles in an FPU, depending on whether or not the branch is taken. Assuming that both the 2 FXUs and the 2 FPUs can be utilized, and random data are used, this is and average of 3/4 cycle of work per iteration. So the presence of the conditional branch has significantly reduced the performance from what we expect from a fully pipelined sequence of instructions. A second test is performed by timing a loop consisting of max(0.0,min(a,min(b,c))) evaluations, which requires about 6 cycles per iteration after some experimentation with unrolling (see VDMAXMIN below). There are 3 loads and 1 store, 3-6 floating point operations(depending on how many branches are taken), and 3 branch instructions per iteration. If the branch instructions were not costing us anything, we would expect this to require 2.25 cycles per iteration on average. A fairly consistent explanation of these timings is a branch delay of 2-2.5 cycles during which the BPU is busy but the other functional units are not. While we also suspect that pipelining is disrupted by the presence of the branch instructions, this is a difficult effect to quantify. Removing the branch instructions entirely is a possible modification (see the next section ).
Since there are unused cycles in the FPU and FXUs while the BPU is busy, it is (in principle) possible to hide instructions independent of the branch outcome behind the BPU delay (see the IBM Optimization and Tuning Guide for Fortran, C, and C++). As in the case of SIGN evaluations (see VDSGN and VDSIGNDIFF below), one might wonder how many of the idle FPU and FXU cycles can be reclaimed by adding work to a loop containing only max(0.0,min(a,min(b,c))). No option that the author has tried has managed to hide any additional work in this loop that is appropriate for the lcpfct subroutine. In fact, adding work can result in serious reductions in overall performance, as will be demonstrated in performance evaluation of code revisions below. We have had marginally better luck with the SIGN evaluations.
The subroutine VDMAXMIN:
subroutine vdmaxmin(n,term,fabs,terp,flxh)
implicit none
real*8 term(*),fabs(*),terp(*),flxh(*)
integer n8,n,i
n8=rshift(n,3)*8
do i=1,n8,8
flxh(i) = max ( 0.0d0,
&
min ( term(i), fabs(i), terp(i) ) )
flxh(i+1) = max ( 0.0d0,
&
min ( term(i+1), fabs(i+1), terp(i+1) ) )
flxh(i+2) = max ( 0.0d0,
&
min ( term(i+2), fabs(i+2), terp(i+2) ) )
flxh(i+3) = max ( 0.0d0,
&
min ( term(i+3), fabs(i+3), terp(i+3) ) )
flxh(i+4) = max ( 0.0d0,
&
min ( term(i+4), fabs(i+4), terp(i+4) ) )
flxh(i+5) = max ( 0.0d0,
&
min ( term(i+5), fabs(i+5), terp(i+5) ) )
flxh(i+6) = max ( 0.0d0,
&
min ( term(i+6), fabs(i+6), terp(i+6) ) )
flxh(i+7) = max ( 0.0d0,
&
min ( term(i+7), fabs(i+7), terp(i+7) ) )
end do
do i=n8+1,n
flxh(i) = max ( 0.0d0,
&
min ( term(i), fabs(i), terp(i) ) )
end do
return
end
when compiled with
xlf90 -qfixed -O3 -qtune=p2sc -qarch=p2sc -c vdmaxmin.f
can return a max(0.0,min(a,min(b,c))) evaluation in as little as 6.1 cycles:
vector length cycles per
evaluation
real*4 real*8
------------- ----------------
64
6.8 6.7
128
6.6 6.4
256
6.4 6.2
512
6.3 6.1
These tests assume that the data are in cache.
Although we have seen in the previous section that we can in some
cases evaluate independent max and min instructions fairly quickly, we
will find in practice that the branch instructions play havoc with scheduling
of instructions in a loop. In fact, the loops below with the worst performance
will be those containing conditional branches. Frequently, the expected
execution time is doubled or tripled over what we might surmise from the
above analysis of performance. The above analysis essentially says that
if you assume the best case of the conditional branch expense being completely
hidden, then you should expect a cycle or two to be consumed in each max
or min. The existence of the conditional move instruction entails no particular
problem in scheduling, because the operations complete in the floating
point pipelines just as multiplies and adds. One might ask if there is
a way to avoid the branch instructions the compiler generates for its MAX
and MIN instructions. The answer is yes, although the alternate method
does not necessarily yield the same floating point results.
Consider the two identities:
2 max(A,B) = A+B+abs(A-B)
2 min(A,B) = A+B-abs(A-B)
where abs is the absolute value. Ignoring for the moment that these are not floating point identities, we can rewrite our required expression
max(0.0,min(A,B,C)) = .125*[A+B-abs(A-B)+2C - abs(A+B-abs(A-B)-2C)
+abs(A+B-abs(A-B)+2C - abs(A+B-abs(A-B)-2C))]
Since the operations involved are now addition, multiplication and absolute value, we can count the cycles required for this alternate version of the computation quite easily. Let us assume that we have enough independent evaluations and unrolling of a loop has exposed these to the compiler. We now count operations and cycles:
FLOATING POINT:
REAL*8
REAL*4
Operations
w/o FMA w/FMA w/o FMA
w/FMA
----------
------- ----- -------
-----
FMA
0 2
0 2
*
2 1
2 1
+
7 5
7 5
ABS
3 3
3 3
CVLS(=)
1 1
Predicted cycles:
6 5.5
6.5 6
MEMORY<->CACHE:
Operations
REAL*8
REAL*4
----------
------
------
LOAD
3
3
STORE
1
1
STORE after LOAD
Predicted cycles:
3*1.3+1*2.3=
3*0.7+1*1.3=
Predicted cycles:
6.2
3.4
CACHE<->REGISTERS:
Operations
----------
LOAD
3
STORE
1
Predicted cycles(w/o quad)
2
Predicted cycles(w/ quad)
1
Overall cycles per iteration predicted
---------------------------- ---------
Data in cache, REAL*4
6
Data in cache, REAL*8
5.5
Data in memory, REAL*4
6
Data in memory, REAL*8
6.2
We have used values for the loads and stores assuming that all three operands are loaded from vectors and the result is stored back to a vector. Even in the worst case the computation is only marginally memory bound; for in-cache data, this evaluation is certainly CPU bound for double precision data. For single precision calculations, the computation will certainly be CPU bound (this is good).
Note that the floating point differences between these two methods of computing max and min can be quite large. The max and min calculations have essentially the same problems as floating point adds and subtracts:
A=1.0, B=.12345678: min(A,B)=
0.1234567836
(A+B-abs(A-B))/2= 0.1234567761
A=10000.0, B=.12345678: min(A,B)=
0.1234567836
(A+B-abs(A-B))/2= 0.1232518256
A=1.0*10^35,B=.12345678: min(A,B)=
0.1234567836
(A+B-abs(A-B))/2= 0.0000000000E+00
These comparisons are for single precision arithmetic.
We do briefly discuss the use of this alternate form of max and min
in the performance section below. However, the reader may wish to discount
this method entirely due to its finite precision effects on the accuracy
of calculations. In the end, we are able to get fairly good performance
without resorting to this alternative. We comment on this more in the
section on performance analysis of the revised code below.
There are several other operations present in the LCPFCT kernel
whose cycle costs are not immediately obvious. These are ABS, the absolute
value, and SIGN, a two-input, single-output function which returns a number
whose absolute value is the first argument, and with the same sign as the
second argument. A quick look at the assembly reveals how these operations
are accomplished.
do i=1,n
c(i)=abs(a(i))
end do
results in a loop containing the assembly instruction:
inst operands
functional unit
----- -----------
---------------
fabs fp3=fp2
FPU
Therefore, we expect each floating point unit to be able to compute an absolute value every cycle.
The SIGN function is succinctly defined by:
c=sign(a,b): if b < 0, then c=-abs(a)
else, c=abs(a).
Like MIN and MAX, the SP implements this function as a conditional. Consider the loop
do i=1,n
flxh(i) = sign(cst,term(i))
end do
The following evaluates a single sign function. In lcpfct, the first argument is a constant and the positive and negative absolute values can be stored outside the loop. The following fragment of a the loop body assumes that fp0(floating point register 0)=0.0, fp1=abs(a), and fp2=-abs(a). Each component of term is loaded into floating point register 5.
inst operands
functional unit
----- -----------
---------------
.
.
.
fmr fp8,fp1
FPU
lfq fp5,-11992(r3)
FXU
fcmpu 7,fp0,fp5
FPU
bc
B0_IF_NOT,CR7_GT,__L460 BPU
fmr fp8,fp2
FPU
__L460:
stfdu fp8,8(r3)
FXU
.
.
.
Again, the effect of the conditional branch is difficult to assess a priori in terms of cycles, but we do know how to count the remainder of the operations: there are 2 to 3 (depending on the sign of the operand in fp5) cycles spent in a floating point unit, and one cycle in a fixed point unit for the load and store. So if the branch instruction cost is completely hidden, and both floating point units can be completely utilized, we would expect each sign evaluation to require 2.5/2=1.25 cycles on average. For the above loop consisting only of sign evaluations, unrolling by 8 yields a code which runs in 2.5 cycles per iteration (see VDSGN below), confirming the branch penalty of 2-2.5 cycles mentioned in the previous section. So the floating point units have an extra 1.25 cycles per iteration in which they are idle. In practice, we are able to reclaim an additional half cycle of work per iteration by computing the differences of the input vector, a task which has to be done in actual lcpfct algorithm (see VDSGNDIFF below). Experimentation with reclaiming the remaining .75 cycles has not been successful. In fact, scheduling more work (that is appropriate for lcpfct) per iteration than is specified in VDSGNDIFF causes the additional work to take longer than if it were put in a separate loop which could then be efficiently pipelined without the disruption of scheduling caused by branch instructions. Note that the performance of VDSGN compares favorably to the vector MASS version of the SIGN function (which takes two vector operands and therefore is not appropriate), which is listed as 3.5 cycles per iteration in the MASSVP2 library (see http://www.rs6000.ibm.com/resource/technology/MASS/ ). The vectorized version of SIGN is the subroutine VDSGN:
subroutine vdsgn(n,scalar,in,out)
implicit none
real*8 scalar,in(*),out(*)
integer n8,n,i
n8=rshift(n,3)*8
do i=1,n8,8
out(i) =
sign(scalar,in(i))
out(i+1)
= sign(scalar,in(i+1))
out(i+2)
= sign(scalar,in(i+2))
out(i+3)
= sign(scalar,in(i+3))
out(i+4)
= sign(scalar,in(i+4))
out(i+5)
= sign(scalar,in(i+5))
out(i+6)
= sign(scalar,in(i+6))
out(i+7)
= sign(scalar,in(i+7))
end do
do i=n8+1,n
out(i)=sign(scalar,in(i))
end do
return
end
When compiled with
xlf90 -qfixed -O3 -qtune=p2sc -qarch=p2sc -c vdsgn.f
this subroutine can return a SIGN evaluation in as little as 2.5 cycles:
vector length cycles per
SIGN
------------- ---------------
64
3.1
128
2.8
256
2.7
512
2.5
These tests assume that the data are in cache. Single precision cycle costs are slightly higher for long vectors at 2.6 cycles per iteration.
The subroutine VDSGNDIFF reclaims some of the idle FPU cycles and requires the same time per iteration as VDSGN:
subroutine vdsgndiff(n,scalar,in,out)
implicit none
real*8 scalar,in(*),out(*)
integer n8,n,i
n8=rshift(n-1,3)*8
do i=1,n8,8
out(i+1)
= sign(scalar,(in(i+1)-in(i)))
out(i+2)
= sign(scalar,(in(i+2)-in(i+1)))
out(i+3)
= sign(scalar,(in(i+3)-in(i+2)))
out(i+4)
= sign(scalar,(in(i+4)-in(i+3)))
out(i+5)
= sign(scalar,(in(i+5)-in(i+4)))
out(i+6)
= sign(scalar,(in(i+6)-in(i+5)))
out(i+7)
= sign(scalar,(in(i+7)-in(i+6)))
out(i+8)
= sign(scalar,(in(i+8)-in(i+7)))
end do
do i=n8+1,n-1
out(i+1)=sign(scalar,in(i+1)-in(i))
end do
return
end
However, for single precision calculations (i.e., VSSGNDIFF), the code is much slower at almost 4 cycles per iteration. The extra CVLS instruction and the subtract are unable to hide behind the branches.
do i=1,n
flxh(i)
= abs(fsgn(i))
end do
becomes:
$LL00009:
; Codeblock 10
subq
r2, 1, r2
; Ln 28 0x00000160 (In 107)
cpys
f31, f2, f0
; Ln 29 0x00000164 (In 102)
ldt
f2, 8(r1)
; Ln 29 0x00000168 BLOAD (In 271)
addq
r1, 8, r1
; Ln 29 0x0000016c (In 105)
stt
f0, (r3)
; Ln 29 0x00000170 (In 104)
addq
r3, 8, r3
; Ln 29 0x00000174 (In 106)
bgt
r2, $LL00009
; Ln 30 0x00000178 (In 108)
In this case, f0 holds values of flxh(i), and f2 holds fsgn(i). CPYS evaluates the absolute value by reading floating point register 31 (f31), a register which always returns a floating point zero when read. In other words, the sign of zero is assigned to the value in f2. CPYS executes in either the FM or FA pipeline, so we can expect up to two of these to complete in a single cycle.
SIGN is also accomplished as a single call to CPYS. For example, the FORTRAN loop:
do i=1,n
flxh(i)
= sign(fabs(i),fsgn(i))
end do
in Alpha assembly is very simply:
$LL00005:
; Codeblock 6
cpys
f3, f2, f0
; Ln 29 0x000001e0 (In 129)
ldt
f3, 8(r2)
; Ln 29 0x000001e4 BLOAD (In 306)
subq
r1, 1, r1
; Ln 28 0x000001e8 (In 135)
ldt
f2, 8(r3)
; Ln 29 0x000001ec BLOAD (In 304)
addq
r2, 8, r2
; Ln 29 0x000001f0 (In 132)
stt
f0, (r4)
; Ln 29 0x000001f4 (In 131)
addq
r3, 8, r3
; Ln 29 0x000001f8 (In 133)
addq
r4, 8, r4
; Ln 29 0x000001fc (In 134)
bgt
r1, $LL00005
; Ln 30 0x00000200 (In 136)
so that we again have as many as two SIGN evaluations per cycle, in
terms of floating point unit work.
lcpfct is a sequence of 5 loops. These are presented below
with the extraneous intervening instructions removed, except for a (commented
out) subroutine call:
Do 1 I =
I1P, IN
FLXH(I) = HADUDTH(I) * ( RHOO(I) + RHOO(I-1) )
1
DIFF(I) = NULH(I) * ( RHOO(I) - RHOO(I-1) )
Do 2 I =
I1, IN
LORHOT(I) = LO(I)*RHOO(I) + SOURCE(I) + (FLXH(I)-FLXH(I+1))
LNRHOT(I) = LORHOT(I) + (DIFF(I+1) - DIFF(I))
RHOT(I) = LORHOT(I)*RLN(I)
2
RHOTD(I) = LNRHOT(I)*RLN(I)
c Call SML_CELL ( i1, iN )
Do 3 I =
I1P, IN
FLXH(I) = MULH(I) * ( RHOT(I) - RHOT(I-1) )
3
DIFF(I) = RHOTD(I) - RHOTD(I-1)
Do 4 I =
I1, IN
FABS(I+1) = ABS ( FLXH(I+1) )
FSGN(I+1) = SIGN ( DIFF1, DIFF(I+1) )
TERM(I+1) = FSGN(I+1)*LN(I)*DIFF(I)
4
TERP(I) = FSGN(I)*LN(I)*DIFF(I+1)
Do 5 I =
I1, IN
FLXH(I+1) = FSGN(I+1) * AMAX1 ( 0.0,
&
AMIN1 ( TERM(I+1), FABS(I+1), TERP(I+1) ) )
RHON(I) = RHOTD(I) - RLN(I)*(FLXH(I+1) - FLXH(I))
5
SOURCE(I) = 0.0
We want to estimate how long we expect these loops to take on both of
our target architectures. But first, let's have a look at the loops and
try to understand what work is actually necessary. If we look through the
remainder of the LCPFCT subroutines, we note that many of the vector variables
in the above 5 loops are internal to lcpfct. These variables are
passed through COMMON and are defined in the file FCT_vars.h, the appropriate
sections of which are included below:
C /FCT_VELO/ Holds velocity-dependent flux coefficients
Real
HADUDTH(Mpt), NULH(Mpt), MULH(Mpt)
Real
EPSH(Mpt), VDTODR(Mpt)
Save
/FCT_VELO/
Common /FCT_VELO/
HADUDTH, NULH, MULH, EPSH, VDTODR
C /FCT_MISC/ Holds the source array and diffusion coefficient
Real
SOURCE(Mpt), DIFF1
Save
/FCT_MISC/
Common /FCT_MISC/
SOURCE, DIFF1
C /FCT_SCRH/ Holds scratch arrays for use by LCPFCT and CNVFCT.
FCTSC1,
C FCTSC2 are called SCRH, SCR1 in the published version.
Real
FCTSC1(Mpt), FCTSC2(Mpt), DIFF(Mpt)
Real
FLXH(Mpt), FABS(Mpt), FSGN(Mpt)
Real
TERM(Mpt), TERP(Mpt), LNRHOT(Mpt)
Real
LORHOT(Mpt), RHOT(Mpt), RHOTD(Mpt)
Save
/FCT_SCRH/
Common /FCT_SCRH/
&
FCTSC1, FCTSC2, DIFF, FLXH, FABS, FSGN, TERM, TERP,
&
LNRHOT, LORHOT, RHOT, RHOTD
Note that the subroutine SML_CELL, in the source code file vce.f, takes vector variables rhotd (or lnrhot) and ln as inputs and returns a smoothed rhotd (or lnrhot) vector. In the interest of maintaining the generality of the code, we will assume that rhotd is needed at the point in the code where SML_CELL might be called. Otherwise, it is a temporary vector variable. An inspection of the rest of the LCPFCT algorithm shows the following for the routine lcpfct:
inputs outputs local
vector workspace
------ -------
----------------------
RHOO RHON
FLXH
SOURCES RHOTD
DIFF
HADUDTH
LORHOT
NULH
LNRHOT
LO
RHOT
RLN
FABS
LN
FSIGN
MULH
TERM
TERP
This table reveals that there as many vector temporaries as there are
input and outputs. We should be concerned about this because the loads and
stores associated with using these local workspaces may add considerable
cycles to the execution of the algorithm. Let us begin by counting cycles
in the original 5 loops. Since the two target platforms have different
processor architectures, we have collected the operation counts for each
loop in several ways. We include the loops to make the operation counts
easier to understand. All operation and cycle counts are per iteration.
We cover the T3E first, and then the SP2.
f90 -O2 -Ounroll2 -Opipeline2
unless otherwise noted in the tables or loop comments. These compiler
options have been used to insure that floating point results are not changed
by the compiler, and for performance of the resulting code. Other compiler
options (e.g. -O3) make little difference in the performance, with -Ounroll2
being the most important options for performance. Although the -Opipeline2
option does not generally improve performance for load-store bound code,
we have included it for consistency; it will help in the revised code.
The performance of the loops in the original code only varies slightly
when -Opipeline2 is omitted; some loops run slightly faster, some
loops run slightly slower. All of the test codes have the vector quantities
declared in common and alignment carefully adjusted to prevent the pathological
alignment of vector elements in cache.
Loop1:
Do 1 I =
I1P, IN
FLXH(I) = HADUDTH(I) * ( RHOO(I) + RHOO(I-1) )
1
DIFF(I) = NULH(I) * ( RHOO(I) - RHOO(I-1) )
FLOATING POINT:
Operations
REAL*8
----------
------
*
2
+
2
Predicted cycles:
2 [MAX(FM ops, FA ops)]
MEMORY<->CACHE:
Operations
REAL*8
----------
------
LOAD
3
STORE
2
Predicted cycles:
3*6+2*6=
Predicted cycles:
30 [6*(LOADS+STORES)]
CACHE<->REGISTERS:
Operations
----------
LOAD
3
STORE
2
Predicted cycles
3.5 [LOADS/2 + STORES]
Overall cycles per iteration predicted
observed
---------------------------- ---------
--------
Data in cache, REAL*8
3.5
5.5
Data in memory,REAL*8
30
30
Loop1 has independent iterations and adequate unrolling should provide plenty of independent calculations for pipelining. For both in-cache and in-memory data, the loop is load-store bound.
Loop 2:
Do 2 I =
I1, IN
LORHOT(I) = LO(I)*RHOO(I) + SOURCE(I) + (FLXH(I)-FLXH(I+1))
LNRHOT(I) = LORHOT(I) + (DIFF(I+1) - DIFF(I))
RHOT(I) = LORHOT(I)*RLN(I)
2
RHOTD(I) = LNRHOT(I)*RLN(I)
FLOATING POINT:
Operations
REAL*8
----------
------
*
3
+
5
Predicted cycles:
5 [MAX(FM ops, FA ops)]
MEMORY<->CACHE:
Operations
REAL*8
----------
------
LOAD
6
STORE
4
Predicted cycles:
6*6+4*6=
Predicted cycles:
60 [6*(LOADS+STORES)]
CACHE<->REGISTERS:
Operations
----------
LOAD
6
STORE
4
Predicted cycles
7.0 [LOADS/2 + STORES]
Overall cycles per iteration predicted
observed
---------------------------- ---------
--------
Data in cache, REAL*8
7.0
18.5
Data in memory,REAL*8
60
117
Loop 2 is also a load-store bound loop, regardless of whether the data
are in memory or cache. For data in memory, the already expensive set of
ten loads and stores is even more costly as the Alpha runs out of stream
buffers. As noted in the architecture section, these stream buffers have
an enormous impact on cache-to-memory bandwidth and latency. The temporary
vectors LORHOT and LNRHOT are included only for generality in the algorithm;
these vectors are not used in this version of the lcpfct subroutine. Finally,
this is also a loop with independent iterations.
Loop3:
Do 3 I =
I1P, IN
FLXH(I) = MULH(I) * ( RHOT(I) - RHOT(I-1) )
3
DIFF(I) = RHOTD(I) - RHOTD(I-1)
FLOATING POINT:
Operations
REAL*8
----------
------
*
1
+
2
Predicted cycles:
2 [MAX(FM ops, FA ops)]
MEMORY<->CACHE:
Operations
REAL*8
----------
------
LOAD
3
STORE
2
Predicted cycles:
3*6+2*6=
Predicted cycles:
30 [6*(LOADS+STORES)]
CACHE<->REGISTERS:
Operations
----------
LOAD
3
STORE
2
Predicted cycles
3.5 [LOADS/2 + STORES]
Overall cycles per iteration predicted
observed
---------------------------- ---------
--------
Data in cache, REAL*8
3.5
5.9
Data in memory,REAL*8
30
30
Loop 3 writes to 2 vectors (FLXH and DIFF) which are completely local
to loops 3,4 and 5. Note that this is also a load-store bound loop.
Loop4:
Do 4 I =
I1, IN
FABS(I+1) = ABS ( FLXH(I+1) )
FSGN(I+1) = SIGN ( DIFF1, DIFF(I+1) )
TERM(I+1) = FSGN(I+1)*LN(I)*DIFF(I)
4
TERP(I) = FSGN(I)*LN(I)*DIFF(I+1)
FLOATING POINT:
Operations
REAL*8
----------
------
*
4
+
0
ABS [1 FM or FA]
1
SIGN [1 FM or FA]
1
Predicted cycles:
4 [MAX(FM ops, FA ops)]
MEMORY<->CACHE:
Operations
REAL*8
----------
------
LOAD
3
STORE
4
Predicted cycles:
3*6+4*6=
Predicted cycles:
42 [6*(LOADS+STORES)]
CACHE<->REGISTERS:
Operations
----------
LOAD
3
STORE
4
Predicted cycles
5.5 [LOADS/2 + STORES]
Overall cycles per iteration predicted
observed
---------------------------- ---------
--------
Data in cache, REAL*8
5.5
9.7
Data in memory,REAL*8
42
66
Loop 4 is different from the first 3 loops, since the usage of FSGN introduces a dependency. Specifically, iteration I is dependent on iteration I-1 because of the 2nd and 4th statements of the do loop body. However, each iteration does have independent operations which can be evaluated while we are waiting on the evaluation of the SIGN function.Since the SIGN function (CPYS instruction) latency is 4 cycles, we need 4 cycles of independent work to prevent the SIGN evaluation from becoming a bottleneck. Fortunately, there are 4 multiplies (i.e., 4 cycles) we can be working on while the next element of FSGN is being computed. Therefore the dependency should not impose a performance penalty.
Again, loop 4 is load-store bound, and since there are 7 loads and stores, the memory-to-cache bandwidth is again reduced by the limit of 6 stream buffers, although this effect is not as severe as loop2.
Loop5:
Do 5 I =
I1, IN
FLXH(I+1) = FSGN(I+1) * AMAX1 ( 0.0,
&
AMIN1 ( TERM(I+1), FABS(I+1), TERP(I+1) ) )
RHON(I) = RHOTD(I) - RLN(I)*(FLXH(I+1) - FLXH(I))
5
SOURCE(I) = 0.0
FLOATING POINT:
Operations
REAL*8
----------
------
*
2
+
2
MAX/MIN [2 FA, 1 FM or FA]
3
Predicted cycles:
8 [MAX(FM ops, FA ops)]
MEMORY<->CACHE:
Operations
REAL*8
----------
------
LOAD
6
STORE
3
Predicted cycles:
6*6+3*6=
Predicted cycles:
54 [6*(LOADS+STORES)]
CACHE<->REGISTERS:
Operations
----------
LOAD
6
STORE
3
Predicted cycles
6 [LOADS/2 + STORES]
Overall cycles per iteration predicted
observed
---------------------------- ---------
--------
Data in cache, REAL*8
8
19.4
Data in memory,REAL*8
54
101
Data in cache, REAL*8,unroll 3 8
13.0
Loop 5 has a dependency introduced by the usage of FLXH; iteration I is dependent on the results of iteration I-1. However, there are more than 6 cycles of pipelined work (and therefore more than the pipeline latency of 4 cycles) involved in computing the next element of FLXH so we do not expect the dependency to limit performance. The assumes the loop can be unrolled sufficiently without running out of floating point registers.
We have also included a timing for a manually unrolled loop. The FORTRAN compiler default unroll count, set by compiling with -Ounroll2 unrolls too many times and software pipelining is disabled due to register overflow. Manually unrolling by 3 (using the !dir$ unroll 3 directive just before the do loop to be unrolled)increases performance dramatically. The fact that performance is potentially degraded is easily checked by looking at compiler output. In this case, the listing file executable.lst generated by compiling with the flag -r3 contained the optimization messages with regard to loop 5 (which began at line 60):
19) cf90-6005,Scalar,Line=60 A loop starting at line 60 was unrolled
4 times.
20) cf90-9870,Scalar,Line=60 Pipelining of loop disabled because
of high pressure: ipressure=2, fpressure=35.
Note that the memory bandwidth is again reduced because
9 streams into memory are needed and only six are available.
Finally, this is the only loop in the 5 loops in the original code which
is not load-store bound for data in cache.
Cycles per iteration:
predicted floating point predicted
observed
------------------------ ---------
--------
Data in cache, REAL*8
21
27.5 59.6
Data in memory, REAL*8
21
216 344
xlf90 -O3 -qstrict
or
xlf90 -O3 -qstrict -qtune=p2sc -qarch=p2sc.
The tables note
which options are appropriate for each loop timing in the observed results. As
in the T3E case, the array variables have been declared in common and
their alignment in memory adjusted to prevent pathological alignment problems
in cache.
Loop1:
Do 1 I =
I1P, IN
FLXH(I) = HADUDTH(I) * ( RHOO(I) + RHOO(I-1) )
1
DIFF(I) = NULH(I) * ( RHOO(I) - RHOO(I-1) )
FLOATING POINT:
REAL*8
REAL*4
Operations
w/o FMA w/FMA w/o FMA
w/FMA
----------
------- ----- -------
-----
FMA
0 0
0 0
*
2 2
2 2
+
2 2
2 2
CVLS(=)
0 0
1 1
Predicted cycles:[FPOPs/2] 2
2 2.5
2.5
MEMORY<->CACHE:
Operations
REAL*8
REAL*4
----------
------
------
LOAD
3
3
STORE
2
2
STORE after LOAD
Predicted cycles:
[L*1.3+S*2.3+SaL*1.0] [L*0.7+S*1.3+SaL*0.6]
Predicted cycles:
3*1.3+2*2.3=
3*0.7+2*1.3=
Predicted cycles:
8.5
4.7
CACHE<->REGISTERS:
Operations
----------
LOAD
3
STORE
2
Predicted cycles(w/o quad)
2.5 [(LOADs + STOREs)/2]
Predicted cycles(w/ quad)
1.25 [(LOADs + STOREs)/4]
Overall cycles per iteration predicted
observed
---------------------------- ---------
--------
Data in cache, REAL*4
2.5
4.2
Data in cache, REAL*8
2.5
3.7
Data in cache,p2sc, REAL*4
2.5
4.2
Data in cache,p2sc, REAL*8
2.0
2.2
Data in memory,REAL*4
4.7
6.3
Data in memory,REAL*8
8.5
10.0
Loop 1 is marginally load-store bound for REAL*4, in-cache data. For REAL*8 data, quad loads make the loop FPU bound and much higher performance is obtained. As one would expect, this loop is almost twice as fast for REAL*4 data when the data is in memory.
Loop 2:
Do 2 I =
I1, IN
LORHOT(I) = LO(I)*RHOO(I) + SOURCE(I) + (FLXH(I)-FLXH(I+1))
LNRHOT(I) = LORHOT(I) + (DIFF(I+1) - DIFF(I))
RHOT(I) = LORHOT(I)*RLN(I)
2
RHOTD(I) = LNRHOT(I)*RLN(I)
FLOATING POINT:
REAL*8
REAL*4
Operations
w/o FMA w/FMA w/o FMA
w/FMA
----------
------- ----- -------
-----
FMA &nb