Many options are available to a programmer when attempting to optimize a code. These can vary significantly in both difficulty and effectiveness. Fortunately, some of the easiest optimizations to apply also yield the biggest payoff. A good example is the replacement of a hand-coded linear algebra solver with a libsci, ESSL, IMSL, or other library routine. A related optimization that most users are not aware of is the use of vendor-supplied vector intrinsic routines. These vector routines have been optimized so as to make best use of the arithmetic functional units, thereby maximizing throughput. For codes that spend a lot of time evaluating commonly used functions such as square roots, exponentials, and logarithms, the performance benefits of using the vector versions of the routines can be tremendous.
There is good reason why users are unaware of these vector intrinsics routines. Unlike the libraries mentioned above, the vector intrinsics are not well advertised and in some cases are even hidden from the user. On the CRAY T3E, for example, there are no man pages describing the vector operations. The Cray online documentation (dynaweb) provides a brief mention of the vector routines in Section 4.6 of the Cray T3E Fortran Optimization Guide, but the existence of the vector routines is not immediately obvious. Fortunately, the Cray T3E does make the vector intrinsics readily available to the programmer, often without his or her knowledge.
As an example, consider a code containing the following fragment:
do i=1,n
y(i) = sqrt(x(i))
enddo
The code was compiled on the CRAY T3E using f90 with the options -O3 -r2. The -O3 option invokes the highest level of general optimization, and -r2 specifies that a listing file describing actions taken by the compiler should be produced.
An examination of the listing file shows the following:
20 1 -------- do i=1,n
21 1 y(i) = sqrt(x(i))
22 1 -------> enddo
...
4) An intrinsic math function call
starting at line 21 was vectorized.
What exactly has the compiler done here? It turns out that the loop-over calls to the scalar (one-at-a-time) square root function have been replaced with a call to a vector version of the intrinsic. To confirm that this substitution really happened, we can take a look at the assembly language for this code. (Compiling with the -eS option specifies that assembly code should be generated). The following excerpt shows calls to _SQRT_V, the vector version of the square root:
laum r9, _SQRT_V(r31) ; Ln 21 0x000001b0 (In 94)
sll r9, 32, r9 ; Ln 21 0x000001b4 (In 95)
lalm r9, _SQRT_V(r9) ; Ln 21 0x000001b8 (In 96)
lal r9, _SQRT_V(r9) ; Ln 21 0x000001bc (In 97)
jsr r26, (r9), _SQRT_V ; Ln 21 0x000001c0 (In 98)
...
.extern _SQRT_V
The convention used in Cray assembly language is that the vector intrinsics are listed as _FUNCTIONNAME_V, while the scalar versions of the routines are simply listed as _FUNCTIONNAME.
If a sufficiently high level of optimization is specified -- either -O3 or -Ovector3 -- the compiler will automatically replace repeated calls to the scalar routine with a call to the corresponding vector routine.
So what's the advantage of using the vector intrinsics? The timings below were obtained for 1,000,000 evaluations of the common intrinsics. To disable vectorization where the scalar routines are desired, the !DIR$ NOVECTOR directive is inserted into the code after the timings on the vector routines are completed.
function t(vector) t(scalar) speedup SQRT 0.08213 0.29229 3.55892 RSQRT 0.08174 0.29249 3.57809 EXP 0.10689 0.30254 2.83038 LOG 0.11753 0.56410 4.79939 SIN 0.42646 0.48205 1.13036 COS 0.39388 0.48559 1.23283 TAN 0.74235 0.74243 1.00011 REAL-TO-REAL 2.74571 2.77406 1.01033 REAL-TO-INT 0.52196 0.52105 0.99825
RSQRT is the reciprocal square root function that is automatically called when 1.0/sqrt(x) is encountered. REAL-TO-REAL and REAL-TO-INT are the power operations for a real number raised to real and integer powers, respectively.
The vector intrinsic versions of the routines are almost always faster. In the case of the square root, reciprocal square root, and exponential and logarithmic functions, the speedup is significant enough to have a large impact on the overall performance of certain codes.
A few points are worth mentioning regarding the vector routines on the CRAY T3E.
Although the performance benefits of the vector intrinsics over the scalar versions of the routines are already impressive, it is possible to do even better. The CRAY T3E supports fast versions of the vector routines that are loaded by specifying -lmfastv on the compile line. The timings below were obtained using these fast vector intrinsics:
function t(vector) t(scalar) speedup SQRT 0.06725 0.29118 4.32994 RSQRT 0.07902 0.29136 3.68706 EXP 0.10613 0.30102 2.83639 LOG 0.09841 0.56275 5.71838 SIN 0.15109 0.41704 2.76014 COS 0.15106 0.39045 2.58469 TAN 0.74126 0.74126 1.00000 REAL-TO-REAL 0.33507 2.67083 7.97104 REAL-TO-INT 0.52012 0.51858 0.99703
While the fast square root, reciprocal square root, and log functions all perform better than their default vector counterparts, the most profound gains occur for the sin, cos, and real-to-real functions. (We are currently investigating why linking in the fast vector libraries reduced the time spent in the scalar sin and cos functions.)
If the fast routines are so much better than the default vector routines, why not just use them all the time? One of the reasons why the fast routines have such good performance is that they sacrifice 1-2 bits of accuracy in the results for the sake of speed. Since all numerical algorithms based on floating point arithmetic are inherently approximations, this slight error should be negligible in most cases. The suggestion is to run the code using both the default and vector libraries and look for any noticeable changes in the results. In the remainder of this article, we will use exclusively the fast versions of the routines.
We have already looked at the performance benefits of the vector routines in the limit of very long vectors. What happens, though, as we change the length of the vector? The table below shows that the vector version of exp is actually a little slower than the scalar routine. As the vector length is increased, the performance gains become evident and asymptote toward speedups of ~2.83.
array size t(scalar) t(vector) speedup
10 0.0000190 0.0000204 0.9341730
100 0.0000467 0.0000311 1.5005266
1000 0.0003183 0.0001278 2.4900804
10000 0.0030349 0.0010902 2.7838593
100000 0.0301330 0.0106554 2.8279671
1000000 0.3011246 0.1062833 2.8332250
These timings were obtained for operands that were not in cache (to ensure this, the random_number intrinsic was called with a very large number of array elements before each timing). When in-cache data was used and the loop count was very short, as in the following code fragment with n=10, the vector routine showed even worse performance relative to the scalar routine.
call random_number(x(1:n))
do i=1,n
x(i) = exp(y(i))
enddo
In these cases where it has been confirmed that the vector routine performs poorly, it may be advantageous to temporarily disable vectorization.
!DIR$ NOVECTOR
call random_number(x(1:n))
do i=1,n
x(i) = exp(y(i))
enddo
!DIR$ VECTOR
In many cases involving exponentiation, all elements of an array are raised to the same power. Taking advantage of the algebraic identity
x**y = exp(log(x)*y)
it is possible to replace the expensive real-to-real operation with calls to the exponential and logarithm functions as shown below.
c Original code for evaluating real-to-real
do i=1,n
y(i) = x(i)**q
enddo
c New code based on exp and log functions
do i=1,n
z(i) = log(x(i))
enddo
do i=1,n
y(i) = exp(z(i)*q)
enddo
The first block of code took approximately 25% longer to run than the second block. If the fast vector routines were not used, the difference was factor of ten!
The technique described above is excellent for the general case of an array of real values raised to a given power. It is possible, though, to do even better for special cases involving 1/2, 1/4, and even 1/8 integral powers by taking advantage of the speed of the square root operation.
For example, the following blocks of code were used to evaulate the special cases of raising an array to the powers 1.5, 0.25, 0.125, and 1.25
do i=1,n
y(i) = sqrt(x(i))
enddo
do i=1,n
y(i) = x(i)*y(i)
enddo
|
do i=1,n
y(i) = sqrt(x(i))
enddo
do i=1,n
y(i) = sqrt(y(i))
enddo
|
do i=1,n
y(i) = sqrt(x(i))
enddo
do i=1,n
y(i) = sqrt(y(i))
enddo
do i=1,n
y(i) = sqrt(y(i))
enddo
|
do i=1,n
y(i) = sqrt(x(i))
enddo
do i=1,n
y(i) = sqrt(y(i))
enddo
do i=1,n
y(i) = y(i)*x(i)
enddo
|
It is a common enough occurrence that both the sin and cos need to be evaluated for a given argument, and a special vector intrinsic has been created for this case. The following loop:
do i=1,n
x(i) = cos(z(i))
y(i) = sin(z(i))
enddo
is automatically replaced by a call to the vector intrinsic labeled as _COSS_V in the assembly output. Compared to the version of the code in which the sin and cos were evaluated in separate loops, _COSS_V required only about 75% of the time.
In the previous sections, we saw that the compiler would automatically replace a repeated loop-over call to an intrinsic function with a call to the vector version of the intrinsic. The compiler can actually take this one step further and split loops as needed so that the intrinsic operation can be isolated. For example, in the following code, the compiler splits the loop and the vector square root function is called as indicated by the assembly listing:
do i=1,n
r(i) = sqrt(x(i)*x(i)+y(i)*y(i)+z(i)*z(i))
enddo
In some cases, the loops are too compicated for the compiler to isolate the call to the intrinsic. In the trivial case below, the scalar version of the square root is called:
c Original version of code - unable to take advantage of vector intrinsics
do i=1,n
r2 = x(i)*x(i)+y(i)*y(i)+z(i)*z(i)
if(r2.lt.1.0) then
r(i) = sqrt(r2)*2.0
else
r(i) = sqrt(r2)*3.0
endif
enddo
Rewriting the code so that the call to sqrt is taken outside of the IF-ELSE-ENDIF test allows the vector square root to be used instead:
c Code rewritten to take advantage of vector intrinsic
do i=1,n
r(i) = sqrt(x(i)*x(i)+y(i)*y(i)+z(i)*z(i))
enddo
do i=1,n
if(r(i).lt.1.0) then
r(i) = r(i)*2.0
else
r(i) = r(i)*3.0
endif
enddo
In more realistic cases, modifying the code to use the vector intrinsics may result in extra arithmetic operations or loads being performed. The suggestion is to simply time multiple versions of the code segment to determine whether the expense of these extra operations are offset by the benefits of the vector intrinsics. In the code fragments below, the compiler was unable to use the vector inverse square root intrinsic:
do i=1,n
r2temp = x(i)*x(i) + y(i)*y(i)
if(r2temp.lt.1.0) then
ri = 1.0/sqrt(r2temp)
fx(i) = x(i)*ri
fy(i) = y(i)*ri
endif
enddo
Modifying the code as shown below requires additional load/store operations (certain elements of x and y are loaded twice, rinv and indx are each loaded and stored) and arithmetic operations (incrementing icount). Still, the benefits of using the vector intrinsic outweigh these costs and result in a code that runs in only about 70% of the time:
icount = 0
do i=1,n
r2temp = x(i)*x(i) + y(i)*y(i)
if(r2temp.lt.1.0) then
icount = icount + 1
rinv(icount) = r2temp
indx(icount) = i
endif
enddo
do j=1,icount
rinv(j) = 1.0/sqrt(rinv(j))
enddo
do j=1,icount
i = indx(j)
fx(i) = x(i)*rinv(j)
fy(i) = y(i)*rinv(j)
enddo
The Cray approach to the vector intrinsics is to hide them from the programmer and have the compiler substitute the vector functions for the scalar functions as appropriate. While this has the effect of reducing the burden on the programmer, it does have some drawbacks. For users who are unaware of the existence of the vector functions, this can result in mysterious and drastic changes in run times as seemingly small changes are made to the source code. In addition, the programmer will be unsure as to whether the substitution has been made unless the time is taken to examine the assembly code or listing file. As we will see in the next part of this series on vector intrinsics, IBM takes the exact opposite approach and requires the programmer to explicitly code the calls to the vector routines.
In summary, the following points should be kept in mind regarding the vector intrinsic routines on the CRAY T3E:
The following codes are available to demonstrate the use of the vector intrinsic functions on the CRAY T3E:
| Compares performance of the scalar and vector versions of the common intrinsic functions.
| |
| Looks at effect of vector length on relative performance of scalar and vector routines.
| |
| Evaluates array of real numbers raised to a common exponent using various techniques.
| |
| Illustrates benefits of using the combined sin/cos intrinsic function.
| |
| Illustrates loop splitting by the compiler in order to use vector intrinsic.
| |
| Simple case of code modification to allow use of vector intrinsic.
| |
| Complex case of code modification to allow use of vector intrinsic.
| |
| Makefile for creating listing file, assembly code, and executable from each source.
| |
On the SDSC CRAY T3E, programs must be run on at least two processors in order to gain access to the application nodes. When running on a single processor, results will be unreliable, since the executable will be competing with other command node processes. All source code is written using a simple MPI wrapper in which only one processor actually performs the calculations. Codes should be run using the following command:
mpprun -n 2 executable
The optimizing compilers on the CRAY T3E do an excellent job of eliminating operations that produce results that are never used. For example, in the program below, the loop-over calls to sqrt(x(i)) will be optimized away:
program test1
real, dimension(100) :: x,y
call random_number(x)
do i=1,100
y(i) = sqrt(x(i))
enddo
end
To prevent this, the possibility of using the results must exist. By adding the following line after the loop
if(y(100).eq.-100.0) write(10,*) y
we take advantage of the fact that the compiler understands the syntax of the program, but not the semantics (for example, that the sqrt function cannot return a negative value). This technique is used throughout the examples and must be kept in mind if the programs are modified.