[an error occurred while processing this directive]

Scientific Computing at NPACI (SCAN)

Vector intrinsic functions I: CRAY T3E

Many options are available to a programmer when attempting to optimize a code. These can vary significantly in both difficulty and effectiveness. Fortunately, some of the easiest optimizations to apply also yield the biggest payoff. A good example is the replacement of a hand-coded linear algebra solver with a libsci, ESSL, IMSL, or other library routine. A related optimization that most users are not aware of is the use of vendor-supplied vector intrinsic routines. These vector routines have been optimized so as to make best use of the arithmetic functional units, thereby maximizing throughput. For codes that spend a lot of time evaluating commonly used functions such as square roots, exponentials, and logarithms, the performance benefits of using the vector versions of the routines can be tremendous.

There is good reason why users are unaware of these vector intrinsics routines. Unlike the libraries mentioned above, the vector intrinsics are not well advertised and in some cases are even hidden from the user. On the CRAY T3E, for example, there are no man pages describing the vector operations. The Cray online documentation (dynaweb) provides a brief mention of the vector routines in Section 4.6 of the Cray T3E Fortran Optimization Guide, but the existence of the vector routines is not immediately obvious. Fortunately, the Cray T3E does make the vector intrinsics readily available to the programmer, often without his or her knowledge.

The Cray approach to vector intrinsics

As an example, consider a code containing the following fragment:

         do i=1,n
            y(i) = sqrt(x(i))
         enddo

The code was compiled on the CRAY T3E using f90 with the options -O3 -r2. The -O3 option invokes the highest level of general optimization, and -r2 specifies that a listing file describing actions taken by the compiler should be produced.

An examination of the listing file shows the following:

     20               1 --------            do i=1,n
     21               1                        y(i) = sqrt(x(i))
     22               1 ------->            enddo

              ...

   4)  An intrinsic math function call
       starting at line 21 was vectorized.

What exactly has the compiler done here? It turns out that the loop-over calls to the scalar (one-at-a-time) square root function have been replaced with a call to a vector version of the intrinsic. To confirm that this substitution really happened, we can take a look at the assembly language for this code. (Compiling with the -eS option specifies that assembly code should be generated). The following excerpt shows calls to _SQRT_V, the vector version of the square root:

        laum      r9,  _SQRT_V(r31)           ; Ln 21 0x000001b0 (In 94)
        sll       r9,  32,        r9          ; Ln 21 0x000001b4 (In 95)
        lalm      r9,  _SQRT_V(r9)            ; Ln 21 0x000001b8 (In 96)
        lal       r9,  _SQRT_V(r9)            ; Ln 21 0x000001bc (In 97)
        jsr       r26, (r9),      _SQRT_V     ; Ln 21 0x000001c0 (In 98)

	...

        .extern   _SQRT_V

The convention used in Cray assembly language is that the vector intrinsics are listed as _FUNCTIONNAME_V, while the scalar versions of the routines are simply listed as _FUNCTIONNAME.

If a sufficiently high level of optimization is specified -- either -O3 or -Ovector3 -- the compiler will automatically replace repeated calls to the scalar routine with a call to the corresponding vector routine.

Performance advantage of the vector intrinsics

So what's the advantage of using the vector intrinsics? The timings below were obtained for 1,000,000 evaluations of the common intrinsics. To disable vectorization where the scalar routines are desired, the !DIR$ NOVECTOR directive is inserted into the code after the timings on the vector routines are completed.

function            t(vector)   t(scalar)   speedup
SQRT                0.08213     0.29229     3.55892
RSQRT               0.08174     0.29249     3.57809
EXP                 0.10689     0.30254     2.83038
LOG                 0.11753     0.56410     4.79939
SIN                 0.42646     0.48205     1.13036
COS                 0.39388     0.48559     1.23283
TAN                 0.74235     0.74243     1.00011
REAL-TO-REAL        2.74571     2.77406     1.01033
REAL-TO-INT         0.52196     0.52105     0.99825

RSQRT is the reciprocal square root function that is automatically called when 1.0/sqrt(x) is encountered. REAL-TO-REAL and REAL-TO-INT are the power operations for a real number raised to real and integer powers, respectively.

The vector intrinsic versions of the routines are almost always faster. In the case of the square root, reciprocal square root, and exponential and logarithmic functions, the speedup is significant enough to have a large impact on the overall performance of certain codes.

A few points are worth mentioning regarding the vector routines on the CRAY T3E.

The fast vector intrinsics

Although the performance benefits of the vector intrinsics over the scalar versions of the routines are already impressive, it is possible to do even better. The CRAY T3E supports fast versions of the vector routines that are loaded by specifying -lmfastv on the compile line. The timings below were obtained using these fast vector intrinsics:

function            t(vector)   t(scalar)   speedup
SQRT                0.06725     0.29118     4.32994
RSQRT               0.07902     0.29136     3.68706
EXP                 0.10613     0.30102     2.83639
LOG                 0.09841     0.56275     5.71838
SIN                 0.15109     0.41704     2.76014
COS                 0.15106     0.39045     2.58469
TAN                 0.74126     0.74126     1.00000
REAL-TO-REAL        0.33507     2.67083     7.97104
REAL-TO-INT         0.52012     0.51858     0.99703

While the fast square root, reciprocal square root, and log functions all perform better than their default vector counterparts, the most profound gains occur for the sin, cos, and real-to-real functions. (We are currently investigating why linking in the fast vector libraries reduced the time spent in the scalar sin and cos functions.)

If the fast routines are so much better than the default vector routines, why not just use them all the time? One of the reasons why the fast routines have such good performance is that they sacrifice 1-2 bits of accuracy in the results for the sake of speed. Since all numerical algorithms based on floating point arithmetic are inherently approximations, this slight error should be negligible in most cases. The suggestion is to run the code using both the default and vector libraries and look for any noticeable changes in the results. In the remainder of this article, we will use exclusively the fast versions of the routines.

Effect of vector length

We have already looked at the performance benefits of the vector routines in the limit of very long vectors. What happens, though, as we change the length of the vector? The table below shows that the vector version of exp is actually a little slower than the scalar routine. As the vector length is increased, the performance gains become evident and asymptote toward speedups of ~2.83.

array size     t(scalar)     t(vector)     speedup
        10     0.0000190     0.0000204     0.9341730
       100     0.0000467     0.0000311     1.5005266
      1000     0.0003183     0.0001278     2.4900804
     10000     0.0030349     0.0010902     2.7838593
    100000     0.0301330     0.0106554     2.8279671
   1000000     0.3011246     0.1062833     2.8332250

These timings were obtained for operands that were not in cache (to ensure this, the random_number intrinsic was called with a very large number of array elements before each timing). When in-cache data was used and the loop count was very short, as in the following code fragment with n=10, the vector routine showed even worse performance relative to the scalar routine.

      call random_number(x(1:n))
      do i=1,n
         x(i) = exp(y(i))
      enddo

In these cases where it has been confirmed that the vector routine performs poorly, it may be advantageous to temporarily disable vectorization.

!DIR$ NOVECTOR
      call random_number(x(1:n))
      do i=1,n
         x(i) = exp(y(i))
      enddo
!DIR$ VECTOR

Advanced use of the vector intrinsics

Fast evaluation of real-to-real operations

In many cases involving exponentiation, all elements of an array are raised to the same power. Taking advantage of the algebraic identity

     x**y = exp(log(x)*y)

it is possible to replace the expensive real-to-real operation with calls to the exponential and logarithm functions as shown below.

c      Original code for evaluating real-to-real
         do i=1,n
            y(i) = x(i)**q
         enddo

c      New code based on exp and log functions
         do i=1,n
            z(i) = log(x(i))
         enddo
         do i=1,n
            y(i) = exp(z(i)*q)
         enddo

The first block of code took approximately 25% longer to run than the second block. If the fast vector routines were not used, the difference was factor of ten!

Exponentiation to special powers

The technique described above is excellent for the general case of an array of real values raised to a given power. It is possible, though, to do even better for special cases involving 1/2, 1/4, and even 1/8 integral powers by taking advantage of the speed of the square root operation.

For example, the following blocks of code were used to evaulate the special cases of raising an array to the powers 1.5, 0.25, 0.125, and 1.25

         do i=1,n
            y(i) = sqrt(x(i))
         enddo
         do i=1,n
            y(i) = x(i)*y(i)
         enddo
         do i=1,n
            y(i) = sqrt(x(i))
         enddo
         do i=1,n
            y(i) = sqrt(y(i))
         enddo
         do i=1,n
            y(i) = sqrt(x(i))
         enddo
         do i=1,n
            y(i) = sqrt(y(i))
         enddo
         do i=1,n
            y(i) = sqrt(y(i))
         enddo
         do i=1,n
            y(i) = sqrt(x(i))
         enddo
         do i=1,n
            y(i) = sqrt(y(i))
         enddo
         do i=1,n
            y(i) = y(i)*x(i)
         enddo

Combined evaluation of SIN and COS functions

It is a common enough occurrence that both the sin and cos need to be evaluated for a given argument, and a special vector intrinsic has been created for this case. The following loop:

       do i=1,n
          x(i) = cos(z(i))
          y(i) = sin(z(i))
       enddo

is automatically replaced by a call to the vector intrinsic labeled as _COSS_V in the assembly output. Compared to the version of the code in which the sin and cos were evaluated in separate loops, _COSS_V required only about 75% of the time.

Helping the compiler to use vector intrinsics

In the previous sections, we saw that the compiler would automatically replace a repeated loop-over call to an intrinsic function with a call to the vector version of the intrinsic. The compiler can actually take this one step further and split loops as needed so that the intrinsic operation can be isolated. For example, in the following code, the compiler splits the loop and the vector square root function is called as indicated by the assembly listing:

         do i=1,n
            r(i) = sqrt(x(i)*x(i)+y(i)*y(i)+z(i)*z(i))
         enddo

In some cases, the loops are too compicated for the compiler to isolate the call to the intrinsic. In the trivial case below, the scalar version of the square root is called:

c      Original version of code - unable to take advantage of vector intrinsics
         do i=1,n
            r2 = x(i)*x(i)+y(i)*y(i)+z(i)*z(i)
            if(r2.lt.1.0) then
               r(i) = sqrt(r2)*2.0
            else
               r(i) = sqrt(r2)*3.0
            endif
         enddo

Rewriting the code so that the call to sqrt is taken outside of the IF-ELSE-ENDIF test allows the vector square root to be used instead:

 
c     Code rewritten to take advantage of vector intrinsic
         do i=1,n
            r(i) = sqrt(x(i)*x(i)+y(i)*y(i)+z(i)*z(i))
         enddo
         do i=1,n
            if(r(i).lt.1.0) then
               r(i) = r(i)*2.0
            else
               r(i) = r(i)*3.0
            endif
         enddo

In more realistic cases, modifying the code to use the vector intrinsics may result in extra arithmetic operations or loads being performed. The suggestion is to simply time multiple versions of the code segment to determine whether the expense of these extra operations are offset by the benefits of the vector intrinsics. In the code fragments below, the compiler was unable to use the vector inverse square root intrinsic:

         do i=1,n
            r2temp = x(i)*x(i) + y(i)*y(i)
            if(r2temp.lt.1.0) then
               ri = 1.0/sqrt(r2temp)
               fx(i) = x(i)*ri
               fy(i) = y(i)*ri
            endif
         enddo

Modifying the code as shown below requires additional load/store operations (certain elements of x and y are loaded twice, rinv and indx are each loaded and stored) and arithmetic operations (incrementing icount). Still, the benefits of using the vector intrinsic outweigh these costs and result in a code that runs in only about 70% of the time:

         icount = 0
         do i=1,n
            r2temp = x(i)*x(i) + y(i)*y(i)
            if(r2temp.lt.1.0) then
               icount = icount + 1
               rinv(icount) = r2temp
               indx(icount) = i
            endif
         enddo

         do j=1,icount
            rinv(j) = 1.0/sqrt(rinv(j))
         enddo

         do j=1,icount
            i = indx(j)
            fx(i) = x(i)*rinv(j)
            fy(i) = y(i)*rinv(j)
         enddo

Conclusions

The Cray approach to the vector intrinsics is to hide them from the programmer and have the compiler substitute the vector functions for the scalar functions as appropriate. While this has the effect of reducing the burden on the programmer, it does have some drawbacks. For users who are unaware of the existence of the vector functions, this can result in mysterious and drastic changes in run times as seemingly small changes are made to the source code. In addition, the programmer will be unsure as to whether the substitution has been made unless the time is taken to examine the assembly code or listing file. As we will see in the next part of this series on vector intrinsics, IBM takes the exact opposite approach and requires the programmer to explicitly code the calls to the vector routines.

In summary, the following points should be kept in mind regarding the vector intrinsic routines on the CRAY T3E:

Appendix: Sample programs

The following codes are available to demonstrate the use of the vector intrinsic functions on the CRAY T3E:

Note on sample programs

On the SDSC CRAY T3E, programs must be run on at least two processors in order to gain access to the application nodes. When running on a single processor, results will be unreliable, since the executable will be competing with other command node processes. All source code is written using a simple MPI wrapper in which only one processor actually performs the calculations. Codes should be run using the following command:

mpprun -n 2 executable

The optimizing compilers on the CRAY T3E do an excellent job of eliminating operations that produce results that are never used. For example, in the program below, the loop-over calls to sqrt(x(i)) will be optimized away:

      program test1
      real, dimension(100) :: x,y
      call random_number(x)
      do i=1,100
         y(i) = sqrt(x(i))
      enddo
      end

To prevent this, the possibility of using the results must exist. By adding the following line after the loop

      if(y(100).eq.-100.0) write(10,*) y

we take advantage of the fact that the compiler understands the syntax of the program, but not the semantics (for example, that the sqrt function cannot return a negative value). This technique is used throughout the examples and must be kept in mind if the programs are modified.