[an error occurred while processing this directive] 

Vector intrinsic functions III: Sun HPC Servers

This is the third in a series of articles covering the use and performance benefits of the vector intrinsic functions. Readers unfamiliar with this topic should at least read the first of these articles, Vector Intrinsics I: CRAY T3E, and, optionally, Vector Intrinsics II: IBM SP for background information.

As we have seen in the earlier articles, for programs that spend a significant fraction of their execution time evaluating commonly used functions, such as square roots, exponentials, logarithms, sines, and cosines, the performance benefits of using the vendor-supplied vector versions of the routines is well worth the effort.

Cray and IBM take different approaches to the vector intrinsics. On the CRAY T3E, the compiler is responsible for identifying opportunities to replace loops over calls to intrinsic functions with calls to a vector version of the function. While the programmer can help by writing the code in such a way that the compiler can recognize opportunities for the substitutions, the burden is placed primarily on the compiler. On the IBM SP, by contrast, the programmer must explicitly code calls to the vector routines. In the remainder of this article, we'll focus on issues specific to the Sun HPC Servers.

Note - The results presented in this article were obtained using Sun's Workshop 6.0 Fortran 90 compiler on SDSC's Sun HPC 10000 with Ultra II (400 MHz) processors. Future hardware and software releases may affect results.

The Sun approach to vector intrinsics

The Sun approach is exactly the same as that taken by Cray; the responsibility for using the vector versions of the intrinsics falls squarely on the compiler. As an example, the Sun f90 compiler will replace the following loop with a call to the vector version of the exp function:

         do i=1,n
            y(i) = exp(x(i))
         enddo

In order for the compiler to make this substitution, the -xvector flag must be used. Consistent with the Cray approach, the Sun compiler does not allow the programmer to explicitly code for the vector intrinsics at the source code level. (Advanced programmers can always modify the assembly code, but this is generally discouraged.)

How do we know that the substitution was made? Unfortunately, the listing file (obtained with the -Xlist flag), doesn't provide this information. At the present time, the only way to be certain is to examine the assembly code (generated using the -S option). The scalar version of the exp function appears as call __exp, while the vector exp appears as call __vexp_.

Compilation of a simple code containing the common intrinsics shows that vector versions of the functions exist only for exp, log, pow (the real-to-real power function), sin, and cos. Following the naming convention used for the exp function, the vector versions of these intrinsics appear as call __vfunctionname_ in the assembly listing.

The Sun f90 compiler is also capable of performing more complex optimizations related to the use of vector functions. For example, the compiler split the following loop so that the intermediate result x(i)*x(i)+y(i)*y(i)+z(i)*z(i) is calculated first and then passed to the vector exponential function.

         do i=1,n
            r(i) = exp(x(i)*x(i)+y(i)*y(i)+z(i)*z(i))
         enddo

Missing vector intrinsics

In this section, we look briefly at omissions in the set of Sun vector intrinsic functions. In addition to the functions described below, it should be noted that there are no vector versions of the real-to-integer power or tan functions. Neither of these omissions is serious. In most cases where exponentiation to an integer power is required, the exponent is known at compilation time and the operation can be replaced by a sequence of multiplications.

Combined sine and cosine

Sun does not provide a vector version of the combined sin/cos function. Examination of the assembly output from a code compiled with the -Xvector flag and containing the following loop:

      do i=1,n
         y(i) = sin(x(i))
         z(i) = cos(x(i))
      enddo

shows that separate calls are made to __vsin and __vcos. Sun does provide a scalar version of the combined function, appearing as call __d_sincos in the assembly listing, but there is actually a slight disadvantage in using it.

Inverse square root

On both the IBM SP and the Cray T3E, there is an advantage to combining square root and inverse operations whenever possible. These vendors provide a vector inverse square root function that is either just as fast or only marginally slower than the vector square root operation.

Unfortunately, Sun does not currently support either a scalar or vector version of the inverse square root. An examination of the assembly listing shows that the following loop involves separate square root and inverse operations:

         do i=1,n
            y(i) = 1.0/sqrt(x(i))
         enddo

Intrinsic function performance

There are a number of factors to consider when trying to get the best performance out of intrinsic functions on the Sun. The table below shows timings obtained using four different sets of compiler options for simple loops over 10,000,000 calls to the intrinsic functions.

Execution time (seconds) for loops over 
10,000,000 evaulations of intrinsic functions

compiler
flags 

-fast
-xnolibmopt 

-fast 

-fast
-xnolibmopt
-xvector 

-fast
-xvector 

effect of
flags
 

slow scalar
no vector
 

fast scalar
no vector
 

slow scalar
vector
 

fast scalar
vector
 
SQRT

1.09 

1.08 

1.04 

1.02
1.0/SQRT

1.65 

1.64 

1.63 

1.64
EXP

3.46 

2.89 

1.95 

1.97
LOG

3.23 

3.08 

2.20 

2.35
SIN

3.74 

2.41 

3.01 

2.94
COS

3.88 

2.66 

2.95 

2.90
TAN

4.38 

3.14 

4.38 

3.17
REAL-TO-REAL

9.82 

6.72 

6.60 

6.61
REAL-TO-INT

3.29 

3.25 

3.27 

3.29

Before interpreting these results, it would be useful to take a brief look at the compiler flags that were used.

 

-fast invokes a selection of flags and is provided as a convenience. In f90 version 6.0, these include the following:
-native -O4 -libmil -fsimple=1 -dalign -xlibmopt -depend -fns -ftrap=%none
-xnolibmopt overrides the -xlibmopt option which links in the optimized math libraries. These optimized libraries trade off a slight amount of accuracy, usually in the last bit, for speed.
-xvector enables generation of calls to vector library functions.

The optimized math libraries, obtained using either the -fast or -xlibmopt option, always result in performance equal to or better than the unoptimized libraries. Except for those cases where the last binary digit of accuracy is required, the optimized libraries should always be used.

The vector libraries, on the other hand, yield some inconsistent results. The vector exp and log functions are noticeably faster than their scalar counterparts, but the sin and cos functions are actually a little slower than the scalar versions.

Conclusions

Although there are some advantages to using the fast and vector versions of the intrinsic functions on the SUN HPC, they do not provide the same level of performance improvement as had been seen on the Cray T3E and IBM SP. Nonetheless, several points should be kept in mind when compiling and running codes on the Sun:

Appendix: Sample programs

The following codes are available to demonstrate the use of the vector intrinsic functions on the Sun HPC Servers:

Note on sample programs

Many optimizing compilers do an excellent job of eliminating operations that produce results that are never used. For example, in the program below, the loop over calls to sqrt(x(i)) will be optimized away.

      program test1
      real, dimension(100) :: x,y
      call random_number(x)
      do i=1,100
         y(i) = sqrt(x(i))
      enddo
      end

To prevent this, the possibility of using the results must exist. By adding the following line after the loop

      if(y(100).eq.-100.0) write(10,*) y

we take advantage of the fact that the compiler understands the syntax of the program, but not the semantics (for example, that the sqrt function cannot return a negative value). This technique is used throughout the examples and must be kept in mind if the programs are modified.

© 2000 Online: News about the NPACI and SDSC Community