[an error occurred while processing this directive]  

Scientific Computing at NPACI (SCAN)

Vector intrinsic functions II: IBM SP

In a previous SCAN article, we discussed the optimizations that can be performed by using the vector intrinsic functions on the CRAY T3E (see Vector intrinsic functions I: CRAY T3E). Although the general approach is the same on both the CRAY T3E and IBM SP - isolate loops over calls to the scalar intrinsic functions so that they can be replaced with calls to the vector routines - there is a big difference in the details of implementation. As we saw on the CRAY T3E, the user cannot manually code the calls to the vector intrinsics. Instead, the code must be written in such a way that the compiler can recognize the opportunity to substitute a vector routine for the loop-over calls to the scalar functions. On the IBM machines, the opposite approach is taken in that the user must explicitly code the calls to vector routines.

In the remainder of this article, we will discuss

Inserting calls to vector MASS

Consider a code containing the following fragment:


         do i=1,n

            y(i) = exp(x(i))

         enddo

As we've seen previously, simply compiling this code on the T3E with either -O3 or -Ovector3 will result in the substitution of a vector routine for this loop. That this substitution occurred can be verified by inspecting either assembly code or the listing file. On the IBM SP systems, this loop can manually be rewritten as


         call vexp(y,x,n)

The calling syntax for the majority of the vector MASS routines is


         call vfunc(target,source,length)

where the names of the vector functions are obtained by simply prepending "v" to the standard Fortran intrinsics: vsqrt, vexp, vlog, vsin, vcos, and vtan. The vector mass library also contains reciprocal (vrec) and inverse square root (vrsqrt) functions that follow this same syntax.

A small number of the routines are called with additional arguments:


         call vsincos(target1,target2,source,length)

         call vdiv(target,source1,source2,length)

         call vatan2(target,source1,source2,length)

The vector MASS routines are part of a separate library that must be explicitly loaded as shown below:


         xlf90 myprog.f -L/usr/local/apps/mass/lib -lmassv

         xlf90 myprog.f -L/usr/local/apps/mass/lib -lmassvp2

Linking with -lmassv uses the generic vector mass library, while -lmassvp2 loads a version of the library that has been specifically tuned for the POWER2 architecture. NPACI's IBM Teraflops system scheduled for delivery in the later half of 1999 is based on the POWER3 architecture. At the time of writing, a version of vector MASS tuned for this processor is not yet available, but will most likely be developed in the future. It is suggested that the architecture-specific library be used whenever possible.

Performance advantage of the vector intrinsics

The performances of the scalar and vector intrinsics were compared for code compiled using -O3 -L/usr/local/apps/mass/lib -lmassvp2. The timings below were obtained for 1,000,000 evaluations of the common intrinsic functions using 64-bit real values. (NOTE - the default REAL size on the IBM is 32-bit. For codes which declare variables to be of type REAL, the -qautodbl=dbl flag should be used). RSQRT is the inverse square root function - the compiler automatically replaces 1.0/SQRT() with a call to this function. SINCOS involves separate calls to SIN and COS (scalar) or a single call to VSINCOS (vector).


function            t(vector)   t(scalar)   speedup

SQRT                0.05373     0.26619     4.95443

RSQRT               0.05078     0.35213     6.93409

EXP                 0.05251     0.33875     6.45155

LOG                 0.05943     0.34537     5.81170

SIN                 0.03959     0.21468     5.42296

COS                 0.03866     0.22530     5.82737

TAN                 0.08948     0.42596     4.76022

SINCOS              0.06710     0.51386     5.74256

Effect of -qarch=pwr2

The code used to obtain the timings on the vector and scalar intrinsics was recompiled as above, but with the addition of the -qarch=pwr2 flag. This enables the compiler to generate code that is specific to the POWER2 architecture, and in particular, utilizes the hardware square root instruction. This affects only the scalar SQRT and RSQRT operations for which new timings are shown below


function            t(vector)   t(scalar)   speedup

SQRT                0.05140     0.08166     1.58876

RSQRT               0.05059     0.13539     2.67612

The (scalar) MASS library

The purpose of this article is to illustrate the use of the vector intrinsic routines, but it is worthwhile to take a short digression to discuss the scalar MASS (or simply MASS) library. We recompiled the code, but this time with -O3 -qarch=pwr2 -lmass -lmassvp2. The results are shown below.


function            t(vector)   t(scalar)   speedup

SQRT                0.05227     0.08228     1.57407

RSQRT               0.05055     0.13545     2.67965

EXP                 0.05138     0.14488     2.81985

LOG                 0.05808     0.22671     3.90322

SIN                 0.03975     0.11235     2.82651

COS                 0.04115     0.11239     2.73112

TAN                 0.09023     0.23476     2.60191

SINCOS              0.06658     0.25029     2.77396

The MASS library, which must be loaded separately from the vector MASS library, contains optimized versions of the scalar intrinsics. The improved performance is obtained at the expense of a slight degree of precision. For most applications, this difference should not be noticeable. For particularly sensitive applications or for cases where high precision is needed, it is strongly suggested that the results obtained with and without the MASS libraries be compared.

In constrast to the vector library, which requires that calls to the vector functions be explicitly coded, linking the MASS library results in an automatic substitution of the scalar MASS routines for the standard intrinsics. All timings presented throughout the remainder of this article are based on code compiled with -qarch=pwr2 -lmass -lmassvp2.

Effect of vector length

Just as on the CRAY T3E, the speedup obtained from using the vector intrinsic depends on the length of the vector. The table below illustrates how the performance of vexp() asymptotes as the length of the vector becomes large:


array size     t(scalar)     t(vector)     speedup

        10     0.0000044     0.0000044     0.9972900

       100     0.0000176     0.0000090     1.9536424

      1000     0.0001494     0.0000650     2.2986060

     10000     0.0014261     0.0005642     2.5277748

    100000     0.0139536     0.0053033     2.6311172

   1000000     0.1387933     0.0535633     2.5912005

If you expect that your vector lengths will be small, it is best to time the code using both the scalar and vector routines from the MASS library. In some cases, the vector intrinsic called for very short arguments can be slower than the scalar routines.

Portability

A main drawback of using the vector MASS routines is that they reduce the portability of the code. Unless you plan on writing a code that runs only on the IBM SP, one of the following two solutions is recommended:

Use the preprocessor directives

The following code fragment will take advantage of the vector intrinsics on both the CRAY T3E and the IBM SP:


#ifdef MASS

      call vexp(y,x,n)

#else

      do i=1,n

         y(i) = exp(x(i))

      enddo

#endif

Write a function wrapper

The code is simply written as if it were to be run only on the IBM SP, but a separate file is compiled and linked for other machines. For example, a file named vecint.f containing the following code could be created:


      subroutine vexp(y,x,n)

      integer n

      real*8 x(n),y(n)

      do i=1,n

         y(i) = exp(x(i))

      enddo



      subroutine vlog(y,x,n)

      integer n

      real*8 x(n),y(n)

      do i=1,n

         y(i) = log(x(i))

      enddo



      ...

Machine specific makefiles could then be created with the following entries


# For the IBM SP2

my_executable: $(OBJS)

     $(LDR) -o $@ $(OBJS) -L/usr/local/apps/mass/lib -lmass -lmassvp2



# For other machines

my_executable: $(OBJS) vecint.o

     $(LDR) -o $@ $(OBJS) vecint.o

Memory issues

For most codes run on NPACI's machines, memory limitations are not an issue. In the example below, the first block of code was rewritten to use the vector function, but requires memory to store an extra vector of length n.


c     Original loop

      do i=1,n

         rinv(i) = 1.0/sqrt(x(i)*x(i) + y(i)*y(i))

      enddo



c     Transformed loop

      do i=1,n

         temp(i) = x(i)*x(i) + y(i)*y(i)

      enddo

      call vrsqrt(rinv,temp,n)

If memory limitations are an issue, the loop can be stripmined and a relatively small temporary array (in this case 1024) need be allocated. An additional benefit of stripmining is that memory traffic will be reduced, since the entire temporary array will fit in cache, possibly resulting in even better performance.


c     Stripmined loop

      do i1=1,n,1024

         i2 = min(i1+1023,n)

         do i=i1,i2

            temp(i-i1+1) = x(i)*x(i) + y(i)*y(i)

         enddo

         call vrsqrt(rinv(i1),temp,i2-i1+1)

      enddo

Conclusions

The IBM approach to the vector intrinsics requires the programmer to explicitly code the calls to the vector routines. While this shifts the burden onto the programmer, it does have the advantage that the programmer always knows when a vector version of the routine is being used. The following points should be kept in mind when optimizing for the intrinsic functions on the IBM SP:

For more information on the mass library, see either the mass man page on the SDSC IBM SP or visit www.rs6000.ibm.com/resource/technology/MASS.

Appendix: Sample programs

The following codes are available to demonstrate the use of the vector intrinsic functions on the IBM SP:

Note on sample programs

The SDSC IBM SP is currently configured with 66 MHz nodes for interactive jobs and 160 MHz nodes for batch jobs. In order to get accurate timings that will reflect the performance that can be expected for real applications, jobs should be submitted via loadleveler. Although serial (single processor) jobs are allowed on SDSC's SP, they are highly discouraged except for short benchmark runs.

The optimizing compilers on the IBM SP do an excellent job of eliminating operations that produce results that are never used. For example, in the program below, the loop-over calls to sqrt(x(i)) will be optimized away.


      program test1

      real, dimension(100) :: x,y

      call random_number(x)

      do i=1,100

         y(i) = sqrt(x(i))

      enddo

      end

To prevent this, the possibility of using the results must exist. By adding the following line after the loop


      if(y(100).eq.-100.0) write(10,*) y

we take advantage of the fact that the compiler understands the syntax of the program, but not the semantics (for example, that the sqrt function cannot return a negative value). This technique is used throughout the examples and must be kept in mind if the programs are modified.