In a previous SCAN article, we discussed the optimizations that can be performed by using the vector intrinsic functions on the CRAY T3E (see Vector intrinsic functions I: CRAY T3E). Although the general approach is the same on both the CRAY T3E and IBM SP - isolate loops over calls to the scalar intrinsic functions so that they can be replaced with calls to the vector routines - there is a big difference in the details of implementation. As we saw on the CRAY T3E, the user cannot manually code the calls to the vector intrinsics. Instead, the code must be written in such a way that the compiler can recognize the opportunity to substitute a vector routine for the loop-over calls to the scalar functions. On the IBM machines, the opposite approach is taken in that the user must explicitly code the calls to vector routines.
In the remainder of this article, we will discuss
Consider a code containing the following fragment:
do i=1,n
y(i) = exp(x(i))
enddo
As we've seen previously, simply compiling this code on the T3E with either -O3 or -Ovector3 will result in the substitution of a vector routine for this loop. That this substitution occurred can be verified by inspecting either assembly code or the listing file. On the IBM SP systems, this loop can manually be rewritten as
call vexp(y,x,n)
The calling syntax for the majority of the vector MASS routines is
call vfunc(target,source,length)
where the names of the vector functions are obtained by simply prepending "v" to the standard Fortran intrinsics: vsqrt, vexp, vlog, vsin, vcos, and vtan. The vector mass library also contains reciprocal (vrec) and inverse square root (vrsqrt) functions that follow this same syntax.
A small number of the routines are called with additional arguments:
call vsincos(target1,target2,source,length)
call vdiv(target,source1,source2,length)
call vatan2(target,source1,source2,length)
The vector MASS routines are part of a separate library that must be explicitly loaded as shown below:
xlf90 myprog.f -L/usr/local/apps/mass/lib -lmassv
xlf90 myprog.f -L/usr/local/apps/mass/lib -lmassvp2
Linking with -lmassv uses the generic vector mass library, while -lmassvp2 loads a version of the library that has been specifically tuned for the POWER2 architecture. NPACI's IBM Teraflops system scheduled for delivery in the later half of 1999 is based on the POWER3 architecture. At the time of writing, a version of vector MASS tuned for this processor is not yet available, but will most likely be developed in the future. It is suggested that the architecture-specific library be used whenever possible.
The performances of the scalar and vector intrinsics were compared for code compiled using -O3 -L/usr/local/apps/mass/lib -lmassvp2. The timings below were obtained for 1,000,000 evaluations of the common intrinsic functions using 64-bit real values. (NOTE - the default REAL size on the IBM is 32-bit. For codes which declare variables to be of type REAL, the -qautodbl=dbl flag should be used). RSQRT is the inverse square root function - the compiler automatically replaces 1.0/SQRT() with a call to this function. SINCOS involves separate calls to SIN and COS (scalar) or a single call to VSINCOS (vector).
function t(vector) t(scalar) speedup SQRT 0.05373 0.26619 4.95443 RSQRT 0.05078 0.35213 6.93409 EXP 0.05251 0.33875 6.45155 LOG 0.05943 0.34537 5.81170 SIN 0.03959 0.21468 5.42296 COS 0.03866 0.22530 5.82737 TAN 0.08948 0.42596 4.76022 SINCOS 0.06710 0.51386 5.74256
The code used to obtain the timings on the vector and scalar intrinsics was recompiled as above, but with the addition of the -qarch=pwr2 flag. This enables the compiler to generate code that is specific to the POWER2 architecture, and in particular, utilizes the hardware square root instruction. This affects only the scalar SQRT and RSQRT operations for which new timings are shown below
function t(vector) t(scalar) speedup SQRT 0.05140 0.08166 1.58876 RSQRT 0.05059 0.13539 2.67612
The purpose of this article is to illustrate the use of the vector intrinsic routines, but it is worthwhile to take a short digression to discuss the scalar MASS (or simply MASS) library. We recompiled the code, but this time with -O3 -qarch=pwr2 -lmass -lmassvp2. The results are shown below.
function t(vector) t(scalar) speedup SQRT 0.05227 0.08228 1.57407 RSQRT 0.05055 0.13545 2.67965 EXP 0.05138 0.14488 2.81985 LOG 0.05808 0.22671 3.90322 SIN 0.03975 0.11235 2.82651 COS 0.04115 0.11239 2.73112 TAN 0.09023 0.23476 2.60191 SINCOS 0.06658 0.25029 2.77396
The MASS library, which must be loaded separately from the vector MASS library, contains optimized versions of the scalar intrinsics. The improved performance is obtained at the expense of a slight degree of precision. For most applications, this difference should not be noticeable. For particularly sensitive applications or for cases where high precision is needed, it is strongly suggested that the results obtained with and without the MASS libraries be compared.
In constrast to the vector library, which requires that calls to the vector functions be explicitly coded, linking the MASS library results in an automatic substitution of the scalar MASS routines for the standard intrinsics. All timings presented throughout the remainder of this article are based on code compiled with -qarch=pwr2 -lmass -lmassvp2.
Just as on the CRAY T3E, the speedup obtained from using the vector intrinsic depends on the length of the vector. The table below illustrates how the performance of vexp() asymptotes as the length of the vector becomes large:
array size t(scalar) t(vector) speedup
10 0.0000044 0.0000044 0.9972900
100 0.0000176 0.0000090 1.9536424
1000 0.0001494 0.0000650 2.2986060
10000 0.0014261 0.0005642 2.5277748
100000 0.0139536 0.0053033 2.6311172
1000000 0.1387933 0.0535633 2.5912005
If you expect that your vector lengths will be small, it is best to time the code using both the scalar and vector routines from the MASS library. In some cases, the vector intrinsic called for very short arguments can be slower than the scalar routines.
A main drawback of using the vector MASS routines is that they reduce the portability of the code. Unless you plan on writing a code that runs only on the IBM SP, one of the following two solutions is recommended:
The following code fragment will take advantage of the vector intrinsics on both the CRAY T3E and the IBM SP:
#ifdef MASS
call vexp(y,x,n)
#else
do i=1,n
y(i) = exp(x(i))
enddo
#endif
The code is simply written as if it were to be run only on the IBM SP, but a separate file is compiled and linked for other machines. For example, a file named vecint.f containing the following code could be created:
subroutine vexp(y,x,n)
integer n
real*8 x(n),y(n)
do i=1,n
y(i) = exp(x(i))
enddo
subroutine vlog(y,x,n)
integer n
real*8 x(n),y(n)
do i=1,n
y(i) = log(x(i))
enddo
...
Machine specific makefiles could then be created with the following entries
# For the IBM SP2
my_executable: $(OBJS)
$(LDR) -o $@ $(OBJS) -L/usr/local/apps/mass/lib -lmass -lmassvp2
# For other machines
my_executable: $(OBJS) vecint.o
$(LDR) -o $@ $(OBJS) vecint.o
For most codes run on NPACI's machines, memory limitations are not an issue. In the example below, the first block of code was rewritten to use the vector function, but requires memory to store an extra vector of length n.
c Original loop
do i=1,n
rinv(i) = 1.0/sqrt(x(i)*x(i) + y(i)*y(i))
enddo
c Transformed loop
do i=1,n
temp(i) = x(i)*x(i) + y(i)*y(i)
enddo
call vrsqrt(rinv,temp,n)
If memory limitations are an issue, the loop can be stripmined and a relatively small temporary array (in this case 1024) need be allocated. An additional benefit of stripmining is that memory traffic will be reduced, since the entire temporary array will fit in cache, possibly resulting in even better performance.
c Stripmined loop
do i1=1,n,1024
i2 = min(i1+1023,n)
do i=i1,i2
temp(i-i1+1) = x(i)*x(i) + y(i)*y(i)
enddo
call vrsqrt(rinv(i1),temp,i2-i1+1)
enddo
The IBM approach to the vector intrinsics requires the programmer to explicitly code the calls to the vector routines. While this shifts the burden onto the programmer, it does have the advantage that the programmer always knows when a vector version of the routine is being used. The following points should be kept in mind when optimizing for the intrinsic functions on the IBM SP:
For more information on the mass library, see either the mass man page on the SDSC IBM SP or visit www.rs6000.ibm.com/resource/technology/MASS.
The following codes are available to demonstrate the use of the vector intrinsic functions on the IBM SP:
The SDSC IBM SP is currently configured with 66 MHz nodes for interactive jobs and 160 MHz nodes for batch jobs. In order to get accurate timings that will reflect the performance that can be expected for real applications, jobs should be submitted via loadleveler. Although serial (single processor) jobs are allowed on SDSC's SP, they are highly discouraged except for short benchmark runs.
The optimizing compilers on the IBM SP do an excellent job of eliminating operations that produce results that are never used. For example, in the program below, the loop-over calls to sqrt(x(i)) will be optimized away.
program test1
real, dimension(100) :: x,y
call random_number(x)
do i=1,100
y(i) = sqrt(x(i))
enddo
end
To prevent this, the possibility of using the results must exist. By adding the following line after the loop
if(y(100).eq.-100.0) write(10,*) y
we take advantage of the fact that the compiler understands the syntax of the program, but not the semantics (for example, that the sqrt function cannot return a negative value). This technique is used throughout the examples and must be kept in mind if the programs are modified.