[an error occurred while processing this directive]  

Scientific Computing at NPACI (SCAN)

Shared memory programming on the IBM POWER3 nodes I: Parallelization of simple loops

In the earlier years of parallel computing, there was a clear distinction between shared memory and distributed memory machines. The shared memory machines, also known as symmetric multiprocessors (SMPs), were characterized by the fact that each CPU had uniform access to all memory locations. Although these machines were straightforward to program, technological limitations made it difficult to scale them beyond a modest number of processors. As a result, distributed memory machines such as the CRAY T3E and IBM SP became the dominant architectures in high performance computing.

In recent years, a number of hardware vendors have recognized the benefits of combining the two types of parallelism in a single machine. Platforms such as the HP Exemplar and SGI/CRAY Origin2000 are essentially collections of SMP nodes connected through a high-speed network. Although some of these machines provide additional hardware to give the user the illusion of a large flat memory extending over several nodes, the best performance is generally obtained when they are explicitly programmed at two levels - shared memory parallelism within each SMP node and message passing between nodes.

The new IBM Teraflops system, scheduled to arrive at SDSC later this year, will consist of 144 8-way SMP nodes based on the new POWER3 processor. The experiences of users at sites such as Lawrence Livermore National Lab (LLNL), which already have similar systems, indicate that a combination of shared memory parallelism and message passing will deliver the best performance on SDSC's Teralop machine.

In this article, we will focus on shared memory parallelism within the IBM POWER3 SMP node. Although we will provide some examples involving the use of OpenMP directives, this is not a tutorial on OpenMP. The reader is referred to www.openmp.org for more details. The main topics to be covered in the remainder of this article are:

Autoparallelization features of the IBM compilers

The IBM compilers for the POWER3-based systems have the capability to parallelize simple loops. While there is much more to shared memory programming than loop-level parallelism, for most applications that will be run on the Teraflops system this is sufficient, since coarse-grained parallelism will be obtained by using message passing between nodes. When deciding whether or not to parallelize a loop, the compilers apply two broad classes of tests:

Parallelism analysis is applied first, since it makes no sense to do a cost-based analysis if it is usafe to parallelize the loop. Once a loop has been determined to be safe for parallelization, cost-based analysis is applied to decide whether it is worthwhile to parallelize the loop.

Codes that are to be run on the POWER3 nodes should compiled using the "_r" versions of the compilers: xlf90_r, mpxlf90_r, cc_r, etc. The "_r" indicates that the code will automatically be linked with the thread-safe libraries. The Fortran examples presented below were all compiled using the following set of flags:

xlf_r -O3 -qsmp -qreport=smplist -qsource

The -qsmp flag is required and specifies that the code should be run in parallel. If using a makefile, this flag should be included on both the compile and load lines. The next set of options, -qreport=smplist -qsource, direct the compiler to generate a listing file containing the original source code plus information on the success or failure of the parallelization of each loop.

As a simple example, consider a code containing the following loop:

      do i=1,n
         z(i) = x(i) + y(i)
      enddo

The SMP Parallelization Report section of the listing file generated for this code contains the following message:

C 1585-501  Original Source Line 27
       AUTO PARALLEL DO CLLIV_44=1,10000000,1
         i = CLLIV_44
         z(i) = x(i) + y(i)
       end do

The AUTO PARALLEL message indicates that the compiler was able to successfully parallelize this loop without programmer intervention. Many of the simple loops in a typical code will be auto parallelized, eliminating the need for the programmer to use OpenMP directives or explicitly program using threads. Unfortunately, there are many loops that the compiler cannot parallelize. In the next two sections, we'll look at some of these loops and show how, if appropriate, to override the compiler's decisions.

Parallelizing loops with OpenMP

Certain loops can be safely parallelized by the programmer, but will not be automatically parallelized by the compiler. We'll look at some of these loops in this section.

Indirect addressing

In the example below, the compiler chose not to parallelize the loop, since it couldn't determine whether or not any data dependencies exist:

      do i=1,n
         y(index(i)) = z(i)
      enddo

C 1585-501  Original Source Line 34
C 1585-108  SMP: Did not parallelize this loop potentially because:
C 1585-112  Dependence information is not precise.
C 1585-113  Data dependence prevents parallelization.
       do i=1,10000000,1
         y(index(i)) = z(i)
       end do

This can be seen by explicitly writing out a few iterations of the loop (for simplicity, let's assume a total of six iterations split across two processors):

          CPU 0                    CPU 1

       y(index(1)) = z(1)       y(index(4)) = z(4)
       y(index(2)) = z(2)       y(index(5)) = z(5)
       y(index(3)) = z(3)       y(index(6)) = z(6)

If the array index does not have any repeated values, then the loop can safely be parallelized. On the other hand, if there are repeated values in index, then a race condition may occur as shown below:

Let index = /1,3,2,4,1,5/

          CPU 0             CPU 1

       y(1) = z(1)       y(4) = z(4)
       y(3) = z(2)       y(1) = z(5)
       y(2) = z(3)       y(5) = z(6)

In this case, the value assigned to element y(1) will depend on which CPU executes the assignment last. If the programmer knows for sure that there are no repeated values in index (for example, index contains a permutation of integers), then the compiler can be overridden by inserting the !$OMP PARALLEL DO directive before the loop. This is an example where the compiler makes the conservative decision, but the programmer takes advantage of extra information that is not available to the compiler. The new listing file will contain the message:

C 1585-501  Original Source Line 33
       PARALLEL DO i=1,10000000,1
         y(index(i)) = z(i)
       end do

Function and subroutine calls

The following three loops contain further examples of operations that prevent autoparallelization. In each case, the compiler was not able to determine if there were any side effects of the function or subroutine calls.

c     Loop 1
      do i=1,n
         call sub1(y,x,z,i,n)
      enddo

C 1585-501  Original Source Line 40
C 1585-108  SMP: Did not parallelize this loop potentially because:
C 1585-111  Side effects of procedure call(s) cannot be determined.
       do i=1,10000000,1
         T_40 = 10000000
         call sub1(y,x,z,i,T_40)
       end do

c     Loop 2
      do i=1,n
         call sub2(y(i),x(i),z(i))
      enddo

C 1585-501  Original Source Line 46
C 1585-108  SMP: Did not parallelize this loop potentially because:
C 1585-111  Side effects of procedure call(s) cannot be determined.
       do i=1,10000000,1
         call sub2(y,x,z)
       end do

c     Loop 3
      do i=1,n
         y(i) = func(x(i),z(i))
      enddo

C 1585-501  Original Source Line 52
C 1585-108  SMP: Did not parallelize this loop potentially because:
C 1585-111  Side effects of procedure call(s) cannot be determined.
       do i=1,10000000,1
         y(i) = func(x,z)
       end do

Even in the second and third examples where individual array elements are passed as arguments, the compiler could not be certain that the function or subroutine did not alter other elements of the arrays. This follows from the fact that Fortran passes arguments by reference rather than value. If we know that there are no unintended side effects, the compiler's decisions can once again be overridden using an OpenMP directive.

Reduction operations

Loops containing reduction operations will automatically be parallelized by the compiler unless the -qstrict flag is used. This option prevents the compiler from performing any optimizations that might alter the numerical results. Since parallelization of a loop modifies the order in which elements of an array are collected, numerical differences can arise. The listing below was obtained with -qstrict, but no OpenMP directives:

      do i=1,n
         s = s + y(i)
      enddo

C 1585-501  Original Source Line 59
C 1585-108  SMP: Did not parallelize this loop potentially because:
C 1585-112  Dependence information is not precise.
C 1585-113  Data dependence prevents parallelization.
       do i=1,10000000,1
         s = s + y(i)
       end do

To override the compiler when -qstrict is used, the directive !$OMP PARALLEL DO REDUCTION(+:s) should be inserted before the loop.

Short loops

Certain loops that are safe candidates for parallelism don't pass the cost-based analysis tests. In particular, short loops where the interation count is known at compile time will not be parallelized:

      do i=1,10
         y(i) = sqrt(x(i))**sqrt(z(i))
      enddo

C 1585-501  Original Source Line 86
C 1585-108  SMP: Did not parallelize this loop potentially because:
C 1585-109  Granularity of computation is relatively small.
       do i=1,10,1
         y(i) = DSQRT(x(i)) ** DSQRT(z(i))
       end do

In this case, the overhead associated with thread management may cause the loop to execute slower in parallel. It should be noted that the compiler makes its analysis based solely on the iteration count, regardless of the amount of work that is done on each iteration. The current version of the xlf compiler uses a cutoff of ~100, but this may potentially change in future releases. If the programmer feels that the loop should be parallelized anyway, the compiler can be overridden by using the !$OMP PARALLEL DO directive.

Loops that shouldn't be parallelized

There are several types of loops that should not be parallelized. These include loops containing control dependencies and data dependencies.

Control dependence

A control dependency arises when one statement determines whether or not a following statement will be executed. In the example below, the test on y(i) determines whether the assignments will be made to elements i+1 through n. The compiler will not parallelize these types of loops and will return an error if an OpenMP directive is used to force parallelization:

      do i=1,n
         y(i) = x(i) * z(i)
         if(y(i).gt.0.75) go to 999
      enddo
 999  continue

C 1585-501  Original Source Line 65
C 1585-108  SMP: Did not parallelize this loop potentially because:
C 1585-110  Loop has loop carried control dependence.
       do i=1,10000000,1
         y(i) = x(i) * z(i)
C 1585-501  Original Source Line 67
         if ((y(i) .gt. 0.75d0) .eq. 0) goto 90007
C 1585-501  Original Source Line 67
         goto 90008
 90007   continue
       end do
 90008 continue

Flow dependence

In a loop containing a flow dependence, the result of one statment is required by a following statement. In the example below, y(3) depends on y(2), y(4) depends on y(3), and so on.

      do i=2,n
         y(i) = y(i-1) + x(i)
      enddo

C 1585-501  Original Source Line 74
C 1585-108  SMP: Did not parallelize this loop potentially because:
C 1585-113  Data dependence prevents parallelization.
       do i=2,10000000,1
         ScRep_59 = y(i - 1) + x(i)
         y(i) = ScRep_59
       end do

The reason for not parallelizing this loop can be seen easily by explicitly writing out the loop for a small number of iterations (we'll assume six iterations divided across two processors):

             CPU 0                  CPU 1

        y(2) = y(1) + x(2)     y(5) = y(4) + x(5)
        y(3) = y(2) + x(3)     y(6) = y(5) + x(6)
        y(4) = y(3) + x(4)     y(7) = y(6) + x(7)

The assignment made to y(4) in the last iteration of CPU 0 is required by CPU 1 in order to get the same answer as would be obtained by the serial code. Placing the !$OMP PARALLEL DO directive before this loop will force parallelization, but the results will most likely be incorrect.

Anti-dependence

The following loop which contains an anti-dependence was not automatically parallelized by the compiler:

      do i=2,n
         x(i-1) = x(i) + z(i)
      enddo

C 1585-501  Original Source Line 80
C 1585-108  SMP: Did not parallelize this loop potentially because:
C 1585-113  Data dependence prevents parallelization.
       do i=2,10000000,1
         ScRep_60 = x(i) + z(i)
         x(i - 1) = ScRep_60
       end do

The reason why it is not safe to parallelize this loop is a little more difficult to see, but becomes apparent when the loop is explicitly written out for a small number of iterations. As in the previous examples, we'll look at six iterations split across two CPUs.

             CPU 0                  CPU 1

        x(1) = x(2) + z(2)     x(4) = x(5) + z(5)
        x(2) = x(3) + z(3)     x(5) = x(6) + z(6)
        x(3) = x(4) + z(4)     x(6) = x(7) + z(7)

The value assigned to x(3) depends on x(4). Different results will be obtained based on whether or not CPU 1 executes its first iteration before CPU 0 executes its last iteration. An easy way to avoid this problem is to copy array x into a temporary array which is used on the right-hand side (rhs) of the expression. This modification to the code guarantees that the "old" values of x are used on the rhs:

      do i=2,n
         t(i) = x(i)
      enddo

      do i=2,n
         x(i-1) = t(i) + z(i)
      enddo

It should be noted that the use of the !$OMP PARALLEL DO directive before the original loop would have overridden the compiler's decision not to parallelize, but the results would most likely be wrong.

Suggested strategies for shared memory parallelism

As we have seen, one way to classify loops compiled on the IBM POWER3 SMP nodes is to divide them into those which auto parallelize and those that don't. This second category can be further divided based on the reason why auto parallelization failed:

  1. Loops which can safely be parallelized based on additional knowledge available to the programmer
  2. Loops which would parallelize except that they fail the cost-benefit analysis
  3. Loops that should not be parallelized

The safest and easiest approach is to first compile the original code and examine the parallelization report. For those loops that auto parallelize, no additional work is required. For those that fail to parallelize, the listing file can then be examined to determine whether or not it is safe to insert OpenMP directive to force parallelization.

In the first category of loops listed above, the programmer applies his or her knowledge of the semantics of the code. For example, that there are no repeated indices on the left-hand side of assignments or that the functions and subroutines have no unintended side effects.

For the second category of loops, the programmer is responsible for making an estimate of whether or not the amount of work carried out by the loops is sufficient to justify the loop overhead. This will probably involve a certain amount of trial and error; we'll investigate this in more detail in future articles.

The third category of loops presents the biggest challenge to the programmer. Parallelization of these loops requires that the programmer transform the loops or change the logic of the code so as to avoid flow or control dependencies.

Summary

In this article, we have explored the issues involved in the parallelization of simple loops on the IBM POWER3 SMP nodes. Programmers can take advantage of the auto parallelization capabilties of the IBM compilers and focus their efforts on just those loops which require additional intervention. In future SCAN articles, we'll turn our attention to more advanced loop transformations that allow parallelization and the use of the various clauses of OpenMP directives to tune the performance of parallel codes.