Parallel dieCAST Program
Serial Code Performance
|
Routine |
Mflops |
|
Main |
1 |
|
Fs |
660 |
|
Range |
758 |
|
Initfs |
41 |
Parallel Program Structure
The original serial code is parallelized. For this parallelization, whole domain is 3 dimensionally decomposed. Following subroutines are added to prepare the parallelization:
Module Index: Define parameter and global variable
Module allocvar: Define allocable variable
Subroutine Mpisetup: Decompose domain and define its variable. Create communication data structure.
Subroutine Mpi_3d_sndrcv: execute communication between neighbor surfaces of local array blocks.
Preparation Subroutines
3-Dimensional Decomposition (subroutine mpisetup)
The domain is decomposed in 3-dimension as shown in the bellow figure.
In MPI, the following lines are preparing this type of domain decomposition:
iproc=N
jproc=N
kproc=N
dims(1)= iproc
dims(2)= jproc
dims(3)= kproc
periods(1) = .false.
periods(2) = .false.
periods(3) = .false.
call MPI_Cart_create(MPI_COMM_WORLD,3,dims,periods,.true.,
& mpi_comm_cart,ierr)
call MPI_Cart_coords(mpi_comm_cart,irank,3,mpi_coords,ierr)
ipid = mpi_coords(1)+1
jpid = mpi_coords(2)+1
kpid = mpi_coords(3)+1
Local Array Structure (subroutine mpisetup)
Each local array is allocated as A(IL0:IR0,JL0:JR0,KL0:KR0). IL0 and IR0 are shown in the bellow figure.
IS1 and I0 are the beginning and end points of the real local variable initially decomposed. Two pads on each side are the receiving pads from the neighbor points. The same works for y and z directions.
IS2, IS3 and IS4 are the same with IS1 if the node of the array block is not the first one in the direction. If the node of the array block is the first one in the direction, IS1=1, IS2=2, IS3=3, and IS4=4.
I1, I2, I3, I4 are the same with I0 if the node of the array block is not the last one in the direction. If the node of the array block is the last one in the direction, I1=I0-1, I2=I0-2, and I3=I0-3.
IP1 and IP2 are I0+1 and I0+2 respectively (or I0 only at the last node in the direction). IN1 and IN2 are Also, IN1 and IN2 are I0-1 and I0-2 respectively.
ISP1 and ISP2 are IS1+1 and IS2+2 respectively. Also, ISN1 and ISN2 are IS1-1 and IS1-2 respectively (or IS1 only at the first node in the direction).
Data Type for Communication
I-Surface
We needed to create data types for the surfaces of the local array block.
ipface is for the surface in the i-direction, jpface for j-direction, and kpface for k-direction.
From the above local array block, the I-surface and points are shown in the following figure:
The data type 1 (imface) will have the stride of Ni and the number of Nj which can be composed in the following mpi call:
call MPI_Type_vector(jmsize,1,ispan,MPI_REAL,imface,ierr)
Where jmsize is Nj and ispan is Ni.
The data type 1 will be repeated by the number of Nk on the other columns with each displacements of Nj*Ni. Thus the general type for the I-surface can be expressed by the following mpi call:
do i=1,kmsize
iblock(i)=1
idisp(i)=jspan*ispan*(i-1)
itype(i)=imface
enddo
call MPI_Type_struct(kmsize,iblock,idisp,itype,ipface,ierr)
In reality we have 4 pad layers which has to be displaced, thus the actual data type of the I-surface will be constructed by
do i=1,kmsize
iblock(i)=1
idisp(i)=jspan*ispan*4*(i-1)
itype(i)=imface
enddo
call MPI_Type_struct(kmsize,iblock,idisp,itype,
+ipface,ierr)
J-Surface
The j-surface and its points from the local array block are shown in the following figure:
The points have the number of Nk, the size of Ni and the stride of Ni*Nj.
This data type will be constructed by the following mpi call:
call MPI_TYPE_vector(kmsize,imsize,ispan*jspan,MPI_REAL,jpface,ierr)
K-Surface
The k-surface and its points from the local array block are shown in the following figure:
The points have the number of Nj, the size of Ni and the stride of Ni.
This data type will be constructed by the following mpi call:
call MPI_Type_vector(jmsize,imsize,ispan,MPI_REAL,kpface,ierr)
Communication (subroutine mpi_3d_sndrcv)
Two pads, P1 and P2, on the communication direction were added on each local array. The communication C1 will send the layer 1 at the PE=i to the pad 1 at the PE=i+1.
A part of the communication program shown bellow do the above communication in the positive x-direction (in this program , it is noted with i-prefix). If mim=2, both C1 and C2 communication will occur. If mim=1, only C1 communication will occur. If mim=0, no communication will occur in this direction.
The same logic works for the negative x-direction with mip, the positive and negative y-direction with mjm and mjp, and the positive and negative z-direction with mkm and mkp.
subroutine mpi_3d_sndrcv(A,mim,mip,mjm,mjp,mkm,mkp,itag)
use index
include 'mpif.h'
dimension A(il0:ir0,jl0:jr0,kl0:kr0)
integer request,status(MPI_STATUS_SIZE)
if(mim.gt.0) then
do i=1,mim
c <- rcv imface from ipface
if(ipid.ne.1) then
ii1=is1-i
ij1=js1
ik1=ks1
call MPI_irecv(A(ii1,ij1,ik1),1,ipface,
& ipid-2,itag,mpi_comm_i,request,ierr)
call mpi_wait(request,status,ierr)
endif
c -> send ipface to imface (mim=1: one layer, mim=2: two layers)
if(ipid.ne.iproc) then
ii1=I0-i+1
ij1=js1
ik1=ks1
call MPI_send(A(ii1,ij1,ik1),1,ipface,
& ipid,itag,mpi_comm_i,ierr)
endif
enddo
endif
Parallelizing Existing Subroutines
Input Broadcasting (subroutine initfs)
To avoid the congestion for every nodes to read an input file simultaneously, one node (IPID=JPID=KPID=1) will read the input file. The node will broadcast to other nodes or decompose the global variable to the local variable to each node.
if(ipid.eq.1 .and. jpid.eq.1. and. kpid.eq.1) then
OPEN(9,file='PREP/DATA/RUNDATA',form='unformatted')
REWIND 9
READ(9) CASE,DSCRIB,DT,TLZ,B,G,DM0,DE0,DMZ0,GEV,GHV,F,
1 TANPHI,RXYMX,RZMX,TAU,TAUN,DRAG,FLTW,DAODT,X0DEG,Y0DEG,DXDEG,
2 X1DEG,Y1DEG,ODT,PRN,ORXYMX,ORZMX,OFLTW,Y,YV,YVDEG,YDEG,
3 CS,CSV,OCS,OCSV,DX,DXV,ODX,ODXV,DY,DYV,ODY,ODYV,OM5,OM8,
4 KTRM,LRSTRT,MXIT,ISAV,MXSAV,M1,M2,M3,M4,M5,M5M,M6,M7,M8,
5 M8M,LOPEN,LSURF,LTURB,LWIND,LHEAT,LSOL,LSAVE,LMOVI,MVI
c
endif
c
c broad casting
c
call MPI_bcast(F,NJ2,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(TANPHI,NJ2,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(Y,NJ0,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(YV,NJ0,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(YVDEG,NJ0,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(YDEG,NJ0,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(CS,NJ0,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(CSV,NJ1,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(OCS,NJ0,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(OCSV,NJ0,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(DX,NJ0,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(DXV,NJ1,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(ODX,NJ0,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(ODXV,NJ0,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(DY,NJ0,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(DYV,NJ1,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(ODY,NJ0,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(ODYV,NJ0,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(GEV,NI2*NJ2*NK2,MPI_REAL,0,MPI_COMM_WORLD,ierr)
call MPI_bcast(GHV,NI2*NJ2*NK2,MPI_REAL,0,MPI_COMM_WORLD,ierr)
……
c
c Decompose the variable
c
do i=is1,i2
do j=js1,j2
do k=ks1,k2
EV(i,j,k)=gev(i,j,k)
HV(i,j,k)=ghv(i,j,k)
enddo
enddo
enddo
Data Parallelization Type 1 (simple data parallelization)
Original code:
DO 98 K=1,K1
DO 98 J=2,J1
DO 98 I=2,I1
98 RHO(I,J,K)=RHO(I,J,K)-TEMP
Parallel code:
DO 98 K=ks1,K1
DO 98 J=js2,J1
DO 98 I=is2,I1
98 RHO(I,J,K)=RHO(I,J,K)-TEMP
The idex K1, J1 and I1 in the original code are renamed as NK1, NJ1 and NI1 in the parallel code. The index shown in the parallel code are as defined in the "Local Array Structure."
Data Parallelization Type 2 (more computation, less communication)
Original code:
DO 99 K=1,K2
L=K+1
TMP=.5*G/ODZW(L)
DO 99 J=2,J1
DO 99 I=2,I1
99 P(I,J,L)=P(I,J,K)+TMP*(RHO(I,J,K)+RHO(I,J,L))
Parallel code 1:
call mpi_3d_sndrcv(RHO,0,0,0,0,0,1,1100)
DO 99 K=ks1,K2
L=K+1
TMP=.5*G/ODZW(L)
DO 99 J=js2,J1
DO 99 I=is2,I1
call mpi_3d_update(P,0,0,0,0,0,1,1200)
Parallel code 2:
call mpi_3d_sndrcv(RHO,0,0,0,0,1,0,1100)
call mpi_3d_sndrcv(P,0,0,0,0,1,0,1200)
DO 99 K=ks1,k1
L=K-1
TMP=.5*G/ODZW(L)
DO 99 J=js2,J1
DO 99 I=is2,I1
99 P(I,J,K)=P(I,J,L)+TMP*(RHO(I,J,L)+RHO(I,J,K))
The k index of the array P in the original code is from ks1+1 to k1+1. It needs to send the P(K1+1) to the right side node and receive P(ks1) from the left side node. The parallel code 1 receives the RHO from the right side neighbor (higher node in the direction) and update P to the right neighbor node.
The parallel code 2 shifts the whole range of the loop to fit the final P array into the local array. Now P is from ks1 to k1, but needs RHO and P from the left side neighbor. We preferred the type 2 parallelization because we can avoid any update routine after computation.
Another example shows clearly how we can save the communication time by this type of parallelization.
Original code:
DO 500 K=1,K1
DO 300 J=2,J1
DO 300 I=1,I1
DO 304 I=3,I2
304 PX(I,J)=PX(I,J)+IN(I-1,J,K)*IN(I,J,K)*IN(I+1,J,K)*
1 (UX(I,J-1)+UX(I-1,J-1)-UX(I+1,J-1)-UX(I-2,J-1))
Parallel code 1:
Call mpi_3d_sndrcv(P,0,1,0,0,0,0,1300)
DO 500 K=ks1,K1
DO 300 J=js2,J1
DO 300 I=is1,I1
Call mpi_3d_sndrcv(UX,2,1,0,0,0,0,1100)
DO 304 I=is3,I2
304 PX(I,J)=PX(I,J)+IN(I-1,J,K)*IN(I,J,K)*IN(I+1,J,K)*
1 (UX(I,J-1)+UX(I-1,J-1)-UX(I+1,J-1)-UX(I-2,J-1))
Parallel code 2:
Call mpi_3d_sndrcv(P,1,2,0,0,0,0,1300)
DO 500 K=ks1,K1
DO 300 J=js2,J1
DO 300 I=isn2,IP1
DO 304 I=is3,I2
304 PX(I,J)=PX(I,J)+IN(I-1,J,K)*IN(I,J,K)*IN(I+1,J,K)*
1 (UX(I,J-1)+UX(I-1,J-1)-UX(I+1,J-1)-UX(I-2,J-1))
Bottom Stress
The bottom point should be checked if it belongs to the current node.
Original code:
DO 600 J=2,J1
DO 600 I=2,I1
K=KB(I,J)
DRG=IN(I,J,K)*TMP*ODZ(K)*SQRT(U1(I,J,K)**2+V1(I,J,K)**2)
U2(I,J,K)=U2(I,J,K)-DRG*U1(I,J,K)
600 V2(I,J,K)=V2(I,J,K)-DRG*V1(I,J,K)
Parallel code:
DO 600 J=js2,J1
DO 600 I=is2,I1
K=KB(I,J)
if(k .gt. ks1 .and. k .le. k0) then
DRG=IN(I,J,K)*TMP*ODZ(K)*SQRT(U1(I,J,K)**2+V1(I,J,K)**2)
U2(I,J,K)=U2(I,J,K)-DRG*U1(I,J,K)
V2(I,J,K)=V2(I,J,K)-DRG*V1(I,J,K)
endif
600 continue
Reduction
Serialization
Future Work
1. Go through the program with the PI and find any logical mistakes.
2. Optimization (Cache, loop, etc.)
3. Parallel scalability analysis and more optimization.
3. Postprocessing