CS596 Assignment 1.1: Single Processor Performance
Due date 3/8/01 at 5 pm in class.
Compile all the codes without any optimization i.e for Fortran
compile as :
f90 code.f -o code
and for C compile as :
cc code.c -o code
Keep your code simple i.e. you don't need to add anymore lines than
needed to solve the problems. The SUNHPC10000 web page is at
http://www.npaci.edu/HPC10000
1. We are going to look at cache performance of simple loops.
Write a code (either in C or fortran) that has four matrices
say w,x,y, and z.
Make the dimensions of the matrices to be 1024 X 1024 .
Fill up all the elements of y and z with some floating point
values.
start your timer here
Next write a loop that does the following 40 times over
both i and j :
x(i,j) = y(i,j)/z(i,j) (for C do x[i][j] = y[i][j]/z[i][j] )
End the loop
Next write a loop that does the following 40 times over
both i and j :
w(i,j) = x(i,j) + z(i,j) ( do the equivalent for C program)
End the loop
end your timer here
Print out few values of x and w.
Print out the total time in second it took for the two loops.
Hand in the code and the timing.
2. Use the code that you wrote above. But reverse the way
you were accessing matrices i.e. if you were accessing
row wise in problem 1, now access them column wise.
Hand in the code and the timing and comment on difference (if any)
in timing from the code in problem 1.
3. Use the same code from problem 1 or 2 (whichever took less time)
and fuse the loops and execute the fused loops 40 times. Hand in
the code and the timing and comment on the timing compare to
before (i.e from timing of either problem 1 or 2).
4. Write a code that does matrix multiply i.e. if a, b, and c are three
matrices write a code to calculate:
a(i,j) = a(i,j) + b(i,k)* c(k,j)
Make the matrices of dimension 1024X1024.
Make sure that you fill up the matrices b and c with some floating
point numbers and initialize matrix a to zero. At the end of your
code print out the time it takes just to do the matrix multiply loop
(i.e do not include intialization of a, b, and c in the timing). Also
print out few elements of matrix a. Hand in the code and the timing.
5. Modify the code from problem 4 to include blocking of the matrix
multiply loop. Block all the three loops of matrix multiply.
Run the code using blocking of 8, 16, 32, 64, 128 and record timing
for each case. Print out the same few elements of matrix a to check
correctness with question 4. Hand in the code and the timing results
for the different blocking sizes.
6. A machine has cache miss penalty of 50 clock cycles. All instructions
normally take 2.0 clock cycles (ignoring memory stalls). Assume miss
rate is 2% and there is average of 1.33 memory references per
instruction. What is the impact on performance when behaviour of cache
is included?
--------------------------------------------
Here is an example how you would use C timing routine :
#include <stdio.h>
#include <sys/time.h>
#include <math.h>
main() {
struct timeval start_time, end_time;
double exe_time;
gettimeofday(&start_time, 0);
/* This is the loop which you want to time */
gettimeofday(&end_time, 0);
exe_time = end_time.tv_sec - start_time.tv_sec;
printf("Time = %f seconds\n", exe_time);
}
--------------------------------------------
Here is an example how you would use a f90 timing routine
integer count_start, count_stop, count_rate
! determine clock period per second
call system_clock(count_rate=count_rate)
call system_clock(count=count_start)
! This is the loop which you want to time
call system_clock(count=count_stop)
write(6,100) real(count_stop-count_start)/real(count_rate)
100 format('Time= ',f12.2, 'seconds')