CS596 Assignment 1.1: Single Processor Performance

Due date 3/8/01 at 5 pm in class.

Compile all the codes without any optimization i.e for Fortran
compile as : 
f90 code.f -o code
and for C compile as :
cc code.c -o code

Keep your code simple i.e. you don't need to add anymore lines than
needed to solve the problems. The SUNHPC10000 web page is at
http://www.npaci.edu/HPC10000

1. We are going to look at cache performance of simple loops.
   Write a code (either in C or fortran) that has four matrices
   say w,x,y, and z.

   Make the dimensions of the matrices to be 1024 X 1024 .
   Fill up all the elements of y and z with some floating point
   values.

 start your timer here
   Next write a loop that does the following 40 times over
   both i and j :

   x(i,j) = y(i,j)/z(i,j) (for C do x[i][j] = y[i][j]/z[i][j] )

   End the loop

   
   Next write a loop that does the following 40 times over
   both i and j :

   w(i,j) = x(i,j) + z(i,j) ( do the equivalent for C program)

   End the loop 
 end your timer here

   Print out few values of x and w.
   Print out the total time in second it took for the two loops.

   Hand in the code and the timing.

2. Use the code that you wrote above. But reverse the way
   you were accessing matrices i.e. if you were accessing
   row wise in problem 1, now access them column wise.
   Hand in the code and the timing and comment on difference (if any)
   in timing from the code in problem 1.

3. Use the same code from problem 1 or 2 (whichever took less time)
   and fuse the loops and execute the fused loops 40 times. Hand in 
   the code and the timing and comment on the timing compare to 
   before (i.e from timing of either problem 1 or 2).

4. Write a code that does matrix multiply i.e. if a, b, and c are three
   matrices write a code to calculate:
   
    a(i,j) = a(i,j) + b(i,k)* c(k,j)
    
    Make the matrices of dimension 1024X1024.
    Make sure that you fill up the matrices b and c with some floating 
    point numbers and initialize matrix a to zero. At the end of your
    code print out the time it takes just to do the matrix multiply loop
    (i.e do not include intialization of a, b, and c in the timing). Also
    print out few elements of matrix a. Hand in the code and the timing.

5. Modify the code from problem 4 to include blocking of the matrix
   multiply loop. Block all the three loops of matrix multiply. 
   Run the code using blocking of 8, 16, 32, 64, 128 and record timing
   for each case. Print out the same few elements of matrix a to check 
   correctness with question 4. Hand in the code and the timing results 
   for the different blocking sizes.

6. A machine has cache miss penalty of 50 clock cycles. All instructions
   normally take 2.0 clock cycles (ignoring memory stalls). Assume miss
   rate is 2% and there is average of 1.33 memory references per
   instruction. What is the impact on performance  when behaviour of cache
   is included?
--------------------------------------------
Here is an example how you would use C timing routine :

#include <stdio.h> 
#include <sys/time.h>
#include <math.h>

main() {
  struct timeval start_time, end_time;
  double exe_time;

  gettimeofday(&start_time, 0);

/* This is the loop which you want to time */

  gettimeofday(&end_time, 0);
  exe_time = end_time.tv_sec - start_time.tv_sec;

  printf("Time = %f seconds\n", exe_time);
}

--------------------------------------------
Here is an example how you would use a f90 timing routine

      integer count_start, count_stop, count_rate
! determine clock period per second
         call system_clock(count_rate=count_rate)

         call system_clock(count=count_start)
! This is the loop which you want to time
         call system_clock(count=count_stop)

         write(6,100) real(count_stop-count_start)/real(count_rate)
 100     format('Time= ',f12.2, 'seconds')