Homework 2: Measuring Software

This homework is due on Tuesday, February 25, at 11:59:59 PM (Eastern standard time). You must use submit to turn in your homework like so:
submit cs411_jtang hw2 hw2.c hw2_asm.S The grader will use the supplied Makefile to compile your work. In addition, each submitted file must have a file header comment, as described on the coding conventions page.

You can only complete this assignment within an ARMv8-A development environment, and thus you must have completed the first homework before attempting this assignment.

In this assignment, you will observe the type of assembly instructions a compiler generates. You will measure how how long different software instructions take to execute in your ARMv8-A environment. As an experiment, you will implement an algorithm and then predict how long it will execute.

Part 1: Simple Time Measurement

To begin, create a directory for your assignment and download the following files into that directory via wget:

http://www.csee.umbc.edu/~jtang/cs411.s20/homework/hw2/hw2.c
Skeleton code for this assignment.
http://www.csee.umbc.edu/~jtang/cs411.s20/homework/hw2/hw2_asm.S
Skeleton code for your assembly code.
http://www.csee.umbc.edu/~jtang/cs411.s20/homework/hw2/Makefile
Builds the code for this assignment, by simply running make. Also included is a clean target to remove all built objects. You do not need to modify this file, nor should you submit it with your work.

Now run make to compile everything. You will get some warnings about unused symbols; by the end of this assignment you will have used all of them. You should now have the executable hw2. Execute it in your terminal. It should display something like this:

$ ./hw2
do nothing: execution time = 24 ns
Need at least one argument to continue
  
So far, this program calculates how long it takes to execute a function. Specifically, it measures the time to execute this function:
static int __attribute__((optimize("O0"))) do_nothing(int val)
{
	return val;
}
In this sample run, it took about 24 nanoseconds to call the function. This includes the time to set up the stack, execute the function (which consists of exactly 1 line), and then restore the stack. Use the utility aarch64-linux-gnu-objdump to disassemble the do_nothing() function:
<do_nothing>:
    sub     sp, sp, #0x10
    str     w0, [sp, #12]
    ldr     w0, [sp, #12]
    add     sp, sp, #0x10
    ret
The compiler uses a lot of instruction to effectively do nothing. Notice the part __attribute__((optimize("O0")))? This is a function attribute. It tells the compiler, gcc, to do something special about that function. In this case, it explicitly disables optimizations ("O0") when generating the assembly for do_nothing().

As an experiment, change the optimization level from 0 to 2. Recompile, and re-run the program several times. The execution time probably will not change much, because precise timing is not possible under an emulated environment with the version of QEMU for this class. However, the generated assembly instructions will differ a bit.

Examine the disassembly for do_nothing(). In a comment box at the top of hw2.c, describe the differences between the unoptimized and optimized versions. More specifically, describe why some unoptimized assembly instruction may be safely eliminated, without changing the overall behavior of the C function.

Part 2: Simple Timing Predictor

Now that you have observed the difference between unoptimized and optimized code, disassemble the functions fibonacci_unoptimized() and fibonacci_optimized(). These function calculate the nth Fibonacci number recursively. The functions are identical, but the former function is compiled with "O0" while the latter has "O2". Run the program with steadily increasing target values:

$ ./hw2 10
do nothing: execution time = 25 ns
fibonacci (unoptimized): result = 55, execution time = 3094 ns
fibonacci (optimized): result = 55, execution time = 1855 ns
Need at least four arguments to continue
$ ./hw2 11
do nothing: execution time = 30 ns
fibonacci (unoptimized): result = 1, execution time = 6123 ns
fibonacci (optimized): result = 1, execution time = 3486 ns
Need at least four arguments to continue
$ ./hw2 12
do nothing: execution time = 57 ns
fibonacci (unoptimized): result = 2, execution time = 8689 ns
fibonacci (optimized): result = 2, execution time = 4491 ns
Need at least four arguments to continue
Observe that the tiny fractions of CPU time saved by enabling compilation optimization accumulate, especially with a CPU intensive algorithm like calculating Fibonacci values.

The computational complexity of this algorithm is about O(1.618n). What that means is if it takes X seconds to find Fibonacci number n, then it will take about 1.618X seconds to find Fibonacci number n+1.

Modify the main() function. Predict how long it will take to find the next 5 Fibonacci numbers (i.e., target+1, target+2, etc.) using both the unoptimized and optimized algorithms. Then measure the exact times. Display the predicted and actual times, for both unoptimized and optimized functions.

Part 3: Another Timing Predictor

The next step is to implement another trivial algorithm. Take a look at the comments to tribonacci_optimized(). Implement that function. The complexity of this algorithm is about O(1.839n).

In main(), measure the time to find the fifth Tribonacci value. Then predict the actual time to find the 5+target value, given the earlier measurement. Measure the actual execution time, then display both values.

Part 4: Estimating Instruction Counts

This last part will estimate how many instructions are executed, based upon execution time. Take a look at the comments to asm_equation4(). Within hw2_asm.S, implement the function asm_equation4(). This assembly routine is to be exactly 20 operations: 4 ldr, 8 add, 7 sub, and then 1 ret. Using the function measure_function4(), measure how long this function takes to execute, passing it the array part4_args and integer part4_len. Assume, for now, that every instruction takes the same amount of time to execute. Divide the actual execution time of asm_equation4() by 20; store the quotient in a double for extra precision.

Finally implement num_divisible_by_7() in hw2.c. This function analyzes each of the elements within the given array. For each element, determine if that value is divisible by seven. This function returns the number of elements that are evenly divisible by seven. Then hand-write a similar function asm_num_divisible_by_7() in hw2_asm.S. SPECIAL RESTRICTION: For asm_num_divisible_by_7(), this function may not use multiplication, division, or modulus instructions; it must use a loop to determine if a value is divisible by seven. Measure execution times for both function given the same input array part4_args.

Based upon the actual execution times of asm_equation4(), num_divisible_by_7(), and asm_num_divisible_by_7(), estimate how many instructions were executed by num_divisible_by_7() and asm_num_divisible_by_7(). Display those estimates.

Don't be concerned if the numbers seem odd. Recall that precise measurements within an emulator is very difficult. As a hint, these ARMv8-a instructions may be useful for your assembly code:

Disassemble your implementation of num_divisble_by_7(). Compare the compiler generated instructions versus your hand-written assembly routine. Add a comment block at the top of hw2.c describing your findings.

Part 5: Required Documentation

Add a comment block to the top of hw2.c file that answers these questions:

  1. From Part 1, how did enabling optimizations change the assembly instructions for do_nothing()?
  2. From Part 4, the C function num_divisible_by_7() probably ran significantly faster than asm_num_divisible_by_7(), even though both functions make the same calculations. Why are the execution times so different? How could asm_num_divisible_by_7() be better optimized?

Sample Output

Here is a sample output from running the program. Your results will obviously differ, based upon your computer setup. Although the sample predicted and actual execution times differ, they should be within a few percentages. Furthermore, your assembly implementations may require more or fewer instructions than this sample implementation.

$ ./hw2 21 1 77 0
do_nothing: execution time = 94 ns
fibonacci (unoptimized): result = 10946, execution time = 607180 ns
fibonacci (optimized): result = 10946, execution time = 325416 ns
measuring Fibonacci number #22
  unoptimized: predicted = 982417 ns, actual = 971825 ns
  optimized: predicted = 526523 ns, actual = 584490 ns
measuring Fibonacci number #23
  unoptimized: predicted = 1589550 ns, actual = 1588223 ns
  optimized: predicted = 851914 ns, actual = 872076 ns
measuring Fibonacci number #24
  unoptimized: predicted = 2571891 ns, actual = 2606122 ns
  optimized: predicted = 1378396 ns, actual = 1432893 ns
measuring Fibonacci number #25
  unoptimized: predicted = 4161319 ns, actual = 4242304 ns
  optimized: predicted = 2230244 ns, actual = 2277234 ns
measuring Fibonacci number #26
  unoptimized: predicted = 6733014 ns, actual = 6732937 ns
  optimized: predicted = 3608534 ns, actual = 3705530 ns
tribonacci #5: result = 11, execution time = 173 ns
measuring Tribonacci number #26
  optimized: predicted 62267641 ns, actual = 60562501 ns
asm_equation4: result = 105, execution time = 25 ns
  each instruction takes about 1.25 ns to execute
num_divisible_by_7: result = 3, execution time = 43 ns
asm_num_divisible_by_7: result = 3, execution time = 91 ns
estimated number of instructions for num_divisible_by_7 = 34
estimated number of instructions for asm_num_divisible_by_7 = 72

Other Hints and Notes

Extra Credit

Sorry, there is no extra credit available for this assignment.