Homework 2: Software Performance

This homework is due on Monday, February 27, at 11:59:59 PM (Eastern standard time). You must use submit to turn in your homework like so:
submit cs411_jtang hw2 hw2.c hw2_asm.S The grader will use the supplied Makefile to compile your work. In addition, each submitted file must have a file header comment, as described on the coding conventions page.

You can only complete this assignment within an ARMv8-A development environment, and thus you must have completed the first homework before attempting this assignment.

In this assignment, you will observe the type of assembly instructions a compiler generates. You will measure how long different software algorithms take to execute in your ARMv8-A environment. You will then implement different optimization techniques to decrease execution time.

Part 1: Simple Time Measurement

To begin, create a directory for your assignment and download the following files into that directory via wget:

https://www.csee.umbc.edu/~jtang/cs411.s23/homework/hw2/hw2.c: Skeleton code for this assignment.
https://www.csee.umbc.edu/~jtang/cs411.s23/homework/hw2/hw2_asm.S: Skeleton code for your assembly code.
https://www.csee.umbc.edu/~jtang/cs411.s23/homework/hw2/Makefile: Builds the code for this assignment, by simply running make. Also included is a clean target to remove all built objects. You do not need to modify this file, nor should you submit it with your work.

Now run make to compile everything to get the executable hw2. Run the program in your terminal. It should display something like this:

$ ./hw2
do nothing: execution time = 8.000000 ns
Need at least one argument to continue

So far, this program calculates how long it takes to execute a function. Specifically, it measures the time to execute this unoptimized function:

static int do_nothing(int val)
{
	return val;
}

This function is also marked with a function attribute. Function attributes are used to tell the compiler to do something special when compiling that particular function. In this case, the attribute tells the compiler to disable optimizations (if using gcc then compile as if -O0 flag; if using clang then disable all optimizations via optnone).

In this sample run, it took 8 nanoseconds to call the function. This includes the time to set up the stack, execute the function (which consists of exactly 1 line), and then restore the stack. Use the utility aarch64-linux-gnu-objdump (if running within a virtual machine) or objdump (for native ARMv8-A) to disassemble the do_nothing() function:

<do_nothing>:
    sub     sp, sp, #0x10
    str     w0, [sp, #12]
    ldr     w0, [sp, #12]
    add     sp, sp, #0x10
    ret

The compiler uses a lot of instruction to effectively do nothing. As an experiment, enable optimizations for this function, be replacing UNOPTIMIZED with OPTIMIZED. Recompile, and re-run the program several times. The execution time probably will not change much, because precise timing is not possible when running so few instructions. However, the generated assembly instructions will differ a bit.

Examine the disassembly for do_nothing(). In a comment box at the top of hw2.c, describe the differences between the unoptimized and optimized versions. More specifically, describe why some unoptimized assembly instruction may be safely eliminated, without changing the overall behavior of the C function. (Unfortunately, clang users are out of luck here, as that clang will generate the same instructions. For this part of the homework, you should still describe which lines can be safely removed.)

Part 2: Simple Timing Predictor

Now that you have observed the difference between unoptimized and optimized code, disassemble the functions fibonacci_unoptimized() and fibonacci_optimized(). These function calculate the nth Fibonacci number recursively. Both functions' code are identical, but the former function is compiled without optimizations while the latter is optimized. Run the program with steadily increasing target values:

$ ./hw2 10
<snip>
PART 2:
fibonacci (unoptimized): result = 55, execution time = 3322.000000 ns
fibonacci (optimized): result = 55, execution time = 1548.000000 ns

$ ./hw2 11
<snip>
PART 2:
fibonacci (unoptimized): result = 89, execution time = 5352.000000 ns
fibonacci (optimized): result = 89, execution time = 2651.000000 ns

$ ./hw2 12
<snip>
PART 2:
fibonacci (unoptimized): result = 144, execution time = 8668.000000 ns
fibonacci (optimized): result = 144, execution time = 4333.000000 ns

Observe that the tiny fractions of CPU time saved by enabling optimization accumulate, especially with a CPU intensive algorithm like calculating Fibonacci values.

The computational complexity of this algorithm is about O(1.618ⁿ). What that means is if it takes X seconds to find Fibonacci number n, then it will take about 1.618X seconds to find Fibonacci number n+1.

Modify the main() function. Predict how long it will take to find the next 5 Fibonacci numbers (i.e., target+1, target+2, etc.) using both the unoptimized and optimized algorithms. Then measure the exact times. Display the predicted and actual times, for both unoptimized and optimized functions.

Part 3: Algorithm Choices

Next you will implement the same algorithm three different ways, then measure their performances. Look at the comments to fizzbuzz_iterative(). This function calculates the number of fizzes and buzzes for a target number. Implement that function only using a loop and without any multiplication nor division operators. Then implement fizzbuzz_recursive(), and you must write this as a recursive procedure. Recompile your code, and then test it by passing it two arguments; main() will pass the second argument into your functions. Test it like so:

$ ./hw2 12 65536
<snip>
target fizzbuzz value is 65536
  iterative: num fizz: 21845, num buzz: 13107, execution time = 1221251.000000 ns
  recursive: num fizz: 21845, num buzz: 13107, execution time = 2016063.000000 ns

Disassemble your hw2.o to see how the compiler generated the assembly instructions for fizzbuzz_iterative(). Your next job is to hand-write the function fizzbuzz_asm() within hw2_asm.S. Like the C implementations, this code may not use multiplication nor division operators. You can write the code using either iteration or recursion. Afterwards, recompile and compare execution times for your different algorithms. You should find that the pure recursive algorithm is much slower than an iterative call. Your hand-written assembly should be fastest.

ARMv8-A does not have a built-in assembly instruction to perform modulo division. You will need to come up with your own implementation that does not involve multiplication nor division.

Part 4: Estimating Recursion Costs

In this part, you will estimate the execution time of fizzbuzz_recursive() given a new target value. In main() calculate how long a single recursion of fizzbuzz_recursive() takes to execute. Because the variable time_fizzbuzz_recursive recorded the actual execution time for target, that execution time is time_fizzbuzz_recursive divided by target. Add code to display time per function call.

The array part4_args holds three more input values for your recursive function. Predict the sum of their execution times. Display your estimate and then measure the actual execution time. Here is an example output for this part of the assignment:

$ ./hw2 12 65536 1234 2345 3456
<snip>
PART 4:
time per function call: 31.741699 ns
  predicted = 223302.000000 ns
  actual = 227623.000000 ns

Part 5: Optimizing Memory Copies

For the final part of this assignment, you will implement the iconic function memcpy(). A simplistic version, memcpy_8bit() is given to you. This version copies 8 bits (1 byte) at a time. Look at its disassembled version to see how inefficient it is. As that ARMv8-A has 64-bit registers, it is easy to make it run faster. Examine memcpy_64bit() and its disassembly. Although the code is similar, it runs faster than the 8 bit version. Your job is to implement in assembly asm_memcpy_256bit. This code is to copy data 256 bits (32 bytes) at a time. Here is an example output for this part of the assignment:

$ ./hw2 12 65536 1234 2345 3456 655360
<snip>
PART 5:
memcpy (built-in): execution time = 122120.000000 ns
memcpy (8 bits at a time): execution time = 3594330.000000 ns
memcpy (64 bits at a time): execution time = 481018.000000 ns
memcpy (256 bits at a time): execution time = 144225.000000 ns

Observe that the built-in memcpy() runs extremely fast as compared to the 8 and 64 bit versions. That is because it uses many interesting assembly tricks.

Sample Output

Here is a sample output from running the program. Your results will obviously differ, based upon your computer setup. Although the sample predicted and actual execution times differ, they should be within a few percentage points.

$ ./hw2 12 65536 1234 2345 3456 655360
PART 1:
do_nothing: execution time = 9.000000 ns

PART 2:
fibonacci (unoptimized): result = 144, execution time = 9631.000000 ns
fibonacci (optimized): result = 144, execution time = 4743.000000 ns
measuring Fibonacci number #13
  unoptimized: predicted = 15582.958000 ns, actual = 14136.000000 ns
  optimized: predicted = 7674.174000 ns, actual = 6922.000000 ns
measuring Fibonacci number #14
  unoptimized: predicted = 25213.226044 ns, actual = 24779.000000 ns
  optimized: predicted = 12416.813532 ns, actual = 12250.000000 ns
measuring Fibonacci number #15
  unoptimized: predicted = 40794.999739 ns, actual = 40420.000000 ns
  optimized: predicted = 20090.404295 ns, actual = 19779.000000 ns
measuring Fibonacci number #16
  unoptimized: predicted = 66006.309578 ns, actual = 66611.000000 ns
  optimized: predicted = 32506.274149 ns, actual = 28514.000000 ns
measuring Fibonacci number #17
  unoptimized: predicted = 106798.208897 ns, actual = 104815.000000 ns
  optimized: predicted = 52595.151573 ns, actual = 52016.000000 ns

PART 3:
target fizzbuzz value is 65536
  iterative: num fizz: 21845, num buzz: 13107, execution time = 1191877.000000 ns
  recursive: num fizz: 21845, num buzz: 13107, execution time = 2023277.000000 ns
   assembly: num fizz: 21845, num buzz: 13107, execution time = 544058.000000 ns

PART 4:
time per function call: 30.872757 ns
  predicted = 217189.000000 ns
  actual = 200683.000000 ns

PART 5:
memcpy (built-in): execution time = 132266.000000 ns
memcpy (8 bits at a time): execution time = 4310706.000000 ns
memcpy (64 bits at a time): execution time = 585617.000000 ns
memcpy (256 bits at a time): execution time = 154298.000000 ns

Other Hints and Notes

Ask plenty of questions on the Blackboard discussion board.
At the top of your submission, list any help you received as well as web pages you consulted. Please do not use any URL shorteners, such as goo.gl or TinyURL. Also, do not cite shared data services, such as Pastebin, Dropbox, or Google Drive.
Don't forget to add a comment block at the top of hw2.c that answers the questions from Part 1.
As a hint, these ARMv8-A instructions may be useful for your assembly code:

add

Add a value to a register.

sub

Subtract a value from a register.

mov

Move a value into a register.

cmp

Compare a register to a value.

ldr

Load from a memory into a register, at address calculated from a base register and an immediate offset. Be aware that if the destination address for ldr is a 64-bit register, the source location has to be 64-bit aligned. Otherwise, use a 32-bit register to store the value.

str

Store a register value into memory. Similar to ldr, this stores either 32 or 64 bits, depending if the source operand has 32-bit or 64-bit notation.

b

Unconditionally branch to a PC-relative immediate address (i.e., a label).

b.XX

Conditionally branch to a label if the previous comparison is true.
In ARMv8-A, the stack must always be aligned on a 16-byte boundary. Thus, the following instruction is illegal: sub sp, sp, #8. This is true even if you have less than 16 bytes to push to the stack; the unused stack space is simply wasted.

Extra Credit

Sorry, there is no extra credit available for this assignment.

CMSC 411:

Computer Architecture