Homework 2: Software Performance

This homework is due on Monday, February 27, at 11:59:59 PM (Eastern standard time). You must use submit to turn in your homework like so:
submit cs411_jtang hw2 hw2.c hw2_asm.S The grader will use the supplied Makefile to compile your work. In addition, each submitted file must have a file header comment, as described on the coding conventions page.

You can only complete this assignment within an ARMv8-A development environment, and thus you must have completed the first homework before attempting this assignment.

In this assignment, you will observe the type of assembly instructions a compiler generates. You will measure how long different software algorithms take to execute in your ARMv8-A environment. You will then implement different optimization techniques to decrease execution time.

Part 1: Simple Time Measurement

To begin, create a directory for your assignment and download the following files into that directory via wget:

https://www.csee.umbc.edu/~jtang/cs411.s24/homework/hw2/hw2.c: Skeleton code for this assignment.
https://www.csee.umbc.edu/~jtang/cs411.s24/homework/hw2/hw2_asm.S: Skeleton code for your assembly code.
https://www.csee.umbc.edu/~jtang/cs411.s24/homework/hw2/Makefile: Builds the code for this assignment, by simply running make. Also included is a clean target to remove all built objects. You do not need to modify this file, nor should you submit it with your work.

Now run make to compile everything to get the executable hw2. Run the program in your terminal. It should display something like this:

$ ./hw2
do nothing: execution time = 8.000000 ns
Need at least one argument to continue

So far, this program calculates how long it takes to execute a function. Specifically, it measures the time to execute this unoptimized function:

static int do_nothing(int val)
{
	return val;
}

This function is also marked with a function attribute. Function attributes are used to tell the compiler to do something special when compiling that particular function. In this case, the attribute tells the compiler to disable optimizations (if using gcc then compile as if -O0 flag; if using clang then disable all optimizations via optnone).

In this sample run, it took 8 nanoseconds to call the function. This includes the time to set up the stack, execute the function (which consists of exactly 1 line), and then restore the stack. Use the utility aarch64-linux-gnu-objdump (if running within a virtual machine) or objdump (for native ARMv8-A) to disassemble the do_nothing() function:

<do_nothing>:
    sub     sp, sp, #0x10
    str     w0, [sp, #12]
    ldr     w0, [sp, #12]
    add     sp, sp, #0x10
    ret

The compiler uses a lot of instruction to effectively do nothing. As an experiment, enable optimizations for this function, by replacing UNOPTIMIZED with OPTIMIZED. Recompile, and re-run the program several times. The execution time probably will not change much, because precise timing is not possible when running so few instructions. However, the generated assembly instructions will differ a bit.

Examine the disassembly for do_nothing(). In a comment box at the top of hw2.c, describe the differences between the unoptimized and optimized versions. More specifically, describe why some unoptimized assembly instruction may be safely eliminated, without changing the overall behavior of the C function. (Unfortunately, clang users are out of luck here, as that clang will generate the same instructions. For this part of the homework, you should still describe which lines can be safely removed.)

Part 2: Simple Timing Predictor

Now that you have observed the difference between unoptimized and optimized code, disassemble the functions fibonacci_unoptimized() and fibonacci_optimized(). These function calculate the nth Fibonacci number recursively. Both functions' C code are identical, but the former function is compiled without optimizations while the latter is optimized. Run the program with steadily increasing target values:

$ ./hw2 10
<snip>
PART 2:
fibonacci (unoptimized): result = 55, execution time = 3322.000000 ns
fibonacci (optimized): result = 55, execution time = 1548.000000 ns

$ ./hw2 11
<snip>
PART 2:
fibonacci (unoptimized): result = 89, execution time = 5352.000000 ns
fibonacci (optimized): result = 89, execution time = 2651.000000 ns

$ ./hw2 12
<snip>
PART 2:
fibonacci (unoptimized): result = 144, execution time = 8668.000000 ns
fibonacci (optimized): result = 144, execution time = 4333.000000 ns

Observe that the tiny fractions of CPU time saved by enabling optimization accumulate, especially with a CPU intensive algorithm like calculating Fibonacci values.

The computational complexity of this algorithm is about O(1.618ⁿ). What that means is if it takes X seconds to find Fibonacci number n, then it will take about 1.618X seconds to find Fibonacci number n+1.

Modify the main() function. Predict how long it will take to find the next 5 Fibonacci numbers (i.e., target+1, target+2, etc.) using both the unoptimized and optimized algorithms. Then measure the exact times. Display the predicted and actual times, for both unoptimized and optimized functions.

Part 3: Algorithm Choices

Next you will implement the same algorithm in multiple ways, then measure their performances. Look at the comments to is_divisible_by_iterative(). This function takes unsigned two values, a dividend and a divisor. The function determines if the dividend is evenly divisible by the divisor. If this code were to be implemented in C, check if dividend % divisor == 0. Implement that function only using a loop and without any multiplication, division, or modulo divison operators. Then implement is_divisible_by_recursive(), and you must write this as a recursive procedure. Compile your code thus far.

Disassemble your hw2.o to see how the compiler generated the assembly instructions for is_divisible_by_iterative() and is_divisible_by_recursive(). Your next job is to hand-write the function is_divisble_by_assembly() within hw2_asm.S. Like the C implementations, this code may not use multiplication, division, nor modulo division operators. You can write the code using either iteration or recursion. Afterwards, recompile and compare execution times for your different algroithms. You should find that the pure recursive algorithm is much slower than an iterative call.

Compile your C and assembly implementations. Then test them by passing two more command line arguments; main() will pass those arguments into your functions. Test it like so:

$ ./hw2 12 60 7
<snip>
PART 3:
60 is not divisible divisible by 7 (iterative)
  execution time = 98.000000 ns
60 is not divisible divisible by 7 (recursive)
  execution time = 222.000000 ns
60 is not divisible divisible by 7 (assembly)
  execution time = 82.000000 ns

ARMv8-A does not have a built-in assembly instruction to perform modulo division. You will need to come up with your own implementation that does not involve multiplication nor division.

Part 4: Preducting Recursion Costs

In this part, you will estimate the execution time of is_divisble_by_recursive() given a new target dividend. Observe how the main() function previously saved into the variable time_div_recur the execution time of is_divisible_by_recursive() for a specific pairing of dividend and divisor. Execution time should grow linearly as dividend changes.

Add code to main(). Treat the next three command-line arguments as new dividends to test. Using the same divisor from Part 2, predict the execution time for each of these new dividends. For example, if the new dividend is twice as large as the original divider, the function's execution time should double. Display your estimates and then measure the actual execution time. Here is an example output for this part of the assignment:

$ ./hw2 12 60 7 49 8897 1178
<snip>
PART 4:
predicted time for 49: 181.300000 ns
  actual = 206.000000 ns
predicted time for 8897: 32918.900000 ns
  actual = 23695.000000 ns
predicted time for 1178: 4358.600000 ns
  actual = 3187.000000 ns

For this part of the assignment, you may use multipliction and division operations.

Part 5: Optimizing Memory Loops

For the final part of this assignment, you will implement the iconic function memcmp(). This function takes two buffers and a length. It determines if both buffers have the same values. A simplistic version, memccmp_8bit() is given to you. This version copies compares 8 bits (1 byte) at a time. Look at its disassembled version to see how inefficient it is. As that ARMv8-A has 64-bit registers, it is easy to make it run faster. Implement the C functions memcmp_32bit() and memcmp_64bit(). These versions compare 32 or 64 bits at a time, respectively. Although the code is similar, they should run faster than the 8-bit version. Finally implement in assembly memcmp_256bit_assembly. This code compares 256 bits (32 bytes) at a time. Here is an example output for this part of the assignment:

$ ./hw2 12 60 7 49 8897 1178 65536
<snip>
PART 5:
memcmp (built-in): execution time = 84471.000000 ns
memcmp (8 bits at a time): execution time = 582433.000000 ns
memcmp (32 bits at a time): execution time = 130328.000000 ns
memcmp (64 bits at a time): execution time = 67683.000000 ns
memcmp (256 bits at a time): execution time = 26091.000000 ns

Observe that the built-in memcmp() runs extremely fast as compared to the 8 and 32 bit versions. That is because it uses many interesting assembly tricks.

Sample Output

Here is a sample output from running the program within a Linux virtual machine. Your results will obviously differ, based upon your computer setup. Although the sample predicted and actual execution times differ, they should be within a few percentages.

$ ./hw2 12 60 7 49 8897 1178 65536
PART 1:
do_nothing: execution time = 21.000000 ns

PART 2:
fibonacci (unoptimized): result = 144, execution time = 8576.000000 ns
fibonacci (optimized): result = 144, execution time = 4309.000000 ns
measuring Fibonacci number #13
  unoptimized: predicted = 13875.968000 ns, actual = 13878.000000 ns
  optimized: predicted = 6971.962000 ns, actual = 6750.000000 ns
measuring Fibonacci number #14
  unoptimized: predicted = 22451.316224 ns, actual = 22416.000000 ns
  optimized: predicted = 11280.634516 ns, actual = 10949.000000 ns
measuring Fibonacci number #15
  unoptimized: predicted = 36326.229650 ns, actual = 40530.000000 ns
  optimized: predicted = 18252.066647 ns, actual = 17712.000000 ns
measuring Fibonacci number #16
  unoptimized: predicted = 58775.839574 ns, actual = 65640.000000 ns
  optimized: predicted = 29531.843835 ns, actual = 32258.000000 ns
measuring Fibonacci number #17
  unoptimized: predicted = 95099.308431 ns, actual = 107436.000000 ns
  optimized: predicted = 47782.523324 ns, actual = 52439.000000 ns

PART 3:
60 is not divisible divisible by 7 (iterative)
  execution time = 98.000000 ns
60 is not divisible divisible by 7 (recursive)
  execution time = 222.000000 ns
60 is not divisible divisible by 7 (assembly)
  execution time = 82.000000 ns

PART 4:
predicted time for 49: 181.300000 ns
  actual = 206.000000 ns
predicted time for 8897: 32918.900000 ns
  actual = 23695.000000 ns
predicted time for 1178: 4358.600000 ns
  actual = 3187.000000 ns

PART 5:
memcmp (built-in): execution time = 84471.000000 ns
memcmp (8 bits at a time): execution time = 582433.000000 ns
memcmp (32 bits at a time): execution time = 130328.000000 ns
memcmp (64 bits at a time): execution time = 67683.000000 ns
memcmp (256 bits at a time): execution time = 26091.000000 ns

Other Hints and Notes

Ask plenty of questions on the Blackboard discussion board.
At the top of your submission, list any help you received as well as web pages you consulted. Please do not use any URL shorteners, such as goo.gl or TinyURL. Also, do not cite shared data services, such as Pastebin, Dropbox, or Google Drive.
Don't forget to add a comment block at the top of hw2.c that answers the questions from Part 1.
As a hint, these ARMv8-A instructions may be useful for your assembly code:

add

Add a value to a register.

sub

Subtract a value from a register.

mov

Move a value into a register.

cmp

Compare a register to a value.

ldr

Load from a memory into a register, at address calculated from a base register and an immediate offset. Be aware that if the destination address for ldr is a 64-bit register, the source location has to be 64-bit aligned. Otherwise, use a 32-bit register to store the value.

b

Unconditionally branch to a PC-relative immediate address (i.e., a label).

b.XX

Conditionally branch to a label if the previous comparison is true.
In ARMv8-A, the stack must always be aligned on a 16-byte boundary. Thus, the following instruction is illegal: sub sp, sp, #8. This is true even if you have less than 16 bytes to push to the stack; the unused stack space is simply wasted.

Extra Credit

Sorry, there is no extra credit available for this assignment.

CMSC 411:

Computer Architecture