Homework 2: Software Performance
This homework is due on Monday, February 27, at 11:59:59 PM
(Eastern standard time). You must use submit to
turn in your homework like so:
submit cs411_jtang hw2 hw2.c hw2_asm.S
The grader will use the supplied Makefile to compile your
work. In addition, each submitted file must have a file header
comment, as described on the coding
conventions page.
You can only complete this assignment within an ARMv8-A development environment, and thus you must have completed the first homework before attempting this assignment.
In this assignment, you will observe the type of assembly instructions a compiler generates. You will measure how long different software algorithms take to execute in your ARMv8-A environment. You will then implement different optimization techniques to decrease execution time.
Part 1: Simple Time Measurement
To begin, create a directory for your assignment and download the following files into that directory via wget:
- https://www.csee.umbc.edu/~jtang/cs411.s24/homework/hw2/hw2.c
- Skeleton code for this assignment.
- https://www.csee.umbc.edu/~jtang/cs411.s24/homework/hw2/hw2_asm.S
- Skeleton code for your assembly code.
- https://www.csee.umbc.edu/~jtang/cs411.s24/homework/hw2/Makefile
- Builds the code for this assignment, by simply running make. Also included is a clean target to remove all built objects. You do not need to modify this file, nor should you submit it with your work.
Now run make to compile everything to get the executable hw2. Run the program in your terminal. It should display something like this:
$ ./hw2 do nothing: execution time = 8.000000 ns Need at least one argument to continueSo far, this program calculates how long it takes to execute a function. Specifically, it measures the time to execute this unoptimized function:
static int do_nothing(int val) { return val; }This function is also marked with a function attribute. Function attributes are used to tell the compiler to do something special when compiling that particular function. In this case, the attribute tells the compiler to disable optimizations (if using gcc then compile as if -O0 flag; if using clang then disable all optimizations via optnone).
In this sample run, it took 8 nanoseconds to call the function. This
includes the time to set up the stack, execute the function (which
consists of exactly 1 line), and then restore the stack. Use the
utility aarch64-linux-gnu-objdump (if running within a
virtual machine) or objdump (for native ARMv8-A) to
disassemble the do_nothing()
function:
<do_nothing>: sub sp, sp, #0x10 str w0, [sp, #12] ldr w0, [sp, #12] add sp, sp, #0x10 retThe compiler uses a lot of instruction to effectively do nothing. As an experiment, enable optimizations for this function, by replacing
UNOPTIMIZED
with OPTIMIZED
. Recompile, and re-run the program
several times. The execution time probably will not change much,
because precise timing is not possible when running so few
instructions. However, the generated assembly instructions will
differ a bit.
Examine the disassembly for do_nothing()
. In a
comment box at the top of hw2.c, describe the
differences between the unoptimized and optimized versions. More
specifically, describe why some unoptimized assembly instruction may
be safely eliminated, without changing the overall behavior of the C
function. (Unfortunately, clang users are out of luck here,
as that clang will generate the same instructions. For this
part of the homework, you should still describe which lines can be
safely removed.)
Part 2: Simple Timing Predictor
Now that you have observed the difference between unoptimized and
optimized code, disassemble the
functions fibonacci_unoptimized()
and fibonacci_optimized()
. These
function calculate
the nth Fibonacci number recursively. Both functions'
C code are identical, but the former function is compiled without
optimizations while the latter is optimized. Run the program with
steadily increasing target values:
$ ./hw2 10 <snip> PART 2: fibonacci (unoptimized): result = 55, execution time = 3322.000000 ns fibonacci (optimized): result = 55, execution time = 1548.000000 ns $ ./hw2 11 <snip> PART 2: fibonacci (unoptimized): result = 89, execution time = 5352.000000 ns fibonacci (optimized): result = 89, execution time = 2651.000000 ns $ ./hw2 12 <snip> PART 2: fibonacci (unoptimized): result = 144, execution time = 8668.000000 ns fibonacci (optimized): result = 144, execution time = 4333.000000 nsObserve that the tiny fractions of CPU time saved by enabling optimization accumulate, especially with a CPU intensive algorithm like calculating Fibonacci values.
The computational complexity of this algorithm is about O(1.618n). What that means is if it takes X seconds to find Fibonacci number n, then it will take about 1.618X seconds to find Fibonacci number n+1.
Modify the main()
function. Predict how long it will
take to find the next 5 Fibonacci numbers
(i.e., target+1
, target+2
, etc.) using
both the unoptimized and optimized algorithms. Then measure the
exact times. Display the predicted and actual times, for both
unoptimized and optimized functions.
Part 3: Algorithm Choices
Next you will implement the same algorithm in multiple ways,
then measure their performances. Look at the comments
to is_divisible_by_iterative()
. This function takes unsigned two
values, a dividend and a divisor. The function
determines if the dividend is evenly divisible by the divisor. If
this code were to be implemented in C, check if dividend %
divisor == 0
. Implement
that function only using a loop and without
any multiplication, division, or modulo divison operators. Then
implement is_divisible_by_recursive()
, and
you must write this as a recursive
procedure. Compile your code thus far.
Disassemble your hw2.o to see how the compiler generated
the assembly instructions
for is_divisible_by_iterative()
and is_divisible_by_recursive()
. Your next job is
to hand-write the
function is_divisble_by_assembly()
within hw2_asm.S. Like the C implementations, this
code may not use multiplication, division, nor modulo
division operators. You can write the code using either
iteration or recursion. Afterwards, recompile and compare execution
times for your different algroithms. You should find that the pure
recursive algorithm is much slower than an iterative call.
Compile your C and assembly implementations. Then test them by
passing two more command line arguments; main()
will
pass those arguments into your functions. Test it like so:
$ ./hw2 12 60 7 <snip> PART 3: 60 is not divisible divisible by 7 (iterative) execution time = 98.000000 ns 60 is not divisible divisible by 7 (recursive) execution time = 222.000000 ns 60 is not divisible divisible by 7 (assembly) execution time = 82.000000 ns
ARMv8-A does not have a built-in assembly instruction to perform modulo division. You will need to come up with your own implementation that does not involve multiplication nor division.
Part 4: Preducting Recursion Costs
In this part, you will estimate the execution time
of is_divisble_by_recursive()
given a new target
dividend. Observe how the main()
function previously
saved into the variable time_div_recur
the execution
time of is_divisible_by_recursive()
for a specific pairing
of dividend
and divisor
. Execution time
should grow linearly as dividend
changes.
Add code to main()
. Treat the next three command-line
arguments as new dividends to test. Using the same divisor from Part
2, predict the execution time for each of these new dividends. For
example, if the new dividend is twice as large as the original
divider, the function's execution time should double. Display your
estimates and then measure the actual execution time. Here is an
example output for this part of the assignment:
$ ./hw2 12 60 7 49 8897 1178 <snip> PART 4: predicted time for 49: 181.300000 ns actual = 206.000000 ns predicted time for 8897: 32918.900000 ns actual = 23695.000000 ns predicted time for 1178: 4358.600000 ns actual = 3187.000000 ns
For this part of the assignment, you may use multipliction and division operations.
Part 5: Optimizing Memory Loops
For the final part of this assignment, you will implement the iconic
function memcmp()
. This function takes two buffers and
a length. It determines
if both buffers have the same values. A simplistic
version, memccmp_8bit()
is given to you. This version
copies compares 8 bits (1 byte) at a time. Look at its disassembled
version to see how inefficient it is. As that ARMv8-A has 64-bit
registers, it is easy to make it run faster. Implement the C
functions memcmp_32bit()
and memcmp_64bit()
. These versions compare 32 or 64
bits at a time, respectively. Although the code is similar, they
should run faster than the 8-bit version. Finally implement in
assembly memcmp_256bit_assembly
. This code compares 256
bits (32 bytes) at a time. Here is an example output for this part
of the assignment:
$ ./hw2 12 60 7 49 8897 1178 65536 <snip> PART 5: memcmp (built-in): execution time = 84471.000000 ns memcmp (8 bits at a time): execution time = 582433.000000 ns memcmp (32 bits at a time): execution time = 130328.000000 ns memcmp (64 bits at a time): execution time = 67683.000000 ns memcmp (256 bits at a time): execution time = 26091.000000 nsObserve that the built-in
memcmp()
runs extremely fast
as compared to the 8 and 32 bit versions. That is because it uses
many interesting
assembly tricks.
Sample Output
Here is a sample output from running the program within a Linux virtual machine. Your results will obviously differ, based upon your computer setup. Although the sample predicted and actual execution times differ, they should be within a few percentages.
$ ./hw2 12 60 7 49 8897 1178 65536 PART 1: do_nothing: execution time = 21.000000 ns PART 2: fibonacci (unoptimized): result = 144, execution time = 8576.000000 ns fibonacci (optimized): result = 144, execution time = 4309.000000 ns measuring Fibonacci number #13 unoptimized: predicted = 13875.968000 ns, actual = 13878.000000 ns optimized: predicted = 6971.962000 ns, actual = 6750.000000 ns measuring Fibonacci number #14 unoptimized: predicted = 22451.316224 ns, actual = 22416.000000 ns optimized: predicted = 11280.634516 ns, actual = 10949.000000 ns measuring Fibonacci number #15 unoptimized: predicted = 36326.229650 ns, actual = 40530.000000 ns optimized: predicted = 18252.066647 ns, actual = 17712.000000 ns measuring Fibonacci number #16 unoptimized: predicted = 58775.839574 ns, actual = 65640.000000 ns optimized: predicted = 29531.843835 ns, actual = 32258.000000 ns measuring Fibonacci number #17 unoptimized: predicted = 95099.308431 ns, actual = 107436.000000 ns optimized: predicted = 47782.523324 ns, actual = 52439.000000 ns PART 3: 60 is not divisible divisible by 7 (iterative) execution time = 98.000000 ns 60 is not divisible divisible by 7 (recursive) execution time = 222.000000 ns 60 is not divisible divisible by 7 (assembly) execution time = 82.000000 ns PART 4: predicted time for 49: 181.300000 ns actual = 206.000000 ns predicted time for 8897: 32918.900000 ns actual = 23695.000000 ns predicted time for 1178: 4358.600000 ns actual = 3187.000000 ns PART 5: memcmp (built-in): execution time = 84471.000000 ns memcmp (8 bits at a time): execution time = 582433.000000 ns memcmp (32 bits at a time): execution time = 130328.000000 ns memcmp (64 bits at a time): execution time = 67683.000000 ns memcmp (256 bits at a time): execution time = 26091.000000 ns
Other Hints and Notes
- Ask plenty of questions on the Blackboard discussion board.
- At the top of your submission, list any help you received as well as web pages you consulted. Please do not use any URL shorteners, such as goo.gl or TinyURL. Also, do not cite shared data services, such as Pastebin, Dropbox, or Google Drive.
- Don't forget to add a comment block at the top of hw2.c that answers the questions from Part 1.
-
As a hint, these ARMv8-A instructions may be useful for your
assembly code:
add
- Add a value to a register.
sub
- Subtract a value from a register.
mov
- Move a value into a register.
cmp
- Compare a register to a value.
ldr
-
Load from a memory into a register, at address calculated
from a base register and an immediate offset. Be aware that
if the destination address for
ldr
is a 64-bit register, the source location has to be 64-bit aligned. Otherwise, use a 32-bit register to store the value. b
- Unconditionally branch to a PC-relative immediate address (i.e., a label).
b.XX
- Conditionally branch to a label if the previous comparison is true.
-
In ARMv8-A, the stack must always be aligned on a 16-byte
boundary. Thus, the following instruction is illegal:
sub sp, sp, #8
. This is true even if you have less than 16 bytes to push to the stack; the unused stack space is simply wasted.
Extra Credit
Sorry, there is no extra credit available for this assignment.