Homework 2: Software Performance
This homework is due on Monday, February 27, at 11:59:59 PM
(Eastern standard time). You must use submit to
turn in your homework like so:
submit cs411_jtang hw2 hw2.c hw2_asm.S
The grader will use the supplied Makefile to compile your
work. In addition, each submitted file must have a file header
comment, as described on the coding
conventions page.
You can only complete this assignment within an ARMv8-A development environment, and thus you must have completed the first homework before attempting this assignment.
In this assignment, you will observe the type of assembly instructions a compiler generates. You will measure how long different software algorithms take to execute in your ARMv8-A environment. You will then implement different optimization techniques to decrease execution time.
Part 1: Simple Time Measurement
To begin, create a directory for your assignment and download the following files into that directory via wget:
- https://www.csee.umbc.edu/~jtang/cs411.s23/homework/hw2/hw2.c
- Skeleton code for this assignment.
- https://www.csee.umbc.edu/~jtang/cs411.s23/homework/hw2/hw2_asm.S
- Skeleton code for your assembly code.
- https://www.csee.umbc.edu/~jtang/cs411.s23/homework/hw2/Makefile
- Builds the code for this assignment, by simply running make. Also included is a clean target to remove all built objects. You do not need to modify this file, nor should you submit it with your work.
Now run make to compile everything to get the executable hw2. Run the program in your terminal. It should display something like this:
$ ./hw2 do nothing: execution time = 8.000000 ns Need at least one argument to continueSo far, this program calculates how long it takes to execute a function. Specifically, it measures the time to execute this unoptimized function:
static int do_nothing(int val) { return val; }This function is also marked with a function attribute. Function attributes are used to tell the compiler to do something special when compiling that particular function. In this case, the attribute tells the compiler to disable optimizations (if using gcc then compile as if -O0 flag; if using clang then disable all optimizations via optnone).
In this sample run, it took 8 nanoseconds to call the function. This
includes the time to set up the stack, execute the function (which
consists of exactly 1 line), and then restore the stack. Use the
utility aarch64-linux-gnu-objdump (if running within a
virtual machine) or objdump (for native ARMv8-A) to
disassemble the do_nothing()
function:
<do_nothing>: sub sp, sp, #0x10 str w0, [sp, #12] ldr w0, [sp, #12] add sp, sp, #0x10 retThe compiler uses a lot of instruction to effectively do nothing. As an experiment, enable optimizations for this function, be replacing
UNOPTIMIZED
with OPTIMIZED
. Recompile, and re-run the program
several times. The execution time probably will not change much,
because precise timing is not possible when running so few
instructions. However, the generated assembly instructions will
differ a bit.
Examine the disassembly for do_nothing()
. In a
comment box at the top of hw2.c, describe the
differences between the unoptimized and optimized versions. More
specifically, describe why some unoptimized assembly instruction may
be safely eliminated, without changing the overall behavior of the C
function. (Unfortunately, clang users are out of luck here,
as that clang will generate the same instructions. For this
part of the homework, you should still describe which lines can be
safely removed.)
Part 2: Simple Timing Predictor
Now that you have observed the difference between unoptimized and
optimized code, disassemble the
functions fibonacci_unoptimized()
and fibonacci_optimized()
. These
function calculate
the nth Fibonacci number recursively. Both functions'
code are identical, but the former function is compiled without
optimizations while the latter is optimized. Run the program with
steadily increasing target values:
$ ./hw2 10 <snip> PART 2: fibonacci (unoptimized): result = 55, execution time = 3322.000000 ns fibonacci (optimized): result = 55, execution time = 1548.000000 ns $ ./hw2 11 <snip> PART 2: fibonacci (unoptimized): result = 89, execution time = 5352.000000 ns fibonacci (optimized): result = 89, execution time = 2651.000000 ns $ ./hw2 12 <snip> PART 2: fibonacci (unoptimized): result = 144, execution time = 8668.000000 ns fibonacci (optimized): result = 144, execution time = 4333.000000 nsObserve that the tiny fractions of CPU time saved by enabling optimization accumulate, especially with a CPU intensive algorithm like calculating Fibonacci values.
The computational complexity of this algorithm is about O(1.618n). What that means is if it takes X seconds to find Fibonacci number n, then it will take about 1.618X seconds to find Fibonacci number n+1.
Modify the main()
function. Predict how long it will
take to find the next 5 Fibonacci numbers
(i.e., target+1
, target+2
, etc.) using
both the unoptimized and optimized algorithms. Then measure the
exact times. Display the predicted and actual times, for both
unoptimized and optimized functions.
Part 3: Algorithm Choices
Next you will implement the same algorithm three different ways,
then measure their performances. Look at the comments
to fizzbuzz_iterative()
. This function
calculates
the number of fizzes and buzzes for a target number. Implement
that function only using a loop and without
any multiplication nor division operators. Then
implement fizzbuzz_recursive()
, and
you must write this as a recursive
procedure. Recompile your code, and then test it by passing it two
arguments; main()
will pass the second argument into
your functions. Test it like so:
$ ./hw2 12 65536 <snip> target fizzbuzz value is 65536 iterative: num fizz: 21845, num buzz: 13107, execution time = 1221251.000000 ns recursive: num fizz: 21845, num buzz: 13107, execution time = 2016063.000000 ns
Disassemble your hw2.o to see how the compiler generated
the assembly instructions
for fizzbuzz_iterative()
. Your next job is
to hand-write the
function fizzbuzz_asm()
within hw2_asm.S. Like
the C implementations, this code may not use multiplication
nor division operators. You can write the code using either
iteration or recursion. Afterwards, recompile and compare execution
times for your different algorithms. You should find that the pure
recursive algorithm is much slower than an iterative call. Your
hand-written assembly should be fastest.
ARMv8-A does not have a built-in assembly instruction to perform modulo division. You will need to come up with your own implementation that does not involve multiplication nor division.
Part 4: Estimating Recursion Costs
In this part, you will estimate the execution time
of fizzbuzz_recursive()
given a new target
value. In main()
calculate how long a single recursion
of fizzbuzz_recursive()
takes to execute. Because the
variable time_fizzbuzz_recursive
recorded the actual
execution time for target
, that execution time
is time_fizzbuzz_recursive
divided
by target
. Add code to display time per function call.
The array part4_args
holds three more input values for
your recursive function. Predict the sum of their execution times. Display
your estimate and then measure the actual execution time. Here is an
example output for this part of the assignment:
$ ./hw2 12 65536 1234 2345 3456 <snip> PART 4: time per function call: 31.741699 ns predicted = 223302.000000 ns actual = 227623.000000 ns
Part 5: Optimizing Memory Copies
For the final part of this assignment, you will implement the iconic
function memcpy()
. A simplistic
version, memcpy_8bit()
is given to you. This version
copies 8 bits (1 byte) at a time. Look at its disassembled version
to see how inefficient it is. As that ARMv8-A has 64-bit registers,
it is easy to make it run faster. Examine
memcpy_64bit()
and its disassembly. Although the code
is similar, it runs faster than the 8 bit version. Your job is to
implement in assembly asm_memcpy_256bit
. This code is
to copy data 256 bits (32 bytes) at a time. Here is an example
output for this part of the assignment:
$ ./hw2 12 65536 1234 2345 3456 655360 <snip> PART 5: memcpy (built-in): execution time = 122120.000000 ns memcpy (8 bits at a time): execution time = 3594330.000000 ns memcpy (64 bits at a time): execution time = 481018.000000 ns memcpy (256 bits at a time): execution time = 144225.000000 nsObserve that the built-in
memcpy()
runs extremely fast
as compared to the 8 and 64 bit versions. That is because it uses
many interesting
assembly tricks.
Sample Output
Here is a sample output from running the program. Your results will obviously differ, based upon your computer setup. Although the sample predicted and actual execution times differ, they should be within a few percentage points.
$ ./hw2 12 65536 1234 2345 3456 655360 PART 1: do_nothing: execution time = 9.000000 ns PART 2: fibonacci (unoptimized): result = 144, execution time = 9631.000000 ns fibonacci (optimized): result = 144, execution time = 4743.000000 ns measuring Fibonacci number #13 unoptimized: predicted = 15582.958000 ns, actual = 14136.000000 ns optimized: predicted = 7674.174000 ns, actual = 6922.000000 ns measuring Fibonacci number #14 unoptimized: predicted = 25213.226044 ns, actual = 24779.000000 ns optimized: predicted = 12416.813532 ns, actual = 12250.000000 ns measuring Fibonacci number #15 unoptimized: predicted = 40794.999739 ns, actual = 40420.000000 ns optimized: predicted = 20090.404295 ns, actual = 19779.000000 ns measuring Fibonacci number #16 unoptimized: predicted = 66006.309578 ns, actual = 66611.000000 ns optimized: predicted = 32506.274149 ns, actual = 28514.000000 ns measuring Fibonacci number #17 unoptimized: predicted = 106798.208897 ns, actual = 104815.000000 ns optimized: predicted = 52595.151573 ns, actual = 52016.000000 ns PART 3: target fizzbuzz value is 65536 iterative: num fizz: 21845, num buzz: 13107, execution time = 1191877.000000 ns recursive: num fizz: 21845, num buzz: 13107, execution time = 2023277.000000 ns assembly: num fizz: 21845, num buzz: 13107, execution time = 544058.000000 ns PART 4: time per function call: 30.872757 ns predicted = 217189.000000 ns actual = 200683.000000 ns PART 5: memcpy (built-in): execution time = 132266.000000 ns memcpy (8 bits at a time): execution time = 4310706.000000 ns memcpy (64 bits at a time): execution time = 585617.000000 ns memcpy (256 bits at a time): execution time = 154298.000000 ns
Other Hints and Notes
- Ask plenty of questions on the Blackboard discussion board.
- At the top of your submission, list any help you received as well as web pages you consulted. Please do not use any URL shorteners, such as goo.gl or TinyURL. Also, do not cite shared data services, such as Pastebin, Dropbox, or Google Drive.
- Don't forget to add a comment block at the top of hw2.c that answers the questions from Part 1.
-
As a hint, these ARMv8-A instructions may be useful for your
assembly code:
add
- Add a value to a register.
sub
- Subtract a value from a register.
mov
- Move a value into a register.
cmp
- Compare a register to a value.
ldr
-
Load from a memory into a register, at address calculated
from a base register and an immediate offset. Be aware that
if the destination address for
ldr
is a 64-bit register, the source location has to be 64-bit aligned. Otherwise, use a 32-bit register to store the value. str
-
Store a register value into memory. Similar
to
ldr
, this stores either 32 or 64 bits, depending if the source operand has 32-bit or 64-bit notation. b
- Unconditionally branch to a PC-relative immediate address (i.e., a label).
b.XX
- Conditionally branch to a label if the previous comparison is true.
-
In ARMv8-A, the stack must always be aligned on a 16-byte
boundary. Thus, the following instruction is illegal:
sub sp, sp, #8
. This is true even if you have less than 16 bytes to push to the stack; the unused stack space is simply wasted.
Extra Credit
Sorry, there is no extra credit available for this assignment.