Homework 2: Measuring Software
This homework is due on Tuesday, February 25, at 11:59:59 PM
(Eastern standard time). You must use submit to
turn in your homework like so:
submit cs411_jtang hw2 hw2.c hw2_asm.S
The grader will use the supplied Makefile to compile your
work. In addition, each submitted file must have a file header
comment, as described on the coding
conventions page.
You can only complete this assignment within an ARMv8-A development environment, and thus you must have completed the first homework before attempting this assignment.
In this assignment, you will observe the type of assembly instructions a compiler generates. You will measure how how long different software instructions take to execute in your ARMv8-A environment. As an experiment, you will implement an algorithm and then predict how long it will execute.
Part 1: Simple Time Measurement
To begin, create a directory for your assignment and download the following files into that directory via wget:
- http://www.csee.umbc.edu/~jtang/cs411.s20/homework/hw2/hw2.c
- Skeleton code for this assignment.
- http://www.csee.umbc.edu/~jtang/cs411.s20/homework/hw2/hw2_asm.S
- Skeleton code for your assembly code.
- http://www.csee.umbc.edu/~jtang/cs411.s20/homework/hw2/Makefile
- Builds the code for this assignment, by simply running make. Also included is a clean target to remove all built objects. You do not need to modify this file, nor should you submit it with your work.
Now run make to compile everything. You will get some warnings about unused symbols; by the end of this assignment you will have used all of them. You should now have the executable hw2. Execute it in your terminal. It should display something like this:
$ ./hw2 do nothing: execution time = 24 ns Need at least one argument to continueSo far, this program calculates how long it takes to execute a function. Specifically, it measures the time to execute this function:
static int __attribute__((optimize("O0"))) do_nothing(int val) { return val; }In this sample run, it took about 24 nanoseconds to call the function. This includes the time to set up the stack, execute the function (which consists of exactly 1 line), and then restore the stack. Use the utility aarch64-linux-gnu-objdump to disassemble the
do_nothing()
function:
<do_nothing>: sub sp, sp, #0x10 str w0, [sp, #12] ldr w0, [sp, #12] add sp, sp, #0x10 retThe compiler uses a lot of instruction to effectively do nothing. Notice the part
__attribute__((optimize("O0")))
? This is
a function
attribute. It tells the compiler, gcc, to do something
special about that function. In this case, it explicitly disables
optimizations ("O0"
) when generating the
assembly for do_nothing()
.
As an experiment, change the optimization level from 0
to 2
. Recompile, and re-run the program several
times. The execution time probably will not change much, because
precise timing is not possible under an emulated
environment with
the version of QEMU for this class. However, the generated
assembly instructions will differ a bit.
Examine the disassembly for do_nothing()
. In a
comment box at the top of hw2.c, describe the
differences between the unoptimized and optimized versions. More
specifically, describe why some unoptimized assembly instruction may
be safely eliminated, without changing the overall behavior of the C
function.
Part 2: Simple Timing Predictor
Now that you have observed the difference between unoptimized and
optimized code, disassemble the
functions fibonacci_unoptimized()
and fibonacci_optimized()
. These
function calculate
the nth Fibonacci number recursively. The functions are
identical, but the former function is compiled
with "O0"
while the latter
has "O2"
. Run the program with steadily
increasing target values:
$ ./hw2 10 do nothing: execution time = 25 ns fibonacci (unoptimized): result = 55, execution time = 3094 ns fibonacci (optimized): result = 55, execution time = 1855 ns Need at least four arguments to continue $ ./hw2 11 do nothing: execution time = 30 ns fibonacci (unoptimized): result = 1, execution time = 6123 ns fibonacci (optimized): result = 1, execution time = 3486 ns Need at least four arguments to continue $ ./hw2 12 do nothing: execution time = 57 ns fibonacci (unoptimized): result = 2, execution time = 8689 ns fibonacci (optimized): result = 2, execution time = 4491 ns Need at least four arguments to continueObserve that the tiny fractions of CPU time saved by enabling compilation optimization accumulate, especially with a CPU intensive algorithm like calculating Fibonacci values.
The computational complexity of this algorithm is about O(1.618n). What that means is if it takes X seconds to find Fibonacci number n, then it will take about 1.618X seconds to find Fibonacci number n+1.
Modify the main()
function. Predict how long it will
take to find the next 5 Fibonacci numbers
(i.e., target+1
, target+2
, etc.) using
both the unoptimized and optimized algorithms. Then measure the
exact times. Display the predicted and actual times, for both
unoptimized and optimized functions.
Part 3: Another Timing Predictor
The next step is to implement another trivial algorithm. Take a look
at the comments to tribonacci_optimized()
. Implement
that function. The complexity of this algorithm is
about O(1.839n).
In main()
, measure the time to find the fifth
Tribonacci value. Then predict the actual time to find
the 5+target
value, given the earlier
measurement. Measure the actual execution time, then display both
values.
Part 4: Estimating Instruction Counts
This last part will estimate how many instructions are executed,
based upon execution time. Take a look at the comments
to asm_equation4()
. Within hw2_asm.S,
implement the function asm_equation4()
. This assembly
routine is to be exactly 20 operations: 4 ldr
,
8 add
, 7 sub
, and then
1 ret
. Using the
function measure_function4()
, measure how long this
function takes to execute, passing it the
array part4_args
and
integer part4_len
. Assume, for now, that every
instruction takes the same amount of time to execute. Divide the
actual execution time of asm_equation4()
by 20; store
the quotient in a double
for extra precision.
Finally implement num_divisible_by_7()
in hw2.c. This function analyzes each of the elements
within the given array. For each element, determine if that value is
divisible by seven. This function returns the number of elements
that are evenly divisible by seven. Then hand-write a similar
function asm_num_divisible_by_7()
in hw2_asm.S. SPECIAL RESTRICTION:
For asm_num_divisible_by_7()
, this function may not use
multiplication, division, or modulus instructions; it must use a
loop to determine if a value is divisible by seven. Measure
execution times for both function given the same input
array part4_args
.
Based upon the actual execution times
of asm_equation4()
, num_divisible_by_7()
,
and asm_num_divisible_by_7()
, estimate how many
instructions were executed by num_divisible_by_7()
and asm_num_divisible_by_7()
. Display those estimates.
Don't be concerned if the numbers seem odd. Recall that precise measurements within an emulator is very difficult. As a hint, these ARMv8-a instructions may be useful for your assembly code:
-
add
: Add a value to a register. -
sub
: Subtract a value from a register. -
mov
: Move a value into a register. -
cmp
: Compare a register to a value. -
ldr
: Load from a memory into a register, at address calculated from a base register and an immediate offset. Be aware that if the destination address forldr
is a 64-bit register, the source location has to be 64-bit aligned. Otherwise, use a 32-bit register to store the value. -
b
: Unconditionally branch to a PC-relative immediate address (i.e., a label). -
b.XX
: Conditionally branch to a label if the previous comparison is true.
Disassemble your implementation
of num_divisble_by_7()
. Compare the compiler generated
instructions versus your hand-written assembly routine. Add a
comment block at the top of hw2.c describing your findings.
Part 5: Required Documentation
Add a comment block to the top of hw2.c file that answers these questions:
-
From Part 1, how did enabling optimizations change the assembly
instructions for
do_nothing()
? -
From Part 4, the C function
num_divisible_by_7()
probably ran significantly faster thanasm_num_divisible_by_7()
, even though both functions make the same calculations. Why are the execution times so different? How couldasm_num_divisible_by_7()
be better optimized?
Sample Output
Here is a sample output from running the program. Your results will obviously differ, based upon your computer setup. Although the sample predicted and actual execution times differ, they should be within a few percentages. Furthermore, your assembly implementations may require more or fewer instructions than this sample implementation.
$ ./hw2 21 1 77 0 do_nothing: execution time = 94 ns fibonacci (unoptimized): result = 10946, execution time = 607180 ns fibonacci (optimized): result = 10946, execution time = 325416 ns measuring Fibonacci number #22 unoptimized: predicted = 982417 ns, actual = 971825 ns optimized: predicted = 526523 ns, actual = 584490 ns measuring Fibonacci number #23 unoptimized: predicted = 1589550 ns, actual = 1588223 ns optimized: predicted = 851914 ns, actual = 872076 ns measuring Fibonacci number #24 unoptimized: predicted = 2571891 ns, actual = 2606122 ns optimized: predicted = 1378396 ns, actual = 1432893 ns measuring Fibonacci number #25 unoptimized: predicted = 4161319 ns, actual = 4242304 ns optimized: predicted = 2230244 ns, actual = 2277234 ns measuring Fibonacci number #26 unoptimized: predicted = 6733014 ns, actual = 6732937 ns optimized: predicted = 3608534 ns, actual = 3705530 ns tribonacci #5: result = 11, execution time = 173 ns measuring Tribonacci number #26 optimized: predicted 62267641 ns, actual = 60562501 ns asm_equation4: result = 105, execution time = 25 ns each instruction takes about 1.25 ns to execute num_divisible_by_7: result = 3, execution time = 43 ns asm_num_divisible_by_7: result = 3, execution time = 91 ns estimated number of instructions for num_divisible_by_7 = 34 estimated number of instructions for asm_num_divisible_by_7 = 72
Other Hints and Notes
- Ask plenty of questions on the Blackboard discussion board.
- At the top of your submission, list any help you received as well as web pages you consulted. Please do not use any URL shorteners, such as goo.gl or TinyURL. Also, do not cite shared data services, such as Pastebin, Dropbox, or Google Drive.
- The integer zero is divisble by all other integers, except for zero itself. This is the definition of divisibility..
Extra Credit
Sorry, there is no extra credit available for this assignment.