<- previous index next ->
Introduction:
Hello, my name is Jon Squire and I have been using computers
to solve numerical problems since 1959.
Overview:
Read the syllabus.
This lecture will provide the definitions and the intuition for
absolute error
relative error
round off error
truncation error
Then how to call intrinsic function and elementary functions.
Some terms will be used without comment in the rest of the course.
Learn them now. You should run the sample code to increase your
understanding and belief.
"Absolute error" A positive number, the difference between the
exact (analytic) answer and the computed, approximate, answer.
"Relative error" A positive number, the Absolute Error divided
by the exact answer. (Not defined if the exact answer is zero.)
Given the exact answer is 1000 and the computed answer is 1001:
The absolute error is 1
The relative error is 0.001
For a set of computed numbers there are three common absolute errors:
"Maximum Error" the largest error in the set.
"Average Error" the sum of the absolute errors divided by the number
in the set.
"RMS Error" the root mean square of the absolute errors.
sqrt(sum_over_set(absolute_error^2)/number_in_set)
Given the exact answer is 100 answers of 1.0 and we
computed 99 answers of 1.0 and one answer of 101.0:
The maximum error is 100.0 101.0 - 1.0
The average error is 1.0 100.0/100
The RMS error is 10.0 sqrt(100.0*100.0/100)
In some problems the main concern is the maximum error, yet
the RMS error is often the best intuitive measure of error.
Almost all Numerical Computation arithmetic is performed using
IEEE 754-1985 Standard for Binary Floating-Point Arithmetic.
The two formats that we deal with in practice are the 32 bit and
64 bit formats. You need to know how to get the format you desire
in the language you are programming. Complex numbers use two values.
older
C Java Fortran 95 Fortran Ada 95 MATLAB
------ ------ ---------------- ------- ---------- -------
32 bit float float real real float N/A
64 bit double double double precision real*8 long_float 'default'
complex
32 bit 'none' 'none' complex complex complex N/A
64 bit 'none' 'none' double complex complex*16 long_complex 'default'
'none' means not provided by the language (may be available as a library)
N/A means not available, you get the default.
IEEE Floating-Point numbers are stored as follows:
The single format 32 bit has
1 bit for sign, 8 bits for exponent, 23 bits for fraction
The double format 64 bit has
1 bit for sign, 11 bits for exponent, 52 bits for fraction
Some example numbers and their bit patterns:
decimal
stored hexadecimal sign exponent fraction significand
The "1" is not stored |
|
31 30....23 22....................0 |
1.0
3F 80 00 00 0 01111111 00000000000000000000000 1.0 * 2^(127-127)
0.5
3F 00 00 00 0 01111110 00000000000000000000000 1.0 * 2^(126-127)
0.75
3F 40 00 00 0 01111110 10000000000000000000000 1.1 * 2^(126-127)
0.9999995
3F 7F FF FF 0 01111110 11111111111111111111111 1.1111* 2^(126-127)
0.1
3D CC CC CD 0 01111011 10011001100110011001101 1.1001* 2^(123-127)
63 62...... 52 51 ..... 0
1.0
3F F0 00 00 00 00 00 00 0 01111111111 000 ... 000 1.0 * 2^(1023-1023)
0.5
3F E0 00 00 00 00 00 00 0 01111111110 000 ... 000 1.0 * 2^(1022-1023)
0.75
3F E8 00 00 00 00 00 00 0 01111111110 100 ... 000 1.1 * 2^(1022-1023)
0.9999999999999995
3F EF FF FF FF FF FF FF 0 01111111110 111 ... 1.11111* 2^(1022-1023)
0.1
3F B9 99 99 99 99 99 9A 0 01111111011 10011..1010 1.10011* 2^(1019-1023)
|
sign exponent fraction |
before storing subtract bias
Note that an integer in the range 0 to 2^23 -1 may be represented exactly.
Any power of two in the range -126 to +127 times such an integer may also
be represented exactly. Numbers such as 0.1, 0.3, 1.0/5.0, 1.0/9.0 are
represented approximately. 0.75 is 3/4 which is exact.
Some languages are careful to represent approximated numbers
accurate to plus or minus the least significant bit.
Other languages may be less accurate.
Now for some experiments for you to run an the computer of your choice
in the language of your choice.
In single precision floating point print:
10^10 * sum( 1.0, 10^-7, -1.0)
10^10 * sum( 1.0, 0.5*10^-7, -1.0)
expression 10^10 * ( 1.0 + 10^-7 -1.0)
The order of addition is important.
Adding a small number to 1.0 may not change the value.
error_demo1.adb source code
error_demo1_ada.out output
error_demo1.c source code
error_demo1_c.out output
error_demo1.f90 source code
error_demo1_f90.out output
error_demo1.java source code
error_demo1_java.out output
error_demo1.m source code
error_demo1_m.out output
We will cover, in the course, methods of reducing error in areas:
statistics sum x^2 - (sum x)^2 vs sum(x-mean)
polynomial definition, evaluation Horner's rule vs x^N
approximation, e.g. sin(x) truncation error vs roundoff error
derivatives
indefinite integral, definite integral, area
partial differential equations
and show sample code in many languages:
Ada 95, C, Fortran 95, Java and MatLab
Error accumulation when computing standard deviation.
Subtracting large numbers looses significant digits.
123456 - 123455 = 1.00000 only 1 significant digit
With only 6 digits representable
123456.00 - 123455.99 = 1.00000 yet should be 0.010000
Two ways of computing the standard deviation are shown,
with the errors indicated for various sets of data:
Cases:
stddev(1, 2,..., 100)
stddev(10, 20,..., 1,000) should just scale
stddev(100, 200,..., 10,000)
stddev(1,000, 2,000,..., 100,000)
stddev(10,000, 20,000,..., 1,000,000)
stddev(10,001, 20,002,..., 1,000,100) should be same as first
sigma_error.c
sigma_error_c.out
sigma_error.f90
sigma_error_f90.out
sigma_error.adb
sigma_error_ada.out
sigma_error.java
sigma_error_java.out
sigma_error.m
sigma_error_m.out
The elementary functions are sin, cos, tan, exp,
and inverse forms asin, acos, atan, log
and reciprocal forms cosecant, secant, cotangent,
and hyperbolic forms sinh, cosh, tanh,
and inverse hyperbolic forms asinh, acosh, atanh, ...
The intrinsic functions are built into the language, e.g. abs (sometimes)
Note that Fortran 95 and Java need no extra information to get the
real valued elementary functions. "C" needs #include <math.h>
while Ada 95 needs 'with' and 'use' Ada.Numerics.Elementary_functions .
Note that Ada 95 overloads the function names and provides
single precision as 'float' and double precision as 'long_float'.
ef_ada.adb Ada 95, float and long float
Note that Fortran 95 overloads the function names and provides
single precision as 'real' and double precision as 'double precision'.
ef_f95.f90 Fortran 95, real and double
Note that Java provides only double precision as 'double'.
ef_java.java Java, only double
Note that C only provides double precision as 'double'.
ef_c.c C, only double is available
Note that MATLAB has the most functions and all functions are
automatically double or double complex as needed.
ef_matlab.m MATLAB everything
Just for fun: Power of 2 "tree"
A small program that prints epsilon, 'eps' and the largest and
smallest floating point numbers, run on one machine, shows:
float_sig.c running, may die
type float has about six significant digits
eps is the smallest positive number added to 1.0f that changes 1.0f
type float eps= 5.960464478E-08
type float small= 1.401298464E-45
this could be about 1.0E-38, above shows good IEEE floating point
type float large= 1.701411835E+38
type double has about fifteen significant digits
type double eps= 1.11022302462515654E-16
type double small=4.940656458E-324
this could be about 1.0E-304, above shows good IEEE floating point
type double large=8.988465674E+307
The program is float_sig.c
Using 2^n = 10^k/3.32 and n bits has a largest number 2^n -1 and
k digits has a largest number 10^k -1. 24 bits would be 7 digits.
53 bits would be 16 digits, the more optimistic significant
digits for IEEE floating point.
Notes: I have chosen to keep most WEB pages plain text so that
they may be read by all browsers. Some referenced pages may
be .pdf, .jpg, .gif, .ps or other file types.
Many math books use many Greek letters. You may want to refer
to various alphabets.
<- previous index next ->