CMSC 411 Lecture 11, Floating Point

    <- previous    index    next ->

Lecture 11, Floating Point


Almost all Numerical Computation arithmetic is performed using
IEEE 754-1985 Standard for Binary Floating-Point Arithmetic.
The two formats that we deal with in practice are the 32 bit and
64 bit formats. You need to know how to get the format you desire
in the language you are programming. Complex numbers use two values.

                                          older
        C       Java    Fortran 95        Fortran    Ada 95         MATLAB
        ------  ------  ----------------  -------    ----------     -------
32 bit  float   float   real              real       float          N/A
64 bit  double  double  double precision  real*8     long_float     'default'

complex
32 bit  'none'  'none'  complex           complex     complex       N/A
64 bit  'none'  'none'  double complex    complex*16  long_complex  'default'

'none' means not provided by the language (may be available as a library)
N/A means not available, you get the default.

IEEE Floating-Point numbers are stored as follows:
The single format 32 bit has
    1 bit for sign,  8 bits for exponent, 23 bits for fraction
The double format 64 bit has
    1 bit for sign, 11 bits for exponent, 52 bits for fraction

There is actually a '1' in the 24th and 53rd bit to the left
of the fraction that is not stored. The fraction including
the non stored bit is called a significand.

The exponent is stored as a biased value, not a signed value.
The 8-bit has 127 added, the 11-bit has 1023 added.
A few values of the exponent are "stolen" for
special values, +/- infinity, not a number, etc.

Floating point numbers are sign magnitude. Invert the sign bit to negate.

Some example numbers and their bit patterns:

   decimal
stored hexadecimal sign exponent  fraction                 significand 
                   bit                                     in binary
                                 The "1" is not stored 
                                 |                                   biased    
                    31  30....23  22....................0            exponent
   1.0
3F 80 00 00          0  01111111  00000000000000000000000  1.0   * 2^(127-127) 

   0.5
3F 00 00 00          0  01111110  00000000000000000000000  1.0   * 2^(126-127)

   0.75
3F 40 00 00          0  01111110  10000000000000000000000  1.1   * 2^(126-127)

   0.9999995
3F 7F FF FF          0  01111110  11111111111111111111111  1.1111* 2^(126-127)

   0.1
3D CC CC CD          0  01111011  10011001100110011001101  1.1001* 2^(123-127)
 

                          63  62...... 52  51 .....  0
   1.0
3F F0 00 00 00 00 00 00    0  01111111111  000 ... 000  1.0    * 2^(1023-1023)

   0.5
3F E0 00 00 00 00 00 00    0  01111111110  000 ... 000  1.0    * 2^(1022-1023)

   0.75
3F E8 00 00 00 00 00 00    0  01111111110  100 ... 000  1.1    * 2^(1022-1023)

   0.9999999999999995
3F EF FF FF FF FF FF FF    0  01111111110  111 ...      1.11111* 2^(1022-1023)

   0.1
3F B9 99 99 99 99 99 9A    0  01111111011  10011..1010  1.10011* 2^(1019-1023)
                                                                           |
                        sign   exponent      fraction                      |
                                                before storing subtract bias

Note that an integer in the range 0 to 2^23 -1 may be represented exactly.
Any power of two in the range -126 to +127 times such an integer may also
be represented exactly. Numbers such as 0.1, 0.3, 1.0/5.0, 1.0/9.0 are
represented approximately. 0.75 is 3/4 which is exact.
Some languages are careful to represent approximated numbers
accurate to plus or minus the least significant bit.
Other languages may be less accurate.

/* flt.c  just to look at .o file with hdump */
void flt()  /* look at IEEE floating point */
{
  float x1 = 1.0f;
  float x2 = 0.5f;
  float x3 = 0.75f;
  float x4 = 0.99999f;
  float x5 = 0.1f;

  double d1 = 1.0;
  double d2 = 0.5;
  double d3 = 0.75;
  double d4 = 0.99999999;                             The "1" not stored
  double d5 = 0.1;                                            in binary
}                                                            |
                      31  30....23  22....................0  |
  3F 80 00 00          0  01111111  00000000000000000000000  1.0   * 2^(127-127) 
  3F 00 00 00          0  01111110  00000000000000000000000  1.0   * 2^(126-127)
  3F 40 00 00          0  01111110  10000000000000000000000  1.1   * 2^(126-127)
  3F 7F FF 58          0  01111110  11111111111111101011000  1.1111* 2^(126-127)
  3D CC CC CD          0  01111011  10011001100110011001101  1.1001* 2^(123-127)
 

                            63  62...... 52  51 .....  0
  3F F0 00 00 00 00 00 00    0  01111111111  000 ... 000  1.0    * 2^(1023-1023)
  3F E0 00 00 00 00 00 00    0  01111111110  000 ... 000  1.0    * 2^(1022-1023)
  3F E8 00 00 00 00 00 00    0  01111111110  100 ... 000  1.1    * 2^(1022-1023)
  3F EF FF FF FA A1 9C 47    0  01111111110  111 ...      1.11111* 2^(1022-1023)
  3F B9 99 99 99 99 99 9A    0  01111111011  1001 ..1010  1.10011* 2^(1019-1023)
                                                                             |
                          sign   exponent      fraction                      |
                                                                   subtract bias

  decimal                     binary fraction / decimal exponent  IEEE normalize
                                                                  binary


Now, all the above is the memory, RAM, format.
Upon a load operation of either float or double into one of the floating point
registers, the format in the register extended to greater precision
than double. All floating point arithmetic is performed at this
greater precision. Upon a store operation, the greater precision is
reduced to the memory format, possibly with rounding.
From a programming viewpoint, always use double.


  exponents must be the same for add and subtract!

  A = 3.5 * 10^6              a = 11.1 * 2^6                        1.11 * 2^7
  B = 2.5 * 10^5              b = 10.1 * 2^5                        1.01 * 2^6

  A+B       3.50 * 10^6       a+b        11.10 * 2^6               1.110 * 2^7
          + 0.25 * 10^6                +  1.01 * 2^6            +  0.101 * 2^7
          _____________               ______________              ------------
            3.75 * 10^6                 100.11 * 2^6              10.011 * 2^7
                                                       normalize  1.0011 * 2^8
                                                       IEEE
  A-B       3.50 * 10^6
                                                       normalize  0.10011 * 2*9
          - 0.25 * 10^6                                fraction
          -------------
            3.25 * 10^6

  A*B       3.50 * 10^6
          * 2.5  * 10^5
          -------------
            8.75 * 10^11

  A/B   3.5 *10^6 / 2.5 *10^5 = 1.4 * 10^1


  

  The mathematical basis for floating point is simple algebra

  The common uses are in computer arithmetic and scientific notation

  given: a number  x1  expressed as 10^e1 * f1
  then  10  is the base, e1 is the exponent and f1 is the fraction
  example  x1 = 10^3 * .1234  means  x1 = 123.4  or  .1234*10^3
  or in computer notation   0.1234E3

  In computers the base is chosen to be 2, i.e. binary notation
  for  x1 = 2^e1 * f1 where e1=3 and f1 = .1011
  then x1 = 101.1 base 2 or, converting to decimal x1 = 5.5 base 10

  Computers store the sign bit, 1=negative, the exponent and the
  fraction in a floating point word that may be 32 or 64 bits.

  The operations of add, subtract, multiply and divide are defined as:

  Given   x1 = 2^e1 * f1
          x2 = 2^e2 * f2  and e2 <= e1

  x1 + x2 = 2^e1 *(f1 + 2^-(e1-e2) * f2)  f2 is shifted then added to f1

  x1 - x2 = 2^e1 *(f1 - 2^-(e1-e2) * f2)  f2 is shifted then subtracted from f1

  x1 * x2 = 2^(e1+e2) * f1 * f2

  x1 / x2 = 2^(e1-e2) * (f1 / f2)

  an additional operation is usually needed, normalization.
  if the resulting "fraction" has digits to the left of the binary
  point, then the fraction is shifted right and one is added to
  the exponent for each bit shifted until the result is a fraction.
  
  We will use fraction normalization, not IEEE normalization:

  if the resulting "fraction" has zeros immediately to the right of
  the binary point, then the fraction is shifted left and one is
  subtracted from the exponent for each bit shifted until there
  is a non zero digit to the right of the binary point.

  Numeric examples using equations:
       (exponents are decimal integers, fractions are decimal)
       (normalized numbers have  1.0 > fraction >= 0.5)
       (note fraction strictly less than 1.0, greater than or equal 0.5)
 
  x1 = 2^4 * 0.5   or  x1 = 8.0
  x2 = 2^2 * 0.5   or  x2 = 2.0

  x1 + x2 = 2^4 * (.5 + 2^-(4-2) * .5) = 2^4 * (.5 + .125) = 2^4 * .625

  x1 - x2 = 2^4 * (.5 - 2^-(4-2) * .5) = 2^4 * (.5 - .125) = 2^4 * .375 
       not normalized, multiply fraction by 2, subtract 1 from exponent 
                                       = 2^3 * .75

  x1 * x2 = 2^(4+2) * (.5*.5) = 2^6 * .25   not normalized
                              = 2^5 * .5    normalized

  x1 / x2 = 2^(4-2) * (.5/.5) = 2^2 * 1.0    not normalized
                              = 2^3 * .5     normalized


  Numeric examples, people friendly:
        (exponents are decimal integers, fractions are decimal)
        (normalized numbers have  1.0 > fraction >= 0.5)

  x1 = 0.5 * 2^4 
  x2 = 0.5 * 2^2  

  x1 + x2 =   0.500 * 2^4
            + 0.125 * 2^4  unnormalize to make exponents equal
              -----------
              0.625 * 2^4  result is normalized, done.

  x1 - x2 =   0.500 * 2^4
            - 0.125 * 2^4  unnormalize to make exponents equal
              -----------
              0.375 * 2^4  result is not normalized
              0.750 * 2^3  double fraction, halve exponential

  x1 * x2 = 0.5 * 0.5 * 2^2 * 2^4 = 0.25 * 2^6   not normalized
                                  = 0.5  * 2^5   normalized

  x1 / x2 = (.5/.5) * 2^4/2^2 = 1.0 * 2^2    not normalized
                              = 0.5 * 2^3    normalized
                                             halve fraction, double exponential


IEEE 754 Floating Point Standard

A few minor problems, e.g. the square root of all complex numbers
are in the right half of the complex plane and thus the real
part of the square root should never be negative. As a concession
to early hardware, the standard define the sqrt(-0) to be -0
rather than +0. Several places the standard uses the word should.
If a standard is specifying something, the word shall is typically used.

Basic decisions and operations for floating point add and subtract:



The decisions indicated above could be used to design the control
component shown in the data path diagram below:




A hint on normalization, using computer scientific notation:

1.0E-8 == 10.0E-9 == 0.01E-6  == 0.00000001 == 10ns == 0.01 microseconds

1.0E8  ==  0.1E9  == 100.0E6  == 100,000,000 == 100MHz == 0.1 GHz

1.0/1.0GHz = 1ns clock period 


Some graphics boards have large computing capacity and
some are releasing the specs so programmers can use the
computing capacity.

nVidia example 2007

nVidia example 2011

Programming 512 cores with CUDA or OpenCL is quite a challenge.
New languages are coming, not optimized yet.

Fortunately, CMSC 411 does not require VHDL for floating point,
just the ability to manually do floating point add, subtract,
multiply and divide. (Examples above and in class on board.)

    <- previous    index    next ->

Lecture 11, Floating Point

Other links

Go to top