Floating Point Numbers

Scientific Notation

Scientific notation allows us to represent large and small fractions using a compact notation:

Avogadro's Number = A = 6.023 x 1023 = 602, 300, 000, 000, 000, 000, 000, 000 = M x BE

Planck's constant = 6.626068 x 10-34 = .00000000000000000000000006626068 = M x BE

where:

M = Mantissa

B = Base

E = Exponent

Notice that the representation isn't unique. For example:

A = 60.23 x 1022

When we specify that there is only one digit to the left of the decimal point this is called normalized scientific notation.

In general, any number can be written as a power of 10, but where negative exponents are allowed:

6.023 = 6 x 100 + 0 x 10-1 + 2 x 10-2 + 3 x 10-3

Base 2 Scientific Notation

Base 2 scientific notation follows the same pattern, where we note that adding a 0 (shifting left) means multiplying by 2 and removing a 0 (shifting right) means dividing by 2.

Example

[42]2 = 101010.0000 = 1.0101 x 25

[21]2 = 1.0101 x 24

[10.5]2 = 1.0101 x 23

[5.25]2 = 1.0101 x 22

[2.625]2 = 1.0101 x 21

[1.3125]2 = 1.0101 x 20

[0.65625]2 = 1.0101 x 2-1

Example

32 = 1.00000 x 25

16 = 1.00000 x 24

8 =  1.00000 x 23

4 =  1.00000 x 22

2 =  1.00000 x 21

1 =  1.00000 x 20

.5 = 1.00000 x 2-1

.25 = 1.00000 x 2-2

IEEE Standards

The Java virtual machine has to floating point types: float and double.

Java floats are represented using the 32 bit IEEE 754-1985 floating point standard:

Where:

sign = 1 bit = 0 (positive) or 1 (negative)

Exponent = 8 bit biased integer = actual exponent + 127

Mantissa = 23 bit unsigned integer following 1.

So the conversion formula is:

[F]2 = -1sign x 1.Mantissa x 2Exponent – 127

Java doubles are a 64 bit version of this pattern.

Examples

[.25]2 = -10 x 1.00000000000000000000000 x 2125 = 0,01111101,00000000000000000000000

[-42.0]2 = -11 x 1.01010000000000000000000 x 2132 = 1,10000100,01010000000000000000000

There are a number of special cases. For example:

[0.0]2 = 00000000000000000000000000000000

Float.NaN

Float.POSITIVE_INFINITY

Float.NEGATIVE_INFINITY

 

Here's a nice conversion tool:

http://www.h-schmidt.net/FloatApplet/IEEE754.html

Round-Off Errors

[1/3]2 = 1/4 + [1/3 – 1/4]2 = 1/4 + [1/12]2 = 1/4 + 1/16 + [1/12 – 1/16]2 = 1/4 + 1/16 + [1/48]2 = 1/4 + 1/16 + 1/64 + ...

= 0.010101010101...

This base 2 expansion goes on forever, so we just round it off:

= -10 x 1.01010101010101010101010 x 2125 = 0,01111101,01010101010101010101010 = [0.3333333]2

These round-off errors can accumulate in a lengthy calculation.

Arithmetic

Multiplying and dividing floats isn't too bad, but adding and subtracting can be hard.