Floating Point Numbers

Scientific Notation

Scientific notation allows us to represent large and small fractions using a compact notation:

Avogadro's Number = A = 6.023 x 10²³ = 602, 300, 000, 000, 000, 000, 000, 000 = M x B^E

Planck's constant = 6.626068 x 10^-34 = .00000000000000000000000006626068 = M x B^E

where:

M = Mantissa

B = Base

E = Exponent

Notice that the representation isn't unique. For example:

A = 60.23 x 10²²

When we specify that there is only one digit to the left of the decimal point this is called normalized scientific notation.

In general, any number can be written as a power of 10, but where negative exponents are allowed:

6.023 = 6 x 10⁰ + 0 x 10^-1 + 2 x 10^-2 + 3 x 10^-3

Base 2 Scientific Notation

Base 2 scientific notation follows the same pattern, where we note that adding a 0 (shifting left) means multiplying by 2 and removing a 0 (shifting right) means dividing by 2.

Example

[42]₂ = 101010.0000 = 1.0101 x 2⁵

[21]₂ = 1.0101 x 2⁴

[10.5]₂ = 1.0101 x 2³

[5.25]₂ = 1.0101 x 2²

[2.625]₂ = 1.0101 x 2¹

[1.3125]₂ = 1.0101 x 2⁰

[0.65625]₂ = 1.0101 x 2^-1

Example

32 = 1.00000 x 2⁵

16 = 1.00000 x 2⁴

8 = 1.00000 x 2³

4 = 1.00000 x 2²

2 = 1.00000 x 2¹

1 = 1.00000 x 2⁰

.5 = 1.00000 x 2^-1

.25 = 1.00000 x 2^-2

IEEE Standards

The Java virtual machine has to floating point types: float and double.

Java floats are represented using the 32 bit IEEE 754-1985 floating point standard:

Where:

sign = 1 bit = 0 (positive) or 1 (negative)

Exponent = 8 bit biased integer = actual exponent + 127

Mantissa = 23 bit unsigned integer following 1.

So the conversion formula is:

[F]₂ = -1^sign x 1.Mantissa x 2^{Exponent
– 127}

Java doubles are a 64 bit version of this pattern.

Examples

[.25]₂ = -1⁰ x 1.00000000000000000000000 x 2¹²⁵ = 0,01111101,00000000000000000000000

[-42.0]₂ = -1¹ x 1.01010000000000000000000 x 2¹³² = 1,10000100,01010000000000000000000

There are a number of special cases. For example:

[0.0]₂ = 00000000000000000000000000000000

Float.NaN

Float.POSITIVE_INFINITY

Float.NEGATIVE_INFINITY

Here's a nice conversion tool:

http://www.h-schmidt.net/FloatApplet/IEEE754.html

Round-Off Errors

[1/3]₂ = 1/4 + [1/3 – 1/4]₂ = 1/4 + [1/12]₂ = 1/4 + 1/16 + [1/12 – 1/16]₂ = 1/4 + 1/16 + [1/48]₂ = 1/4 + 1/16 + 1/64 + ...

= 0.010101010101...

This base 2 expansion goes on forever, so we just round it off:

= -1⁰ x 1.01010101010101010101010 x 2¹²⁵ = 0,01111101,01010101010101010101010 = [0.3333333]₂

These round-off errors can accumulate in a lengthy calculation.

Arithmetic

Multiplying and dividing floats isn't too bad, but adding and subtracting can be hard.