Scientific notation allows us to represent large and small fractions using a compact notation:

Avogadro's Number = A = 6.023 x 10^{23} = 602, 300, 000,
000, 000, 000, 000, 000 = M x B^{E}

Planck's constant = 6.626068 x 10^{-34} =
.00000000000000000000000006626068 = M x B^{E}

where:

M = Mantissa

B = Base

E = Exponent

Notice that the representation isn't unique. For example:

A = 60.23 x 10^{22}

When we specify that there is only one digit to the left of the decimal point this is called normalized scientific notation.

In general, any number can be written as a power of 10, but where negative exponents are allowed:

6.023 = 6 x 10^{0} + 0 x 10^{-1} + 2 x 10^{-2}
+ 3 x 10^{-3}

Base 2 scientific notation follows the same pattern, where we note that adding a 0 (shifting left) means multiplying by 2 and removing a 0 (shifting right) means dividing by 2.

[42]_{2} = 101010.0000 = 1.0101 x 2^{5} ^{}

[21]_{2} = 1.0101 x 2^{4} ^{}

[10.5]_{2} = 1.0101 x 2^{3}

[5.25]_{2} = 1.0101 x 2^{2}

[2.625]_{2} = 1.0101 x 2^{1} ^{}

[1.3125]_{2} = 1.0101 x 2^{0}

[0.65625]_{2} = 1.0101 x 2^{-1} ^{}

32 = 1.00000 x 2^{5}

16 = 1.00000 x 2^{4}

8 = 1.00000 x 2^{3}

4 = 1.00000 x 2^{2}

2 = 1.00000 x 2^{1}

1 = 1.00000 x 2^{0}

.5 = 1.00000 x 2^{-1}

.25 = 1.00000 x 2^{-2}

The Java virtual machine has to floating point types: float and double.

Java floats are represented using the 32 bit IEEE 754-1985 floating point standard:

Where:

sign = 1 bit = 0 (positive) or 1 (negative)

Exponent = 8 bit biased integer = actual exponent + 127

Mantissa = 23 bit unsigned integer following 1.

So the conversion formula is:

[F]_{2} = -1^{sign} x 1.Mantissa x 2^{Exponent
– 127}

Java doubles are a 64 bit version of this pattern.

[.25]_{2} = -1^{0} x 1.00000000000000000000000 x
2^{125} = 0,01111101,00000000000000000000000

[-42.0]_{2} = -1^{1} x 1.01010000000000000000000
x 2^{132} = 1,10000100,01010000000000000000000

There are a number of special cases. For example:

[0.0]_{2} = 00000000000000000000000000000000

Float.NaN

Float.POSITIVE_INFINITY

Float.NEGATIVE_INFINITY

Here's a nice conversion tool:

http://www.h-schmidt.net/FloatApplet/IEEE754.html

[1/3]_{2} = 1/4 + [1/3 – 1/4]_{2} = 1/4 + [1/12]_{2}
= 1/4 + 1/16 + [1/12 – 1/16]_{2} = 1/4 + 1/16 + [1/48]_{2} =
1/4 + 1/16 + 1/64 + ...

= 0.010101010101...

This base 2 expansion goes on forever, so we just round it off:

= -1^{0} x 1.01010101010101010101010 x 2^{125} =
0,01111101,01010101010101010101010 = [0.3333333]_{2}

These round-off errors can accumulate in a lengthy calculation.

Multiplying and dividing floats isn't too bad, but adding and subtracting can be hard.