Floating Point Numbers
Why does 0.1 + 0.2
not equal 0.3
, but instead 0.30000000000000004
?
Like integers, fractional numbers are stored in binary. They can accurately represent fractional numbers where the denominator is a power of 2. For example 1/2
is represented as 0.1
in base 2.
Rounding Errors
However when the denominator is not a power of 2, then rounding occurs.
1/10
cannot be accurately represented in base 2. In base 2 it is 0.0001100110011
recurring so to solve the problem of having finite memory, the computer rounds. For example to 5 digits this would be 0.00011
. The problem however is that this rounding causes our value to be less accurate. 0.00011
in base 10 is actually 1 * 2^-4 + 1 * 2^-5 = 0.09375
, not 0.1
.
Rounding errors can occur when either the denominator is large, we are dealing with recurring fractions (e.g. 1/3
) or when we are dealing with non-rational numbers.
In lower level languages, you are able to set the rounding mode. C++11 has macros
What Degree of Accuracy is Necessary?
So with computer memory being limited and not allowing us to store numbers of infinite precision, then at some point the number needs to be cut off. But in a certain use-case, it’s worth asking how much accuracy is needed. How many integer digits and fractional digits are necessary?
To a fund manager operating on a fund in GBP (£), the currency only has pence up until 99. 100 pence equals a new integer digit so it is unnecessary for us to have fractional digits exceeding 2 (in base 10).
To someone designing a microchip, 0.0001 meters (1⁄10 of a milimeter) makes a huge difference due to the scale of chips. However that person would never deal with a distance larger than 1cm (0.1 meters).
To a physicist who needs to use the speed of light (300000000 in SI units) and Newton’s gravitational constant (~ 0.0000000000667 in SI units) together in the same calculation.
To the fund manager and microchip designer, only relative accuracy is needed. However the physicist the requires calculations with very different magnitudes.
How Floating Point Numbers Work
So having a fixed number of integer and fractional digits is not very flexible in each use case. As a result the floating point format exists which allows us to represent numbers at various magnitudes:
In this format, the value is composed of 2 numbers:
- The significand contains the numbers digits.
- The exponent which states where the binary point is placed.
For example in base 10, if the significand is 5
, and exponent is 3
, then the value would be 5000
. If the exponent is -3
, then the value would be 0.005
.
The IEEE 754 Standard
Almost all hardware and programming languages implement the floating point format with the IEEE 754 Standard. The formats are as follows:
- Single precision (32-bit) consists of 1 sign bit, 23 significand bits and 8 exponent bits.
- Double precision (64-bit) consists of 1 sign bit, 52 significand bits and 11 exponent bits.
Deciding between which to use will depend on magnitude and accuracy calculated, as well as consideration for how much memory is available and would be consumed.
Epsilon
Epsilon is a number provided by languages that is the difference between 1 and and next value represented using it’s floating point type (or chosen floating point type). It allows us to write expressions that determine an acceptable margin of error. In Javascript this value is approximately 2^-52
.
For example in Javascript we are not surprised by the output of this expression now we know about floating point maths:
0.1 + 0.2 === 0.3; // false
Instead we can write an expression like this:
const a = 0.1 + 0.2; // 0.30000000000000004
const b = 0.3;
Math.abs(a - b) < Number.EPSILON; // true