Floating Point Numbers
0.1 + 0.2 not equal
0.3, but instead
Like integers, fractional numbers are stored in binary. They can accurately represent fractional numbers where the denominator is a power of 2. For example
1/2 is represented as
0.1 in base 2.
However when the denominator is not a power of 2, then rounding occurs.
1/10 cannot be accurately represented in base 2. In base 2 it is
0.0001100110011 recurring so to solve the problem of having finite memory, the computer rounds. For example to 5 digits this would be
0.00011. The problem however is that this rounding causes our value to be less accurate.
0.00011 in base 10 is actually
1 * 2^-4 + 1 * 2^-5 = 0.09375, not
Rounding errors can occur when either the denominator is large, we are dealing with recurring fractions (e.g.
1/3) or when we are dealing with non-rational numbers.
In lower level languages, you are able to set the rounding mode. C++11 has macros
What Degree of Accuracy is Necessary?
So with computer memory being limited and not allowing us to store numbers of infinite precision, then at some point the number needs to be cut off. But in a certain use-case, it’s worth asking how much accuracy is needed. How many integer digits and fractional digits are necessary?
To a fund manager operating on a fund in GBP (£), the currency only has pence up until 99. 100 pence equals a new integer digit so it is unnecessary for us to have fractional digits exceeding 2 (in base 10).
To someone designing a microchip, 0.0001 meters (1⁄10 of a milimeter) makes a huge difference due to the scale of chips. However that person would never deal with a distance larger than 1cm (0.1 meters).
To a physicist who needs to use the speed of light (300000000 in SI units) and Newton’s gravitational constant (~ 0.0000000000667 in SI units) together in the same calculation.
To the fund manager and microchip designer, only relative accuracy is needed. However the physicist the requires calculations with very different magnitudes.
How Floating Point Numbers Work
So having a fixed number of integer and fractional digits is not very flexible in each use case. As a result the floating point format exists which allows us to represent numbers at various magnitudes:
In this format, the value is composed of 2 numbers:
- The significand contains the numbers digits.
- The exponent which states where the binary point is placed.
For example in base 10, if the significand is
5, and exponent is
3, then the value would be
5000. If the exponent is
-3, then the value would be
The IEEE 754 Standard
Almost all hardware and programming languages implement the floating point format with the IEEE 754 Standard. The formats are as follows:
- Single precision (32-bit) consists of 1 sign bit, 23 significand bits and 8 exponent bits.
- Double precision (64-bit) consists of 1 sign bit, 52 significand bits and 11 exponent bits.
Deciding between which to use will depend on magnitude and accuracy calculated, as well as consideration for how much memory is available and would be consumed.
0.1 + 0.2 === 0.3; // false
Instead we can write an expression like this:
const a = 0.1 + 0.2; // 0.30000000000000004 const b = 0.3; Math.abs(a - b) < Number.EPSILON; // true