# Floating Point Numbers

Why does `0.1 + 0.2`

not equal `0.3`

, but instead `0.30000000000000004`

?

Like integers, fractional numbers are stored in binary. They can accurately represent fractional numbers where the denominator is a power of 2. For example `1/2`

is represented as `0.1`

in base 2.

## Rounding Errors

However when the denominator is not a power of 2, then rounding occurs.

`1/10`

cannot be accurately represented in base 2. In base 2 it is `0.0001100110011`

recurring so to solve the problem of having finite memory, the computer rounds. For example to 5 digits this would be `0.00011`

. The problem however is that this rounding causes our value to be less accurate. `0.00011`

in base 10 is actually `1 * 2^-4 + 1 * 2^-5 = 0.09375`

, not `0.1`

.

Rounding errors can occur when either the denominator is large, we are dealing with recurring fractions (e.g. `1/3`

) or when we are dealing with non-rational numbers.

In lower level languages, you are able to set the rounding mode. C++11 has macros

## What Degree of Accuracy is Necessary?

So with computer memory being limited and not allowing us to store numbers of infinite precision, then at some point the number needs to be cut off. But in a certain use-case, it’s worth asking how much accuracy is needed. How many integer digits and fractional digits are necessary?

To a fund manager operating on a fund in GBP (£), the currency only has pence up until 99. 100 pence equals a new integer digit so it is unnecessary for us to have fractional digits exceeding 2 (in base 10).

To someone designing a microchip, 0.0001 meters (^{1}⁄_{10} of a milimeter) makes a huge difference due to the scale of chips. However that person would never deal with a distance larger than 1cm (0.1 meters).

To a physicist who needs to use the speed of light (300000000 in SI units) and Newton’s gravitational constant (~ 0.0000000000667 in SI units) together in the same calculation.

To the fund manager and microchip designer, only relative accuracy is needed. However the physicist the requires calculations with very different magnitudes.

## How Floating Point Numbers Work

So having a fixed number of integer and fractional digits is not very flexible in each use case. As a result the floating point format exists which allows us to represent numbers at various magnitudes:

In this format, the value is composed of 2 numbers:

- The
**significand**contains the numbers digits. - The
**exponent**which states where the binary point is placed.

For example in base 10, if the significand is `5`

, and exponent is `3`

, then the value would be `5000`

. If the exponent is `-3`

, then the value would be `0.005`

.

## The IEEE 754 Standard

Almost all hardware and programming languages implement the floating point format with the IEEE 754 Standard. The formats are as follows:

- Single precision (32-bit) consists of 1 sign bit, 23 significand bits and 8 exponent bits.
- Double precision (64-bit) consists of 1 sign bit, 52 significand bits and 11 exponent bits.

Deciding between which to use will depend on magnitude and accuracy calculated, as well as consideration for how much memory is available and would be consumed.

## Epsilon

Epsilon is a number provided by languages that is the difference between 1 and and next value represented using it’s floating point type (or chosen floating point type). It allows us to write expressions that determine an acceptable margin of error. In Javascript this value is approximately `2^-52`

.

For example in Javascript we are not surprised by the output of this expression now we know about floating point maths:

```
0.1 + 0.2 === 0.3; // false
```

Instead we can write an expression like this:

```
const a = 0.1 + 0.2; // 0.30000000000000004
const b = 0.3;
Math.abs(a - b) < Number.EPSILON; // true
```