Introduction
Normalization and Representation
- Explicit Normalization
- Implicit Normalization
Converting Floating Point Decimal to Binary
Conclusion

Introduction

If you have worked with floating-point numbers in computers, you'll notice they sometimes exhibit certain weird behavior. For example, if you type 0.1 + 0.2 in the Python console, you're going to get 0.30000000000000004 instead of 0.3. This behavior is mainly due to how computers store floating-point numbers.

Normalization and Representation

For us, 0.1 is just 0.1, but that is in decimal, and computers know nothing about decimals. All they know is binary. Computers store floating-point numbers by first converting them from floating-point decimal into binary, which we'll take a look at shortly. Then they do what we call normalization.

The idea behind normalization is to create some form of standard. This is because floating-point numbers can be represented in different ways. Take, for example, 0.123. We can represent this as $1.23 \times 10^{-1}$ or $12.3 \times 10^{-2}$ . These are all valid ways of representing floating-point numbers. To make it standardized, we first came up with what is now known as explicit normalization.

Explicit Normalization

With explicit normalization, we move the radix point of a floating-point binary number to the left-hand side of its most significant 1. For example, given the binary number 10.100, we'll move the radix point to the left-hand side of the most significant 1, giving 0.10100. Since we moved the radix point 2 times to the left, we will multiply by $2^2$ , thus $10.100 = 0.10100 \times 2^2$

This is binary, thus we're multiplying by 2 and not 10

This allows us to save only the fractional part, 10100, which is also known as the mantissa, and the exponent, 2. The values are laid out in memory in the form:

________________________
|sign|exponent|mantissa|
------------------------
sign: represents the sign of the floating-point number, with 0 being positive and 1 being negative
exponent: represents the exponent of 2 after normalization
mantissa: represents the fractional value after the radix point

For simplicity's sake, let's assume we have an 8-bit computer, so we're going to store our floating-point number in 8 bits. We use 1 bit to represent our sign, 4 bits to represent our exponent, and 3 bits to represent our mantissa.

The sign is 0 since the value is positive, and the mantissa will be 10100. The exponent will be a bit tricky as we can't just convert 2 into binary and save it. This is because the exponent can be a negative number, thus we need to add a bias. We get the bias using $2^{k-1} - 1$ where k is the number of bits representing the exponent, which for our 8-bit computer is 4. Thus, $2^{4-1} - 1 = 7$ , so our exponent will be $2 + 7 = 9$ , whose binary is 1001. A representation of 0.10100 in an 8-bit computer will be:

____________
|0|1001|101|
------------

We only store 101 instead of 10100 because we only have 3 bits to save our number.

We can then convert this back with the formula:

v = -1^s \times 0.M \times 2^e

$-1^s$ allows us to get the right sign of the value. s is the sign, and thus if s is 0, then the expression will evaluate to $-1^0 = 1$ , but if s is 1, then the expression will evaluate to $-1^1 = -1$
0.M evaluates to 0.101 as M represents the mantissa.
$2^e$ evaluates to the exponent, but before saving the exponent, we added a bias, thus we need to subtract the bias: $e = \text{exponent} - \text{bias} = 9 - 7 = 2$ . Thus, the formula becomes

$v = -1^s \times 0.M \times 2^e = 1 \times 0.101 \times 2^2 = 0.101 \times 2^2 = 10.1$

Implicit Normalization

The problem with explicit normalization is the same problem that leads to the sum of 0.1 and 0.2 not being exactly 0.3. The problem is precision. With floating-point numbers, the more bits we can store, the more precise our floating-point representation. With our 8-bit computer described earlier, we had 3 bits representing the mantissa. This means if we had more than 3 bits, as we did, we could still only save 3 bits, thus losing the remaining bits. However, if instead of moving the radix point to the left-hand side of the most significant 1 bit, we move it to the right-hand side of it, we can save an extra fractional bit.

10.100 will become $1.0100 \times 2^1$ and will be represented as:

____________
|0|1000|010|
------------

We can then, during conversion, change the expression 0.M to 1.M.

v = -1^s \times 1.M \times 2^e = 1 \times 0.1010 \times 2^2 = 0.1010 \times 2^2 = 10.10

The last bit is 0 so it’s irrelevant in this case but would be important if it was 1.

Thus, implicit normalization allows us to imply there is a 1 and saves an extra bit. One bit may not seem like a lot, but it's the difference between ASCII and UTF-8, it's the difference between 127 and 256, so yeah, for computers, it's a lot.

Converting Floating Point Decimal to Binary

Now that we have some basic understanding of how computers store floating-point numbers, let's try to understand why 0.1 + 0.2 is not particularly equal to 0.3. First, let's convert 0.1 and 0.2 to binary. To convert floating-point numbers to binary, we first convert the value before the radix point to binary and then multiply the fractional part by 2, keeping the value of the integer before the radix point until the value becomes 1. Yeah, that may not have been the best explanation, so let's see an example.

Let's say we want to convert 2.25 to binary:

Convert the integer before the radix point, which is 2, to binary: 10.
Convert the fractional part by multiplying it by 2 and keep whatever integer is before the radix point until we get 1:

0.25 × 2 = 0.5 → 0
0.5 × 2 = 1.0 → 1

Then put the integer values you got from top down behind the radix point. The floating-point 2.25 will be 10.01.

Now let's convert 0.1:

The integer before the radix point is 0, which is the same in binary
Convert the remaining fraction:

0.1 × 2 = 0.2 → 0
0.2 × 2 = 0.4 → 0
0.4 × 2 = 0.8 → 0
0.8 × 2 = 1.6 → 1
0.6 × 2 = 1.2 → 1
0.2 × 2 = 0.4 → 0
0.4 × 2 = 0.8 → 0
0.8 × 2 = 1.6 → 1
0.6 × 2 = 1.2 → 1
...

We can see this goes on and on. This is a recurring binary number, thus 0.1 in binary is 0.0001100110011....

We can also do the same for 0.2:

0.2 × 2 = 0.4 → 0
0.4 × 2 = 0.8 → 0
0.8 × 2 = 1.6 → 1
0.6 × 2 = 1.2 → 1
0.2 × 2 = 0.4 → 0
0.4 × 2 = 0.8 → 0
0.8 × 2 = 1.6 → 1
0.6 × 2 = 1.2 → 1
...

Thus, 0.2 in binary is 0.001100110011....

Now before adding these two numbers, we're first going to store them in our 8-bit computer:

Normalize using implicit normalization: $0.1 = 0.000110011... = 1.100110011... × 2^{-4}$ $0.2 = 0.00110011... = 1.100110011... × 2^{-3}$
Add bias to our exponents:
- for 0.1: -4 + 7 = 3
- for 0.2: -3 + 7 = 4
Store in our 8-bit computer. We only have 3 bits for the fractional part, so we can only store 3 values after the radix point, meaning we lose all the remaining bits:

For 0.1
____________
|0|0011|100|
------------

For 0.2
____________
|0|0100|100|
------------

Convert them back to their floating-point representations using our formula earlier $v = -1^s \times 1.M \times 2^e$ :
- 0.1: $1.100 \times 2^{-4} = 0.0001100$
- 0.2: $1.100 \times 2^{-3} = 0.001100$
Add them:

  0.0001100
+ 0.001100
-----------
  0.0100100

Convert back to decimal. Values before the radix point will be converted as normal, while values after the radix point will have exponents with decreasing values of negative numbers, i.e., $2^{-1}, 2^{-2}$ , etc.: $0.0100100 → (0 × 2^0) + (0 × 2^{-1}) + (1 × 2^{-2}) + (0 × 2^{-3}) + (0 × 2^{-4}) + (1 × 2^{-5}) + 0 = 0 + 0 + 0.25 + 0 + 0 + 0.03125 = 0.28125$
In our 8-bit computer,0.1 + 0.2 = 0.28125

Conclusion

We started this article trying to understand why 0.1 + 0.2 sometimes does not give us 0.3. Using our 8-bit computer for illustration purposes, we've been able to come to that conclusion. This weird behavior, we came to understand, was a result of how computers normalize and represent floating-point numbers, which leads to loss of bits, ergo leading to loss of precision.

Luckily for us, we have 32-bit and 64-bit computers now. The Institute of Electrical and Electronics Engineers (IEEE) has standards for how these representations should be done. Thus, 32-bit systems allow 8 bits for the exponent and 23 bits for the mantissa, while 64-bit systems use 11 bits for the exponent and 52 bits for the mantissa. They both still use 1 bit for the sign. This means 64-bit systems have more bits to store the fractional parts, hence more precision. This is why 64-bit systems are known as double-precision and 32-bit systems are known as single-precision. The more bits we're able to store, the closer our approximation to the actual value.

Why 0.1 + 0.2 != 0.3: Understanding Floating Point Arithmetic in Computers

Table of Contents