*overflow*. Note that the single-precision floating point format uses 32 bits including 23 bits fraction.

## Floating Point Numbers |

We know \(\pi=3.14159265358979\cdots\), where the decimal digits never terminate. For numerical calculations, we consider
only a finite number of digits of a number. A \(t\)-digit * floating point number* of base \(10\) is of the form
\[\pm 0.a_1a_2\ldots a_t\cdot 10^e,\]
where \(0.a_1a_2\ldots a_t\) is called the * mantissa* and \(e\) is called the * exponent*. Usually the mantissa
\(0.a_1a_2\ldots a_t\) is normalized, i.e., \(a_1\neq 0\). For example, the normalized \(15\)-digit floating point number of
\(\pi\) is \[fl(\pi)=0.314159265358979\cdot 10^1.\]

Note that floating point numbers are approximation of the exact numbers obtained by either ** chopping** or ** rounding up**
the digits. The error in calculations caused by the use of floating point numbers is called ** roundoff error**.
For example, a computer may calculate the following
\[2-(\sqrt{2})^2=-0.444089209850063\cdot 10^{-15},\]
which is just a roundoff error. Note that since floating point numbers are rational numbers, a computer cannot express
any irrational number without errors. Also note that almost all computers use binary floating point numbers.

A 64-bit computer uses the IEEE 754-2008 standard which defines the following format for 64-bit binary floating point numbers
(Double-precision floating point format):
\[\begin{array}{ccc}
s& x& f\\
(1-\text{bit sign}) & (11-\text{bit exponent}) & (52-\text{bit fraction})
\end{array}\]
converts to the following decimal number
\[(-1)^s2^{x_{10}-1023}\left(1+(f_1\cdot 2^{-1}+f_2\cdot 2^{-2}+\cdots+f_{52}\cdot 2^{-52})\right),\]
where \(x_{10}\) is the decimal number of \(x\). For example, \(1\;\; 10000000010\;\; 11000000\cdots 0\)
converts to
\[(-1)^12^{1026-1023}\left(1+(1\cdot 2^{-1}+1\cdot 2^{-2}+0\cdot 2^{-3}+\cdots+0\cdot 2^{-52})\right)=-2^3\left(1+\frac{1}{2}+\frac{1}{4}\right)=-14.\]
In this format the magnitude of the largest and smallest decimal numbers is \(1.797693134862231\cdot 10^{308}\).
If a number has magnitude bigger than that, a 64-bit computer stops working and it is called an * overflow*.
Note that the single-precision floating point format uses 32 bits including 23 bits fraction.

Last edited