Numerical Analysis Home

Floating Point Numbers


We know \(\pi=3.14159265358979\cdots\), where the decimal digits never terminate. For numerical calculations, we consider only a finite number of digits of a number. A \(t\)-digit floating point number of base \(10\) is of the form \[\pm 0.a_1a_2\ldots a_t\cdot 10^e,\] where \(0.a_1a_2\ldots a_t\) is called the mantissa and \(e\) is called the exponent. Usually the mantissa \(0.a_1a_2\ldots a_t\) is normalized, i.e., \(a_1\neq 0\). For example, the normalized \(15\)-digit floating point number of \(\pi\) is \[fl(\pi)=0.314159265358979\cdot 10^1.\]

Note that floating point numbers are approximation of the exact numbers obtained by either chopping or rounding up the digits. The error in calculations caused by the use of floating point numbers is called roundoff error. For example, a computer may calculate the following \[2-(\sqrt{2})^2=-0.444089209850063\cdot 10^{-15},\] which is just a roundoff error. Note that since floating point numbers are rational numbers, a computer cannot express any irrational number without errors. Also note that almost all computers use binary floating point numbers.

A 64-bit computer uses the IEEE 754-2008 standard which defines the following format for 64-bit binary floating point numbers (Double-precision floating point format): \[\begin{array}{ccc} s& x& f\\ (1-\text{bit sign}) & (11-\text{bit exponent}) & (52-\text{bit fraction}) \end{array}\] converts to the following decimal number \[(-1)^s2^{x_{10}-1023}\left(1+(f_1\cdot 2^{-1}+f_2\cdot 2^{-2}+\cdots+f_{52}\cdot 2^{-52})\right),\] where \(x_{10}\) is the decimal number of \(x\). For example, \(1\;\; 10000000010\;\; 11000000\cdots 0\) converts to \[(-1)^12^{1026-1023}\left(1+(1\cdot 2^{-1}+1\cdot 2^{-2}+0\cdot 2^{-3}+\cdots+0\cdot 2^{-52})\right)=-2^3\left(1+\frac{1}{2}+\frac{1}{4}\right)=-14.\] In this format the magnitude of the largest and smallest decimal numbers is \(1.797693134862231\cdot 10^{308}\). If a number has magnitude bigger than that, a 64-bit computer stops working and it is called an overflow. Note that the single-precision floating point format uses 32 bits including 23 bits fraction.

Last edited