50% found this document useful (2 votes)
960 views

Errors

1. Propagated errors refer to errors that carry through a calculation from the initial inputs, whereas generated errors refer to new errors introduced at each step of the calculation. 2. For addition and subtraction, the propagated error is the sum of the initial absolute errors, while for multiplication and division it is the sum of the initial relative errors. 3. Floating point arithmetic can introduce significant generated errors at each step and does not always obey the usual laws of arithmetic, such as the commutative and associative properties, due to rounding. This can lead to unexpected results.

Uploaded by

api-3756871
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
50% found this document useful (2 votes)
960 views

Errors

1. Propagated errors refer to errors that carry through a calculation from the initial inputs, whereas generated errors refer to new errors introduced at each step of the calculation. 2. For addition and subtraction, the propagated error is the sum of the initial absolute errors, while for multiplication and division it is the sum of the initial relative errors. 3. Floating point arithmetic can introduce significant generated errors at each step and does not always obey the usual laws of arithmetic, such as the commutative and associative properties, due to rounding. This can lead to unexpected results.

Uploaded by

api-3756871
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 106

STEP 3

ERRORS 3

Error propagation and generation

It has been noted that a number is to be represented by a finite


number of digits, and hence often by an approximation. It is to be
expected that the result of any arithmetic procedure (any algorithm)
involving a set of numbers will have an implicit error relating to the
error of the original numbers. One says that the initial errors
propagate through the computation. In addition, errors may he
generated at each step in an algorithm, and one speaks of the total
cumulative error at any step as the accumulated error.

Since one ains to produce results within some chosen limit of error, it is
useful to consider error propagation. Roughly speaking, based on
experience, the propagated error depends on the mathematical
algorithm chosen, whereas the generated error is more sensitive to
the actual ordering of the computational steps. It is possible to be more
precise, as described below.

Absolute error

The absolute error is the absolute difference between the exact


number x and the approximate number X, i.e.,

A number correct to n decimal places has


We expect that the absolute error involved in any approximate number
is no more than five units at the first neglected digit.

Relative error

This error is the ratio of the absolute error to the absolute exact
number, i.e.,

(Note that the upper bound follows from the triangle inequality; thus

A decimal number correct to n significant digits has

Error propagation

Consider two numbers

Under the operations of addition or subtraction,

The magnitude of the propagated error is therefore not more than the
sum of the initial absolute enors; of course, it may be zero.

Under the operation of multiplication:


The maximum relative error propagated is approximately the sum of
the initial relative errors. The same result is obtained when the
operation is division.

Error generation

Often (for example, in a computer) an operation ‫נ‬is approximated by


an operation sup>*, say. Consequently, x is represented by x* *.
Indeed, one has

so that the accumulated enor does not exceed the sum of the
propagated and generated errors. Examples may be found in Step 4.

Example

Evaluate (as accurately as possible) the expressions:

1. 3.45+4.87-5.16
2. 3.55 x 2.73

There are two methods which the student may consider: The first is to
invoke the concepts of absolute and relative error as defined above.
Thus, the result for 1. is 3.16 +/- 0.015, since the maximum absolute
error is 0.005 + 0.005 + 0.005 = 0.015. We conclude that the answer is
3 (to 1S ), for the number certainly lies between 3.145 and 3.175. In 2. ,
the product 9.6915 is subject to the maximum relative error:
whence the maximum (absolute) error ~ (2.73 + 3.55) x 0.005 ~ 0.03, so
that the answer is 9.7.

A second approach is interval arithmetic. Thus, the approximate


number 3.45 represents a number in the interval (3.445, 3.455), etc.
Consequently, the result for 1. lies in the interval bounded below by

and above by

Similarly, in 2. , the result lies in the interval bounded below by

and above by

whence once again the approximate numbers 3 and 9.7 correctly


represent the respective results to 1. and 2. .

Checkpoint

1. What distinguishes propagated and generated errors?

2. How to determine the propagated error for the operations addition


(subtraction) and multiplication (division)?

EXERCISES
Evaluate the following operations as accurately as possible, assuming
all values to the number of digits given:

1. 8.24 + 5.33.
2. 124.53 - 124.52.
3. 4.27 x 3.13.
4. 9.48 x 0.513 - 6.72.
5. 0.25 x 2.84/0.64.
6. 1.73 - 2.16 + 0.08 + 1.00 - 2.23 - 0.97 + 3.02.

STEP 4

ERRORS 4

Interval arithmetic

In STEP 2, floating point representation was introduced as a convenient


way of dealing with large or small numbers. Since most scientific
computations involve such numbers, many students will be familiar
with floating point arithmetic and will appreciate the way in which it
facilitates calculations involving multiplication or division.

In order to investigate the implications of finite number


representation, one must examine the way in which arithmetic is
carried out with floating point numbers. The following specifications
apply to most computers which round, and are easily adapted to those
which chop. For the sake of simplicity in the examples, we will use a
three-digit decimal mantissa normalized to lie in the range

(most computers use binary representation and the mantissa is


commonly normalized to lie in the rangc [?,1]). Note that up to six
digits are used for intermediate results, but the final result of each
operation is a normalized three-digit decimal floating point number.

Addition and subtraction

Mantissae are added or subtracted (after shifting the mantissa and


increasing the exponent of the smaller number, if necessary, to make
the exponents agree); the final normalized result is obtained by
rounding (after shifting the mantissa and adjusting the exponent, if
necessary). Thus:

3.12 x 101 + 4.26 x l01 = 7.38 x 101


2.77 x 102 + 7.55 x 102 = 10.32 x 102  1.03 x 103
6.18 x l01 + 1.84 x l0-1 = 6.18 x 101 + 0.0184 x 101 = 6.1984 x 101  6.20 x 101 ,
3.65 x 10-1 - 2.78 x 10-1 = 0.87 x 10-1  8.70 x 10-2.

The exponents are added and the mantissae are multiplied; the final
result is obtained by rounding (after shifting the mantissa right and
increasing the exponent by 1, if necessary). Thus:

(4.27 x 101) x (3.68 x 101) = 15.7136 x 102  1.57x103


(2.73x102)x(-3.64x10-2)=-9.9372x100  -9.94x100.

Division

The exponents are subtracted and the mantissae are divided; the
final result is obtained by rounding (after shifting the mantissa
left and reducing the exponent by 1, if necessary). Thus:

(5.43xl01) / (4.55x102) = 1.19340...xl0-1  1.19x10-1


(-2.75x102) / (9.87x10-2) = -0.278622. . .x104  -2.79x103.
Expressions

The order of evaluation is determined in a standard way and the


result of each operation is a normalized floating point number.
Thus:

(6.18x101+1.84xl0-1)/((4.27x101)x(3.68x101))
(6.20x101)/(1.57x103)=3.94904...x10-2 3.95x10-2

Generated error

Note that all the above examples (except the subtraction and the
first addition) involve generated errors which are relatively large
due to the small number of digits in the mantissae. Thus the
generated error in

2.77x102+7.55x102=10.32x102  1.03x103

is 0.002 x 103. Since the propagated error in this example may be


as large as 0.01 x 102 (assuming the operands are correct to 3S ),
one can use the result given in Error propagation to deduce that
the accumulated error cannot exceed 0.002x103 + 0.01x102 =
0.003x103..

Consequences

The peculiarities of floating point arithmetic lead to some


unexpected and unfortunate consequences, including the
following:

1. Addition or subtraction of a small (but nonzero) number


may have no effect, for example
5.18x102 + 4.37x10-1 = 5.18x102 + 0.00437x102 = 5.18437x102 
5.18x102,

whence, the additive identity is not unique.

2. Frequently, the result of ax(1/a) is not 1; for example, if a =


3.00x100, then

1/a is 3.33x10-1

and

a (1/a) is 9.99 10-1,

whence the multiplicative inverse may not exist.

3. The result of (a + b) + c is not always the same as the


result of a + (b + c); for example, if

a = 6.31x101, b = 4.24x100, c = 2.47x10-1,

then

(a+b)+c = (6.31x101 + 0.424x101) + 2.47x10-1  6.73x101 +


.0247x101  6.75x101,

whereas

(a+b)+c = 6.31x101 + (4.24x100 + 2.47x100)  6.31x101 +


4.49x100  6.31x101+4.49x100  6.31x101+ 0.449x101 
6.76x101,

whence the associative law for addition does not


always apply.
Examples involving adding many numbers of varying size
indicate that adding in order of increasing magnitude is
preferable to adding in the reverse order.

4. Subtracting a number from another nearly equal number


may result in loss of significance or cancellation
errors. In order to illustrate this loss of accuracy, suppose
that we evaluate f(x) = 1 - cos x for x=0.05, using three-
digit, decimal, normalized, floating point arithmetic with
rounding. Then

1 - cos(0.05) = 1-0.99875  1.00x100 - 0.999x100  1.00x10-3.

Although the value of 1 is exact and cos(0.05) is correct to


3S, when expressed as a three-digit floating point number,
their computed difference is correct to only 1S ! (The two
zeros after the decimal point in 1.00x10-3 pad the number.)

The approximation 0.999 ~ cos(0.05) has a relative error of


about 2.5x10-4. By comparison, the relative error of 1.00x10-3
- cos(0.05) is about 0.2, i.e., it is much larger. Thus,
subtraction of two nearly equal numbers should be avoided
whenever possible.

In the case of f(x)= 1 - cos x, one can avoid loss of


significant digits by writing

This last formula is more suitable for calculations when x is


close to 0. It can be verified that the more accurate
approximation of 1.25 x 10-3 is obtained for 1- cos(0.05)
when three-digit floating point arithmetic is used.

5. Checkpoint

Why is it sometimes necessary to shift the mantissa and


adjust the exponent of a floating point number?

6. Does floating point arithmetic obey the usual laws of


arithmetic?
7. Why should the subtraction of two nearly equal
numbers be avoided?
8. Exercises

Evaluate the following expressions, using three-digit


decimal normalized floating point arithmetic with rounding:

1. 6.19x102+5.82x102,
2. 6.19x102+3.61x101,
3. 6.19x102-5.82x102,
4. 6.19x 102-3.61x101,
5. (3.60 x 103)x(1.01x10-1,
6. (-7.50x10-1x(-4.44x101,
7. (6.45x102/(5.16xl0-1,
8. (-2.86 x 10-2)/(3.29 x 103).
9. Estimate the accumulated errors in the results of
Exercise 1, assuming that all values are correct to 3S.
10. Evaluate the following expressions, using four-digit
decimal normalized floating point arithmetic with
rounding, then recalculate them, carrying all decimal
places, and estimate the propagated error.
1. Given a = 6.842x10-1, b = 5.685x101, c = 5.641x101,
find a(b-c) and ab-ac.
2. Given a=9.812xl01, b=4.631x+l0-1, c=8.340xl0-1,
find (a+b)+c and a+(b+c).
3. Use four-digit decimal normalized floating point
arithmetic with rounding to calculate f (x)=
tanx - sinx for x = 0.1.

Since

tanx-sinx=tanx(1-cosx)=tanx(2sin2(x/2))

f(x) may be written as f (x)=2tanxsin2(x/2).


Repeat the calculation using this alternative
expression. Which of the two values is more
accurate?

STEP 5

ERRORS

Approximation to functions

An important procedure in Analysis is to represent a given function as


an infinite series of terms involving simpler or otherwise more
appropriate functions. Thus, if f is the given function, it may be
represented as the series expansion
involving the set of functions ( i). Mathematicians have spent a lot of
effort in discussing the convergence of series, i.e., on defining
conditions for which the partial sum

approximates the function value f (x) ever more closely as n increases.


In Numerical Analysis, we are primarily concerned with such
convergent series; computation of the sequence of partial sums is an
approximation process in which the truncation error may be made as
small as we please by taking a sufficient number of terms into account.

1. Taylor series

The most important expansion to represent a function is the


Taylor series. If f is suitably smooth in the neighbourhood of
some chosen point x0,

where

here h = x - x0 denotes the displacement from x0 to the point x in


the neighbourhood, and the remainder term is

for some point between x0 and x. (This is known as the


Lagrange form of the remainder; its derivation may be found
in Section 8.7 of Thomas and Finney ( 1992), cited in the
Bibliography.) Note that the in this expression for Rn may be
written as

The Taylor expansion converges for x within some range


including the point x0, a range which lies within the
neighbourhood of x0 mentioned above. Within this range of
convergence, the truncatiorr error due to discarding terms
after the xn term (equal to the value of Rn at the point x) can be
made smaller in magnitude than any positive constant by
choosing n sufficiently large. In other words, by using Rn to decide
how many terms are needed, one may evaluate a function at any
point in the range of convergence as accurately as the
accumulation of round- off error permits.

From the point of view of the numerical analyst, it is most


important that the convergence be fast enough. For example, if
we consider f (x) = sin x, we have

and the expansion (about x0 = 0) for n = 2k - 1 is given by

with

.
Note that this expansion has only odd-powered terms, although
the polynomial approximation is of degree (2k - 1 ) - it has only k
terms. Moreover, the absence of even-powered terms means that
the same polynomial approximation is obtained with n = 2k, and
hence R2k-1 = R2k; the remainder term R2k - 1 given above is actually
the expression for R2k. Since then

if 5D accuracy is required, it follows that we need only take k = 2


at x = 0.1, and k = 4 at x=1 (since 9! = 362 880). On the other
hand, the expansion for the natural (base e) logarithm,

where

is less suitable. Although only n = 4 terms are needed to give 5D


accuracy at x = 0.1, n = 13 is required for 5D accuracy at x = 0.5,
and n = 19 gives just 1D accuracy at x = 1!

Moreover, we remark that the Taylor series is not only used


extensively to represent functions numerically, but also to
analyse the errors involved in various algorithms (for example,
STEP 8, Step9, Step10, Step30, and Step31)

Polynomial approximation
The Taylor series provides a simple method of polynomial
approximation (of chosen degree n)

which is basic to the later discussion of various elementary


numerical procedures. Because f is often complicated, one may
prefer to execute operations such as differentiation and
integration of a polynomial approximation. Interpolation
formulae in STEP22 and STEP23 may also be used to construct
polynomial approximations.

2. Other series expansions

There are many other series expansions, such as the Fourier


series (in terms of sines and cosines), or those involving various
orthogonal functions (Legendre polynomials, Chebyshev
polynomials, Bessel functions, etc.). From the numerical
standpoint, truncated Fourier series and Chebyshev polynomial
series have proven to be most useful. Fourier series are
appropriate in treating functions with natural periodicities,
while Chebyshev series provide the most rapid convergence
of all known polynomial based approximations.

Occasionally, it is possible to represent a function adequately


(from the numerical standpoint) by truncating a series which
does not converge in the mathematical sense. For example,
solutions are sometimes obtained in the form of asymptotic
series with leading terms which provide sufficiently accurate
numerical results.
While we confine our attention in this book to truncated Taylor
series, the interested reader should be aware that such
alternative expansions exist (see, for example, Burden and Faires
(1993)).

3. Recursive procedures

While a truncated series with few terms may be a practical way


to compute values of a function, there is a number of arithmetic
operations involved, and it may be preferable to employ possibly
available recursive procedures which reduce the arithmetic
required. For example, the values of the polynomial

and its derivative

for may be generated recursively by Horner's scheme:

Thus, for successive values of k, one has


The technique just described is also known as nested
multiplication. (Perhaps the student may want to suggest a
recursive procedure for higher derivatives of P.)

Finally, it should be noted that it is common to generate


members of a set of orthogonal functions recursively.

Checkpoint

1. How do numerical analysts use the remainder term Rn in Taylor


series?
2. Why is the rate of convergence so important from the
numerical standpoint?
3. From the numerical standpoint, is it essential for a series
representation to converge in the mathematical sense?

EXERCISES

1. Find the Taylor series expansions about x = 0 for each of the


functions:
1. cosx.
2. 1/(1 - x).
3.

Determine also for each series a general remainder term.


2. For each of the functions in 1. evaluate f(0.5), using a calculator
and the first four terms of the Taylor expansions.
3. Use the remainder term found in 1. 3 to find the value of n
required in the Taylor series for f(x)=ex about x = 0 to give 5D
accuracy for all x between 0 and 1.
4. Truncate the Taylor series found in 1. 3 to give linear, quadratic,
and cubic polynomial approximations for f(x) = ex in the
neighbourhood of x = 0. Use the remainder term to estimate (to
the nearest 0.1) the range over which each polynomial
approximation yields results correct to 2D.
5. Evaluate P(3.1) and P'(3.1), where P(x) = x3 - 2x2 + 2x + 3, using the
technique of nested multiplication.
6. Evaluate P(2.6) and P'(2.6), where P(x) = 2x4 - x3 + 3x2 + 5, using
nested multiplication. Check your values using a calculator.

STEP 6

Approximation to functions

Non- linear equations 1

Non- linear algebraic and transcendental equations

The first non-linear equation encountered in algebra courses is usually


the quadratic equation

and all students will be familiar with the formula for its roots:
The formula for the roots of a general cubic is somewhat more
complicated and that for a general quartic usually takes several
pages to describe! We are spared further effort by a theorem which
states that there is no such formula for general polynomials of
degree higher than four. Accordingly, except in special cases (for
example, when factorization is easy), we prefer in practice to use a
numerical method to solve polynomial equations of degree higher than
two.

Another class of nonlinear equations consists of those which involve


transcendental functions such as

. Useful analytic solutions of such equations are rare so that we are


usually forced to use numerical methods.

1. A transcendental equation

We shall use a simple mathematical problem to show that


transcendental equations do arise quite naturally. Consider the
height of a liquid in a cylindrical tank of radius r and horizontal
axis, when the tank is a quarter full (see Figure 2). Denote the
height of the liquid by h (DB in the diagram). The condition to be
satisfied is that the area of the segment ABC should be ? of the
area of the circle. This task reduces to

where is the area of the sector OAB, the triangle OAD. Hence

.
When we have solved the transcendental equation

we obtain h from

FIGURE 2.
Cylindrical tank (cross-section).

2. Locating roots

Let it be required to find some or all of the roots of the nonlinear


f(x) = 0. Before we use a numerical method (STEP 7, STEP 8, STEP
9 and STEP 10), we should have some idea about the number,
nature and approximate location of the roots. The usual approach
involves the construction of graphs and perhaps a table of
values of the function f, in order to confirm the information
obtained from the graph.

We will now illustrate this approach by a few


examples.

a)

If we do not have a calculator or computer


available to immediately plot the graph of

f(x )= sin x - x + 0.5,

we can separate f into two parts, sketch two curves on a single


set of axes, and find out whether they intersect. Thus we sketch
. Since , we are only interested in the
interval -0.5  x  1.5 (outside which |x - 0.5| > 1). Thus we deduce
from Fig. 3 that the equation has only one real root, near x =1.5 as
follows:

We now know that the root lies between 1.49 and 1.50, and we
can use a numerical method to obtain a more accurate answer as
is discussed in tlater Steps.

b)

Again, we sketch two curves:

In order to sketch the second curve, we use the three obvious


zeros at x = 0, 2, and 3, as well as the knowledge that x(x - 2) (x -
3) is negative for x < 0 and 2 < x < 3, but positive and increasing
steadily for x > 3. We deduce from the graph (Fig. 4) that there
are three real roots, near x = 0.2, 1.8, and 3. 1, and tabulate as
follows (with :

We conclude that the roots lie between 0.15 and 0.2, 1.6 and 1.8,
and 3.1 and 3.2, respectively. Note that the values in the table
were calculated to an accuracy of at least 5SD. For example,
working to 5S accuracy, we
have f (0.15) = 0.97045-
0.79088= 0.17957, which is
then rounded to 0.1796. Thus
the entry in the table for
f(0.15) is 0.1796 and not 0.1795
as one might expect from
calculating 0.9704 - 0.7909.

Checkpoint

1. Why are numerical methods used in solving nonlinear


equations?
2. How does a transcendental equation differ from an
algebraic equation?
3. What kind of information is used when sketching curves
for the location of roots?

EXERCISES

4. Locate the roots of the equation x+cos x=0.


5. Use curve sketching to roughly locate all the roots of the
equations:

a) x + 2 cos x = 0.

b) x + ex= 0.

c) x(x - 1) - ex= 0.

d) x(x - 1 - sin x = 0.

STEP 7
NONLINEAR EQUATIONS 2

The bisection method

The bisection method, suitable for implementation on a computer (cf.


the Pseudo-code) allows to find the roots of the equation f (x) = 0,
based on the following theorem:

Theorem: If f is continuous for x between a and b and if f (a) and f(b)


have opposite signs, then there exists at least one real root of f (x) = 0
between a and b.

1. Procedure: Suppose that a continuous function f is negative at x


= a and positive at
x = b, so that there is at least one real root between a and b. (As
a rule, a and b may be found from a graph of f.) If we calculate f
((a +b)/2), which is the function value at the point of bisection of
the interval a< x < b, there are three possibilities:
1. f ((a + b)/2) = 0, in which case (a + b)/2 is the root;
2. f ((a + b)/2) <0, in which case the root lies between (a +
b)/2 and b;
3. f ((a + b)/2) > 0, in which case the root lies between a and
(a + b)/2.

Presuming there is just one root, in Case 1 the process is


terminated. In either Case 2 or Case 3, the process of bisection
of the interval containing the root can be repeated until the root
is obtained to the desired accuracy. In Figure 5, the successive
points of bisection are denoted by x1 , x2, and x3.
2. Effectiveness: The bisection method is almost certain to give a
root. Provided the conditions of the above theorem hold; it can
only fail if the accumulated error in the calculation of f at a
bisection point gives it a small negative value when actually it
should have a small positive value (or vice versa); the interval
subsequently chosen would then be wrong. This can be
overcome by working to sufficient accuracy, and this almost-
assured convergence is not true of many other methods of
finding roots.

One drawback of the bisection method is that it applies only to


roots of f about which f (x) changes sign. In particular, double
roots can be overlooked; one should be careful to examine f(x) in
any range where it is small, so that repeated roots about which f
(x) does not change sign are otherwise evaluated (for example,
see Steps 9 and 10). Of course, such a close examination also
avoids another nearby root being overlooked.

Finally, note that bisection is rather slow; after n iterations the


interval containing the root is of length (b - a)/2n. However,
provided values of f can be generated readily, as when a
computer is used, the rather large number of iterations which
can be involved in the application of bisection, is of relatively
little consequence.

3. Example

Solve 3xex = 1 to three decimal places by the bisection method.


Cconsider f(x) = 3x - ex, which changes sign in the interval 0.25 <
x < 0.27: one tabulates (working to 4D ) as follows:

(Ascertain graphically that there is just one root!)

Denote the lower and upper endpoints of the interval bracketing


the root at the n-th iteration by an and bn, respectively (with a1 =
0.25 and b1 = 0.27). Then the approximation to the root at the n-
th iteration is given by xn = an + bn)/2. Since the root is either in
[an, bn] or [xn, bn] and both intervals are of length bn - an)/2, we
see that xn will be accurate to three decimal places when (bn -
an)/2 < 5 10-4. Proceeding to bisection:

(Note that the values in the table are displayed to only 4D.)
Hence the root accurate to three decimal places is 0.258.

Checkpoint
4. When may the bisection method be used to find a root of the
equation f(x) = 0?
5. What are the three possible choices after a bisection value is
calculated?
6. What is the maximum error after n iterations of the bisection
method`?

EXERCISES

1. Use the bisection method to find the root of the equation

x+cosx = 0.

correct to two decimal places (2D ).

2. Use the bisection method to find to 3D the positive root of the


equation

x - 0.2sinx - 0.5=0.

3. Each equation in Exercises 2(a)-2(c) of Step 6 has only one root.


For each equation use the bisection method to find the root
correct to 2 D.

STEP 8

NONLINEAR EQUATIONS 3

Method of false position

As mentioned in the Prologue, the method of false position dates back


to the ancient Egyptians. It remains an effective alternative to the
bisection method for solving the equation f(x) = 0 for a real root
between a and b, given that f (x) is continuous and f (a) and f(b) have
opposite signs. The algorithm is suitable for automatic computation
(pseudo.code)

1. PROCEDURE

The curve y = f(x) is not generally a straight line. However, one may
join the points (a,f(a)) and (b,f(b)) by the straight line

Thus straight line cuts the x-axis at (X, 0) where

so that

Suppose that f(a) is negative and f(b) is positive. As in the bisection


method, there are the three possibilities :

1. f(X) = 0, when case X is the root ;


2. f(X) < 0, when the root lies between X and b ;
3. f(X)>0, when the root lies between X and a.

Again, in Case 1, the process is terminated, in either Case 2 or Case 3,


the process can be repeated until the root is obtained to the desired
accuracy. In Fig. 6, the successive points where the straight lines cut
the axis are denoted by x1, x2, x3.
2. EFFECTIVENESS AND THE SECANT METHOD

Like the bisection method, the method of false position has


almost assured convergence, and it may converge to a root faster.
However, it may happen that most or all the calculated values of X are
on the same side of the root, in which case convergence may be slow
(Fig. 7). This is avoided in the secant method, which resembles the
method of false position except that no attempt is made to ensure that
the root is enclosed; starting with two approximations to the root (x0,
x1), further approximations x2, x3,… are computed from

There is no longer convergence, but the process is simpler (the sign of


f(xn+1) is not tested) and often converges faster.
With respect to speed of convergence of the secant method, one has at
the (n+1)th step:

Hence, expanding in terms of the Taylor series,

where we have used the fact that f( )=0. Thus we see that en+1 is
proportional to enen-1, which may be expressed in mathematical
notation as
We seek k such that

Hence the speed of convergence is faster than linear (k =1 ), but slower


than quadratic (k=2). This rate of convergence is sometimes referred
to as superlinear convergence.

3. EXAMPLE

Use the method of false position to solve

Then

f (x1) =f (0.257637) = 3 x 0.257637 0 0.772875 = 0.772912 - 0.772875 =


0.000036.

The student may verify that doing one more iteration of the method of
false position yields an estimate x2 = 0.257628 for which the function
value is less than 5*10-6. Since x1 and x2 agree to 4D, we conclude that
the root is 0.2576, correct to 4D.

Checkpoint

1. When may the method of false position be used to find a root


of the equation f(x) = 0?
2. On what geometric construction is the method of false
position based?

EXERCISES
1. Use the method of false position to find the smallest root of the
equation f (x) = 2 sin x + x - 2 = 0, stopping when

2. Compare the results obtained when you use


1. the bisection method,
2. the method of false position, and
3. the secant method

with starting values 0.7 and 0.9 to solve the equation

3sin x = x + 1/x.

4. Use the method of false position to find the root of the


equation

5. Each equation in Exercises 2(a)-2(c) of Step 6 has only one


root. Use the method of false position to find each root
stopping when

STEP 9

NONLINEAR EQUATIONS

The method of simple iteration


The method of simple iteration involves writing the equation f(x) = 0 in
a form x = f(x), suitable for the construction of a sequence of
approximations to some root in a repetitive fashion.

1. Procedure

The iteration procedure follows: In some way, we obtain a rough


approximation x0 of the desired root, which may then be substituted
into the right-hand side to give a new approximation, x1= (x0). The new
approximation is again substituted into the right-hand side to give a
further approximation x2= (x1), etc., until
(hopefully) a sufficiently accurate
approximation to the root is obtained. This
repetitive process, based on xn+1 = (xn) is called
simple iteration; provided that |xn+1 - xn|
decreases as n increases, the process tends to
= ( ), where denotes the root.

2. Example

Use simple iteration to find the root of the


equation

3xex = 1

to an accuracy of 4D.

One first writes

x = e-x/3 = (x).

Assuming x0 = 1, successive iterations yield


x1 = 0.12263, x2 = 0.29486,
x3 = 0.24821, x4 = 0.26007,
x5 = 0.25700, x6 = 0.25779,
x7 = 0.25759, x8 = 0.25764.

Thus, we see that after eight iterations the root is 0.2576 to 4D. A
graphical interpretation of the first three iterations is shown in Fig. 8.

3. Convergence

Whether or not an iteration procedure converges or indeed at all,


depends on the choice of the function (x) as well as the starting value
x0. For example, the equation x? = 3 has two real roots

It can be given the form

x = 3/x = (x)

which suggests the iteration

xn+1 = 3/xn.

However, if the starting value x0 = 1 is used, the successive iterations


yield

We can examine the convergence of the iteration process


x= (xn) to  ( )

with the aid of the Taylor series

where k is a point between the root and the approximation xk. We have

Multiplying the n + 1 rows together and cancelling the common factors


x1, x2, ??? , xn leaves

whence

so that the absolute error | xn+1| can be made as small as we please by


sufficient iteration if | '|< 1 in the neighbourhood of the root.

Note that (x) = 3/x has derivative | '(x)| = |-3/x?| > 1 for |x| < 3?.

Checkpoint
1. Assuming x0 = 1, show by simple iteration that one root of the
equation 2x - 1 -2sinx = 0 is 1.4973.

2. Use simple iteration to find (to 4D) the root of the equation x + cos
x = 0.

STEP 10

NONLINEAR EQUATIONS 5

The Newton- Raphson iterative method

The Newton- Raphson method is suitable for implementation on a


computer (pseudo-code). It is a process for the determination of a real
root of an equation f (x) = 0, given just one point close to the desired
root. It can be viewed as a limiting case of the secant method (Step
8) or as a special case of simple iteration ( Step 9).

1. Procedure

Let x0 denote the known approximate value of the root of f(x) = 0 and h
the difference between the true value and the approximate value, i.e.,

The second degree, terminated Taylor expansion ( STEP 5) about x0


is

where lies between

Ignoring the remainder term and writing


whence

and, consequently,

should be a better
estimate of the root
than x0. Even better
approximations may be
obtained by repetition
(iteration) of the process, which then becomes

Note that if f is a polynomial, you can use the recursive procedure of


STEP 5 to compute

The geometrical interpretation is that each iteration provides the point


at which the tangent at the original point cuts the x-axis (Figure 9).
Thus the equation of the tangent at (xn, f (xn)) is

y - f(x0) = f '(x0)(x - x0)

so that (x1, 0) corresponds to

-f(x0) = f '(x0)(x1 - x0),

whence x1 = x0 - f(x0)/f '(x0).

2. Example
We will use the Newton- Raphson method to find the positive root of
the equation sin x = x2, correct to 3D.

It will be convenient to use the method of false position to obtain


an initial approximation. Tabulation yields

With numbers displayed to 4D, we see that there is a root in the


interval 0.75 < x < 1 at approximately

Next, we will use the Newton- Raphson method:

and

yielding

Consequently, a better approximation is


Repeating this step, we obtain

and

so that

Since f(x2) = 0.0000, we conclude that the root is 0.877 to 3D.

3. Convergence

If we write

the Newton-Raphson iteration expression

may be rewritten

We have observed (STEP 9) that, in general, the iteration method


converges when near the root. In the case of Newton-Raphson,
we have

so that the criterion for convergence is


,

i.e., convergence is not as assured as, say, for the bisection method.

4. Rate of convergence

The second degree terminated Taylor expansion about xn is

where is the error at the n-th iteration and .

Since , we find

But, by the Newton-Raphson formula,

whence the error at the (n + 1)-th iteration is

provided en is sufficiently small.

This result states that the error at the (n + 1)-th iteration is proportional
to the square of the error at the nth iteration; hence, if , an
answer correct to one decimal place at one iteration should be
accurate to two places at the next iteration, four at the next, eight at
the next, etc. This quadratic - second- order convergence - outstrips
the rate of convergence of the methods of bisection and false
position!

In relatively little used computer programs, it may be wise to prefer the


methods of bisection or false position, since convergence is virtually
assured. However, for hand calculations or for computer routines in
constant use, the Newton-Raphson method is usually preferred.

5. The square root

One application of the Newton- Raphson method is in the


computation of square roots. Since a? is equivalent to finding the
positive root of x2 = a. i.e.,

f(x) = x2 - a = 0.

Since f '(x) = 2x, we have the Newton-Raphson iteration formula:

xn+1 = xn - (x?n - a)/2xn = ?(xn + a/xn),

a formula known to the ancient Greeks. Thus, if a = 16 and x0 = 5, we


find to 3D

x1 = (5 + 3.2)/2 = 4.1, x2 = (4.1 + 3.9022)/2 = 4.0012, and x3 = (4.0012 +


3.9988)/2 = 4.0000.

Checkpoint

1. What is the geometrical interpretation of the Newton-


Raphson iterative procedure?
2. What is the convergence criterion for the Newton-Raphson
method?
3. What major advantage has the Newton-Raphson method over
some other methods?

EXERCISES

1. Use the Newton-Raphson method to find to 4S the (positive) root of


3xex=1?

Note that Tables of natural logarithms ( Naperian) are more more


readily available than tables of the exponential, so that one might
prefer to solve the equivalent equation f(x) = loge 3x + x = log 3 + loge x
+ x = 0.

2.Derive the Newton-Raphson iteration formula

xn + 1 = xn - (xkn - a)/k xk-1n

for finding the k-th root of a.

3. Compute the square root of 10 to 5 significant digits from an initial


guess.

4. Use the Newton-Raphson method to find to 4D the root of the


equation

x cos x = 0.

Use the Newton-Raphson method to find to 4D the root of each


equation in the exercises 1 - 3.

STEP 11

SYSTEMS OF LINEAR EQUATIONS 1


Solution by elimination

Many physical systems can be modelled by a set of linear equations


which describe relationships between system variables. In simple
cases, there are two or three variables; in complex systems (for
example, in a linear model of the economy of a country) there may be
several hundred variables. Linear systems also arise in connection with
many problems of numerical analysis. Examples of these are the
solution of partial differential equations by finite difference methods,
statistical regression analysis, and the solution of eigenvalue problems
(see, for example, Gerald and Wheatley ( 1994)). A brief introduction to
this last topic is given in STEP 17.

Hence there arises a need for rapid and accurate methods for solving
systems of linear equations. The student will already be familiar
with solving by elimination systems of equations with two or three
variables. This Step presents a formal description of the Gauss
elimination method for n-variable systems and discusses certain
errors which might arise in their solutions. Partial pivoting, a
technique which enhances the accuracy of this method, is discussed in.
in the next Step.

1. Notation and definitions

We first consider an example in three variables:

a set of three linear equations in the three variables or unknowns x, y,


z. During solution of such a system, we determine a set of values for x,
y and z which satisfies each of the equations. In other words, if values
(X, Y, Z) satisfy al1 equations simultaneously, then they constitute a
solution of the system.

Consider now the general system of n equations in n variables:

Obviously, the dots indicate similar terms in the variables and the
remaining (n - 3) equations.

In this notation, the variables are denoted by x1, x2, . . , xn; they are
sometimes referred to as xi , i = 1, 2, ? ?. , n. The coefficients of the
variables may be detached and written in a coefficient matrix:

The notation aij will be used to denote the coefficient of xj in the i-th
equation. Note that aij occurs in the i-th row and j-th column of the
matrix.

The numbers on the right-hand side of the equations are called


constants; they may be written in a column vector:

The coefficient matrix may be combined with the constant vector to


form the augmented matrix:
As a rule, one works in the elimination method directly with the
augmented matrix.

2. The existence of solutions

For any particular solution of n linear equations, there may be a single


solution (X1, X2, . . . Xn), or no solution, or an infinity of solutions. In the
theory of linear algebra, theorems are given and conditions stated
which allow to make a decision regarding the category into which a
given system falls. We shall not treat the question of existence of
solutions in this book, but for the benefit of students, familiar with
matrices and determinants, we state the theorem:

Theorem: A linear system of n equations in n variables with coefficient


matrix A and non-zero constants vector b has a unique solution, if and
only if the determinant of A is not zero.

If b = 0 , the system has the trivial solution x = 0 . It has no other


solution unless the determinant of A is zero, when it has an infinite
number of solutions.

If the determinant of A is non-zero, there exists an n n matrix, called


the inverse of A (denoted by A-1) such that the matrix product of A-1
and A is equal to the n  n-identity or unit matrix I. The elements of
the identity matrix are 1 on the main diagonal and 0 elsewhere. Its
algebraic properties include Ix = x for any n1 vector x, and IM = MI =
M for any nn matrix M. For example, the 33 identity matrix is
Multiplication of the equation Ax = b from the left by the inverse matrix
A-1 yields A-1Ax = A-1b, whence the unique solution is x = A-1b (since A-
1
A = I and Ix = x). Thus, in principle, a linear system with a unique
solution may be solved by first evaluating A-1 and then A-1b. This
approach is discussed in more detail in the optional Step 14. The Gauss
elimination method is a more general and efficient direct procedure for
solving systems of linear equations.

3. Gauss elimination method

In Gauss elimination, the given system of equations is transformed into


an equivalent system which has upper triangular form; this form is
readily solved by a process called back- substitution. We shall
demonstrate the process by solving the example of Section 1.

Transformation into upper triangular form:

First stage: Eliminate x from equations R2 and R3 using equation R1.

Second stage: Eliminate y from R3' using R2':

The system, now in upper triangular form, has the coefficient matrix:
Solution by back- substitution

The system in upper triangular form is easily solved by obtaining z


from R3", then y from R2", and finally x from R1". This procedure is
called back- substitution. Thus:

4. The transformation operations

During transformation of a system to upper triangular form, one or


more of the following elementary operations are used at every step:

1. Multiplication of an equation by a constant;


2. Subtraction from one equation some multiple of another
equation;
3. Interchange of two equations.

Mathematically speaking, it should be clear to the student that


performing elementary operations on a system of linear equations
leads to equivalent systems with the same solutions. This statement
requires proof which may be found as a theorem in books on linear
algebra (cf., for example, Anton (1993)). It forms the basis of all
elimination methods for solving systems of linear equations.

5. General treatment of the elimination process


We will now describe the application of the elimination process to a
general n n linear system, written in general notation, which is
suitable for implementation on a computer (Pseudo-code). Before
considering the general system, the process will be described by
means of a system of three equations. We begin with the augmented
matrix, and display in the column (headed by m) the multipliers
required for the transformations.

First stage: Eliminate the coefficients a21 and a31, using row R1:

Second stage: Eliminate the coefficient a32 using row R2:

The matrix is now in the form which permits back- substitution. The
full system of equations at this stage, equivalent to the original system,
is

Hence follows the solution by back- substitution:


We now display the process for the general n  n system, omitting the
primes (') for convenience. Recall that the original augmented matrix
was:

First stage: Eliminate the coefficients a21, a31,. . . , an1 by calculating the
multipliers

and then

This leads to the modified augmented system

Second stage: Eliminate the coefficients a32, a42, . . . , an2 by calculating


the multipliers

and then

whence
We continue to eliminate unknowns, going on to columns 3, 4, . . so that
by the beginning of the k-th stage we have the augmented matrix

k- th stage: Eliminate ak+l,k, ak+2,k, . . , ak+n,kl by calculating the multipliers

and then

At the end of the k-th stage, we obtain the augmented system

Continuing in this way, we obtain after n -1 stages the augmented


matrix
Note that the original coefficient matrix has been transformed into the
upper triangular form.

We now back- substitute: We find xn = bn/ann, and then

Notes

1. The diagonal elements akk, used at the k-th stage of the successive
elimination, are referred to as pivot elements
.

2. In order to proceed from one stage to the next, the pivot elements
must be non- zero, since they are used as divisors in the multipliers
and in the final solution. If at any stage a pivot element vanishes,
rearrange the remaining rows of the matrix, in order to obtain a non-
zero pivot; if this is not possible, then the system of linear equations
has no solution.

3. If a pivot element is small compared with the elements in its


column which have to be eliminated, the corresponding multipliers
used at that stage will be larger than one in magnitude. The use of
large multipliers in elimination and back-substitutions leads to
magnification of round- off errors; this can be avoided by using the
partial pivoting described in Step12.
4. During the elimination method, an extra check column,
containing the sums of the rows, should be computed at each stage (cf.
the following example). Its elements are treated in exactly the same
way as the equation coefficients. After each stage is completed, the
new row sums should equal the new check column elements ( within
roundoff error).

6. A numerical example with check column

Consider the system of equations:

0.34x1 - 0.58x2 + 0.94x3 = 2.0


0.27x1 + 0.42x2 + 0.134x3 = 1.5
0.20x1 - 0.51x2 + 0.54x3 = 0.8

The manipulations leading to the solution are set out in tabular form
below. For the purpose of illustration, the calculations have been
executed in 3D floating point arithmetic. For example, at the first stage,
the multiplier 0.794 arises as follows:

while the value -0.0900 arises from the sequence of operations

Work with so few significant digits leads to errors in the solution, as is


shown below by an examination of the so called residuals.
Next, we back-substitute:

As a check, we sum the original three equations to obtain 0.81x1 - 0.67x2


= 0.43. Inserting the solution yields 0.81 0.89- 0.67  2.1 + 1.61 
3.03 = 4.3049.

In order to assess the accuracy of the solution, we insert the solution


into the left-hand sides of each of the original equations and compare
the results with the right-hand side constants. For the present example,
we find:

which yields the residuals

It seems reasonable to conclude that, if the residuals are small, the


solution is a good one. As a rule, this is true. However, at times, small
residuals do not indicate that a solution is acceptable. This aspect is
discussed under the heading ill-conditioning in Step 12.

Checkpoint

1. What types of operations are permissible during the


transformation of the augmented matrix?
2. What is the final form of the coefficient matrix, before
backsubstitution begins?
3. What are pivot elements? Why must small pivot elements be
avoided if possible?

EXERCISES

Solve the following systems by Gauss elimination:

a)
x1 + x2 - x3 = 0,
2x1 - x2 + x3 = 6,
3x1 + 2x2 - 4x3 = -4.

b)
5.6 x + 3.8 y + 1.2z = 1.4,
3.lx + 7.ly - 4.7z = 5.1,
1.4x - 3.4y + 8.3z = 2.4.

c)
2x + 6&127 + 4z = 5,
6x + 19y + 12z = 6,
2x + 8y + 14z = 7.
d)
1.3x + 4.6y + 3.lz= -1,
5.6x + 5.8y + 7.9z = 2,
4.2x + 3.2y + 4.5z = -3.

STEP 12

SYSTEMS OF LINEAR EQUATIONS 2

Errors and ill- conditioning

For any system of linear equations, the question concerning the errors
in a solution obtained by a numerical method is not readily answered. A
general discussion of the problems it raises is beyond the scope of this
book. However, some sources of errors will be indicated below.

1. Errors in the coefficients and constants

In many practical cases, the coefficients of the variables, and also the
constants on the right-hand sides of the equations are obtained from
observations of experiments or from other numerical calculations. They
will have errors; and therefore, once the solution of a system has been
found, it too will contain errors. In order to show how this kind of error
is carried through calculations, we shall solve a simple example in two
variables, assuming that the constants have errors at most as large as
+/- 0.01. Consider the system:

Solving this system by Gauss elimination and back- substitution


yields:
,

whence 3/2 y lies between 2.985 and 3.015, i.e., y lies between 1.990
and 2.010. From the first equation, we now obtain

so that x lies between 0.99 and 1.01.

If the coefficients and constants of the system were exact, its exact
solution would be x = 1, y = 2. However, since the constants are not
known exactly, it does not make sense to talk of an exact solution; all
one can say is that 0.99  x  1.01 and 1.99  y  2.01.

In this example, the error in the solution is of the same order as that in
the constants. Yet, in general, the errors in the solutions are greater
than those in the constants.

2. Round- off errors and numbers of operations

Numerical methods for solving systems of linear equations involve


large numbers of arithmetic operations. For example, the Gauss
elimination of Step 11, according to Atkinson (1993), involves (n3 + 3n2
- n)/3 multiplications/divisions and (2n3 + 3n2 - 5n)/6
additions/subtractions in the case of a system with n unknowns.

Since round-off errors are propagated at each step of an algorithm, the


growth of round-off errors can be such that, when n is large, a solution
differs greatly from the true one.

3. Partial pivoting
In Gauss elimination, the buildup of round-off errors may be reduced by
rearranging the equations so that the use of large multipliers in the
elimination operations is avoided. The corresponding procedure is
known as partial pivoting (or pivotal condensation). The general
rule to follow involves: At each elimination stage, rearrange the rows of
the augmented matrix so that the new pivot element is larger in
absolute value than (or equal to) any element beneath it in its column.

A use of this rule ensures that the magnitudes of the multipliers,


employed at each stage, are less or equal to unity. A simple example in
3 floating point arithmetic follows:

In the following tabular solution, the pivot elements are in bold print.
(Note that the magnitude of all the multipliers is less than unity.)

If no pivoting is done, it may be verified that using three-digit floating


point arithmetic yields the solution z = 2.93, y = 2.00 and x = 1.30. Since
the true solution to 3S is z = 2.93, y = 1.96, and x=1.36, the solution
obtained by partial pivoting is better than the one obtained without
pivoting.

4. Ill-conditioning

Certain systems of linear equations are such that their solutions are
very sensitive to small changes (and therefore to errors) in their
coefficients and constants. We give an example below in which 1 %
changes in two coefficients change the solution by a factor of 10 or
more. Such systems are said to be ill- conditioned. If a system is ill-
conditioned, a solution obtained by a numerical method may differ
greatly from the exact solution, even though great care is taken to
keep round-off and other errors very small.

As an example, consider the system of equations:

which has the exact solution x =1, y = 2. Changing coefficients of the


second equation by 1% and the constant of the first equation by 5%
yields the system:

It is easily verified that the exact solution of this system is x = 11, y =


-18.2. This solution differs greatly from the solution of the first system.
Both these systems are said to be ill- conditioned.

If a system is ill-conditioned, then the usual procedure of checking a


numerical solution by calculation of the residuals may not be valid. In
order to see why this is so, suppose we have an approximation X to the
true solution x. The vector of residuals r is then given by r = b - AX
= A( x - X) . Thus e = x - X satisfies the linear system Ae = r. In
general, r will be a vector with small components. However, in an ill-
conditioned system, even if the components of r are small so that it
is `close' to 0, the solution of the linear system Ae = r could differ
greatly from the solution of the system Ae = 0 , namely 0. It then
follows that X may be a poor approximation to x despite the residuals
in r being small.

Obtaining accurate solutions to ill-conditioned linear systems can be


difficult, and many tests have been proposed for determining whether
or not a system is ill-conditioned. A simple introduction to this topic is
given in the optional STEP 16.

Checkpoint

1. Describe the types of error which may affect the solution of a


system of linear equations.
2. How can partial pivoting contribute to a reduction of errors?
3. Is it true to say that an ill- conditioned system does not have
an exact solution?

EXERCISES

1. Find the range of solutions for the following system, assuming that
the maximum errors in the constants are as shown:

2. Solve the following systems by Gauss elimination:

a)
b)

c)

3. Use 4D normalized floating point arithmetic to solve the


fo1lowing system with and without partial pivoting. Compare your
answers with the exact answer: x =1.000 x 1O0, y = 5.000 x 10-1:

4. Show that for a linear system of three unknowns, Gauss


elimination requires three divisions, eight multiplications, and eight
sub- tractions for the triangularization, and a further three divisions,
three multiplications and three additions/subtractions for the
backsubstitution.

5. Derive the general formulae in Section 2 for the numbers of required


arithmetic operations.

6. Study the ill- conditioning example of Section 4 in the following


ways:

a) Plot the lines of the first system on graph paper; then describe ill-
conditioning in geometrical terms when only two unknowns are
involved.

b) Insert the solution of the first system into the left-hand side of the
second system. Does x=1, y=2 `look like a good solution to the second
system? Comment!
c) Insert the solution of the second system into the left-hand side of
the first system. Comment!

d) The system

is an example of ill-conditioning, due to T. S. Wilson. Insert the


`solution' (6.0, -7.2, 2.9, -0.1 ) into the left-hand side. Would you claim
this solution to be a good one? Next insert the solution (1.0,1.0,1.0,1.0).
Comment on the dangers of laying claims!

STEP 13

SYSTEMS OF LINEAR EQUATIONS 3

The Gauss- Seidel iterative method

The methods used in lhe previous Steps for solving systems of linear
equations are termed direct methods. If a direct method is used, and
round-off and other errors do not arise, an exact solution is reached
after a finite number of arithmetic operations. In general, of course,
round-off errors do arise; and when large systems are being solved by
direct methods, the growing errors can become so large as to render
the results obtained quite unacceptable.

1. Iterative methods

Iterative methods provide an alternative approach. Recall that an


iterative method starts with an approximate solution and uses it by
means of a recurrence formula to provide another approximate
solution; by repeated application of the formula, a sequence of
solutions is obtained which (under suitable conditions) converges to
the exact solution. Iterative methods have the advantages of simplicity
of operation and ease of implementation on computers, and they are
relatively insensitive to propagation of errors; they would be used in
preference to direct methods for solving linear systems involving
several hundred variables, particularly, if many of the coefficients were
zero. Systems of over 100 000 variables have been successfully solved
on computers by iterative methods, whereas systems of 10 000 or more
variables are difficult or impossible to solve by direct methods.

2. The Gauss- Seidel method

This text will only present one iterative method for linear equations,
due to Gauss and improved by Seidel. We shall use this method to
solve the system

It is suitable for implementation on computers(PSEUDO CODE)

The first step is to solve the first equation for x1, the second for x2, and
the third for x3 when the system becomes:

An initial solution is now assumed; we shall use xl = 0, x2 = 0 and x3 = 0.


Inserting these values into the right-hand side of Equation (1) yields xl
= 1.3. This value for xl is used immediately together with the remainder
of the initial solution (i.e., x2 = 0 and x3 = 0) in the right-hand side of
Equation (2) and yields x2 =1.3 - 0.2 x 1.3 - 0 = 1.04. Finally, the values xl
= 1.3 and x2 = 1.04 are inserted into Equation (3) to yield x3 = 0.936. This
second approximate solution (1.3, 1.04, 0.936) completes the first
iteration.

Beginning with this second approximation, we repeat the process to


obtain a third approximation, etc. Under certain conditions relating to
the coefficients of the system, this sequence will converge to the exact
solution.

We can set up recurrence relations which show clearly how the


iterative process proceeds. Denoting the k-th and k+1-th
approximations by (x(k)1, x(k)2, x(k)3) and (x(k+1)1, x(k+1)2, x(k+1)3), respectively,
we find

We begin with the starting vector x(0) = (x(0)1, (x(0)2, (x(0)3) all components
of which are 0, and then apply these relations repeatedly in the order
(1)', (2)' and (3)'. Note that, when we insert values for xl, x2 and x3 into
the right-hand sides, we always use the most recent estimates found
for each unknown.

3. Convergence

The sequence of solutions produced by the iterative process for the


above numerical example are shown in the table:
.

The student should check that the exact solution for this system is
(1,1,1). It is seen that the Gauss- Seidel solutions are rapidly
approaching these values; in other words, the method is converging.

Naturlly, in practice, the exact solution is unknown. It is customary to


end the iterative procedure as soon as the differences between the
x(k+1) and x(k) values are suitably small. 0ne stopping rule is to end the
iteration when

becomes less than a prescribed small number (usually chosen


according to the accuracy of the machine on which the calculations are
carried out).

The question of convergence with a given system of equations is


crucial. As in the above example, the Gauss- Seidel method may
quickly lead to a solution very close to the exact one; on the other
hand, it may converge too slowly to be of practical use, or it may
produce a sequence which diverges from the exact solution. The reader
is referred to more advanced texts (such as Conte and de Boor ( 1980))
for treatments of this question.

In order to improve the chance (and rate) of convergence, the system


of equations should be rearranged before applying the iterative
method, so that, as far as possible, each leading-diagonal coefficient is
larger (in absolute value) than any other in its row.

Checkpoint

1. What is the essential difference between a direct method and


an iterative method?
2. Give some advantages of the use of iterative methods
compared with that of direct methods.
3. How can you improve the chance of success with the Gauss-
Seidel method?

EXERCISES

1. For the example treated above, compute the value of S3, the
quantity used in the suggested stopping rule after the third
iteration.
2. Use the Gauss- Seidel method to solve the following systems to
5D accuracy (remembering to rearrange the equations if
appropriate). Compute the value of Sk (to 6D) after each
iteration.

a)

x - y + z = -7,
20x + 3y - 2z = 51,
2x + 8y + 4z = 25.

Remember to rearrange! Compute the value of Sk to 5 D after


each iteration.

b)
10x - y = 1
10
-x + - z = 1
y
- y + 10z - w = 1
10
- z + = 1
w

STEP 14

SYSTEMS OF LINEAR EQUATIONS 4*

Matrix inversion*

The general system of linear equations in n variables (STEP11, Section


1) can be written in matrix form Ax = b, when a vector x is sought
which satisfies this equation. We will now use the inverse matrix A-1 to
find this vector.

1. The inverse matrix

In Step 11, we have observed that, if the determinant of A is non-zero, it


has an inverse matrix A-1 , and the solution of the linear system can be
written as x = A-1 b, i.e., the solution of the system of linear equations
can be obtained by first finding the inverse of the coefficient matrix A,
and then forming the product A-1 b.

However, as a rule, this approach is not adopted in practice. The


problem of finding the inverse matrix is itself a numerical task, which
generally requires for its solution many more operations (and therefore
involves more round- off and other errors) than any of the methods
described in the preceding Steps. However, if the inverse matrix is
required for some additional reason, it is sensible to compute the
inverse first. For example, the inverse may contain theoretical or
statistical information or be of use in some other formula or calculation.

2. Method for inverting a matrix

There are many numerical methods for finding the inverse of a matrix.
We shall describe one which uses Gauss elimination and back-
substitution procedures. It is simple to apply and is computationally
efficient. We shall illustrate the method by application to a 2 x 2 matrix
and a 3 x 3 matrix; it should then be clear to the reader, how the
method may be extended for use with n x n matrices.

As the 2 x 2 example, consider

and seek the inverse matrix

such that

This is equivalent to solving the two systems

Thus:
a. Form the augmented matrix

b. Apply elementary row operations to the augmented matrix such


that A is transformed into an upper triangular matrix ( Step 11,
Section 5):

c. Solve the two systems:

using back- substitution. Note how the systems have been


constructed, using and columns of . From the first system, 3v1 =
-2, 3v1u&127 = -2/3, and 2u1 + v = 1, whence 2u1 = 1 + 2/3, u1 = 5/6.
From the second system, 3v2 = 1, v2 = 1/3 and 2u2 + v2 = 0, whence
2u2 = -1/3, u2 = -1/6. Thus the required inverse matrix is:

d. Check: AA-1 should be equal to I. Multiplication yields:

so that A-1 is correct.

In this simple example, one can work with fractions, so that no round-
off errors occurred and the resulting inverse matrix is exact. In
general, during hand calculations, the final result should be checked by
computing AA-1 , which should be approximately equal to the identity
matrix I. As the 33 example, consider

In order to demonstrate the effects of errors, compute A-1 to 3S. The


results of the calculations in tabular form are:

As an example of back- substitution, taken with the second column


of yields the second column of A-1 . Thus:

yields w2 = -20.0, v2 = 38.3 and u2 = -34.0, determined in this order.


Check by multiplication that AA-1 is

,
which is approximately equal to I. The noticeable inaccuracy is due to
carrying out the calculation of the elements of A-1 to 3S only.

3. Solution of linear systems using the inverse matrix

As previously noted, the unique solution of a linear system Ax = b is x


= A-1 b, provided the coefficient matrix A has an inverse A-1 . We shall
illustrate this by using the inverse A-1 obtained in Section 2 to compute
the solution of the linear system:

The coefficient matrix is

We can use A-1, calculated in the preceding section, as follows:

Thus, we arrive at the solution x = -13, y = 16 and z = -2.5 to 2S. We


check the solution by adding the three equations:

Inserting this solution in the left-hand side yields to 2S:

Checkpoint
1. In the method for finding the inverse of A, what is the final form
of A after the elementary row operations have been carried out?
2. Is the solution of the system Mx = d, x = dM-1 or x = M-1 d (or
neither)?
3. Give a condition for a matrix not to have an inverse.

EXERCISES

1. Find the inverses of the following matrices, using elimination and


back- substitution:

1. 2. 3.

2. Solve the systems of equations (each for two right-hand side


vectors):

a.

b.

c.
STEP 15

SYSTEMS OF LINEAR EQUATIONS 5*

Use of LU decomposition*

We have shown in STEP 11 how to solve a linear system Ax = b by


Gauss elimination, applied to the augmented matrix . In the
preceding Step, we have extended the elimination process to calculate
the inverse A of the coefficient matrix A and assumed that it exists.

Another general approach to solving Ax = b is known as the method of


LU decomposition, which provides new insights into matrix algebra
and has many theoretical and practical uses. It yields efficient
computer algorithms for handling practical problems.

The symbols L and U denote lower triangular matrix and upper


triangular matrices, respectively. Examples of lower triangular
matrices are

Note that in such a matrix all elements above the leading diagonal are
zero. Examples of upper triangular matrices are:

where all elements below the leading diagonal are zero. The product of
L1 and U1 is
1. Procedure

Suppose we have to solve a linear system Ax = b and that we can


express the coefficient matrix A in the form of the socalled LU
decomposition A = LU. Then we may solve the linear system as
follows:

Stage l:

Write Ax = LUx = b.

Stage 2:

Set y = Ux, so that Ax = Ly = b. Use forward substitution with Ly =


b to find y1, y2, . . , yn in that order, i.e., assume that the augmented
matrix for the system Ly = b is:

Then forward substitution yields y1 = b1/l11, and, subsequently,

Note that the value of yi depends on the values y1, y2, . . , yi-1, which
have already been calculated.

Stage 3:

Finally, use back- substitution with Ux = y to find xn, . . . , x1 in that


order.
Later on we shall outline a general method for finding LU
decompositions of square matrices. The following example
demonstrates this method, involving the matrix A = L1 U1 above. If we
wish to solve Ax = b for a number of different vectors b, then this
method is more efficient than Gauss elimination. Once we have
found an LU decomposition of A, we need only forward and
backward substitute to solve the system for any b.

Example

We shall solve the system

Stage l:

An LU decomposition of the system is

Stage 2:

Set y = U1 x and then solve the system L1 y = b, i.e.,

Using forward substitution, we obtain:


Stage 3:

Solve

Back-substitution yields:

Thus, the solution of Ax = b is:

which you may check, using the original equations. We turn now to the
problem of finding an LU decomposition of a given square matrix
A.

Realizing an LU decomposition

For an LU decomposition of a given n x n matrix A, we seek a lower


triangular matrix L and an upper triangular matrix U (both of order n x
n) such that A = LU. The matrix U may be taken to be the upper
triangular matrix resulting from Gauss elimination without partial
pivoting (Sections 3 and 5 of STEP 11), and the matrix L may be taken
to be the lower triangular matrix which has diagonal elements 1 and
which has as the (i, k) element the multiplier mik. This multiplier is
calculated at the k-th stage of Gauss elimination and is required to
transform the current value of aik into 0. In the notation of Step 11 ,
these multipliers were given by mik = aik/akk, I = k+l, k+2,.. ,n.
An example will help to clarify this procedure. From Step 11 , we recall
that Gauss elimination, applied to the system

yielded the upper triangular matrix:

Also, we saw that in the first stage we calculated the multipliers rn21 =
a2l/a11 = 1 / 1= 1 and m3l = a3l/a11 = 2/1 = 2, while, in the second stage, we
calculated the multiplier m32 = a32/a22 = -3/1 = -3. Thus

It is readily verified that LU equals the coefficient matrix of the original


system:

Another technique which may be used to find an LU decomposition


of an n x n matrix is by direct decomposition. In order to illustrate
this process, let it be rquired to find an LU decomposition for the 3 x 3
coefficient matrix of the system above. Then the required L and U are
of the form
Note that the total number of unknowns in L and U is 12, whereas there
are only 9 elements in the 3 x 3 coefficient matrix A. To ensure that L
and U are unique, we need to impose 12 - 9 = 3 extra conditions on the
elements of these two triangular matrices. (In the general nn case, n
extra conditions are required.) One common choice is to require all the
diagonal elements of L to be 1; the resulting method is known as
Doolittle' s method. Another choice is make the diagonal elements in
U to be 1; this is Crout' s method. Since Doolittle's method will give
the same in this direct LU decomposition for A, given above, we shall
use Crout's method to illustrate decomposition procedure.

We then require that

Multiplication of L and U yields:

It is clear that this construction by Crout's method yields triangular


matrices L and U for which A= LU.

Checkpoint

1. What constitutes an LU decomposition of a matrix A?


2. How is a decomposition A = LU used to solve a linear system
Ax = b?
3. How may an LU decomposition be obtained by Gauss
elimination?

EXERCISES
1. Find an LU decomposition of the matrix

where

2. Solve each of the following systems (Exercises 1 and 3 of STEP 11)


by first finding an LU decomposition of the coefficient matrix and
then using forward and backward substitutions.
a.

b.

STEP 16

SYSTEMS OF LINEAR EQUATIONS 6*

Tests for ill- conditioning*

We recall from Section 4 of STEP 12 that the solutions of ill-conditioned


systems of linear equations are very sensitive to small changes in their
coefficients and constants. We will show now how one may test a
system for ill- conditioning.

1. Norms
One of the most common tests for ill-conditioning of a linear system
involves the norm of the coefficient matrix. In order to define this
quantity, we must consider first the concept of the norm of a vector
or matrix, which in some way assesses the size of their elements.

Let x and y be vectors. Then a vector norm ||?|| is a real number with
the properties:

1. || x ||  0 and || x || = 0 if and only if x is a vector all components


of which are zero;
2. || x || = | | || x || for any real number ;
3. || x + y||  || x || + || y || (triangle inequality).

There are many possible ways to choose a vector norm with these
properties. One vector norm which is probably familiar to the student is
the Euclidean or 2-norm. Thus, if x is an n  1 vector, then the 2-norm
is denoted and defined by

As an example, if x is the 5 x 1 vector with the components x1 = 1, x2 =


-3, x3 = 4, x4 = -6, x5 = 2, then

Another possible choice of norm, which is more suitable for our


purposes here, is the infinity norm, defined by

Thus, we find for the vector in the previous example . It is easily


verified that has the three properties above. For , the first two
properties are readily verified; the triangle inequality is somewhat
more difficult and requires the use of the so-called Cauchy- Schwarz
inequality (for example, cf. Cheney and Kincaid ( 1994)).

The defining properties of a matrix norm are similar, except that it


involves an additional property. Let A and M be matrices. Then a
matrix norm || ? || is a real number with the properties:

1. is a matrix with all elements zero;

2.

3.

4.

As for vector norms, there are many ways of choosing matrix norms
with the four properties above, but we will consider only the infinity
norm. If A is an nn matrix, then the infinity norm is defined by

By this definition, this norm is the maximum of the sums obtained by


adding the absolute values of the elements in each row, whence it is
commonly referred to as the maximum row sum norm

. As an example, let

Then
so that

A useful property interrelating the matrix and vector infinity norms is

which follows from Property 4 of the matrix norms above.

2. Ill-conditioning tests

We will now find out whether or not a system is ill- conditioned by


using the condition number of the coefficient matrix. If A is an n x
n matrix and A-1 is its inverse, then the condition number of A is
defined by

It is bounded below by 1, since || I || = 1 and

where we have used the matrix norm property 4 given in the preceding
section. Large values of the condition number usually indicate ill-
conditioning. As a justification for this last statement, we state and
prove the theorem:

Theorem: Suppose x satisfies the linear system Ax = b and satisfies


the linear system . Then

Proof: We have . Since


,

we see that

However, since b = Ax, we have , or

It then follows that

from which follows the result above.

It is seen from the theorem that even if the difference between b and
is small, the change in the solution, as measured by the `relative error'
may be large when the condition number is large. It follows
that a large condition number is an indication of possible ill-
conditioning of a system. A similar theorem for the case when there
are small changes to the coefficients of A may be found in more
advanced texts such as Atkinson (1993). Such a theorem also shows
that a large condition number is an indicator of ill- conditioning.

There arises the question: How large has the condition number to be
for ill- conditioning to become a problem? Roughly speaking, if the
condition number is 1m and the machine in use to solve a linear system
has kD accuracy, then the solution of the linear system will be accurate
to k - m decimal digits.

In Step 12, we had the coefficient matrix


which was associated with an ill- conditioned system. Then

and

This suggests that a numerical solution would not be very accurate, if


only 2D accuracy were used in the calculations. Indeed, if the
components of A were rounded to two decimal digits, the two rows of A
would be identical. Then the determinant of A would be zero, and it
follows from the theorem in Step 11 that this system would not have a
unique solution.

We recall that the definition of the condition number requires A-1, the
computation of which is expensive. Moreover, even if the inverse were
calculated, this approximation might not be very accurate, if the
system is ill-conditioned. It is therefore common in software packages
to estimate the condition number by obtaining an estimate of ,
without explicitly finding A-1.

Checkpoint

1. What are the three properties of a vector norm?


2. What are the four properties of a matrix norm?
3. How is the condition number of a matrix defined and how is it
used as a test for ill- conditioning?

EXERCISES
1. For the 5 x 1 vector with elements x1 = 4, x2 = -6, x3 = -5, x4 = 1, and x5
= -1, calculate .

2. For each of the matrices in Exercise 1 of Step 14, compute

a. the infinity norm,

b. the condition number.

STEP 17

THE EIGEN- VALUE PROBLEM

The power method

Let A be an nn matrix. If there exists a number and a non-zero


vector x such that , then is said to be an eigen- value of A,
and x the corresponding eigen- vector. The evaluation of eigen-values
and eigen-vectors of matriccs is a problem that arises in a variety of
contexts. Note that if we have an eigen-value and an eigen-vector x,
then x, where is any real number, is also an eigen-vector, since

This shows that an eigen-vector is not unique and may be scaled, if


desired (for instance, one might want the sum of the components of
the eigenvector to be unity).

Writing as

we conclude from the theorem in STEP 11 that this equation can have a
non-zero solution only if the determinant of |A - I| is zero. If we
expand this determinant, then we get an n-th degree polynomial in
known as the characteristic polynomial of A. Thus, one way to find
the eigen-values of A is to obtain its characteristic polynomial and then
to find the n zeros of this polynomial (some of them may be complex!).
For example, let

Then

This last matrix has the determinant

The zeros of this quadratic yield the eigen-values.

Although the characteristic polynomial is easy to work out in this


sirnple 2 x 2 example, as n increases, it becomes more complicated
(and, of course, it has a correspondingly higher degree). Moreover,
even if we can find thc characteristic polynomial, the analytic formulae
for the roots of a cubic or quartic are somewhat inconvenient to use,
and in any case, we must use numerical methods (cf. Steps 7-10) to
find the roots of the polynomials of degree n> 4. It is therefore common
to use alternative, direct numerical methods to find eigen-values
and eigen-vectors of a matrix.

If we are only interested in the eigen- value of largest magnitude,


then a popular approach to its evaluation is the power method. We
shall discuss later on, how this method may be modified to find the
eigen- value of smallest magnitude. Methods for finding all the
eigen-values are beyond the scope of this book. (One such method,
called the QR method, is based on QR factorization .

1. The power method

Suppose that the n eigen-values of A are and that they


are ordered in such a way that

Then the method, which is readily implemented on a computer


can be used to find We begin with a starting vector w(0) and
compute the vectors

for j = 1, 2, . . ., so that induction yields

where Aj is A, multiplied by itself j times. Thus w(j) is the product


of w(0) and the j-th power of A, which explains why this approach
is called the power method.

It turns out that at the j-th iteration an approximation to the


eigen-vector x associated with is given by the k-th components
of w(j) and w(j-1), respectively, and an approximation to is given
by
for any Although there are n possible choices for k, it
is usual to choose it so that is the component of w(j) with the
largest magnitude.

2. Example

We will now use the power method to find the largest eigen-value
of the matrix

As a starting vector we take

Since the second component of w(1) has the largest magnitude,


take k = 2 so that the first approximation to is

Subsequent iterations of the power method yield


These calculations allow to conclude that the largest eigen-value
is about 2.6.

3. Variants

In the preceding example, the reader will have noted that the
components of w(j) were growing in size as j increases. Overflow
problems would arise, if this growth were to continue; hence, in
practice, it is usual to use instead the scaled power method.
This is identical to the power method except that the vectors w(j)
are scaled at each iteration. Thus, suppose w(0) be given and set
y(0) = w(0). Then we will carry out for j = 1, 2, . , . the steps:

a. Calculate w(j) = Ay(j-1);

b. find p such that so that the p-th


component of w(j) has the largest magnitude;
c. evaluate an approximation to , i.e.,
for some .

d. calculate .

In Step c, there are n choices for k. As for the unscaled power


method, k is usually chosen to be the value for which has the
largest magnitude, i.e., k is taken to be the value p obtained in
Step b. Another option is to choose k to be the value of p from
the previous iteration, although often this results in the same
value. The effect of Step d is to produce a vector y(j) with
components of magnitude not more than unity.

As an example, we apply the scaled power method to the


matrix in the preceding section. We take the value of k in each
iteration to be p. The starting vector w(0) is the same as before
and y(0) = w(0). Then the first four iterations of the scaled power
method yield:
Note that the eigen-value estimates are the same as before and
the w(j)s are just multiples of those obtained in the preceding
section.

We shall now discuss the use of the power method to find the
eigen-value of the smallest magnitude. If A has an inverse,
then may be written as

It follows that the smallest eigen-alue of A may be found by


finding the largest eigen-value of A-1 and then taking the
reciprocal. Thus, if the unscaled power method were used, we
would calculate the vectors

In general, it is more efficient to solve the linear system


than to find the inverse of A (Step 14).

4. Other aspects

It may be shown that the convergence rate of the power


method is linear and that, under suitable conditions,

where C is some positive constant. Thus, the bigger the gap


between the faster the rate of convergence. Since the
power method is an iterative method, one has to stop at some
stage. It is usual to carry on the process until successive
estimates of the eigen- value agree to a certain tolerance or,
alternatively, until a given maximum number of iterations is
exceeded.

Difficulties with the power method usually arise when our


assumptions about the eigen-values are not valid. For instance, if
then the sequence of estimates for may not converge.
Even if the sequence does converge, one may not be able to get
an approximation to the eigen-vector associated with . A short
discussion of such difficulties may be found in Conte and de Boor
(1980).

Checkpoint

1. How are the eigen- values and eigen- vectors of a matrix


defined?
2. What is the power method for finding the eigen-value of largest
magnitude?
3. What advantage does the scaled power method have over the
power method?

EXERCISES

1. For the 2 x 2 matrix

apply eight iterations of the power method. Find the


characteristic polynomial, and hence find the two true eigen-
values of the matrix. Verify that the approximations are
converging to the eigen-value with the larger magnitude.
2. Apply five iterations of the normal and scaled power
methods to the 3 x 3 matrix

STEP 18

FINITE DIFFERENCES 1

Tables

Historically speaking, numerical analysts have always been concerned


with tables of numbers, and many techniques have been developed for
dealing with mathematical functions, represented in this way. For
example, the value of the function at an untabulated point may be
required, so that a interpolation is necessary. It is also possible to
estimate the derivative or the definite integral of a tabulated
function, using some finite processes to approximate the
corresponding (infinitesimal) limiting procedures of calculus. In each
case, it has been traditional to use finite differences. Another
application of finite differences, which is outside the scope of this book,
is the numerical solution of partial differential equations.

1. Tables of values

Many books contain tables of mathematical functions. One of


the most comprehensive is the Handbook of Mathematical
Functions, edited by Abramowitz and Stegun (see the
Bibliography for publication details), which also contains useful
information about numerical methods.
Although most tables use constant argument intervals, some
functions do change rapidly in value in particular regions of their
argument, and hence may best be tabulated using intervals
varying according to the local behaviour of the function.
Tables with varying argument intervals are more difficult to
work with, however, and it is common to adopt uniform
argument intervals wherever possible. As a simple example,
consider the 6S table of the exponential function over 0.10
(0.01 ) 0.18 (a notation which specifies the domain 0.10

It is extremely important that the interval between successive


values is small enough to display the variation of the tabulated
function, because usually the value of the function will be needed
at some argument value between values specified (for example,
from the above table). If the table is constructed in
this manner, we can obtain such intermediate values to a
reasonable accuracy by using a polynomial representation
(hopefully, of low degree) of the function f.

2. Finite differences

Since Newton, finite differences have been used extensively. The


construction of a table of finite differences for a tabulated
function is simple: One obtains first differences by subtracting
each value from the succeeding value in the table, second
differences by repeating this operation on the first differences,
and so on for higher order differences. From the above table
of one has the (note the standard layout, with
decimal points and leading zeros omitted from the differences):

(In this case, the differences must be multiplied by 10-5 for


comparison with the function values.)

3. Influence of round- off errors

Consider the difference table given below for to


6S, constructed as in Section 2. As before, differences of
increasing order decrease rapidly in magnitude, but the third
differences are irregular. This is largely a consequence of round-
off errors, as tabulation of the function to 7S and differencing to
fourth order illustrates (compare Exercise 3 ).
Although the round-off errors in f should be less than 1/2 in the
last significant place, they may accumulate; the greatest error
that can be obtained corresponds to:
A rough working criterion for the expected fluctuations (noise
level) due to round-off error is shown in the table:

Checkpoint

1. What factors determine the intervals of tabulation of a


function?
2. What is the name of the procedure to determine a value
of a tabulated function at an intermediate point?
3. What may be the cause of irregularity in the highest
order differences in a difference table?

EXERCISES
1. Construct the difference table for the function f (x) = x3 for x
= 0(1) 6.

2. Construct difference tables for each of the polynomials:

a.2x - l for x = 0(1)3.


b.3x2 + 2x - 4 for x = 0(1)4.
c.2x3 + 3x - 3 for x = 0(1)5.

Study your resulting tables carefully; note what happens in the


final few columns of each table. Suggest a general result for
polynomials of degree n and compare your answer with the
theorem in STEP 20.

3. Construct a difference table for the function f (x) = ex, given to


7D for x = 0.1(0.05) 0.5

STEP 19

FINITE DIFFERENCES 2

Forward, backward, central difference notations

There are several different notations for the single set of finite
differences, described in the preceding Step. We introduce each of
these three notations in terms of the so-called shift operator, which we
will define first.

1. The shift operator E


Let be a set of values of the
function f(x) The shift operator E is defined by:

Consequently,

and so on, i.e.,

where k is any positive integer. Moreover, the last formula can be


extended to negative integers, and indeed to all real values of j
and k, so that, for example,

and

2. The forward difference operator Q

If we define the forward difference operator Q by

then

,
which is the first- order forward difference at xj. Similarly, we
find that

is the second- order forward difference at xj, and so on. The


forward difference of order k is

where k is any integer.

3. The backward difference operator

If we define the backward difference operator by

then

which is the first- order backward difference at xj. Similarly,

is the second- order backward difference at xj, etc. The


backward difference of order k is

where k is any integer. Note that .

4. The central difference operator


If we define the central difference operator by

then

which is the first- order central difference at xj. Similarly,

is the second- order central difference at xj, etc. The central


difference of order k is

where k is any integer. Note that .

5. Differences display

The role of the forward, central, and backward differences is


displayed by the difference table:
Although forward, central, and backward differences
represent precisely the same data:

1. Forward differences are useful near the start of a


table, since they only involve tabulated function values
below xj ;
2. Central differences are useful away from the ends of a
table, where there are available tabulated function values
above and below xj;
3. Backward differences are useful near the end of a
table, since they only involve tabulated function values
above xj.
Checkpoint

4. What is the definition of the shift operator?


5. How are the forward, backward, and central
difference operators defined?
6. When are the forward, backward, and central
difference notations likely to be of special use?

EXERCISES

7. Construct a table of differences for the polynomial

for x = 0(1)4. Use the table to obtain the values of :

1. ;
2. ;

3. .
8. For the difference table in Section 3 of STEP 18 of f (x) = ex
for x = 0.1(0.05)0.5 determine to six significant digits the
quantities (taking x0 = 0.1 ):

1. ;

2. ;
3. ;
4. ;

5. ;
9. Prove the statements:
1. ;

2. ;
3. ;
4. .

STEP 20

FINITE DIFFERENCES 3

Polynomials

Since polynomial approximations are used in many areas of Numerical


Analysis, it is important to investigate the phenomena of differencing
polynomials.

1. Finite differences of a polynomial

Consider the finite differences of an n-th degree polynomial

tabulated for equidistant points at the tabular interval h.

Theorem: The n-th difference of a polynomial of degree n is a constant


proportional to n

and higher order differences are zero.

Proof: For any positive integer k, the binomial expansion

yields

.
Omitting the subscript of x, we find

In passing, the student may recall that in the Differential Calculus the increment is

related to the derivative of f (x) at the point x.

2. Example

3
Construct for f (x) = x with x = 5.0(0.1)5.5 the difference table:

Since in this case n = 3, an =1, h = 0.1, we find

3 for 5.0(0.1)5.5, rounded to


Note that round- off error noise may occur; for example, consider the tabulation of f(x) = x

two decimal places:


3. Approximation of a function by a polynomial

Whenever the higher differences of a table become small (allowing for round-off noise), the function represented may be

approximated well by a polynomial. For example, reconsider the difference table of 6D for f (x ) = e x with x =
0.1(0.05)0.5:

Since the estimate for round-off error at (cf. the table in STEP 12), we say that third differences are constant

within round-off error, and deduce that a cubic approximation is appropriate for ex over the range 0.1 < x < 0.5. An
example in which polynomial approximation is inappropriate occurs when f(x) = 10 x for x = 0(1)4, as is shown by the next
table:
Although the function f(x) = 10 x is `smooth', the large tabular interval (h = 1) produces large higher order finite differences.
It should also be understood that there exist functions that cannot usefully be tabulated at all, at least in certain

neighbourhoods; for example, f(x) = sin(1/x) near the origin x = 0. Nevertheless, these are fairly exceptional cases.

Finally, we remark that the approximation of a function by a polynomial is fundamental to the widespread use of finite

difference methods.

Checkpoint

1. What may be said about the higher order ( exact) differences of a polynomial?

2. What is the effect of round- off error on the higher order differences of a polynomial?

3. When may a function be approximated by a polynomial?

EXERCISES

1. Construct a difference table for the polynomial f(x) = x 4 for x = 0(0.1)1 when
a.
b. the values of f are exact;

c. the values of f have been rounded to 3 D.

d. Compare the fourth difference round-off errors with the estimate +/-6.

e.
2. Find the degree of the polynomial which fits the data in the table:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy