Escolar Documentos
Profissional Documentos
Cultura Documentos
M.A. Wongsam R.W. Chantrell School of Electronic Engineering and Computer Systems University of Wales Dean Street, Bangor, Gwynedd, LL57 1UT
Contents
2.1 The computer representation of numbers : : : : : : : : : : : : : : 2.2 Errors due to the computer representation of numbers : : : : : : :
4 5 8
5 6
3 Lecture 2 - Errors arising from arithmetical operations 4 Lecture 3 - Solving Non-Linear Equations
3.1 Summary from lecture 1 : : : : : : : : : : : : : : : : : : : : : : : 8 3.2 Error accumulation : : : : : : : : : : : : : : : : : : : : : : : : : : 9 3.3 Loss of signi cance : : : : : : : : : : : : : : : : : : : : : : : : : : 10 4.1 Fixed-point iteration : : : : : : : : : : : : : : : : : : : : : : : : : 11 4.2 criteria for convergence : : : : : : : : : : : : : : : : : : : : : : : : 12 4.3 The Newton-Raphson method : : : : : : : : : : : : : : : : : : : : 13
11
5.1 Introduction : : : : : : : : : : : : : : : : : : : : 5.1.1 Solvability of systems of linear equations 5.2 Direct methods : : : : : : : : : : : : : : : : : : 5.2.1 De nitions : : : : : : : : : : : : : : : : : 5.2.2 The backsubstitution algorithm : : : : : 5.3 Gaussian elimination : : : : : : : : : : : : : : : 5.3.1 The reduction algorithm : : : : : : : : : 5.3.2 Failure of the reduction method : : : : : 5.4 Iterative methods : : : : : : : : : : : : : : : : : 5.4.1 Jacobi's method : : : : : : : : : : : : : : 5.4.2 The Gauss-Seidel method : : : : : : : : 6.1 The Algebraic eigenvalue problem : : : 6.2 The power method : : : : : : : : : : : 6.3 Similarity and the QR method : : : : : 6.3.1 Householder tridiagonalisation : 6.3.2 QR Factorisation : : : : : : : :
: : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : :
14
14 14 16 16 16 17 17 18 21 21 21 24 24 25 25 27
: : : : :
: : : : :
: : : : :
24
7.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : 7.2 Single-step methods : : : : : : : : : : : : : : : : : : : : : 7.2.1 The Euler method : : : : : : : : : : : : : : : : : 7.2.2 The improved Euler (predictor-corrector) method 7.2.3 Runge-Kutta methods : : : : : : : : : : : : : : : 7.2.4 4-th order Runge-Kutta algorithm : : : : : : : : :
30
30 31 31 32 32 33
35
8.2 The Adams-Moulton predictor-corrector method : : : : : : : : : : 37 8.2.1 The Adams-Bashford method : : : : : : : : : : : : : : : : 37 8.2.2 The Adams-Moulton method : : : : : : : : : : : : : : : : 38
39 39 42
=)
Z R
there exists ... such that for all ... we have is a member of the set is a subset of implies that the set of non-negative integers the set of all real numbers
Integers: Many real numbers can be represented by nite strings of digits, but
most real numbers require in nite strings of digits. Since for practical computations only nite strings of digits must be used, a trauncation error is entailed when real numbers are represented in calculations. For example, most computer subroutines for evaluating sin , cos , etc. rst subtract or add 2 from the argument in order to obtain an argument in the range ; + . But since requires an in nite number of digits for its representation, the subtraction entails an error which changes the result. A non-negative integer N with n digits may be represented according to the following polynomial scheme
N=
n ;1 X i=0
an;i;110n;i;1
0 fak 2Zg 9
so that the number 257 in this denary representation is 257 = 2 102 + 5 101 + 7 100 Since computers store numbers as states of electronic components, they often use a binary representation, so that a non-negative integer with n digits may be represented according to
N=
n ;1 X i=0
an;i;12n;i;1
0 fak 2Zg 1
For example, the binary representation of the denary number 13 is (1101) which has polynomial representation 1101 = 1 23 + 1 22 + 0 21 + 1 20
gral part xI which is the largest integer less than or equal to x, while its fractional part xF = x ; xI . xF can have polynomial denary representation
bi 10;i
0 fbk 2Zg 9
If k i 2 Z , then the fraction xF is said to terminate if and only if it is true that 9 k 8 k + i] bk+i = 0. If no such k exists, then xF requires an in nite number of digits to represent it exactly. Now, if xI = (anan;1 :::a0 ) and xF = (b1 b2 :::) the we can write x = (an an;1:::a0 :b1 b2 :::) where the decimal point separates the ak from the bk . In the binary polynomial representation
xF = xF =
1 X
i=1
bi2;i
;i
bi
When we use computers for numerical calculations we have to convert between the denary and representations of numbers. It so happens that not every number which is terminating in denary representation is also terminating in binary representation. Therefore, even when a fraction can be exactly represented by a nite number of digits in denary representation, it may still entail a truncation error when converted to binary. Computers use a normalised oating point t-digit representation x = (:b1 b2 :::bt ) e where (:b1b2 :::bt ) is called the mantissa and e is called the exponent. The precision of the mantissa is determined by the word length of the computer, and the exponent is bounded within some range m e M where m is a negative integer and M is some positive integer. Therefore, all the numbers which can be stored in a computer and used in calculations are contained within the set F (t m M ). Any number x that we use in a computer computation that is not in this set, must be approximated by a number which is a member of the set, ie. x fl(x) 2 F (t m M ) A number x is converted to normalised oating point representation fl(x) by choosing the nearest normalised oating point number to x, and applying some rule in the case of a tie, such as rounding (choosing the nearest oating point number to x and using a rule like rounding up in the case of a tie) or chopping (choosing the nearest oating point number smaller in magnitude that x), as long as e lies within the range m e M 1 . The di erence jx ; fl(x)j is called the absolute error.
1
If e < m one has an under ow condition, and if e > M one has an over ow condition.
The distribution of oating point numbers is very uneven2 so that fl(x) depends strongly on x. It is always possible to write any nonzero decimal number x in the form 1 juj < 1 0 jvj < 1 x = u 10e + v 10e;t 10 For example, with t = 4 12:3456 = :1234 102 + :56 10;2 ;:0123456 = ;:1234 10;1 + (;:56) 10;5 3:1415927::: = :3141 101 + :5927::: 10;3 In the same way any base nonzero number can be written 1 juj < 1 0 jvj < 1 x = u e + v e;t Now, the rst term xC = u value xR is given by
e
1=2 1=2
whereas the absolute errors EC = x ; xC and ER = x ; xR for chopping and rounding respectively are given by
EC = jvj ER = jvj
and hence the upper bounds R = ER = jxj are
C
e;t e;t R
< <
e;t e;t =2 C
and
C R
= =
See problem 1
Let denote any of the binary arithmetical operations (+ ; ). Also, let x y 2 F (t m M ) and denote by fl x y] 2 F (t m M ) the computed oating point value of x y using the system F (t m M )5. Then C = (xC ; x)=x allows one to write xC = x(1 + C ) as a relation between the exact and chopped values, where C from Eq. 1, and where a similar relation holds for xR in terms of . Then, one has R fl x y] = (x y)(1 + ) (3)
fl(s) = fl fl fl fl x1 + x2 ] + x3 ] + x4 ] + x5 ]
On using Eq. 3 with + for
= = = = =
(1 + (1 + (1 + (1 + (1 +
Using Eq. 6 one then arives at an expression for the bound on the error for the computed sum
jfl(s) ; sj
5
x1 1 + x2 2 + x3 3 + x4 4 + x5
The errors discussd so far arise as a result of the machine representation of numbers, and the consequencies which follow when arithmetical operations are carried. They are present in numerical calculations because machine storage of numbers is discrete, whereas the numbers we want to use are part of a continuum. Given a particular machine environment, there is nothing that we can do to avoid such errors. However, there are errors which occur when carrying out arithmetic operations which can be avoided. Consider the oating point subtraction in the representation F (t m M ) of two numbers which are very close in numerical value. We cannot specify the numbers to more than t -digit accuracy, which may be in error by 0:5 1;t. If the result of the subtraction is of the same order as the t-th digits in the original numbers, the error in the result will be of the same order as the result itself. For instance, consider f (x) = 1 ; cos x near to x = 0. From Eq. 2 the bound on the error in calculating cos x close 1 1;t which may be as large or larger than f (x). 1 1;t jcos xj to x = 0 is 2 2 Whenever such a condition could appear, one can circumvent the problem by a judicious choice of alternative expression6
See problem 3.
f (x) = x2 ; x ; 2 One can use a g(x) in any of the forms (1) g(x) = x2 ; 2 p (2) g(x) = 2 + x 2 (3) g(x) = 1 + x
(9)
x;2 8 m 2Z (4) g(x) = x ; x2; m One then starts the calculation with an initial value x0 which is intuitively close to the desired solution, and uses Eq. 8 to generate the iteration sequence.
It sometimes happens that not all iteration sequences converge10 . Also, the function g has to be chosen such p that it is possible to generate the iteration sequence. For example, with g(x) = ; x, x0 necessarily has to be positive, but then that would necessarily make x1 = g(x0 ) negative. Therefore, x2 cannot be calculated. However, if it is the case that
9I
= a b] 2 R 8x0 2 I g(x) 2 I
and that g(x) is continuous in I , then it is guaranteed that there is a xed point of g in I . This is so because g(a) a and g(b) b. Therefore if h(x) = g(x) ; x satis es h(a) 0 and h(b) 0. If g(x) is continuous in I then so is h(x), and so h(x) vanishes somewhere in I by the intermediate value theorem. This zero of h is a xed point of g. In general, the xed point x ~ is located at the intersection of the curves y = x and y = g(x). Now, suppose that
9K
> 0 8x 2 I jg0(x)j K
(10)
K jen;1j
(11)
See problem 4.
the criterion for a convergent iteration sequence is that the iteration function shall have a derivative with a maximum absolute value less than unity. Therefore, in the example Eq. 9 if we choose g(x) = x2 ; 2 asp iteration function, then for x > 1=2, g0(x)p> 1 whereas, if we choose g(x) = 2 + x as iteration function 0 g0(x) 1= 8.
Eq. 11 shows that iteration functions with derivatives with smaller absolute value converge faster than those with larger derivatives. For fast convergence therefore, we should choose a form of the iteration function which is minimised at x ~. If f is di erentiable then a xed point form with the desirable features is derived by expanding f in a Taylor series and retaining only the linear term, that is
f (x + h) = f (x) + f 0(x)h
and solving this linearised equation, that is solving f (x + h) = 0 with f (x + h) in the above linear approximation, instead of f (x) = 0 with all terms present. This amounts to replacing f (x) by a linear function l(x) tangent to f (x0), that is
(12)
x = x0 ; f (x0)=f 0(x0 )
In other words, we have a xed point iteration with iteration function
(13)
00
x)) ; f (~ x)f (~ x) g0(~ x) = 1 ; (f (~ 2 (f 0(~ x)) which is zero since f (~ x) = 0. The method which uses the iteration function according to Eq. 13 is called the Newton-Raphson method. It should be noted that a solution depends upon the condition f 0(x) 6= 0. If f 0(x) is very small in the neighbourhood of a solution, then the solution determined by the NewtonRapshon method will be very sensitive to changes in the initial data. For instance, if f (x) = x5 + 10;4x = 0, then x ~ = 0 is a solution. However, f 0(0) = 10;4 which is small. At x = 0:1, we have f (0:1) = 2 10;5. Therefore, if the precision with which we determine that a solution has been found according to f (x) = 0 is of this order, we can get a solution of x = 0:1 even though the precision of the solution kx ; x ~k = 0:1 is about 5000 times larger than the precision of f . Such a problem is said to be ill-conditioned.
5.1 Introduction
A system of m linear equations in n unknowns is usually expressed in the following form a11 x1 + a12 x2 + ::: + a1n xn = b1 a21 x1 + a22 x2 + ::: + a2n xn = b2 : : : am1 x1 + am2 x2 + ::: + amn xn = bm where the aij and bi are given oating point numbers and the xi are to be determined so as to satisfy the system. It is convenient to express the system in matrix form Ax = b (14) where A is the m n matrix 0 a a ::: a 1 11 12 1n B C a a ::: a B 21 22 2n C B C B C : B C A=B : (15) C B C B C @ : A am1 am2 ::: amn and x and b are the column vectors 0x 1 0b 1 1 1 B C B C x b B C B 2 2 C B C B B C : C : C B C B C x=B b = (16) B C B C : : B C B C B B @ : C A @ : C A xn bn Matrices associated with linear systems can be classi ed as either dense or sparse. Dense matrices have very few zero valued elemnets, and tend to be relatively small, whereas sparse matrices have very few nonzero valued elements, and tend to be relatively large. Sparse matrices usually arise from attempts to solve di erential equations by nite di erence or nite element methods.
Eq. 14 falls into one of three categories:1 if the system has no solution, the equations are said to be inconsistent, as for example 2x1 + 3x2 = 5 4x1 + 6x2 = 1
2 if the system has many solutions, the system is said to be dependent, as for example 2x1 + 3x2 = 5 4x1 + 6x2 = 10 3 if the system has exactly one solution for every b, it is said to be nonsingular. In this case we usually speak of the non-singularity of the coe cient matrix A. If this is the case, then the homogeneous equation Ax = 0 has only the trivial solution11 x = 0. Here, we will be concerned only with problems which fall into the third category { nonsingular systems. For such systems, it is necessary that the number of equations equals the numbers of unknowns, ie., in Eq. 15 m = n, so that A is an n n square matrix A is invertible. A frequently quoted test for invertibility is based uopn whether det(A) 6= 0. If this is case12 , then it is possible to express the solution in terms of determinants by using Cramer's rule 13 . However, for problems in which n is large Cramer's rule is not of practical interest since the calculation of determinants is in general of the same order of di culty as solving a linear system itself. Numerical methods for solving linear systems may be divided into two categories, direct: these yield the exact solution in a nite number of elementary arithmetic operations subject to round-o and other errors14 iterative: these start with an initial approximation, and by applying a suitable algorithm, successively better approximations are generated. Direct methods are usually better suited to handling problems characterised by dense coe cient matrices, whereas iterative methods are more suitable for dealing with problems characterised by sparse coe cient matrices.
If det(A) is small, ie. if the problem is nearly singular, then one has an ill-conditioned problem. For example, the equations x ; y = 1 and x ; 1:0001y = 0 have solution x = 10 001, y = 10 000. However, the equations x ; y = 1 and x ; 0:9999y = 0 have solution x = ;9999, y = ;10 000 !!! Yet, the coe cients in the two sets di er by at most two units in the fourth decimal place. 13 Cramer's rule states that the solution to an n n linear system can be written down in terms of the determinants D, D1 , D2 , etc., where D = det(A), and the D are de ned as the determinants of the matrices obtained by replacing the i-th column of A by the column vector b. 14 In practice, because computers use nite length oating point representations, extremely poor results can be obtained.
12
i
6.1
11
This will be important later when the algebraic eigenvalue problem is discussed in section
if all of the aii are nonzero. A slightly more general form consists in A being upper-triangular. Then, the system of equations are
nn
::: :::
+ +
a1nxn a2nxn
b1 b2
(17)
bn;1 bn
(18)
xn = abn
The pennultimate, or (n-1)-th row is
nn
bn;1 = an;1 n;1xn;1 + an;1 nxn =) xn;1 = bn;1a; an;1 nxn n;1 n;1 where xn on the right hand side is known from Eq. 18. Now xn and xn;1 which are now known, can be used in the (n-2)-th row to determine xn;2 , etc. This procedure for solving an upper triangular system of linear equations is called the backsubstitution algorithm.
Most of the problems encountered in solving systems of linear equations do not present themselves in the upper traingular form. If there is a way of transforming a general square matrix into the upper triangular form, then upon using this one is then in a position to implement the backsubstitution algorithm and solve the problem. The most frequently used method is Gaussian elimination. We rst note that the solution to the system of linear equations is invariant with respect to the elementary row operations:1 multiplication of one of the equations by a non-zero constant 2 replacing an equation by the sum of the equation any other equation in the system 3 interchanging any two equations 4 a combination of 1 and 2 above. One can now see that repeated application of elementary row operations, under which the solution of the system is invariant, can transform the original system into upper-triangular form. This is achieved in the way outline below.
First reduction step: We rst eliminate all coe cients of x1 apart from that
occurring in the rst row by 1.1 move to each row in turn starting from row 2 1.2 at row i add the constant mi1 = ;ai1 =a11 multiplied by the corresponding coe cient of equation 1 1.3 do the same to the right hand side of row i15 . After application of the above procedure, the system of equations looks like
a11x1 + a12 x2 + a13 x3 + ::: + a1nxn = b1 a22 x2 + a23 x3 + ::: + a2nxn = b2 a32 x2 + a33 x3 + ::: + a3nxn = b3 : : : an2x2 + an3 x3 + ::: + annxn = bn
where of course, the coe cients in rows 2 { n are linear combinations of the original coe cients.
The e ect of 1.2 and 1.3 together is to multiply the whole row 1 by m 1 and add it to equation i, ie., elementary row operation 4 above, but with the e ect of eliminating a 1 .
15
i i
by 2.1 move to each row in turn starting from row 3 2.2 at row i add the constant mi2 = ;ai2 =a2 multiplied by the corresponding coe cient of equation 2 2.3 do the same to the right hand side of row i. After application of the above procedure, the system of equations looks like a11x1 + a12 x2 + a13 x3 + ::: + a1nxn = b1 a22 x2 + a23 x3 + ::: + a2nxn = b2 a33 x3 + ::: + a3nxn = b3 : : : an3 x3 + ::: + annxn = bn
where again, the coe cients in rows 3 { n are linear combinations of the coe cients resulting after application of the rst reduction step. This procedure is now repeated for third, fourth, etc. reduction steps, until after n steps, the resulting coe cient matrix is in upper-triangular form. Throughout, only elementary row operations have been performed, under which the solution remains invariant. The rules involved in the k ; th reduction step can be illustrated by the ow diagram in Fig. 1 An algorithm for reduction to upper triangular form by Gaussian elimination consists in a series of such reduction steps from k = 1 to k = n ; 1, after which the continually updated coe cients will be in the required form. This form can then be used as the input to the backsubstitution algorithm. A ow diagram for the Gaussian elimination procedure is as outline in Fig. 2 Examination of Stage 2 of the algorithm for performing the k-th reduction step in Fig. 1 requires that akk 6= 0. Whenever this occurs, there are two possibilities:1 9 l > k alk 6= 0, ie. there is a row which has not yet been visited that has a non-zero coe cient for xk 2 8 l > k alk = 0, ie. all succeeding coe cients of xk are zero. If the rst possibility is the case, then one can simply interchange row k and row l, since row interchange is one of the elementary row operations under which the solution is invariant. If this can be done whenever there is a akk = 0 in the k-th reduction step, then the system is singular. If the second possibility is the case, then the system will be overdetermined, and either there are an in nite number of possible solutions (the system is dependent, see section 5.1.1), or there is no solution (the system is inconsistent). If this is the case, the system is de nitely non-singular.
?
mik = ;aik =akk
?
do j = k n
k in row i
?
aij = aij + mik akj
?
end do
6 6 -
? ?
bi = bi + mik bk
end do
stop Figure 1: The sequence of operations in making the k-th reduction step to uppertriangular form. Notice that in stage 2 above, it is necessary that akk 6= 0. See section 5.3.2
?
do k = 1 n ; 1
?
k-th reduction
? ? ?
6 as outline in Fig. 1 -
output the new coe cients and right-hand sides to the backsubstitution algorithm to obtain the nal result
Figure 2: Flow diagram for reduction of a square n n matrix to upper triangular form using Gaussian elimination. The procedure outputs the transformed aij and bi , which can become the input for the backsubstitution algorithm.
The last section discussed the Gaussian elimination algorithm, which is useful for moderately sized systems of linear equations. This is what was called a direct method, which in principle yields an exact result, notwithstanding errors due to roundo and computer arithmetic. Certain problems however involve the inversion of very large matrices. Since matrix inversion involves n3 oating point operations, where n is the dimension of the matrix, the inversion of systems of dimension 104 would require 1012 oating point operations. Since such problems usually arise from the solution of di erential equations, the solution usually has to be marched across some parameter subspace16 . In such circumstance, one then applies approximate methods for nding the solution. These invariably involve some iterative procedure.
As with all iterative procedures, one starts with an assumed trial solution x(0) . Then, the simplest algorithm proceeds by solving for x1 in terms of b1 and the (0) trial solution x(0) i6=1 solving for x2 in terms of b2 and the trial solution xi6=2 etc. The obtained solution then becomes the rst approximation 8i xi = x(1) i . Then 17 after the k-th iteration the i-th equation takes the form h x(ik) = a1 bi ; ai1 x(1k;1) ; ai2x(2k;1) ; ::: ii i k;1) (k;1) (k;1) ;ai i;1 x( (19) i;1 ; ai i+1 xi+1 ; ::: ; ain xn In an application for which the are many zero's (ie., the coe cient matrix is sparse) which occur in regular patterns, this can be utilised in order to avoid making several needless multiplications aij x(jk) where aij = 0. Such a strategy then reduces the number of oating point operations which have to be performed, and hence also the accumulated error (recall lecture 2). The procedure outlined above is known as the Jacobi algorithm, and is represented in ow diagram form in Fig. 3
k) have previously been In evaluating x(ik) in Eq. 19 it can be noticed that all x(j<i determined. A faster convergence can be obtained if these values are used in the k;1) . This is the basis of the right hand side of Eq. 19 instead of the old values x(j<i Gauss-Seidel method, and the evaluation stage is given by the equation h x(ik) = a1 bi ; ai1x(1k) ; ai2 x(2k) ; ::: ii i k) ; a (k;1) (k;1) ;ai i;1 x( x ; ::: ; a x (20) i i+1 i+1 in n i;1
For instance, in the solution of the Landau-Lifshitz equation (see assignment 2) for many degree of freedom systems, the solution has to be determined over a given range of time and external control eld. If the time and control eld parameters are discretised into say N timesteps and M eld steps, then the inversion has to be done N M times. Therefore, the entire simulation would take n3 N M oating point operations ! 17 Obviously, a must be non-zero. For a non-singular system, one can always achiev this by interchanging rows, which is an elementary row operation under which the solution remains invariant.
16
ii
start procedure: supply initial data main iteration loop apply equation 19
? ?
yes
6 the updated solution becomes the old solution ERROR the algorithm has failed to converge terminate and output the solution
? no
x(k;1)
x(k)
Figure 3: Flow diagram for the solution of a system of linear equations using the Jacobi algorithm. Nmax is the maximum number of iterations after which it is deemed that the solution has not converged. At the stage where the x(ik) are generated from the x(ik;1) , the occurrence of zero's in de nite patterns in the matrix of coe cients should be used to avoid unnecessary multiplications.
An algorithm for the implementation of the Gauss-Seidel method will be virtually identical to the Jacobi method, except that Eq. 20 will be used in place of Eq. 19
Eigenvalues are of great importance in many physical problems and so it is necessary to have at hand robust ways of systemmatically computing eigenvalues and their associated eigenvectors. The algebraic eigenvalue problem is a vast and important subject, and as such, many methods have been developed to obtain solutions under a variety of conditions. The selection of method depends on what type of matrix is involved, and what information is required. Here, we describe probably the simplest method of locating a given eigenvalue by an iterative procedure | the so called power method. Later, we describe algorithm which is used to compute all eigenvalues and eigenvectors of a real symmetric matrix | the method of Householder tridiagonalisation and QR factorisation. Mathematically, the algebraic eigenvalue problem arises whenever the right hand side of the system of linear equations, Eq. 14 is a multiple of the vector of unknowns, that is whenever b = x, so that (A ; I) x = 0 (21) where I is the unit n n matrix and is a scalar quantity. But from the discussion of non-singular matrices in section 5.1.1, if the matrix A ; I is non-singular, only the trivial solution x = 0 exists. Therefore, for non-trivial solutions to Eq. 21 to exist, the matrix A ; I must be singular, ie. non-invertible.
x1 = Ax0
m0 = xT k;1 xk;1
x2 = Ax1
m1 = xT k;1 xk m1 q=m
0
::: xk = Axk;1 m2 = xT k xk
(22)
(23)
(24)
where Eq. 22 has been used in the second step and Eq. 23 has been used in the nal step. Now, the normalised eigenvectors zi i = 1 n of the real symmetric matrix A form a complete orthonormal set which span the vector space Vn = fy = (y1 y2 ::: yn )g so that we can write
xk;1 =
n X i=1
cizi
m0 =
n X i=1
c2 i
xk = Axk;1 =
n X i=1 n
n X i=1
ci Azi =
n X i=1
ci izi (25)
xk ; qxk;1 =
Using this result in Eq. 24
ci (
i ; q ) zi
X 2 j j2 m0 = c2 i ( i ; q)
i=1
n X 2 c2 c ; q) i i=1
by
= ( c ; q)2m0
where the second of Eq. 25 has been used. Dividing throughout by m0 we arrive at Eq. 23.
We now turn to the problem of nding all the eigenvalues and associated eigenvectors for a given real symmetric matrix, that is a matrix which has A = AT . ^ and A are said to be similar if Two matrices A ^ = T;1AT A (26) for some non-singular n n matrix T. Eq. 26 is called a similarity transformation. These are important because they preserve eigenvalues | the eigenvalues of A ^ are the same. Moreover, if x is an eigenvector of A, then x ^ = T;1x is an and A ^ belonging to the same eigenvalue since from Eq. 21 we have eigenvector of A
Ax =
= T;1ATT;1x ^ T;1x = A
this strategy is Householder's method. This employs n ; 2 successive similarity transformations in order to reduce A to tridiagonal form18. The matrices under which these similarity tranformations are carried out, denoted by P1, P2 , ...,Pn;2 1 T are orthogonal and symmetric, and hence19 P; i = Pi = Pi . These similarity transformations generate a sequence of matrices A0 = A, A1 , A2 ,...,An;3 given by
: : : ^ = Pn;2An;3Pn;2 A
A1 = P1 A0 P1 A2 = P2 A1 P2
(27)
The idea is that these tranformations create the necessary zero's in row 1 column 1 in thr rst step, row 2 column 2 in the second step, etc. These steps are ^ is tridiagonal. illustrated for a 5 5 matrix in Fig. 4. The result is that A First step A1 = P1 A0 P1 Second step A2 = P2 A1 P2 Third step A3 = P3 A2 P3
Figure 4: Illustration of Householder's method for a 5 5 matrix. The positions left blank are the zero's created by the transformations.
A tridiagonal matrix has non-zero elements in the diagonal, superdiagonal and subdiagonal only, that is to say, only elements a , a +1 and a +1 can be di erent from zero. 19 Recall that an orthogonal matrix is a real n n matrix which has column vectors e that form an orthonormal system so that e e e e = where is the Kronecker delta equal to 1 when i = j and 0 when i 6= j . This means that, if E is an orthogonal matrix with column vectors e , E E, which has elements e e is equal to the unit matrix, which implies that E = E;1 .
18
ij i j i j i i j T i j ij ij i T T T i j
(28)
0 1 0C B B C B C B C B C B C : v1 = B C B B C :C B C B @:C A
0 1 0C B B 0C B C B C B C B C v2 = B : C B B C :C B C B @:C A
0 1 0C B B 0C B C B 0C B C C ::: vn;2 = B B : C B B C :C B C B @ C A
The sequence of calculations that give rise to the vk are depicted in Fig. 5
6.3.2 QR Factorisation
Having transformed the original matrix A into tridiagonal form using successive Householder transformations under which the eigenvalues are invariant, we now have to solve the eigenvalue problem for the transformed matrix B0 = An;2. This is accomplished by the so called QR method as follows. 1 Factor B0 = Q0 R0 where Q0 is orthogonal and R0 is an upper triangular matrix. Then compute
B1 = R0 Q0
2 Factor
then compute
20
k
B1 = Q1 R1 B1 = R0 Q0
P
T k
= I ; 2v v ; = I ;2 v v = I ; 2v v
k T k T k k T k
T k
= I ; 2v v 2 = I ; 4v v + 4v v v v = I ; 4v v + 4v v = I
k T k k k T k T k k k T k T k k
T k
where the fact that v v = 1 has been used in the pennultimate step. 21 The asterisks denote components other than zero's
T k k
ja21 j
3 j
First step
(1)
Second step
2
Figure 5: De nition of the unit vectors vk showing explicitly the rst (v1 ) and second (v2 ) steps. The components vij refer to the i-th component of unit vector vj . a(ijk) refers to elements of Ak , and aij are the elements of the original matrix A. ;1) ) = +1 when a(k;1) 0 and sgn(a(k;1) ) = ;1 when In the k-th step, sgn(a(kk+1 k k+1 k k+1 k ;1) < 0. After v is computed, P is determined from Eq. 28 and then A from a(kk+1 1 1 1 k Eq. 27. Step 2 is the same as step 1 with all subscripts increased by 1 and aij replaced by a(1) ij , which has just been computed. Thus, we obtain v2 and hence P2 according to Eq. 28. In turn, A2 can then be computed from Eq. 27, upon which v3 can be determined, etc.
followed by computation of
Bs+1 = Rs Qs
If the eigenvalues22 of B are all di erent in absolute value then we have that slim !1 Bs = D
22 R = Q;1 B from the factorisation in the general step s. Therefore, B +1 = R Q = ; Q 1 B Q so that B +1 is similar to B and hence, by induction, to B0 and A.
s s s s s s
where D is a diagonal matrix with the required eigenvalues i as its diagonal elements23 .
s s s
A proof of this assertion can be found in Wilkinson, J.H.,The Algebraic Eigenvalue Problem. Oxford Clarendon, 1965.
23
The factorisation itself is obtained in the following way. The tridiagonal matrix B0 has n ; 1 generally non-zero elements b21 b32 ::: bn n;1 below the main diagonal. Suppose that B0 is multiplied on the left by a matrix C2 such that the result C2B0 has b21 = 0. Then, the result is multiplied on the left by a matrix C3 such that C3 C2 B0 has b32 = 0, etc. After n ; 1 such multiplications one is left with an upper triangular matrix R0 = CnCn;1 :::C3 C2 B0. The Cj are orthogonal plane rotations, that is, they contain the 2 2 submatrix cos j sin ; sin j cos
j j
in rows j ; 1, j and columns j ; 1, j , with 1's on all other main diagonal elements and zero's everywhere else. Since the Cj are orthognal, so is their product, and so is its inverse | this is Q0 = CnCn;1 :::C3 C2 ];1. Therefore, according to the scheme24 T T T B1 = R0 Q0 = R0 CT 2 C3 :::Cn;1 Cn Now, in the rst operation C2 B0, one has
2 2
0 0 : :
b12 b22 : :
and
resulting in cos 2 = q 1 1 + (b21 =b11 )2 sin 2 = q b21 =b11 2 1 + (b21 =b11 ) (29)
etc.
24
Notice that we do not need Q0 explicitly, since to get B1 , compute R0C2 , then R0 C2 C3 ,
T T T
7.1 Introduction
(30)
(x) y00(x) = d2y(x) ::: yk (x) = dk y(x) y0(x) = dy dx dx2 dxk Since the highest order derivative is yn(x), the solution y(x) will involve n constants of integration, and will constitute an n-parameter family of functions. A unique solution therefore requires n auxiliary conditions to be sati ed. If the solution is required on x 2 a b], and these auxiliary conditions are speci ed by y(a) = 1 y0(a) = 2 ::: yn;1(a) = n then the problem is referred to as an n-th order initial value problem. When this is the case, one may introduce new variables z1 = (x) = y(x) z2 (x) = y0(x) ::: zn(x) = yn;1(x) In terms of the zk (x), Eq. 30 becomes 0 (x) = f (x z (x) z (x) ::: z (x)) zn zn(a) = n 1 2 n which together with the derivatives of the zk (x) 0 (x) = z (x) z1(a) = 1 z1 2 0 z2 (x) = z3 (x) z2(a) = 2 : : : 0 zn;1 (x) = zn(x) zn;1 (a) = n;1 comprise a system of n coupled rst order di erential equations in the n unknown functions zk (x) 1 k n. Therefore, an n-th order initial value problem in one
variable x can alsways be treated as a system of n coupled rst order initial value problems. Numerical methods for solving a rst order initial value problem can easily be generalised for a system of coupled rst order initial value problems, and so, it is su cient to study the rst order initial value problem y0(x) = f (x y(x)) y(a) =
The most basic numerical algorithms for solving an initial value problem of the form y0(x) = f (x y) y(x0) = y0 (31) for x 2 x0 X ] are the single-step methods. These proceed from y(x0) = y0, and advance across x in nite but small steps h, computing approximate values of the solution y(xk ) at the 'grid points' x1 = x0 + h x2 =0 +2h x3 = x0 + 3h etc: The computation is achieved by expanding y(x) according to 2 00 y(x + h) = y(x) + hy0(x) + h 2 y (x) + ::: y(x) + hy0(x) = y(x) + hf (x y) (32) where the fact that h is small has been used in the second step, and where Eq. 31 has been used in the last step. The fact that only the rst order term proportional to h has been retained in the expansion25 leads to the designation of methods based on this approximation as rst order methods. Neglecting the higher order terms h2, etc. naturally causes a truncation error. If one chooses h to be as small as possible, the truncation error will be minimised. However, this will be at the expense of the need to make excessively many steps to cover x0 X ], with the consequent accumulation of round-o error26 . Now, evaluating Eq. 32 about x0 y(x1) = y0 + hf (x0 y0) Writing y(x1) y1, this can be substituted into Eq. 32 evaluated about x1 y2 = y1 + hf (x1 y1) etc., so that the general step will be yk+1 = yk + hf (xk yk ) This process of advancing across x continues until xk X at which point the calculation is complete. Obviously, a truncation error is entailed at each step because only the rst order term has been retained, and so after many steps, one expects that the error in the solution will have accumulated considerably.
25 26
n n n n
Therefore, the truncation error is of the order of h2 . There will also be an error generated as a result of the fact that f is evaluated at (x y ) instead of (x y(x )). If f depends very strongly upon y, ie. f varies rapidly as y varies (@f=@y large in comparison to y(x ) ; y ), this error could be large. This consideration will force the use of a very small value of h.
n n
Now, instead of computing the partial derivatives of f and applying the chain rule, one approximates df (x y(x)) = f (x + h y(x + h)) ; f (x y(x)) dx h and substituting in Eq. 34 " # f ( x 1 k+1 yk+1) ; f (xk yk ) 2 yk+1 = yk + hf (xk yk ) + 2 h h 1 = yk + hf (xk yk ) + 2 h f (xk+1 yk+1) ; f (xk yk )] (35)
In this equation, yk+1 occurs on both sides. In order to obtain an explicit formula for yk+1, consider 0) f (xk+1 yk+1) = f (xk+1 yk + hyk
27
It can be shown that the truncation error involved in the improved Euler method is of the order h3 . The improved Euler method is therefore a second order method. 28 Fourth order because it can be shown that the truncation error is of the order h5 . Generally, if the truncation error is of order h +1 , the method is classed as a k-th order method.
k
which is obtained by expanding yk to rst order. Substituting this into Eq. 35 one obtains 1 h f (x y ) + f (x y + hf (x y )] yk+1 = yk + 2 k k k+1 k k k This is the recurrence formula for the second order Runge-Kutta method. Like the modi ed Euler method, this has truncation error proportional to h3.
a1 = hf (xk yk )
1 h y a2 = hf xk + 1 k + a1 2 2 1h y + 1a a3 = hf xk + 2 k 2 2
a4 = hf (xk + h yk + a3 )
A ow diagram for the implementation29 of this scheme is depicted in Fig. 6. The major drawback of the Runge-Kutta scheme is the number of function evaluations per time-step. In the 4-th order Runge-Kutta algorithm, f (x y) has to be evaluated four times for every timestep. In a simulation with perhaps thousands of variables, which has to run over perhaps tens of thousands of timesteps, this can be prohibitive. Lecture 11 will introduce methods which require fewer function evaluations, and exhibit better stability characteristics.
In an actual implementation of this algorithm, one might want to include step-size control | that is, methods of optimising the performance by adjusting h so that accuracy is maintained, while excessive computation due to h being smaller than necessary is minimised.
29
Inputfx0 X y0 hg
?
x = x0 y = y0
?
do while x + h X
?
a1 = hf (x y) a2 = hf (x + 0:5h y + 0:5a1) a3 = hf (x + 0:5h y + 0:5a2) a4 = hf (x + h y + a3 ) y=y+1 6 (a1 + 2a2 + 2a3 + a4 ) x=x+h
?
output fx yg
?
stop Figure 6:
y(x + t) = y(x) +
Z x+t
x
f (s y(s))ds
(36)
where a x < x + t b and the solution is required across x 2 a b]. One can now replace the integral by a numerical integration formula.
Suppose that one integrates from x to x + 2h using Simpson's rule Z x+2h f (x y(x)) + 4f (x + h y(x + h)) + f (x + 2h y(x + 2h))] f (s y(s))ds = h 3 x Then Eq. 36 becomes y(x + 2h) = y(x) + h 3 f (x y(x)) + 4f (x + h y(x + h)) + f (x + 2h y(x + 2h))] which can be written as a di erence equation, yk+2 = yk + 1 3 h f (xk yk ) + 4f (xk+1 yk+1) + f (xk+2 yk+2)]
This formula, called Simpson's method or Milne's method is implicit, in that yk+2 occurs on both sides. Similarly, any integration formula that uses the value of the integrand at the upper limit will lead to an implicit formula. If f (x y(x)) is a nonlinear function, in general these implicit equations cannot be solved excatly. However, one can attempt to solve the by means of iteration. Therefore, in the equation above, if yk+1 is known, one can obtain a rst approximation to yk+2 by using the Euler formula (0) yk +2 = yk+1 + hf (xk+1 yk+1 ) (0) One can then evaluate f (xk+2 yk +2) which is then substituted into the di erence equation 1 h (0) i (1) yk +2 = yk + 3 h f (xk yk ) + 4f (xk+1 yk+1 ) + f (xk+2 yk+2 ) (1) One can then evaluate f (xk+2 yk +2) which is then substituted into the di erence equation 1 h hf (x y ) + 4f (x y ) + f (x y(1) )i (2) yk = y + k k k k+1 k+1 k+2 k+2 +2 3 The general step in the iteration procedure is given by 1 h (m) i (m+1) yk +2 = yk + 3 h f (xk yk ) + 4f (xk+1 yk+1) + f (xk+2 yk+2 ) This procedure is continued until two successive iterates agree to the desired accuracy. Obviously, such a method is expensive computationally depending upon how fast the iteration procedure converges. However, for sti di erential equations or systems of equations30 , it provides the most stable option availible.
Alternatively, one can use the so called 'open type' Newton-Cotes integration formulas of the form Z x+3h f (s y(s))ds = 3 h f (x + h y(x + h)) + f (x + 2h y(x + 2h))] 2 x Z x+4h 4 h 2f (x + h y(x + h)) ; f (x + 2h y(x + 2h) f (s y(s))ds = 3 x +2f (x + 3h y(x + 3h))]
Z x+5h
x
5 h 11f (x + h y(x + h)) + f (x + 2h y(x + 2h)) f (s y(s))ds = 24 +f (x + 3h y(x + 3h)) + 11f (x + 4h y(x + 4h))] etc:
A sti di erential equation is especially di cult to solve because di erent processes in the system behave with signi cantly di erent time scales. Most of the basic methods which are described here exhibit extreme instability when applied to problems of this type. However, implicit methods such as that described above prove to be the most stable methods for this type of problem.
30
which lead to di erence equations of the form 3 h f (x y ) + f (x y )] yk+3 = yk + 2 k+1 k+1 k+2 k+2 yk+4 = yk + 4 3 h 2f (xk+1 yk+1) ; f (xk+2 yk+2) + 2f (xk+3 yk+3)] 5 h 11f (x y ) + f (x y ) + f (x y ) yk+5 = yk + 24 k+1 k+1 k+2 k+2 k+3 k+3 +11f (xk+4 yk+4)] etc: These are all explicit formulas, where, the current step is the left hand side, which is expressed in terms of previous evaluations appearing on the right hand side. For instance, the rst of these is a 3-step method, which uses the three previous values at xk , xk+1 and xk+2 in order to evaluate yk+3. Similarly, the second and third of these are respectively four and ve-step formulas.
Alternative to using the open-type Newton-Cotes integration formulas, suppose that f is replaced by an interpolating polynomial p3(x) of third degree. For p3(x) we take the polynomial that at xk , xk;1 , xk;2 and xk;3 has the values
p3(x) is obtained from the Newton backward di erence formula 1 p3(x) = fk + rrfk + 1 r ( r + 1) r2 fk + r(r + 1)(r + 2)r3 fk 2 6 where r = (x ; xk )=h, and
rfk r2 fk r3 fk
Z x +1
k
xk
p3 dx = h
Z1
0
Since this is an approximation to the integral of f (x y(x)) between x = xk and x = xk+1, then Eq. 36 becomes h (55f ; 59f + 37f ; 9f ) yk+1 = yk + 24 (37) k k ;1 k;2 k ;3 This is a 4-step formula that uses the value of f computed using the previous four approximations yk , yk;1, yk;2 and yk;3 in order to calculate yk+1. The method is called the Adams-Bashford method.
One can use Eq. 37 as a predictor yk+1, and obtain a corrector in the same way by integrating another interpolating polynomial p ~3(x) which equals fk+1, fk , fk;1 and fk;2 at xk+1 , xk , xk;1 and xk;2 , and where
9 Problems
1 Determine all the oating point numbers 0 fl(x) 3 for the system fl(x) 2 F (3 2 ;1 2) and plot them on a real number line. 2 Determine how many digits a computer using a binary number representation system would need to match the accuracy of an 8 digit denary oating point calculation. 3 Determine alternative expressions for a) f (x) = 1 ; cos x near x = 0 p b) f (x) = ;b+ 2ba2 ;4ac 4ac b2 which avoid loss of signi cant digits in oating point calculation. 4 Solve the equation f (x) = 0 where f (x) = x + ex ; 2 by successive iteration.
10 Answers to Problems
Problem 1 (:000)2 20 = 0 (:100)2 2;1 = 1 1 = 1 2 2 4 1 1 1= 5 (:101)2 2;1 = 2 + 8 2 16 1 1 1=3 (:110)2 2;1 = + 2 4 2 8 1 1+1 1= 7 (:111)2 2;1 = 2 + 4 8 2 16 1 (:100)2 20 = 1= 1 2 2 1 1 5 0 (:101)2 2 = 2 + 8 1 = 8 1 1= 3 + (:110)2 20 = 1 2 4 4 1 1 1 7 (:111)2 20 = 2 + 4 + 8 1 = 8 (:100)2 21 = 1 2 2=1 1 2= 5 (:101)2 21 = 1 + 2 8 4 1 2= 3 (:110)2 21 = 1 + 2 4 2 1 1 7 1 (:111)2 21 = 2 + 4 + 8 2 = 4 (:100)2 22 = 1 2 4=2
Figure 7: The diagram shows the discrete and non-uniform nature of nite digit oating point number representations. Problem 2 It is necessary to compare 101;t with 21;s where t and s are the number of digits in the denary and binary representations respectively. The steps are 101;t =) log10 101;t =) (1 ; t) log10 10 =) (1 ; t) = = = = 21;s log10 21;s (1 ; s) log10 2 (1 ; s) log10 2 (s ; 1) 0:30103
If t = 8 then s 1 + (t ; 1)=0:30103) 26:5754. Therefore, 24 binary digit calculations are not as accurate as 8 digit denary calculations, while 25 digit binary calculations are slightly more accurate. Problem 3 a) ; cos2 x sin2 x 1 ; cos x = 1 = 1 + cos x 1 + cos x b)
;b + p2 b ; 4ac
2a
;b +
p2 b ; 4ac
x = 2 ; ex x1 x2 x3 x4 x5 x6
= = = = = = 1
(39)
;0:71828
Clearly, this sequence does not converge. However, by taking natural logarithms of Eq. 39 one obtains
x = log(2 ; x)
Now, with x0 = 0 one obtains
x1 x1 x1 x1 x1
= = = = = ::: x1 = x1 = x1 =
It should be noticed that both sequences oscillate, but the rst sequence diverges while the second converges.
11 Lecture programme
Date Session Lecture Topic Feb. 4 1 1 Introduction, Computing with numbers 6 2 Practical 1 11 3 2 Errors in computer arithmetic 13 4 Practical 2 18 5 3 Solving algebraic equations 20 6 Practical 3 25 7 4 Fitting experimental data 27 8 5 Interpolation Mar. 4 9 Practical 4 6 10 Assignment 1y 11 11 6 Linear algebra 1 13 12 7 Linear algebra 22 18 13 8 Optimisation 20 14 9 Numerical quadrature 25 15 10 Ordinary di erential equations 1 27 16 11 Ordinary di erential equations 2 Apr. 1 17 Assignment 2z 3 18 Assignment 2 29 19 12 Ordinary di erential equations 3 May 1 20 13 Partial di erential equations 1 6 21 14 Partial di erential equations 2 8 22 15 Partial di erential equations 3 13 23 16 Tutorial 15 24 17 Tutorial Table 1: 1997/98 lecture programme.
y Di erentiating experimental data. z Integrating the Landau-Lifshitz equation