Complex Wavelet Based Image Analysis and PHD

www.GetPedia.
com
* The Ebook starts from the next page : Enjoy !

Complex Wavelet Based
Image Analysis and Synthesis
This dissertation is submitted for the degree of Doctor of Philosophy

Peter de Rivaz Trinity College October 2000
University of Cambridge Department of Engineering
de Rivaz, Peter F. C. PhD thesis, University of Cambridge, October 2000.
Complex Wavelet Based Image Analysis and Synthesis
Key Words
Complex wavelets, multi-scale, texture segmentation, texture synthesis, in-
terpolation, deconvolution.
Copyright P.F.C.
c de Rivaz, 2000.
All rights reserved. No part of this work may be reproduced, stored in a

retrieval system, or transmitted, in any form or by any means, electronic,
mechanical, photocopying, recording, or otherwise, without prior permission.
All statements in this work are believed to be true and accurate at the time of
its production but neither the author nor the University of Cambridge offer
any warranties or representations, nor can they accept any legal liability for
errors or omissions.
P.F.C. de Rivaz
Signal Processing and Communications Laboratory
Department of Engineering
Trumpington Street
Cambridge, CB2 1PZ, U.K.
To Jenny
Summary
This dissertation investigates the use of complex wavelets in image processing.

The limitations of standard real wavelet methods are explained with emphasis on the
problem of shift dependence.
Complex wavelets can be used for both Bayesian and non-Bayesian processing. The
complex wavelets are first used to perform some non-Bayesian processing. We describe
how to extract features to characterise textured images and test this characterisation
by resynthesizing textures with matching features. We use these features for image
segmentation and show how it is possible to extend the feature set to model longer-
range correlations in images for better texture synthesis.
Second we describe a number of image models from within a common Bayesian frame-
work. This framework reveals the theoretical relations between wavelet and alternative
methods. We place complex wavelets into this framework and use the model to address
the problems of interpolation and approximation. Finally we show how the model can
be extended to cover blurred images and thus perform Bayesian wavelet based image
deconvolution.
Theoretical results are developed that justify the methods used and show the connections
between these methods and alternative techniques.
Numerical experiments on the test problems demonstrate the usefulness of the proposed
methods, and give examples of the superiority of complex wavelets over the standard
forms of both decimated and non-decimated real wavelets.
Declaration
The research described in this dissertation was carried out between October 1997 and
September 2000. Except where indicated in the text, this dissertation is the result of my
own work and includes nothing which is the outcome of work done in collaboration. No
part of this dissertation has been submitted to any other university. The dissertation
does not exceed 65000 words and does not contain more than 150 figures.
Acknowledgements
I would like to thank my supervisor, Dr. Nick Kingsbury, for suggesting this topic and
for his guidance during the research. Thanks to my parents for encouraging my curiosity
and to my wife for keeping me calm. This work was made possible by an EPSRC grant.
Contents
1 Introduction 3
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Justification for the research . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Bayesian and Non-Bayesian Approaches . . . . . . . . . . . . . . . . . . . 5
1.4 Original contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Most important contributions . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Medium importance contributions . . . . . . . . . . . . . . . . . . . 7
1.4.3 Least important contributions . . . . . . . . . . . . . . . . . . . . . 8
1.4.4 Contributions based largely on previous work . . . . . . . . . . . . 9
1.5 Organisation of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Wavelet transforms 13
2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 The Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Filter design and the product filter . . . . . . . . . . . . . . . . . . 17
2.3.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Single tree complex wavelets . . . . . . . . . . . . . . . . . . . . . . 21
2.3.4 Directionality and Ridgelets . . . . . . . . . . . . . . . . . . . . . . 23
2.3.5 Shift invariance and the Harmonic wavelet . . . . . . . . . . . . . . 24
2.3.6 Non-redundant, directionally selective, complex wavelets . . . . . . 27
2.3.7 Prolate spheroidal sequences . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Redundant complex wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1 Dual tree complex wavelet transform . . . . . . . . . . . . . . . . . 31
2.4.2 Q-shift Dual tree complex wavelets . . . . . . . . . . . . . . . . . . 33
7
2.4.3 Steerable transforms . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.4 Multiwavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5 Noise amplification theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.2 Theoretical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.3 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3 Previous applications 45
3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Complex wavelet texture features 53

4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Pyramid-based texture analysis/synthesis . . . . . . . . . . . . . . . . . . . 54
4.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.2 Results and applications . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Gabor based texture synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.1 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.2 Extracted features . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.6 Texture features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5 Texture segmentation 67
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Original Classification Method . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Training simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5 Detailed description of method . . . . . . . . . . . . . . . . . . . . . . . . 71
5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.7 Discussion of relative performance . . . . . . . . . . . . . . . . . . . . . . . 77
5.8 Discussion of DT-CWT performance . . . . . . . . . . . . . . . . . . . . . 83
5.9 Multiscale Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.9.1 Texture features for multiscale segmentation . . . . . . . . . . . . . 86
5.9.2 Multiscale classification . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.9.3 Choice of parameter values . . . . . . . . . . . . . . . . . . . . . . . 88
5.9.4 Multiscale results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6 Correlation modelling 93
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 Auto-correlation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3 Auto-correlation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.5 Cross-correlation method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.6 Cross-correlation results and discussion . . . . . . . . . . . . . . . . . . . . 100
6.7 Large feature set segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7 Bayesian modelling in the wavelet domain 107

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2 Introduction to Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . 108
7.3 Bayesian image modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.3.1 Filter model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3.2 Fourier model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3.3 Wavelet direct specification . . . . . . . . . . . . . . . . . . . . . . 113
7.3.4 Wavelet generative specification . . . . . . . . . . . . . . . . . . . . 114
7.4 Choice of wavelet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4.1 Possible Basis functions . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4.2 Shift Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.4.3 One dimensional approximation . . . . . . . . . . . . . . . . . . . . 119
7.4.4 Two-dimensional shift dependence . . . . . . . . . . . . . . . . . . . 120
7.4.5 Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.4.6 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8 Interpolation and Approximation 129

8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.3 Posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.4 Approximation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.4.1 Kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.4.2 Radial Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.4.3 Bandlimited interpolation . . . . . . . . . . . . . . . . . . . . . . . 135
8.4.4 Minimum smoothness norm interpolation . . . . . . . . . . . . . . . 136
8.4.5 Large spatial prediction for nonstationary random fields . . . . . . . 140
8.4.6 Comparison with Spline methods . . . . . . . . . . . . . . . . . . . 140
8.5 Wavelet posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.6 Method for wavelet approximation/interpolation . . . . . . . . . . . . . . . 145
8.6.1 Estimating scale energies . . . . . . . . . . . . . . . . . . . . . . . . 146
8.6.2 Important wavelet coefficients . . . . . . . . . . . . . . . . . . . . . 146
8.6.3 Impulse responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.6.4 Solving for wavelet coefficients . . . . . . . . . . . . . . . . . . . . . 147
8.6.5 Reconstruct image estimate . . . . . . . . . . . . . . . . . . . . . . 147
8.7 Choice of wavelet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.7.1 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.7.2 Shift Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.7.3 Experiments on shift dependence . . . . . . . . . . . . . . . . . . . 152
8.7.4 Discussion of the significance of shift dependence . . . . . . . . . . 153
8.8 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.8.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.8.2 Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.8.3 Trading accuracy for speed . . . . . . . . . . . . . . . . . . . . . . . 159
8.9 Discussion of model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
9 Deconvolution 165
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.1.1 Bayesian framework . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.1.2 Summary of review . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.2 Image model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9.3 Iterative Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.3.1 Variance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
9.3.2 Choice of search direction . . . . . . . . . . . . . . . . . . . . . . . 178
9.3.3 One dimensional search . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.4 Convergence experiments and discussion . . . . . . . . . . . . . . . . . . . 182
9.5 Comparison experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
10 Discussion and Conclusions 201

10.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
10.2 Implications of the research . . . . . . . . . . . . . . . . . . . . . . . . . . 203
11 Future work 205

11.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
11.2 Texture synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
11.3 Deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
11.4 Other possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
A Balance and noise gain 209

A.1 Amplification of white noise by a transform . . . . . . . . . . . . . . . . . 209
A.2 Definition of d1 , ...dN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
A.3 Determining frame bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
A.4 Signal energy gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
A.5 Reconstruction noise gain bound . . . . . . . . . . . . . . . . . . . . . . . . 213
A.6 Consequences of a tight frame . . . . . . . . . . . . . . . . . . . . . . . . . 214
A.7 Relation between noise gain and unbalance . . . . . . . . . . . . . . . . . . 214
B Useful results 217

B.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
B.2 Bayesian point inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
C Review of deconvolution techniques 223

C.1 CLEAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
C.2 Maximum Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
C.3 Projection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
C.4 Wiener filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
C.5 Iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
C.6 Constrained Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
C.7 Total variation and Markov Random Fields . . . . . . . . . . . . . . . . . . 234
C.8 Minimax wavelet deconvolution . . . . . . . . . . . . . . . . . . . . . . . . 235
C.9 Mirror wavelet deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . 237
C.9.1 Mirror Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . 238
C.9.2 Deconvolution Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 239
C.10 Wavelet-based smoothness priors . . . . . . . . . . . . . . . . . . . . . . . 241
List of Figures
1.1 Guide to the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Building block for wavelet transform . . . . . . . . . . . . . . . . . . . . . 14

2.2 Subband decomposition tree for a 4 level wavelet transform . . . . . . . . . 15
2.3 Building block for inverting the wavelet transform . . . . . . . . . . . . . . 15
2.4 Alternative structure for a subband decomposition tree . . . . . . . . . . . 17
2.5 Contours of half-peak magnitude of filters at scales 3 and 4 . . . . . . . . . 30
2.6 Contours of 70% peak magnitude of filters at scales 3 and 4 . . . . . . . . 31
2.7 The complex wavelet dual tree structure. This figure was provided by Dr
N. Kingsbury. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.8 The Q-shift dual tree structure. This figure was provided by Dr N. Kingsbury. 34
2.9 Model of wavelet domain processing. . . . . . . . . . . . . . . . . . . . . . 38
2.10 Comparison of noise gain for different transforms . . . . . . . . . . . . . . 42
3.1 Top left: Original image. Top right: Noisy image. Bottom left: DWT
results. Bottom middle: DT-CWT results. Bottom right: NDWT results. . 49
3.2 PSNRs in dB of images denoised with the HMT acting on different wavelet
transforms. The results in normal type are published in the literature [24]
while the results in bold come from our replication of the same experiment. 50
4.1 Results of using histogram/energy synthesis . . . . . . . . . . . . . . . . . 62

4.2 Results of using energy synthesis . . . . . . . . . . . . . . . . . . . . . . . 63
4.3 Results of using histogram/energy synthesis on a wood grain texture. . . . 64
4.4 Results of using different methods on a strongly diagonal texture . . . . . . 65
4.5 Energy before rescaling for different subbands during the histogram/energy
synthesis algorithm. Horizontal lines represent the target energy values. . . 66
13
5.1 Mosaics tested . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Comparison of segmentation results for different transforms . . . . . . . . . 75
5.3 Percentage errors for (DWT,NDWT,DT-CWT) . . . . . . . . . . . . . . . 76
5.4 Performance measure for different methods . . . . . . . . . . . . . . . . . . 78
5.5 Sine wave input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6 Nondecimated scale 4 wavelet coefficients . . . . . . . . . . . . . . . . . . . 80
5.7 Rectified nondecimated scale 4 wavelet coefficients . . . . . . . . . . . . . . 80
5.8 Rectified decimated scale 4 wavelet coefficients . . . . . . . . . . . . . . . . 82
5.9 Comparison of segmentation results for altered DT-CWT . . . . . . . . . . 83
5.10 Percentage errors for (HalfCWT,RealCWT,DT-CWT) . . . . . . . . . . . 84
5.11 Segmentation results for mosaic “f” using the DT-CWT . . . . . . . . . . . 85
5.12 Comparison of segmentation results for multiscale methods . . . . . . . . . 89
5.13 Percentage errors for single scale DT-CWT, multiscale DT-CWT, multiscale
DWT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.14 Segmentation results for mosaic “f” using the multiscale DT-CWT . . . . . 91
6.1 Results of original energy matching synthesis . . . . . . . . . . . . . . . . . 96

6.2 Results of matching 3 by 3 raw auto-correlation values . . . . . . . . . . . 97
6.3 Results of matching 5 by 5 raw auto-correlation values . . . . . . . . . . . 97
6.4 Results of matching 5 by 5 magnitude autocorrelation values . . . . . . . . 98
6.5 Results of matching 5 by 5 raw autocorrelation values . . . . . . . . . . . . 99
6.6 Comparison of different synthesis methods. . . . . . . . . . . . . . . . . . . 101
6.7 2-D frequency responses for the four subbands derived from the level 2 45◦
subband. Contours are plotted at 90%, 75%, 50%, 25% of the peak ampli-
tude. A dashed contour at the 25% peak level for the original 45◦ scale 2
subband is also shown in each plot. . . . . . . . . . . . . . . . . . . . . . . 103
6.8 Comparison of segmentation results for different transforms . . . . . . . . . 105
7.1 Sequence of operations to reconstruct using a Gaussian Pyramid . . . . . . 117

7.2 One dimensional approximation results for different origin positions. Crosses
show location and values of measured data points. . . . . . . . . . . . . . . 120
7.3 Shift dependence for different scales/dB. . . . . . . . . . . . . . . . . . . . 122
7.4 Covariance structure for an orthogonal real wavelet. . . . . . . . . . . . . . 124
7.5 Covariance structure for a nondecimated real wavelet. . . . . . . . . . . . . 125
7.6 Covariance structure for the Gaussian pyramid. . . . . . . . . . . . . . . . 125
7.7 Covariance structure for the W transform. . . . . . . . . . . . . . . . . . . 126
7.8 Covariance structure for the DT-CWT . . . . . . . . . . . . . . . . . . . . 126
7.9 Covariance structure for a translated orthogonal real wavelet. . . . . . . . . 127
7.10 Summary of properties for different transforms . . . . . . . . . . . . . . . . 128
8.1 Count of important coefficients for different transforms . . . . . . . . . . . 148

8.2 Shift dependence for different scales. . . . . . . . . . . . . . . . . . . . . . 151
8.3 Aesthetic quality for DWT(o) and DT-CWT(x) /dB . . . . . . . . . . . . 154
8.4 Relative statistical quality for DWT(o) and DT-CWT(x) /dB . . . . . . . 154
8.5 Computation time versus SNR (128 measurements). . . . . . . . . . . . . . 160
8.6 Computation time versus SNR (256 measurements). . . . . . . . . . . . . . 160
9.1 Prior cost function f (x) expressions for standard deconvolution techniques. 169
9.2 Flow diagram for the proposed wavelet deconvolution method. . . . . . . . 173
9.3 Block diagram of deconvolution estimation process. . . . . . . . . . . . . . 175
9.4 Performance of different search directions using the steepest descent (x) or
the conjugate gradient algorithm (o). . . . . . . . . . . . . . . . . . . . . . 184
9.5 Performance of different search directions using the steepest descent (x) or
the conjugate gradient algorithm (o) starting from a WaRD intialisation. . 186
9.6 Performance of different search directions over 100 iterations . . . . . . . . 187
9.7 Value of the energy function over 100 iterations . . . . . . . . . . . . . . . 187
9.8 Test images used in the experiments. . . . . . . . . . . . . . . . . . . . . . 189
9.9 Alternative PSF used in experiments. . . . . . . . . . . . . . . . . . . . . . 190
9.10 Comparison of ISNR for different algorithms and images /dB . . . . . . . . 191
9.11 Comparison of different published ISNR results for a 9 by 9 uniform blur
applied to the Cameraman image with 40dB BSNR. . . . . . . . . . . . . . 193
9.12 Deconvolution results for a 9 by 9 uniform blur applied to the Cameraman
image with 40dB BSNR using the PRECGDT-CWT method with WaRD
initialisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9.13 Comparison of different published ISNR results for a Gaussian blur applied
to the Mandrill image with 30dB BSNR. . . . . . . . . . . . . . . . . . . . 195
9.14 Deconvolution results for a Gaussian blur applied to the Mandrill image with
30dB BSNR using the PRECGDT-CWT method with WaRD initialisation. 196
9.15 Comparison of the PRECGDT-CWT and Wiener filtering with published
results of Modified Hopfield Neural Network algorithms for a 3 × 3 uniform
blur applied to the Lenna image with 40dB BSNR. . . . . . . . . . . . . . 197
10.1 Summary of useful properties for different applications . . . . . . . . . . . 203
C.1 Effective assumption about SNR levels for Van Cittert restoration (K =
3,α = 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
C.2 Effective assumption about SNR levels for Landweber restoration (K =
3,α = 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
C.3 The mirror wavelet tree structure . . . . . . . . . . . . . . . . . . . . . . . 238
C.4 2D frequency responses of the mirror wavelet subbands shown as contours
at 75% peak energy amplitude. . . . . . . . . . . . . . . . . . . . . . . . . 240
Abbreviations and notation
BSNR Blurred Signal to Noise Ratio.

BW Bandwidth.
dB decibels.
DT-CWT Q-shift Dual Tree Complex Wavelet Transform.
DTFT Discrete Time Fourier Transform.
DWT Discrete Wavelet Transform.
FFT Fast Fourier Transform.
FIR Finite Impulse Response.
GPT Gaussian Pyramid Transform.
IIR Infinite Impulse Response.
ISNR Improved Signal to Noise Ratio.
NDWT Non-decimated Discrete Wavelet Transform.
pdf probability density function.
PR Perfect Reconstruction.
PSD Power Spectral Density.
SNR Signal to Noise Ratio.
WWT W-(Wavelet)Transform.
∀ for all.
↓k downsampling by a factor of k.
↑k upsampling by a factor of k.
a∈S a belongs to a set S.
a the largest integer not greater than a.
a the smallest integer not less than a.
[a]i the ith element of vector a.
argmaxa∈S f (a) the value for a within set S that maximises f (a).
argmina∈S f (a) the value for a within set S that minimises f (a).
A matrix.
T
A transpose of A.
∗
A conjugate of A.
AH Hermitian transpose of A.
A−1 inverse of A.
|a| absolute value of a scalar a.
|A| determinant of a matrix A.
a
Frobenius norm of a.
diag {a} diagonal matrix containing elements of a along the diagonal.
D a diagonal matrix normally containing weights for wavelet coefficients.
ei a vector containing zeros everywhere except for a 1 in position i.
E {X} the expected value of X.
G0 (z) Z-transform of a wavelet lowpass synthesis filter.
G1 (z) Z-transform of a wavelet highpass synthesis filter.
H0 (z) Z-transform of a wavelet lowpass analysis filter.
H1 (z) Z-transform of a wavelet highpass analysis filter.
{a} the imaginary part of a.
Ik k-dimensional identity matrix.
√
j −1.
M number of wavelet and scaling coefficients produced by a transform.
N number of input samples in a data set.
N (µ, C) multivariate Gaussian distribution with mean µ and covariance C.
P N × M matrix representing an inverse wavelet transform.
P (z) product filter.
p(θ) joint pdf for the elements of θ.
p(θ|φ) pdf for θ, conditioned on φ.
{a} the real part of a.
RN vector space of N × 1 real vectors.
sup S the supremum of set S.
tr(A) the trace of matrix A.
W M × N matrix representing a wavelet transform.
w M × 1 vector containing wavelet coefficients.
x N × 1 vector containing all the input samples.
Z a vector of random variables.
2
Chapter 1
Introduction
1.1 Overview
This dissertation investigates the use of complex wavelets in image processing. Traditional
formulations of complex wavelets are seldom used because they generally suffer from ei-
ther lack of speed or poor inversion properties. A recently developed dual-tree complex
wavelet transform (DT-CWT) has solved these two fundamental problems while retaining
the properties of shift invariance and additional directionality that complex wavelets pro-
vide. The aim of this dissertation is to discover when the DT-CWT is a useful tool by
developing complex wavelet methods to address a number of image processing problems
and performing experiments comparing methods based on the DT-CWT to alternative
methods. In particular, we aim to compare complex wavelets with standard decimated
and non-decimated real wavelet transforms.
Complex wavelets can be used for both Bayesian and non-Bayesian image processing
and we have applied complex wavelet methods to a wide range of problems including
image database retrieval [33], fast adaptive contour segmentation [34], and edge detection.
In this dissertation we restrict our attention to four main examples of particular interest,
two Bayesian and two non-Bayesian.
We first consider non-Bayesian applications and explain how to use the DT-CWT to
generate texture features. A qualitative feel for the power of the description is given by
texture synthesis experiments. The features are then experimentally tested by addressing
the problem of image segmentation to show the performance relative to many other feature
sets. We also consider images for which the simple model is inadequate. We show how
3
4 CHAPTER 1. INTRODUCTION
the simple model can be enhanced to handle longer-range correlations in images for better
texture synthesis. This is of interest as it provides information about the kind of images
that are well modelled by complex wavelets, and the kind that are too complicated.
Second we demonstrate how complex wavelets can be used for Bayesian image process-
ing. The wavelets are used to define a prior probability distribution for images and solution
methods are developed based on these models.
We use this Bayesian model to deal with irregularly sampled data points. In this case we
assume that the data is a realisation of a stationary Gaussian process. This is of interest
for a number of different reasons. First, it demonstrates how a wavelet method can be
much more efficient than standard methods. Second, we can develop theory that relates
the wavelet methods to a variety of alternative techniques such as Kriging, splines, and
radial basis functions. Third, it provides information about the kind of images that are
too simple for the complex wavelet model – in the sense that although complex wavelets
provide a good answer there is a more basic technique that is significantly faster.
The final extension is concerned with image deconvolution. In this case we develop an
enhanced non-stationary wavelet model. This is of interest because it demonstrates how
wavelet methods give better results than standard approaches such as Wiener filtering and
how a Bayesian approach to inference can further increase the accuracy.
The following sections explain the justification for the research, give an introduction
to the difference between Bayesian and non-Bayesian approaches, describe the original
contribution of the dissertation, and explain the organisation of the work.
1.2 Justification for the research

The wavelet transform has become a widely used technique but the fastest and most
popular formulation (the fully decimated wavelet transform) suffers from aliasing problems.
The usual solution to the aliasing is to eliminate all subsampling from the transform (to
give the non-decimated wavelet transform) but this greatly increases the computation and
amount of storage required (especially in multiple dimensions).
It has been claimed that a recently developed complex wavelet transform gives a useful
compromise solution that removes almost all of the aliasing problems for only a slight in-
crease in computational cost. Previous complex wavelet transforms have generally suffered
from problems with efficiency and reconstruction. It has also been claimed that the new
1.3. BAYESIAN AND NON-BAYESIAN APPROACHES 5
transform solves both these problems. However, currently the new transform is not gen-
erally used: this may be due to doubts about the importance of the differences, concerns
about the complexity of methods based on complex wavelets, or simply because the new
transform is relatively unknown. By testing these claims on a range of practical tasks we
discover the significance of the differences and demonstrate the simplicity of designing and
implementing methods based on complex wavelets.
1.3 Bayesian and Non-Bayesian Approaches

The Bayesian approach to a problem involves constructing a model and then performing
inference using the model. The model consists of specifying firstly the prior probability
distributions for the model parameters and secondly the conditional probability distribu-
tion for the data given the parameters (also called the likelihood). (Chapter 7 explains the
terminology used in Bayesian signal processing.) Inference is performed by the application
of Bayes’ theorem in order to find the posterior distribution for the parameters of the model
conditional on the observed data. If the model correctly describes the data then the best
possible answers can be calculated from the posterior distribution. In this context best is
defined in terms of a cost function that measures the penalty for errors in the estimate,
the best answer is the one that minimises the expected value of the cost function. This is
a very powerful way for performing inference even for complicated models.
There are two main problems with this approach. The first problem is that it is often
hard to construct an appropriate model for the data. The second problem is that the
inference is usually impossible to perform analytically. Instead numerical techniques such
as MCMC (Markov Chain Monte Carlo) must be used. It is usually easy to find a slow
technique that will solve the problem but much more difficult to find a fast technique.
In a trivial sense non-Bayesian approaches are all other approaches. To sharpen the
discussion consider restoring an image that has been corrupted in some way. The Bayesian
approach is to try and construct an accurate prior probability distribution for images that
are likely to occur plus a model for the type of image degradation that is believed to have
occurred. Armed with this model an estimate for the original image can be calculated
from an inferred posterior distribution. A typical non-Bayesian approach might be to
apply a median filter1 . We now list a number of features of the median filter that are often
1
A median filter replaces every pixel by the median of a set of pixels from the local neighbourhood.
characteristic of non-Bayesian approaches:
1. It is very fast.
2. It requires no knowledge about the type of degradation.
3. It is not obvious what assumptions are implicitly made about the expected structure
of images or the degradation.
4. In practice it is often effective.
5. For images with certain structures, such as line drawings, it performs very badly (if
the size of the median filter is large compared to the width of the line then every
pixel will be set to the background colour).
The fact that the approach works without needing the problem to be accurately specified
is the source of both strength and weakness. The strength is that the method can work
reasonably even for the very complicated images that are seen in the real world. The
weakness is that it is also possible that the method will be totally inappropriate (for
instance, the degradation might be that the image is upside down). It is therefore crucial
that the effectiveness of the method is experimentally determined.
Currently Bayesian methods are not often used in real-world applications of image
processing because of their lack of speed and the problem of modelling images. For these
reasons, and to facilitate comparison with alternative transforms, we select a non-Bayesian
approach for segmentation (chapter 5). A Bayesian approach to segmentation is made
difficult by the need to specify a prior for the shapes of segments. This is a large research
topic by itself and there are encouraging results based on complex wavelets in the literature
[59].
However, for interpolation and image deconvolution we attempt a Bayesian approach.
This requires simplifying assumptions to be made about the problem but the benefits of
the Bayesian approach can be seen. The benefits consist not only of improved experimental
results but additionally the mathematical framework permits a theoretical comparison of
many Bayesian and non-Bayesian techniques.
1.4. ORIGINAL CONTRIBUTIONS 7
1.4 Original contributions

This section describes the contributions to learning made by this dissertation from most to
least important. The references are to the corresponding sections in the dissertation. This
classification is, of course, subjective and merely represents the author’s current opinion.
1.4.1 Most important contributions

These are the results of the most general interest.
1. We experimentally compare feature sets for segmentation (5.6,5.9.4).
2. We develop theoretical links between a number of interpolation and approximation

techniques for irregularly sampled data (8.4).
3. We describe how the DT-CWT can be used for image deconvolution(9) and provide
experimental comparisons with other techniques (9.5).
1.4.2 Medium importance contributions

These are the results of interest principally to researchers in a specific area of image pro-
cessing.
1. We explain why it is impossible to have a useful single-tree complex wavelet or a

shift-invariant non-redundant wavelet transform based on short support filters (2.3).
2. We derive an expression for the noise gain of a transform in terms of the unbalance
between analysis and reconstruction (2.5).
3. We describe shift invariant wavelet models in terms of a Gaussian random process

(7.3.3,7.3.4) and identify the auto-correlation of the associated process.
4. We describe how Bayesian interpolation can be implemented with wavelet methods

(8.5).
5. We calculate theoretical predictions of aesthetic and statistical quality of solutions to

interpolation problems (8.7.2) and measure the accuracy of these predictions (8.2).
6. We explain a method for fast conditional simulation for problems of interpolation

and approximation (8.8.2).
7. We show how the speed of wavelet interpolation can be significantly increased by

allowing a small amount of error (8.8.3).
1.4.3 Least important contributions

These are specific experimental or theoretical results of less general interest but which give
supporting evidence for the thesis.
1. We derive connections between the singular values of a wavelet transform matrix,

the frame bounds for the transform, and the noise gain in reconstruction (2.5.2).
2. We perform experiments to compare the noise gain performance for certain complex
Daubechies wavelets and the DT-CWT (2.5.3).
3. We perform segmentation experiments to measure the effect of some additional auto-

correlation based features (6.7).
4. We characterise minimum smoothness norm interpolations and prove that shift de-
pendence is always expected to decrease the quality (8.4.4).
5. We experimentally compare the shift dependence of interpolation methods based on

the DT-CWT and alternative wavelet transforms (8.7.3).
6. We propose and compare a number of methods for calculating search directions within
the deconvolution method (9.3.2).
7. We review the main deconvolution techniques from a Bayesian perspective (appendix

C).
8. We use Fourier transforms to explain why Van Cittert and Landweber methods (with-
out a positivity constraint) will always perform worse than oracle Wiener filtering
(C.5).
1.5. ORGANISATION OF THE DISSERTATION 9
1.4.4 Contributions based largely on previous work

These are straightforward extensions of existing work to use the complex wavelet transform.
1. We perform some experiments using the Hidden Markov Tree for image denoising
(3.4).
2. We show how the DT-CWT can be used for texture synthesis (4.4.3) and display
some experimental results (4.8).
3. We develop pixel by pixel (5.5) and multiscale segmentation methods (5.9) based on
complex wavelets.
4. We explain how the autocorrelation of DT-CWT subbands can be used to improve

the quality of texture synthesis (6.2) and display some experimental results (6.3).
The author has chosen to use the pronoun “we” to represent himself in order to avoid both
the jarring effect of “I” and the awkwardness of the passive tense. Nevertheless, all the
original research presented here is the work of the sole author.
A technical report [35] has been published containing the results of a collaboration
with an expert in seismic surveying. During the collaboration we applied the results of
chapter 8 to the problem of using seismic measurements to determine subsurface structure.
The report itself contains contributions from the expert but all the material and results of
chapter 8 are the work of the sole author.
1.5 Organisation of the dissertation

Figure 1.1 illustrates the organisation. The purpose of the first two chapters is mainly to
provide background information that will be useful in the subsequent chapters. Chapter
2 reviews the principles of standard wavelet transforms from a filter design viewpoint
and describes the properties of complex wavelet transforms. Chapter 3 reviews some
relevant work done in motion estimation, image classification, and image denoising that
uses complex wavelets.
The next three chapters propose and test a number of non-Bayesian complex wavelet
image processing methods. Chapter 4 proposes texture features based on complex wavelets

2: Wavelet 3: Previous
Background
transforms applications

4: DT-CWT
Non-Bayesian processing
texture features

6: Correlation
Example applications 5: Segmentation
modelling

7: Bayesian
Bayesian processing
modelling

8: Interpolation 9: Image
Example applications
and approximation deconvolution

11: Further
Final remarks 10: Conclusions
possibilities
Figure 1.1: Guide to the dissertation

1.5. ORGANISATION OF THE DISSERTATION 11
and examines their properties by means of texture synthesis experiments. Chapter 5 com-
pares a DT-CWT segmentation method with the results from alternative schemes for a
variety of image mosaics. Chapter 6 extends the texture set to include longer-range corre-
lations.
The following three chapters propose and test Bayesian approaches to image processing.
Chapter 7 introduces a Bayesian framework for image modelling. Chapter 8 describes the
application of Bayesian methods to approximation and interpolation. Chapter 9 uses a
similar Bayesian method to address the problem of deconvolution.
The final two chapters summarise the findings and discuss future possibilities. Chap-
ter 10 discusses the impact of the research and summarises the main conclusions of the
dissertation. Chapter 11 suggests directions for future research.
At the end of the dissertation are the references and appendices.
Chapter 2
Wavelet transforms
2.1 Summary
The purpose of this chapter is to introduce and motivate the use of a complex wavelet
transform. The chapter first gives a short introduction to the terminology and construction
of real wavelet transforms and then a review of a number of complex wavelet transforms.
We explain why useful single-tree complex wavelets will suffer from poor balance (we
define balance as a measure of similarity between the filters in the forward and inverse
transforms) and why most non-redundant transforms will suffer from aliasing.
The main original contribution of the chapter is an equation relating the balance of a
transform to the amount of noise amplification during reconstruction which shows why a
balanced complex wavelet transform is preferred. This importance of this result is demon-
strated by an experiment comparing a single-tree complex wavelet with a dual-tree complex
wavelet.
The description of wavelets and complex wavelet systems is based on the material
referenced, but the equation 2.15 mentioned above is original.
2.2 Introduction
The concepts behind wavelets have been independently discovered in many fields including
engineering, physics, and mathematics. This dissertation is concerned with the application
of wavelets to image processing problems making the engineering perspective the most
useful. The principal sources for this chapter are books by Daubechies [30], Mallat [75],
13
14 CHAPTER 2. WAVELET TRANSFORMS
Strang and Nguyen [113], and Vetterli and Kovac̆ević [122]. We assume familiarity with
the Z-transform. We follow the notation of Vetterli [121].
2.3 The Wavelet Transform

We will first describe the one dimensional dyadic discrete time wavelet transform. This is a
transform similar to the discrete Fourier transform in that the input is a signal containing
N numbers, say, and the output is a series of M numbers that describe the time-frequency
content of the signal. The Fourier transform uses each output number to describe the
content of the signal at one particular frequency, averaged over all time. In contrast, the
outputs of the wavelet transform are localised in both time and frequency.
The wavelet transform is based upon the building block shown in figure 2.1. This block

is crucial for both understanding and implementing the wavelet transform.

- H0(z) - ↓ 2 - y0

x
- H1(z) - -
↓2 y1
Figure 2.1: Building block for wavelet transform
This diagram is to be understood as representing the following sequence of operations:
1. Filter an input signal (whose value at time n is x(n)) with the filter whose Z-transform
is H0 (z).
2. Downsample the filter output by 2 to give output coefficients y0 (n).
3. Filter the input signal x(n) with the filter whose Z-transform is H1 (z).
4. Downsample the filter output by 2 to give output coefficients y1 (n).
Downsampling by k is a common operation in subband filtering. It is represented by the

notation ↓k . This operation converts a sequence of kM coefficients a(0), a(1), . . . , a(kM −1)
to a sequence of M coefficients b(0), b(1), . . . , b(M − 1) by retaining only one of every k
coefficients
b(n) = a(kn). (2.1)

2.3. THE WAVELET TRANSFORM 15
For wavelet transforms H0 (z) will be a lowpass filter and H1 (z) will be a highpass filter.
The idea is to split the original signal into two parts; the general trends given by y0 , and
the fine details given by y1 . The downsampling is a way of preventing redundancy in the
outputs. The detail coefficients (highpass filtered) in y1 are known as wavelet coefficients,
and the lowpass filtered coefficients in y0 are known as scaling coefficients.
A full wavelet transform is constructed by repeating this operation a few times. Each
time the basic splitting operation is applied to the scaling coefficients just generated. Figure
2.2 shows an example of a 4-level subband decomposition tree. This represents the forward

wavelet transform.
Level 4

Level 3
-
y000

- ↓ 2 - y0000

H0
Level 2 y00
- - ↓2

H0
Level 1 -
y0
- ↓2 - - -

↓2 y0001

H0 H1
- - - - -

↓2 ↓2

H0 H1 y001
x - - - ↓2

H1 y01
- H1 - -↓2 y1
Figure 2.2: Subband decomposition tree for a 4 level wavelet transform
The filters are carefully designed in order that the wavelet transform can be inverted.

Figure 2.3 shows the building block for the reconstruction. This block represents the
y0

- -

↑2
?- G0
z
6
+
y1 - - ↑2 G1
Figure 2.3: Building block for inverting the wavelet transform
following operations:
1. Upsample the lowpass coefficients y0 by 2.
2. Filter the upsampled signal with G0 .
3. Upsample the highpass coefficients y1 by 2.

4. Filter the upsampled signal with G1 .
5. Add the two filtered signals together.
Upsampling by k is represented by the notation ↑k . This operation converts a sequence of M

coefficients b(0), b(1), . . . , b(M −1) to a sequence of kM coefficients a(0), a(1), . . . , a(kM −
1) by inserting k − 1 zeros after every coefficient. More precisely: if n is a multiple of k
then a(n) = b(n/k), otherwise a(n) = 0.
The filters are usually designed to ensure that when this reconstruction block is applied
to the outputs of the analysis block, the output sequence z(n) is identical to the input
sequence x(n). This is known as perfect reconstruction (PR).
We can use this block repeatedly in order to recover the original sequence from the
wavelet transform coefficients:
1. Use the block to reconstruct y000 from y0000 and y0001 .
2. Reconstruct y00 from y000 and y001 .
3. Reconstruct y0 from y00 and y01 .
4. Reconstruct x from y0 and y1 .
The structure described above is an efficient way of implementing a wavelet transform.

We now describe an alternative structure that is less efficient but often more convenient for
theoretical results. The alternative relies on the equivalence of the following two transforms:
1. Downsample by k then filter by H(z).
2. Filter by H(z k ) then downsample by k.
This equivalence allows us to move the downsampling steps in figure 2.2 past the filters to
produce the equivalent structure shown in 2.4 in which all the downsampling operations
have been moved to the right. Each subband is now produced by a sequence of filters
followed by a single downsampling step. The impulse responses of these combined filters
are called analysis wavelets (for any combination including a high-pass filter) or scaling
functions (for a combination of only lowpass filters). For example, the y001 coefficients are
produced by filtering with W001 (z) = H0 (z)H0 (z 2 )H1 (z 4 ) followed by downsampling by 8.
2.3. THE WAVELET TRANSFORM

17
Level 4
-
Level 2

Level 3 -H0(z8) - ↓ 16 - y0000

H0 (z 4 )
Level 1 -

- -

H0 (z 2 )- H1 (z 8 ) ↓ 16 y0001
- H0(z)

- - -
-

- -
2
↓4
H1 (z 4 ) ↓8 y001

x(n) H1 (z ) y01
- H1(z) - -
↓2 y1
Figure 2.4: Alternative structure for a subband decomposition tree
The impulse response of W001 (z) is called the analysis wavelet for scale 3 - or simply the
scale 3 wavelet.
In a similar way the reconstruction can be represented as an upsampling step followed
by a single filtering step. The impulse response in this case is called the reconstruction or
synthesis wavelet.
2.3.1 Filter design and the product filter

Vetterli showed that only FIR (Finite Impulse Response) analysis and synthesis filters lead
to perfect reconstruction without implicit pole/zero cancellation [121]. Suppose that we
want to design analysis and reconstruction FIR filters that can perfectly reconstruct the
original sequence. It can be shown [121] that a necessary and sufficient condition for perfect
reconstruction is that
2 = H0 (z)G0 (z) + H1 (z)G1 (z) (2.2)

0 = H0 (−z)G0 (z) + H1 (−z)G1 (z) (2.3)
(Sometimes the perfect reconstruction condition is relaxed to mean that the reconstructed
signal is identical to a shifted version of the original. In this case the LHS of the first
equation is of the form 2z −k for some integer k.)
It can be shown that solutions are given by any FIR filters that satisfy the following
equations
H1 (z) = z −1 G0 (−z) (2.4)

G1 (z) = zH0 (−z) (2.5)

P (z) + P (−z) = 2 (2.6)
where P (z) is known as the product filter and is defined as P (z) = H0 (z)G0 (z).
If we have a solution then we can produce additional solutions by
1. either multiplying all the coefficients in an analysis filter by r and all the coefficients
in the corresponding reconstruction filter by 1/r,
2. or adding a time delay to the analysis filters and a time advance to the reconstruction
filters (or the other way around).
These simple changes do not change the wavelet transform in any significant way. If we
ignore such trivial changes then it can also be shown that any FIR filters that achieve
perfect reconstruction must also satisfy the design equations given above1 .
2.3.2 Terminology
This section defines a number of terms that will be used in the following discussion. For
convenience the important definitions from the previous sections are repeated here.
Perfect reconstruction (PR) A system has the perfect reconstruction property if the
combination of a forward and reverse wavelet transform leaves any signal unchanged.
Analysis filters The filters used in the forward wavelet transform (H0 (z) and H1 (z)).
Reconstruction filters The filters used in the reverse wavelet transform (G0 (z) and
G1 (z)).
Balanced For real filters we define the system to be balanced if G0 (z) = H0 (z −1 ) and
G1 (z) = H1 (z −1 ). (For complex filters balance requires equivalence of the recon-
struction filters with the conjugate time reverse of the analysis filters.) Balanced
filters will therefore have equal magnitude frequency responses |Ga (ejθ )| = |Ha (ejθ )|.
1
If the equation 2.2 is true then it is clear that H0 (z) and H1 (z) cannot share any non-trivial zeros (we
call zeros at the origin trivial). If equation 2.3 is true then any non-trivial zeros of H0 (−z) must therefore
belong to G1 (z). A similar argument shows that any non-trivial zeros of G1 (z) belong to H0 (−z) and
hence G1 (z) = rz k H0 (−z) where k is an integer delay and r is a scaling factor. Finally, substituting this
into equation 2.3 shows that H1 (z) = −(1/r)(−z)−k G0 (−z).
This definition is not normally used in wavelet analysis because for critically sam-
pled systems it is equivalent to orthogonality. In fact, the term balance has been
used for other purposes within the wavelet literature (Lebrun and Vetterli use it to
measure the preservation of a polynomial signal in the scaling coefficients during re-
construction [70]) and the reader should be aware that our usage is not standard.
Nevertheless, the concept is crucial and within this dissertation we will exclusively
use the definition given here.
Near balanced The system is near balanced if the analysis filters are close to the conju-
gate time-reverse of the reconstruction filters.
Redundancy The redundancy of the transform is the ratio of the number of outputs to
the number of inputs. A complex coefficient is counted as two outputs.
Non-redundant A transform is non-redundant if the redundancy is 1.
Orthogonal A PR system is orthogonal if the transform is non-redundant and balanced.
Bi-Orthogonal A PR system is bi-orthogonal if the transform is non-redundant but not

balanced.
Product filter The product filter P (z) is defined as the product of the lowpass analysis
and reconstruction filters P (z) = H0 (z)G0 (z).
Symmetric We say that an odd length filter with Z-transform H(z) is symmetric with
even symmetry if H(z) = H(z −1 ), or symmetric with odd symmetry if H(z) =
−H(z −1 ). Note that these definitions are for both real and complex signals, and
in particular that there is no conjugation. For even length filters we also allow a
time delay; even symmetry means H(z) = z −1 H(z −1 ), odd symmetry means H(z) =
−z −1 H(z −1 ). We will use antisymmetric as another way of saying symmetric with
odd symmetry.
Ideal filter We say a filter is ideal if its frequency response H(f ) takes the value 1 on a
set of frequencies, and 0 on all other frequencies.
Shift invariant We call a method shift invariant if the results of the method are not
affected by the absolute location of data within an image. In other words, a method
that gives the answer b when applied to data a is called shift invariant if it gives a
translated version of b when applied to a translated version of a.
We will call a transform shift invariant if it produces subbands such that the total
energy of the coefficients in any subband is unaffected by translations applied to the
original image.
Shift dependent We call a method shift dependent if the results of the method are
affected by the absolute location of data within an image.
We call a transform shift dependent if it produces a subband such that translations

of an image can alter the total energy in the subband.
It is often useful in developing theoretical results to use vector and matrix notation to
describe the transform. Let N be the number of input samples and M the number of
output coefficients. We will use x to be a N × 1 column vector containing all the input
signal values. Let w denote a M × 1 column vector containing the wavelet coefficients and
let W denote a M × N matrix representing the wavelet transform such that
w = W x.
We also define a N × M matrix P to represent the inverse wavelet transform such that
(for perfect reconstruction wavelet systems)
x = P w.
Matrix multiplication is a very inefficient way of calculating wavelet coefficients and such
an operation should always be implemented using the filter bank form described earlier.
As it is often convenient to express algorithms using this matrix notation we adopt the
convention that whenever such a multiplication has to be numerically evaluated it is tacitly
assumed that a fast filterbank implementation is used.
For complex wavelet transforms we will sometimes use complex coefficients but other
times it is more useful to consider the real and imaginary parts separately. When confusion
is possible we will use the subscripts R and C to denote the separate and complex forms.
For example, treating the output as complex coefficients we can write
wC = WC x
and then use these complex coefficients to define the separated form

{wC }
wR =
{wC }
or equivalently we can calculate the separated form directly by
wR = WR x.
In actual wavelet implementations the data sets are finite and care must be taken
when processing coefficients near the edges of the data set. The problems occur when a
filter requires samples outside the defined range. A natural approach is to assume that such
values are zero (known as zero extension) but this will not produce a perfect reconstruction
system (except for filters with very short support such as the Haar filters). The easiest way
to treat the edges and preserve the PR property is to assume that the signal is periodic
(known as periodic extension). In other words, a sample from just before the beginning of
the data set is assumed to have the same value as a sample just before the end. This has
the drawback that discontinuities are normally created at the edges of the dataset. A third
method (known as symmetric extension) that avoids discontinuities at the edge is based
on reflections of the original data. A sample from just before the beginning is assumed to
have the same value as a sample just after the beginning. Methods differ in whether the
edge samples are doubled up. With a careful design symmetric extension can also result
in a perfect reconstruction system. Any of these edge treatments still results in an overall
linear transform. It is this complete transform (including edge effects) that is represented
by the matrices W and P .
The explanation has so far been restricted to the wavelet transform of one dimensional
signals but we will use the same definitions and notation for two dimensional wavelet
transforms of images. In particular, it is convenient to retain the same matrix and vector
notation so that a N ×1 vector x represents an image containing N pixels and W x represents
computing the two dimensional wavelet transform of x. The efficient implementation of
such a two dimensional wavelet transform requires the alternation of row and column
filtering steps [113].
2.3.3 Single tree complex wavelets

Complex wavelets can be produced by using filters with complex coefficients. This allows
greater freedom in the design. For example, if we want to construct an orthogonal wavelet
transform with symmetric wavelets then the only possible choice with real coefficients is

the Haar wavelet (the Haar wavelet has very simple analysis filters H0 (z) = (1/ (2))(1 +

z −1 ), H1 (z) = (1/ (2))(1 − z −1 )) . However, if complex coefficients are allowed then many
possible solutions are allowed, such as certain cases of the complex Daubechies wavelets[73].
It is of particular interest to construct a (necessarily complex) wavelet transform that
is able to distinguish positive and negative frequencies. There are two main reasons for
this:
1. When images are analysed complex filters can separate the information in the first
and second quadrants (of 2D frequency space). This permits methods to distinguish
features near 45◦ from those near −45◦ .
2. Wavelet methods are often shift dependent due to aliasing caused by the downsam-
pling. Real filters have both negative and positive frequency passbands and usually
an aliased version of the positive passband will have a significant overlap with the
negative passband. By removing the negative passband the aliasing can be greatly
reduced [63, 64].
The first reason only applies to multi-dimensional data sets, but the second reason is always
important.
We now explain why it is impossible to get a useful single tree complex wavelet transform
(with either orthogonal or biorthogonal filters) that will both be able to distinguish positive
and negative frequencies and have good reconstruction properties. Any symmetric wavelet
(with either even or odd symmetry) will have an equal magnitude response to positive
and negative frequencies. Now consider the asymmetric case. We define the passband
as the frequencies for which the magnitude of the frequency response is above 1. From
the equation P (z) + P (−z) = 2 we see that any frequency must be contained in the
passband of either P (z) or P (−z) and that therefore the passband of P (z) must cover
at least half the spectrum. For a useful transform we want P (z) to be within the low
frequency half of the spectrum (more on this assumption later) and hence the passband
of P (z) must cover all of the low frequencies, both positive and negative. Therefore if
H(z) is biased towards positive frequencies then G(z) must be biased towards negative
frequencies. Unfortunately this leads to very bad noise amplification: small changes made
to the wavelet coefficients result in large changes in the reconstructed signal. Section 2.5
contains theoretical results linking noise amplification to the balance between analysis and
reconstruction filters. We will show that the frequency responses of H0 (z) and G0 (z) must
be close for the wavelet transform to achieve low noise amplification. We conclude that
a complex wavelet transform based on a single dyadic tree cannot simultaneously possess
the four following properties:
1. Perfect reconstruction.
2. The ability to distinguish positive and negative frequencies.
3. Balanced filters leading to low noise amplification during reconstruction.
4. A lowpass product filter.
The last property merits a little further discussion. By, for example, applying a phase
rotation to the filter coefficients of an orthogonal real wavelet transform
hk → hk exp {jθk}
(where θ is a real number but not a multiple of π) it is certainly possible to construct a

complex wavelet transform with balanced filters and perfect reconstruction (this operation
corresponds to frequency shifting all the filter frequency responses and hence the product
filter is no longer lowpass). However, such an operation will have the scaling coefficients
tuned to some non-zero frequency and while this may be appropriate for some specialised
application (for example, the wavelet transform in section 2.3.6) it is not generally useful.
More importantly, the system will still possess the same aliasing problems of the original
real wavelet. Complex wavelets only reduce shift dependence problems when the overlap
between aliased passbands is reduced [63]. The fundamental conflict is that we require
narrow passbands in order to reduce shift dependence, but that the product filter passband
for a single tree complex wavelet transform must necessarily cover half the spectrum.
2.3.4 Directionality and Ridgelets

The previous section attempted to motivate the use of complex wavelets by their ability to
give shift invariance and better directionality properties. Real filters must respond equally
to positive and negative frequencies and hence transforms based on separable filtering
with real filters will not differentiate between adjacent quadrants in the 2D frequency
plane. Separable filtering means that we can apply a 2D filter by first filtering all the
rows with one 1D filter, and then filtering all the columns with a second 1D filter. This
leads to efficient computation of wavelet transforms but it is not the only possibility.
Bamberger and Smith proposed a directional filter bank [7] that generalises the notion of
separability by modulating, rotating, and skewing the frequency content of the image. This
results in a transform that splits the spectrum into a number of fan shaped portions but
is shift dependent and does not differentiate between different scales. Another important
alternative is known as a Ridgelet transform [19, 20].
Candès and Donoho have developed a mathematical framework for a rigorous treat-
ment of a continuous ridgelet transform[19] and notions of smoothness associated with
this transform[20] but we shall only discuss the discrete version. The ridgelet transform
acting on images can be implemented by a Radon transform followed by the application
of a one-dimensional wavelet transform to slices of the Radon output. A Radon transform
computes the projection of the image intensity along a radial line oriented at a specific
angle. For the ridgelet transform, angles of the form 2πl2−j are used where j and l are
integers.
A particular basis function of the ridgelet transform has a constant profile (equal to
a 1D wavelet) in some specific direction depending on the associated angle in the Radon
transform. The large range of angles used means that the ridgelet transform has good
directionality properties and is much better suited than the standard real wavelet for
analysing straight lines and edges of arbitrary orientations. Unfortunately, as the ridgelet
transform is based upon a 1D DWT, the transform naturally inherits the aliasing of the 1D
DWT. Later in this chapter we discuss examples of useful complex wavelets that do reduce
aliasing. It would be interesting to construct a complex ridgelet transform by replacing the
DWT as this might add shift invariance to the other properties of the ridgelet. However, in
this dissertation we have chosen to extensively test a single representative complex wavelet
transform rather than exploring the range of construction options.
2.3.5 Shift invariance and the Harmonic wavelet
The tree structure described for the DWT is non-redundant as it produces the same number
of output coefficients as input coefficients. This has a number of advantages including low
storage requirements and fast computation. However, there is one important disadvantage:
any non-redundant wavelet transform based on FIR filters will produce shift dependent
methods. The down-sampling introduces aliasing and so the results of processing will
depend upon the precise location of the origin. Essentially the problem is that the subbands
are critically sampled and hence there will always be aliasing unless the filters have an ideal
band pass response. However, FIR filters always have a non-ideal response.
It may be thought that by constructing some new tree system with carefully chosen
filters and degrees of downsampling that produce oversampling in some subbands and
undersampling in others it may be possible to get round this problem and produce a linear
transform with a negligible amount of aliasing while still using short support filters. We
now consider the performance of such a system.
Suppose we have some linear PR non-redundant transform represented by the matrices
W for the forward transform and P for the inverse transform. PR means that P W = IN .
As the transform is non-redundant both W and P are square matrices. Therefore P = W −1
and W P = IN . This means that a inverse transform followed by the forward transform
will give identical wavelet coefficients.
Now consider the elementary processing step that reconstructs from just the coefficients
in one subband. This is elementary both in the sense that it is simple and in the sense that
more complicated operations can often be viewed as a combination of such steps. If the
transform is to be shift invariant then this operation must represent a stationary filtering
operation.
Let T be a diagonal matrix whose diagonal entries select out a chosen subband. In
other words, if wi is an output coefficient in the chosen subband then Tii = 1, while all the
other entries in T are zero. The filtering can therefore be represented as
z = PTWx (2.7)
However, if we repeat the filtering the output is
PTWPTWx = PTTWx = PTWx (2.8)
and we conclude that repeating the filtering does not change the output. Any filter localised
in both space and frequency should continue to change a signal when repeated, and we
conclude that the transform either possesses ideal bandpass filters or that it results in shift
dependent processing.
This shows that no matter what games are played with sampling structures and filters
it is always impossible to avoid shift dependence (for a linear non-redundant PR transform)
without constructing ideal band pass filters.
In fact, this argument suggests the stronger result that the amount of shift dependence
is directly related to the amount the filters differ from an ideal bandpass response.
However, it is not true that significant aliasing necessarily leads to worse results. For
example, suppose we wish to implement a very simple lossy image coder by just transmit-
ting a few of the largest coefficients. The reconstructed image will be the z of equation
2.7 (where T is now defined to preserve the transmitted coefficients). If the reconstructed
(degraded) image is now coded again with the same lossy image coder then the result of
equation 2.8 holds and proves that no additional errors are introduced by the repeated
coding. Of course, in practice there will be quantisation errors and a more sophisticated
choice of which coefficients to keep but it is certainly feasible that the aliasing is beneficial.
Therefore one of the aims of the dissertation is to experimentally test the importance of
aliasing in different applications.
In the discussion above we have always needed to exclude filters with ideal responses. It
is impossible to strengthen the results because of three important counter examples. The
first example is a transform consisting of the filter H(z) = 1 that produces a single subband
(containing the original data). This results in shift invariant processing in a trivial sense
by means of doing nothing. Less trivially, the second example is the well-known Fourier
transform that also results in a shift invariant system. Thirdly and most interestingly is the
example of orthogonal harmonic wavelets. Harmonic wavelets were proposed by Newland
[85] and are particularly suitable for vibration and acoustic analysis [86]. In particular,
orthogonal harmonic wavelets provide a complete set of complex exponential functions
whose spectrum is confined to adjacent non-overlapping bands of frequency.
The easiest way to describe orthogonal harmonic wavelets is to give an account of an
efficient algorithm for their construction [86]:
1. Compute the N point FFT (Fast Fourier Transform) of the data x.
2. Subband k is formed from the mk point inverse FFT of mk consecutive Fourier

coefficients.
There is no restriction on the number mk of coefficients for each subband except that
together each coefficient must be associated with exactly one subband.
From this construction it is easy to construct a perfect reconstruction inverse by in-
verting each step. It is also clear that the filters corresponding to each subband have
an ideal bandpass response and hence result in a shift invariant system. The drawback
of an ideal bandpass response is that the associated wavelets have a poor localisation in
time. In practice, the box-car spectrum of the orthogonal harmonic wavelets is smoothed
to improve this localisation and the spectra of adjacent wavelet levels are overlapped to
give oversampling to improve the resolution of time frequency maps generated from the
wavelets. These more practical systems are known as harmonic wavelets.
From a computational perspective there is not much difference between this method
and a complex wavelet implemented with a tree structure. The complexity of the Fourier
transform is order N log N while a wavelet transform is order N but for signals of modest
length the harmonic wavelet may well be quicker to compute.
The design freedom of harmonic wavelets makes them well suited for analysis and a
careful design would also permit a stable reconstruction transform to be generated. The
resulting transform would be a redundant complex wavelet system. We have actually se-
lected a different form of complex wavelet transform as a representative for the experiments
but we would expect the results to be very similar for the harmonic wavelets.
2.3.6 Non-redundant, directionally selective, complex wavelets
Recently a new complex wavelet transform has been proposed [120] that applies filters
differentiating between positive and negative frequencies to the subbands from a standard
wavelet transform. The outputs of these filters are again subsampled so that the complete
complex transform is non-redundant. The multiple subbands produced by this complex
filtering can be recombined to give a perfect reconstruction. (Recall that we cannot hope
to reconstruct from just a single branch, such as the top-right quadrant, for the reasons
given in section 2.3.3.)
We proposed two main reasons for wanting to use complex wavelets; increased direction-
ality, and reduced shift dependence. This type of complex wavelet transform has increased
directionality, but as it is based on the shift dependent DWT subbands it naturally retains
the DWT shift dependence. Furthermore, the additional filtering discriminating between
the different quadrants of frequency space will cause additional shift dependence errors.
Using increased redundancy in this method could reduce this additional shift dependence
but will never remove the shift dependence caused by basing the transform on the output
of a standard decimated wavelet transform.
2.3.7 Prolate spheroidal sequences

Slepian introduced the use of the prolate spheroidal wavefunctions for signal processing
[109]. These functions are the solutions of a simple energy form of the uncertainty principle.
For practical application it is useful for the functions to be defined on finite 2-D lattices
[126]. Such functions are called finite prolate spheroidal sequences (FPSSs). A FPSS
can be defined as the eigenvectors of a linear transform that first bandlimits a signal
and then truncates the signal. The eigenvector corresponding to the largest eigenvalue
can then be translated and frequency shifted to construct a set of basis functions that
tile the time-frequency plane. We will call these basis wavefunctions the wavelets of this
transform. Wilson has examined the properties of a critically sampled FPSS [126]. In this
case critically sampled means that if we have an input containing N complex numbers,
then the output also contains N complex numbers. He defines a matrix Q = W H W where
W represents the transform (i.e. calculating the correlation of the signal with each wavelet)
and W H represents the Hermitian transpose of W . If the wavelets were orthogonal then Q
would be equal to the identity matrix. The condition number µ of this matrix is measured.
The condition number is defined as

µ(Q) = σ0 /σN −1 (2.9)
where σk is the (k + 1)th eigenvalue, in descending numerical order, of Q. Wilson found

condition numbers ranging between 1.348 and 1.406 [126]. This means that the ratio of
the eigenvalues varies by more than a factor of 1.8 (the square of the condition number).
As this is greater than 1 the reconstruction filters will not be balanced. The significance
of the condition number will be seen in section 2.5.2.
The main problems with the FPSS occur when we need to reconstruct the signal from
the transform coefficients. The only design criteria for the wavelets is good time-frequency
energy concentration. There is no direct consideration of the reconstruction filters or of
the effect of constructing a wavelet pyramid from the functions. For example, there is
no guarantee that it will be possible to have finite impulse response (FIR) reconstruction
filters. It is suggested [126] that truncating the inverse filters results in an almost perfect
reconstruction, but errors will be magnified by the tree structure or by iterative techniques.
The frequency responses of the functions tend to be Gaussian shaped and well separated.
Although this is good for reducing aliasing, it leads to problems during reconstruction.
For perfect reconstruction we need an overall frequency response that sums to unity for
2.4. REDUNDANT COMPLEX WAVELETS 29
all frequencies. This results in large amplification being necessary for some frequencies
and hence bad noise amplification properties due to the unbalance between analysis and
reconstruction filters.
Another problem occurs if we try and use the FPSS wavefunctions as the low and high
pass analysis filters of a wavelet transform. The low pass filter does not necessarily have
a zero at a frequency of π and therefore the coarse level scaling functions can develop
oscillations [30].
2.4 Redundant complex wavelets

We first describe the traditional formulation (called Gabor wavelets) and problems of
redundant complex wavelets and then explain recently developed solutions.
The term Gabor wavelets is used for several different systems that differ in the choice of
positions, orientations, and scales for the filters. They were originally proposed in 1980 by
Marcelja [77] for 1D and Daugman [31] for 2D in order to model the receptive field profiles
for simple cells in the visual cortex. This section describes the choice of Gabor wavelets
that Manjunath and Ma found to be best in their texture processing experiments [76]. A
two-dimensional Gabor function centred on the horizontal frequency axis can be written
as

1 1 x2 y2
g(x, y) = exp − + + 2πjW x (2.10)
2πσx σy 2 σx2 σy2
where σx and σy are the bandwidths of the filter and W is the central frequency. This
function can then be dilated and rotated to get a dictionary of filters by using the trans-
formation
gmn (x, y) = a−m g(x , y ) (2.11)

x = a−m (x cos θ + y sin θ) (2.12)
y = a−m (−x sin θ + y cos θ) (2.13)
where θ = nπ/K and K is the total number of orientations. Given a certain number of
scales and orientations, the scaling factor a and the bandwidths of the filters are chosen
to ensure that the half-peak magnitude support of the filter responses in the frequency
spectrum touch each other. Figure 2.5 shows these half-peak contours. Manjunath and
Ma found that a choice of 4 scales (with a scaling factor of a = 2) and 6 orientations at

each scale was best.
0.12
vertical frequency (/sample freq)

0.1
0.08
0.06
0.04
0.02
0
−0.02
−0.04
−0.1 −0.05 0 0.05 0.1
horizontal frequency (/sample freq)
Figure 2.5: Contours of half-peak magnitude of filters at scales 3 and 4
They can be implemented using the Fast Fourier Transform (FFT) to perform the
filtering. This requires one forward transform, and the same number of inverse transforms
as there are desired subbands in the image. This process gives a very high redundancy
(equal to the number of subbands) and is therefore slow to compute. The main advantage
is that the frequency responses can be chosen to achieve perfect reconstruction. Some
attempts have been made to reduce the amount of redundancy. For example Daugman
[32] uses a subsampled set of Gabor wavelets on a regular grid, while Pötzsch et al [95]
use sets of Gabor wavelets (that they call jets) centered on a small number of nodes. Such
methods have two main problems:
1. They are inefficient. The Gabor wavelet coefficients are found by calculating the full
transform and discarding the unwanted coefficients.
2. They are hard to reconstruct. Daugman achieves his reconstruction by a process

of gradient descent during analysis that finds weights for the Gabor wavelets to
allow simple reconstruction by filtering [32]. This is a slow process. The problem is
expressed in terms of a neural network that must be trained for each new image to
be processed.
Pötzsch et al propose an approximate reconstruction that ignores the interaction
between jets at different nodes [95]. Their exact reconstruction takes 900 times
longer than the approximate method.
An alternative approach was developed by Magarey [74]. He developed a complex

wavelet transform based on short 4-tap filters that had responses very close to Gabor
wavelets. This efficient system was successfully used for motion estimation (as described
in section 3.2) but did not possess a simple set of reconstruction filters.
2.4.1 Dual tree complex wavelet transform

Kingsbury’s complex wavelets [62, 63] have similar shapes to Gabor wavelets. The fre-
quency responses for the 2D transform are shown in figure 2.6. As for the Gabor wavelets,
there are 6 orientations at each of 4 scales (any number of scales can be used, but the
number of orientations is built into the method).
0.12
vertical frequency (/sample freq)
0.1
0.08
0.06
0.04
0.02
0
−0.02
−0.04
−0.1 −0.05 0 0.05 0.1
horizontal frequency (/sample freq)
Figure 2.6: Contours of 70% peak magnitude of filters at scales 3 and 4

The main advantages as compared to the DWT are that the complex wavelets are
approximately shift invariant and that the complex wavelets have separate subbands for
positive and negative orientations. Conventional separable real wavelets only have sub-
bands for three different orientations at each level and cannot distinguish lines near 45◦
from those near −45◦ .
The complex wavelet transform attains these properties by replacing the tree structure
of the conventional wavelet transform with a dual tree shown in figure 2.7. At each scale
one tree produces the real part of the complex wavelet coefficients, while the other produces
the imaginary parts. Note that all the filters in the dual tree are real. Complex coefficients
only appear when the two trees are combined.
The extra redundancy allows a significant reduction of aliasing terms and the complex
wavelets are approximately shift invariant [63]; translations cause large changes to the
phase of the wavelet coefficients, but the magnitude, and hence the energy, is much more
stable.
By using even and odd filters alternately in the trees it is possible to achieve overall
complex impulse responses with symmetric real parts and antisymmetric imaginary parts.
Level 3
Level 4
-H0000a - ↓ 2 - x0000a
x000a
- -

Level 2
x00a
↓2 even

H000a
-
Level 1- x0a
- - -

↓2 ↓2 x0001a

H00a odd H0001a
- - - - -

↓2 ↓2
even

H0a H001a x001a
- - -

Tree a ↓2

odd H01a x01a
- - -

↓2

H1a x1a
- - - x000b

↓2

x0000b

H0000b
- - x00b

↓2
x

H000b odd
- - x0b
- - -

↓2 even ↓2 x0001b

H00b H0001b
- - - - -

H0b ↓2 odd H001b ↓2 x001b
Tree b - - - ↓2

odd H01b x01b
- - - H1b ↓2

x1b 2-band reconstruction block
-

- - -

?-
x...
H...0 ↓2 ↑2 G...0
y...
6
+
- - - - H...1 ↓2 ↑2 G...1
Figure 2.7: The complex wavelet dual tree structure. This figure was provided by Dr N.
Kingsbury.
The filters are designed to give a number of desired properties including strong discrimina-
tion between positive and negative frequencies. Note that it is impossible to discriminate
positive and negative frequencies when using conventional real wavelets. This important
property means that in a 2D version of the dual tree separable filters can be used to filter
an image and still distinguish the information in the first and second quadrants of the
two-dimensional frequency response - information that allows us to distinguish features at
angles near 45◦ from those near −45◦ .
The filters are near-balanced and permit perfect reconstruction from either tree. The
results of inverting both trees are averaged as this achieves approximate shift invariance.
In d dimensions with N samples, the transform has a computational order of N2d . For
comparison the fully decimated transform has order N and the nondecimated wavelet
transform has order N((2d − 1)k + 1) where k is the number of scales.
2.4.2 Q-shift Dual tree complex wavelets

In each subband one tree produces the real part and the other the imaginary part of the
complex wavelet coefficient and so the filters in the two trees cannot be identical but must
be designed to produce responses that are out of phase. More precisely, a delay difference
1
of 2
sample is required between the outputs of the two trees. The main problems with the
odd/even filter approach to achieving this delay are that [65]:
1. The sub-sampling structure is not very symmetrical.
2. The two trees have slightly different frequency responses.
3. The filter sets must be biorthogonal because they are linear phase.
These drawbacks have been overcome with a more recent form of the dual tree known as a
Q-shift dual tree [65]. This tree is shown in figure 2.8. There are two sets of filters used,
the filters at level 1, and the filters at all higher levels. The filters beyond level 1 have
even length but are no longer strictly linear phase. Instead they are designed to have a
group delay of approximately 14 . The required delay difference of 1
2
sample is achieved by
using the time reverse of the tree a filters in tree b. The PR filters used are chosen to be
orthonormal, so that the reconstruction filters are just the time reverse of the equivalent
analysis filters. There are a number of choices of possible filter combinations. We have
Level 4
- H00a - ↓ 2 - x0000a
Level 3 x000a
- -
- - Level 2 x00a

↓2

H00a
-
Level 1 - x0a
-

↓2 ↓2 x0001a

H00a H01a
- - - - -

↓2 ↓2

H0a H01a x001a
- - -

Tree a ↓2

H01a x01a
- - -

↓2

H1a x1a
- - - x000b

↓2

x0000b

H00b
- - x00b

↓2

x H00b
- - x0b
- - -

↓2 ↓2 x0001b

H00b H01b
- - - - -

H0b ↓2 H01b ↓2 x001b
Tree b - - - ↓2

H01b x01b
- - - H1b ↓2 x1b
Figure 2.8: The Q-shift dual tree structure. This figure was provided by Dr N. Kingsbury.
chosen to use the (13-19)-tap near-orthogonal filters at level 1 together with the 14-tap
Q-shift filters at levels ≥ 2 [65].
The Q-shift transform retains the good shift invariance and directionality properties of
the original while also improving the sampling structure. When we talk about the complex
wavelet transform we shall always be referring to this Q-shift version unless explicitly stated
otherwise. We will often refer to this transform by the initials DT-CWT.
2.4.3 Steerable transforms

Simoncelli et al [107] highlighted the problem of translation invariance for orthogonal
wavelet transforms and developed a theory of shiftability. They developed a two-dimensional
pyramid transform that was shiftable in both orientation and position which decomposes
the image into several spatial frequency bands. In addition it divides each frequency band
into a set of four orientation bands.
The decomposition is called “steerable” because the response of a filter tuned to any
orientation at a particular level can be obtained through a linear combination of the four
computed responses at that level.
At each level four output subband images are produced by using high pass filters tuned
to different orientations. A radially symmetric low-pass filter is also applied and the output
is subsampled by a factor of two in each direction to produce the input image for the next
level. The steerable pyramid is self-inverting in that the reconstruction filters are the same
as the analysis filters. This decomposition has the disadvantages of non-separable filters,
non-perfect reconstruction and being an over-complete expansion. Note that there is no
sub-sampling in the high-pass channels and so a three level decomposition of an N by N
image will produce:
1. 4 subband images with N*N coefficients
2. 4 images with (N/2)*(N/2) coefficients
3. 4 images with (N/4)*(N/4) coefficients
4. and a final low pass image with (N/8)*(N/8) coefficients.
Hence we have gone from N 2 numbers to (4(1+1/4+1/16)+1/64)N 2 ≈ 5.25N 2 coefficients.

This transform has been used for a number of applications including stereo matching
[107], texture synthesis [45], and image denoising [107]. For some of these a quadrature pair
of steerable filters was used. Using quadrature filters makes the steerable transform almost
equivalent to the DT-CWT with the main differences being that the steerable transform
has increased redundancy and worse reconstruction performance.
2.4.4 Multiwavelets
An alternative way of avoiding the limitations of the standard wavelet transform is known
as multiwavelets [3, 46]. From the filterbank perspective the difference is that the signals
are now vector valued and the scalar coefficients in the filter banks are replaced by matrices.
The conversion of the original data signal to the vectorised stream is known as preprocessing
and there are a number of choices. The choice of preprocessing decides the redundancy of
the system and it is possible to have both critically sampled and redundant multiwavelet
systems. Experimental results in the literature indicate that the redundant systems usually
give better results for denoising [114] (but are less appropriate for coding applications).
Multiwavelets are closely related to the DT-CWT. If we combine the signals from tree
a and b to produce a single 2 dimensional signal then it is clear that the DT-CWT for
scales 2 and above is equivalent to a multiwavelet system of multiplicity 2. The equivalent
matrices in the multiwavelet filterbank are simply 2 by 2 diagonal matrices whose diagonal
entries are given by the corresponding coefficients from the DT-CWT filters.
The DT-CWT processing at scale 1 can be viewed in two ways. One way is to see
it as part of a multiwavelet structure that uses repeated row preprocessing (giving the
redundancy) and has different filters for the first scale. The other way is to interpret the
first scale as performing the multiwavelet preprocessing while preserving (in the scale 1
subbands) the parts of the signal that are filtered out.
In either case it is clear that the DT-CWT is a special case of a multiwavelet transform.
The advantages of this special case are:
1. The preprocessing and filters are carefully designed to allow an interpretation of the
output as complex coefficients produced by filters discriminating between positive
and negative frequencies. This leads to the good shift invariance properties.
2. The absence of signal paths between the two trees (reflected in the diagonal structure
of the matrices) leads to less computation.
2.5. NOISE AMPLIFICATION THEORY 37
2.5 Noise amplification theory

The use of redundant transforms gives a much greater design freedom but there are also
extra complications. This section considers the effect of a very simple model of wavelet
domain processing. The model and notation are described in section 2.5.1. If the wavelet
transform is associated with an orthogonal matrix then an error in the wavelet coefficients
translates directly into an error of the same energy in the signal. This will be the case
for (appropriately scaled) orthogonal wavelet transforms. However, for biorthogonal and
redundant transforms the energy of the error can change significantly during reconstruction.
We give results (proved in appendix A) that describe the noise gain during reconstruction
and in particular equation 2.15 that gives a formula for the noise gain in terms of the
redundancy and unbalance of the transform (the noise gain is a measure of the change in
error energy caused by the inverse wavelet transform).
2.5.1 Preliminaries
We consider a simple form of wavelet processing in which we have an observed image (or
signal) x ∈ R N that is processed by:
1. Calculate the wavelet transform w ∈ R M of x
w = Wx
This theory applies to both real and complex wavelet transforms. For complex trans-
forms we use the separated form in which w is still a real vector and consists of the
real parts of the complex coefficients followed by the imaginary parts.
2. Apply some wavelet domain processing to produce new wavelet coefficients v ∈ R M .
3. Invert the wavelet transform to recover a new signal y ∈ R N
y = P v.
In the mathematical analysis of this model we will model the wavelet domain processing
by adding independent white Gaussian noise of mean zero and variance σ 2 to the wavelet
coefficients.

v = N w, σ 2 IM
Wavelet Transform Inverse Wavelet Transform

w v
x W + P y
noise
Wavelet domain processing
Figure 2.9: Model of wavelet domain processing.
where σ 2 is the variance of the added noise. This model is illustrated in figure 2.9. The
total expected energy of the added noise is E {
v − w
2 } = Mσ 2 . The total expected
energy of the error after reconstruction is given by E {
y − x
2 }. We would like to define
the noise gain as the ratio of these two energies but there is a problem due to scaling.
Suppose we construct a new wavelet transform W = sW that simply scales the values of
all the wavelet coefficients by a factor of s and a new reconstruction matrix P = (1/s)P
that still achieves PR. During reconstruction all the coefficients are scaled down by a factor
of s and hence the noise energy after reconstruction is reduced by s2 . However, in almost all
practical applications a scaling factor of s will mean that the noise standard deviation σ is
also increased by the same factor2 . In order to get meaningful values for the noise gain we
adopt a convention that the scaling factor s is chosen such that the transform preserves the
energy of a white noise signal during the forward wavelet transform. In other words, if we
use the transform to analyse a signal containing independent white Gaussian noise of mean
0 and variance α2 (to give a total expected input energy of Nα2 ) then the total expected
energy of the wavelet coefficients will be equal to Nα2 . We shall prove (A.1) that this is
equivalent to the requirement that tr(W T W ) = N. We define a normalised transform to
be a transform scaled in this manner. We assume that all transforms mentioned in this
section satisfy this convention. The noise gain g is then defined as
E {
y − x
2 }
g= (2.14)
E {
v − w
2 }
A low noise gain means that small changes in the wavelet coefficients lead to small changes
in the reconstructed signal. In this case we will call the reconstruction robust.
2
One exception to this principle is when we use fixed precision numbers to store the coefficients. The
quantisation noise will remain at the same level for different scaling factors s and the argument is now a
valid argument for scaling the coefficients to use the full dynamic range.
We now attempt to motivate this model by giving two examples where it might be
appropriate:
1. In an image coding example the wavelet coefficients are commonly quantised to a

number of discrete values. The quantisation results in errors in the wavelet coefficients
and consequently errors in the decoded image. If these values are equally spaced by
a distance dQ then we could attempt to model the errors in the wavelet coefficients as
white Gaussian noise of variance d2Q /12. Better models certainly exist, for example
if dQ is very small then a uniform distribution of errors in the range −dQ /2 to dQ /2
is a closer approximation while for large dQ (that quantise most of the coefficients
to zero) a Laplacian distribution may give a better fit, but white Gaussian noise is
often a reasonable first approximation. In this case we would want a low noise gain
because in a codec we wish to minimise the error, x − y, in the decoded image for a
given level of quantisation error in the wavelet coefficients.
2. Consider an image restoration technique, such as denoising, for which the wavelet
coefficients are significantly changed. The wavelet domain processing is designed to
make the output coefficients v a reasonable estimate of the wavelet coefficients of
the original image. The purpose of the technique is to get an enhanced image and
initially it may seem inappropriate to use a transform that minimises the effect of
the change. However, if we now reinterpret x as representing the original image, and
w as the wavelet coefficients of this original image, then the same model can be used
to represent the belief that v will be noisy estimates of w. This reinterpretation
may seem a bit confusing but is worth understanding. The estimates v are in reality
produced by some estimation technique acting on the observed data. Nevertheless,
they are modelled based on the original image. The noise in the model represents
the wavelet coefficient estimation error. From this new perspective the noise gain of
the transform measures the relationship between the wavelet estimation error and
the final image error and it is clear that a low noise gain will be beneficial in order
to produce a low final image error.
The theory is equally valid for any linear transforms W and P provided that the system
achieves perfect reconstruction (P W = IN ) and the scaling convention (tr(W T W ) = N)
is observed.
2.5.2 Theoretical results

Consider a finite linear transform represented by the matrix W. The maximum robustness
of the transform W can be calculated from the numbers d1 , ..., dN (which are defined in
A.2 and are the squares of the singular values of W).
• The frame bounds are given by the largest and smallest of these numbers and so the
transform represents a wavelet frame if and only if the smallest is non-zero (see A.3
for proof).
• The average of these numbers gives the gain in energy when we transform white noise
signals and so the average is one for normalised transforms (see A.4 for proof).
• Any linear perfect reconstruction transform that is used to invert W has noise gain

i=N 1
bounded below by M1 i=1 di and this lower bound is achievable (see A.5 for proof).
• If the frame is tight then it can be inverted by the matrix W T , and this inversion
achieves the lower bound on noise gain (see A.6 for proof).
• The noise gain of any real linear perfect reconstruction transform, P , used to invert
W is given by
N +U
g= (2.15)
M

where U is a non-negative quantity given by U = tr (P − W T )(P T − W ) (see A.7
for proof). We will call U the unbalance between the analysis and reconstruction
transform.
Most of these results come from standard linear algebra and can be found in the literature.
We present them here for interest and as a route to the final simple equation for the noise
gain in terms of the unbalance. We are not aware of this final equation (2.15) appearing
in the literature.
Balanced wavelet transforms use the conjugate time-reverse of the analysis filters for
the reconstruction filters and therefore P T = W . (This result is true regardless of whether
we have a real or complex wavelet because we are using the expanded form of the complex
wavelets that treat the real and imaginary parts separately. The equivalent statement for
complex matrices is that PCH = WC .) This results in U taking its minimum value U = 0
and we deduce from the last result that balanced wavelet transforms will have the least
noise gain and hence the greatest robustness.
This can also be expressed in terms of the frequency responses. The frequency response
of an analysis filter will be (for a balanced transform) the conjugate of the frequency re-
sponse of the corresponding reconstruction filter. Note that this means that the magnitude
of the frequency responses will be equal. Therefore a necessary condition for low noise gain
is that the frequency responses of H0 (z) and G0 (z) must be close.
2.5.3 Numerical results

Section 2.3.3 argued that it was impossible to get a complex tree that differentiated positive
and negative frequencies while maintaining good noise reconstruction properties. The
purpose of this section is merely to illustrate the problems that can occur.
We tested the robustness of the DT-CWT and a variety of single tree complex wavelets.
Each transform was designed to produce a 6 level wavelet decomposition of a real signal3
of length 128 and all produced 128 (complex-valued) wavelet output coefficients, and so
all had a redundancy of 2. The transforms were normalised to preserve energy during the
forward transform by a single scaling applied to all output coefficients.
Daubechies wavelets of a certain length are designed to have the maximum number
of zeros in the product filter P (z) at z = −1 subject to the constraint of satisfying the
conditions necessary for perfect reconstruction [30]. The product filter is then factorised
into H0 (z) and G0 (z). Usually factors are allocated in conjugate pairs in order to produce
real filters but complex wavelets can be produced by alternative factorisations [73]. Each
factor a + bz −1 corresponds to a zero of the product filter at z = −b/a. For our experiment
we put all the factors corresponding to zeros with a positive imaginary part into H0 (z) and
all those corresponding to negative imaginary parts into G0 (z). The factors corresponding
to zeros on the real axis are split equally between H0 (z) and G0 (z). This choice gives the
greatest possible differentiation between positive and negative frequencies.
Each choice of filters corresponds to a linear transform that can be represented by
the matrix W . This matrix is found by transforming signals e1 , . . . , eN where ek is zero
everywhere except that at position k the value is 1. The wavelet transform of ek gives the
3
The single tree complex wavelets were restricted to produce a real output by taking the real part of
the reconstructed signal.
k th column of W . Similarly the columns of matrix P are found by reconstructing signals

from wavelet coefficients that are all zero except for a single coefficient. Recall that we are
treating the real and imaginary parts separately and hence W is of size 2N × N. There are
many ways of designing a perfect reconstruction inverse of W . The matrix P represents
one way (that can be efficiently implemented with a tree of filters) but other ways can give
a better noise gain. The theoretical minimum reconstruction noise gain would be given if
we reconstructed using the pseudo-inverse solution. The results of appendix A can be used
to calculate both the noise gain for P and the minimum noise gain. The numerical results
are tabulated in figure 2.10.
We characterise the single tree complex wavelets (STCWT) by the number of zeros the
H0 (z) filter has at z = −1. The greater this number the smoother the wavelets and so in
the table these wavelets are described by the acronym “STCWT” followed by the number
of zeros.
Transform Minimum Reconstruction

name noise gain noise gain
Original dual-tree 0.571 0.599
Q-shift dual-tree 0.50028 0.50032
STCWT 2 0.5 0.5
STCWT 3 0.6 1.5
STCWT 4 0.93 6.7
STCWT 5 3.2 211
STCWT 6 9.6 3430
STCWT 7 53 21000
STCWT 8 225 6.0 ∗ 106
STCWT 9 1570 4.7 ∗ 108
STCWT 10 8370 1.8 ∗ 1010
Figure 2.10: Comparison of noise gain for different transforms
2.5.4 Discussion
The transform labelled STCWT 2 is a special case because the product filter has no complex
zeros. The analysis and reconstruction filters are all real and this transform is identical to a
2.6. CONCLUSIONS 43
real (and orthogonal) Daubechies wavelet transform (of order 2). The transform is therefore
balanced and achieves low noise gain but has the problems of large shift dependence and
wavelets which are not very smooth. To allow direct comparison we still treat these real
wavelets as having complex outputs and so add complex noise to them. Half of this complex
noise is lost when we take the real part at the end of the transform which is why the noise
gain is 1/2. If we attempted to compute a wavelet transform with 1 zero in H0 (z) we would
obtain the Haar wavelet transform with a noise gain of 1/2 for the same reasons.
The DT-CWT achieves a very low noise gain and so will give robust reconstructions.
We note that the Q-shift tree has a lower noise gain than the original dual tree. This
is because of the better balanced filters in the Q-shift version. The single tree complex
wavelets, however, have a rapidly increasing noise gain for the longer (and smoother)
wavelets which is likely to make the wavelets useless. Note that the minimum noise gain
increases at a much slower rate suggesting that an alternative reconstruction transform
could be found with much less noise gain. However, even this alternative would not be of
much practical use as the minimum noise gain is still significant.
Note that this is a very unusual choice of complex Daubechies wavelet. The forms
more commonly used are much more balanced and do not suffer from these reconstruc-
tion problems but consequently have poor discrimination between positive and negative
frequencies.
2.6 Conclusions
The main aim of this chapter was to introduce the terminology and construction of wavelet
and complex wavelet systems. The secondary aim was to explain why we want to use
complex wavelets and what form of complex wavelet is appropriate. We now summarise
the principal points relating to this secondary aim:
1. We want to distinguish positive and negative frequencies in order to:
(a) Improve the directional frequency resolution of the 2D transform while still using
efficient separable filters.
(b) Reduce aliasing and produce shift invariant methods.
2. We define a transform to be balanced if the reconstruction filters are the conjugate

time reverse of the analysis filters. We show that the noise gain during reconstruction
is closely related to the balance of the transform and, in particular, that the best (i.e.
lowest) noise gains are given by balanced transforms and are equal to the reciprocal
of the redundancy.
3. We explain why linear PR complex wavelets based on a single standard tree cannot
simultaneously; give shift invariant methods, be balanced, and use short support
filters.
4. We illustrate numerically the problems of noise gain caused by lack of balance when
we use a single tree complex wavelet that strongly differentiates between positive and
negative frequencies.
5. Using an elementary processing example we explain more generally why any non-
redundant linear PR transform must use ideal filters in order to achieve shift invari-
ance.
6. We introduced several forms of redundant wavelet transforms including Gabor wavelets,

prolate spheroidal sequences, steerable transforms, harmonic wavelets, multiwavelets,
and dual tree complex wavelets. The purpose of this dissertation is not to compare
different types of complex wavelet, but rather to explore the potential of such a sys-
tem against more standard approaches. For this reason we have just selected one
transform, the Q-shift dual tree complex wavelet system, as a representative.
Chapter 3
Previous applications
3.1 Summary
This dissertation aims to explore the potential of the DT-CWT in image processing. The
purpose of this chapter is to describe applications for which the DT-CWT (or a similar
transform) has already been evaluated.
The phase of the complex coefficients is closely related to the position of features within
an image and this property can be utilised for motion estimation [74]. The properties of
the DT-CWT (in particular, its extra directional frequency resolution) make it appropri-
ate for texture classification [47, 43, 33] and gives methods that are efficient in terms of
computational speed and retrieval accuracy. The complex wavelets are also appropriate
for use in denoising images [62]. Previous work has shown that a nondecimated wavelet
transform [82] performs better than decimated transforms for denoising and the DT-CWT
is able to achieve a performance similar to the nondecimated transforms. Interestingly,
when the DT-CWT is used in more sophisticated denoising methods [24] it is found to
significantly outperform even the equivalent nondecimated methods.
The only original results in this chapter come from the replication of the denoising
experiments in section 3.4.
3.2 Motion Estimation

Magarey [74] developed a motion estimation algorithm based on a complex discrete wavelet
transform. The transform used short 4-tap complex filters but did not possess the PR
45
46 CHAPTER 3. PREVIOUS APPLICATIONS
property. In other words, the author had been unable to find simple FIR filters that could
be used to exactly reconstruct the original signal. The filter shapes were very close to
those used in the DT-CWT suggesting that the conclusions would also be valid for the
DT-CWT.
The task is to try and estimate the displacement field between successive frames of an
image sequence. The fundamental property of wavelets that makes this possible is that
translations of an image result in phase changes for the wavelet coefficients. By measuring
the phase changes it is possible to infer the motion of the image. A major obstacle in
motion estimation is that the reliability of motion estimates depends on image content.
For example, it is easy to detect the motion of a single dot in an image, but it is much harder
to detect the motion of a white piece of paper on a white background. Magarey developed
a method for incorporating the varying degrees of confidence in the different estimates, but
for the purposes of this dissertation we highlight just a couple of the conclusions.
“In tests on synthetic sequences the optimised CDWT-based algorithm showed

superior accuracy under simple perturbations such as additive noise and inten-
sity scaling between frames.”
“In addition, the efficiency of the CDWT structure minimises the usual dis-
advantage of phase-based schemes– their computational complexity. Detailed
analysis showed that the number of floating point operations required is com-
parable to or even less than that of standard intensity-based hierarchical algo-
rithms.”
Although not included in this dissertation, we have found such phase based computation
beneficial for constructing an adaptive contour segmentation algorithm based on the DT-
CWT [34].
3.3 Classification
Efficient texture representation is important for content based retrieval of image data. The
idea is to compute a small set of texture-describing features for each image in a database
in order to allow a search of the database for images containing a certain texture. The
DT-CWT has been found by a number of authors to be useful for classification [47, 43, 33].
Each uses the DT-CWT in different ways to compute texture features for an entire image:
3.4. DENOISING 47
1. de Rivaz and Kingsbury [33] compute features given by the logarithm of the energy
in each subband.
2. Hill, Bull, and Canagarajah [47] compute the energies of the subbands at each scale.
However, in order to produce rotationally invariant texture features, they use features
based on either the Fourier transform or the auto-correlation of the 6 energies at each
scale.
3. Hatipoglu, Mitra, and Kingsbury [43] use features of the mean and standard devia-
tions of complex wavelet subbands. However, instead of using the DT-CWT based on
a fixed tree structure, they use an adaptive decomposition that continues to decom-
pose subbands with energy greater than a given threshold. The aim is to adapt the
transform to have the greatest frequency resolution where there is greatest energy.
All authors report significant improvements in classification performance compared to a

standard real wavelet transform. The different authors used different databases so the
following results are not directly comparable:
1. de Rivaz and Kingsbury report an improvement from 58.8% for the DWT to 63.5%
for the DT-CWT on a database of 100 images [33].
2. Hill, Bull, and Canagarajah report an improvement from 87.35% for the DWT to
93.75% for the DT-CWT on a database of 16 images [47].
3. Hatipoglu, Mitra, and Kingsbury report an improvement from 69.64% for a real
wavelet (with an adaptive decomposition) to 79.73% for the DT-CWT (with the
adaptive decomposition) on a database of 116 images [43].
3.4 Denoising
In many signal or image processing applications the input data is corrupted by some noise
which we would like to remove or at least reduce.
Wavelet denoising techniques work by adjusting the wavelet coefficients of the signal in
such a way that the noise is reduced while the signal is preserved. There are many different
methods for adjusting the coefficients but the basic principle is to keep large coefficients
while reducing small coefficients. This adjustion is known as thresholding the coefficients.
One rationale for this approach is that often real signals can be represented by a few
large wavelet coefficients, while (for standard orthogonal wavelet transforms) white noise
signals are represented by white noise of the same variance in the wavelet coefficients.
Therefore the reconstruction of the signal from just the large coefficients will tend to
contain most of the signal energy but little of the noise energy.
An alternative rationale comes from considering the signal as being piecewise stationary.
For each piece the optimum denoising method is a Wiener filter whose frequency response
depends on the local power spectrum of the signal. Where the signal power is high, we keep
most of the power; where the signal power is low, we attenuate the signal. The size of each
wavelet coefficient can be interpreted as an estimate of the power in some time-frequency
bin and so again we decide to keep the large coefficients and set the small ones to zero in
order to approximate adaptive Wiener filtering.
The first wavelet transform proposed for denoising was the standard orthogonal trans-
form [39]. However, orthogonal wavelet transforms (DWT) produce results that substan-
tially vary even for small translations in the input [63] and so a second transform was
proposed, the nondecimated wavelet transform (NDWT) [82], which produced shift invari-
ant results by effectively averaging the results of a DWT-based method over all possible
positions for the origin [69, 26]. Experiments on test signals show that the NDWT is
superior to the DWT. The main disadvantage of the NDWT is that even an efficient im-
plementation takes longer to compute than the DWT, by a factor of the three times the
number of levels used in the decomposition.
Kingsbury has proposed the use of the DT-CWT for denoising [62] because this trans-
form not only reduces the amount of shift-variance but also may achieve better compaction
of signal energy due to its increased directionality. In other words, at a given scale an object
edge in an image may produce significant energy in 1 of the 3 standard wavelet subbands,
but only 1 of the 6 complex wavelet subbands.
The method is to attenuate the complex coefficients depending on their magnitude. As
for the standard techniques, large coefficients are kept while smaller ones are reduced. It
was found that this method produces similar results to the nondecimated wavelet method
while being much faster to compute.
Figure 3.1 shows the results when using a simple soft denoising gain rule. White noise
was added to a test image and the denoised rms error was measured for the different
techniques.
3.4. DENOISING 49
Method RMS error SNR improvement

No denoising 26.3 0dB
DWT 14.47 5.19dB
NDWT 12.66 6.35dB
DT-CWT 12.54 6.43dB
Figure 3.1: Top left: Original image. Top right: Noisy image. Bottom left: DWT results.
Bottom middle: DT-CWT results. Bottom right: NDWT results.
In this case we can see that the DT-CWT has a slightly better SNR than the NDWT
method but that the difference is not visually noticeable.
There are often significant correlations between the wavelet coefficients in the trans-
forms of real images. In particular, it is found that large coefficient values cascade along
the branches of the wavelet tree. This property is known as persistence [104]. A model
known as the Hidden Markov Tree (HMT) proposed by Crouse, Nowak, and Baraniuk
attempts to capture the key features of the joint statistics of the wavelet coefficients (in-
cluding persistence) [28]. This is achieved by means of hidden state variables that describe
the likely characteristics of the wavelet coefficients (e.g. whether they are likely to be
large or small). A Markov model is used to describe the relative probability of transitions
between different states along a branch of the wavelet tree (moving from coarse to fine
scales). The initial model was based on a decimated wavelet transform (DWT). Romberg,
Choi, and Baraniuk proposed a shift-invariant denoising model based on the HMT using
the nondecimated wavelet transform (NDWT) [104]. In experiments this HMT shift in-
variant denoising was shown to outperform a variety of other approaches including Wiener
filtering and the shift invariant hard thresholding method mentioned above [24]. Choi et
al have also tested the use of dual tree complex wavelets within the HMT framework [24].
The results were found to be consistently better than even the shift invariant HMT. Table
3.2 shows their published results for the HMT combined with either the DWT, NDWT,
or DT-CWT. Three test images (Boats, Lena, and Bridge) were used with two choices of
noise variance (σ = 10 or σ = 25.5). The table displays the peak signal to noise ratios
(PSNRs) of the denoised images. Romberg and Choi have made the source code for the
HMT available on the internet [103]. Combining this code with the DT-CWT we have
attempted to replicate the denoising results. The table also contains the results of our
experiments (shown in bold). Looking at the σ = 25.5 results we see that our results are
Image Boats Lena Bridge

σ 10 25.5 10 25.5 10 25.5
Noisy 28.1 20.2 28.1 20.0 28.2 20.2
HMT/DWT 33.5 28.6 33.9 29.3 29.7 25.4
HMT/NDWT 33.8 28.6 34.6 29.8 29.8 24.7
HMT/DT-CWT 34.4 29.3 34.9 29.9 30.8 25.8
New HMT/DT-CWT 34.4 29.7 34.8 30.7 28.6 24.7
Figure 3.2: PSNRs in dB of images denoised with the HMT acting on different wavelet
transforms. The results in normal type are published in the literature [24] while the results
in bold come from our replication of the same experiment.
similar for the boats image, better for the Lena image, and worse for the bridge image.
Differences are partly caused by different noise realisations, but repeating the experiments
produces very similar SNR levels. The main reason for the difference is that we are using
a slightly different version of the dual tree.
The main point to notice is that in this case the shift invariance given by the use of the
NDWT tends to give a small improvement in results, while the DT-CWT tends to give an
3.5. CONCLUSIONS 51
even larger improvement.
3.5 Conclusions
Treating the coefficients as complex numbers has the result of approximately decoupling
the shift dependent information from the shift invariant information. The phase gives a
measure of local translations and therefore permits phase-based motion estimation. The
magnitude is more robust to translations and hence applications like classification and
denoising (that should be shift invariant) process just the magnitude.
For simple thresholding denoising we see that the DT-CWT achieves a similar perfor-
mance to methods based on the NDWT. However, for a more complicated HMT denoising
the DT-CWT is significantly better than even the NDWT equivalent.
Chapter 4
Complex wavelet texture features
4.1 Summary
The purpose of this chapter is to describe how to use the DT-CWT to generate texture
features and to begin to explore the properties of these features. We describe texture by
computing features from the energy of the DT-CWT wavelet subbands.
One powerful method for evaluating a choice of texture model is to estimate the model
parameters for a given image and then attempt to resynthesize a similar texture based
on these parameters. This chapter reviews two texture synthesis methods based on filter
banks. This is of interest because it suggests how complex wavelet texture synthesis might
be done and because it illustrates the problems encountered when filter banks do not have
the perfect reconstruction properties of wavelets.
We adapt the first of these methods to demonstrate the difference between using real
or complex wavelets. Section 4.7 describes the texture synthesis algorithm and section 4.8
contains discussion of the experimental results. We find that the complex wavelets are
much better at representing diagonally orientated textures.
The main original contributions of this chapter are the synthesis results that give an
indication of when the DT-CWT texture features will be appropriate.
4.2 Introduction
There are many techniques for texture synthesis. For example, there are methods based
on auto-regressive filters or autocorrelation and histogram [10, 22]. Gradient algorithms
53
54 CHAPTER 4. COMPLEX WAVELET TEXTURE FEATURES
are used to simultaneously impose the autocorrelation function and the histogram. Proper
Bayesian inference usually requires extensive computation and is consequently extremely
slow but has been done, usually by means of Markov Random Fields (often at a variety
of scales) [91, 134]. Other techniques include models based on reaction-diffusion [118, 129]
frequency domain [72] or fractal techniques [40, 72]. A trivial technique which is practically
very useful is to simply copy the texture from a source image (used extensively in computer
games). This kind of approach can be powerfully extended by a stochastic selection of
appropriate sources [16].
Sections 4.3 and 4.4 describe methods that have been used to synthesize textures based
on the output of a filter bank. We will use (in the second half of the chapter) a technique
very similar to the one described in section 4.3, while the second method is of interest
mainly to show the difficulties posed by lack of perfect reconstruction in the filters.
While texture synthesis by itself is an important topic we also have an indirect moti-
vation for studying these methods. Our main goal in this chapter is to use the DT-CWT
to produce useful texture features. The indirect motivation is that texture synthesis pro-
vides an interesting way to demonstrate visually the relative advantages of different sets
of texture features.
4.3 Pyramid-based texture analysis/synthesis

Heeger and Bergen [45] describe an automatic method for synthesizing images with a
similar texture to a given example. They assume that textures are difficult to discrimi-
nate when they produce a similar distribution of responses in a bank of orientation and
spatial-frequency selective linear filters. Their method synthesizes textures by matching
the histograms of filter outputs. The method makes use of the steerable pyramid transform
[107] that was described in section 2.4.3.
4.3.1 Method
The method starts with an image x(0) of the desired size that is filled with white Gaussian
noise of variance 1 (the histogram matching step will mean that the results will be the
same whatever the choice of initial mean and variance). The steps for iteration n are as
follows:
4.3. PYRAMID-BASED TEXTURE ANALYSIS/SYNTHESIS 55
1. First the histogram is matched to the input texture. More precisely, we generate a
new image y(n) by applying a monotonically increasing transform to x(n−1) such that
the histogram of y(n) matches the histogram of the input texture.
2. Make pyramids from both y(n) and the input texture.
3. Alter the pyramid representation of y(n) in order that the histogram of each subband
matches the histogram of the corresponding subband in the pyramid transform of
the input texture.
4. Invert the pyramid transform to generate x(n) , the next image in the sequence.
In order to get both the pixel and pyramid histograms to match, these steps are iterated K
times. As the filters are not perfect, iterating too many times introduces artefacts due to
reconstruction error [45]. Stopping after about K = 5 iterations is suggested. The output
of the algorithm is the last image x(K) .
4.3.2 Results and applications
The algorithm is effective on “stochastic” textures (like granite or sand) but does not work
well on “deterministic” textures (like a tile floor or a brick wall). Igehy and Pereira [51]
describe an application of this algorithm to replacing part of an image with a synthesized
texture (this might be done to remove stains or scratches or other unsightly objects from
an image).
Their algorithm extends the original with the goal that the synthesized texture be-
comes an image which is a combination of an original image and a synthetic texture; the
combination is controlled by a mask. The algorithm remains the same as before, except
that at each iteration the original image is composited back into the synthesized texture
according to the mask using a multi-resolution compositing technique that avoids blurring
and aliasing [18].
The conclusion is that this texture synthesis can be useful in a variety of images which
need the replacement of large areas with stochastic textures but that the technique is
inappropriate for images that need the replacement of areas with structured texture[45].
4.4 Gabor based texture synthesis

Navarro and Portilla [83] propose a synthesis method based on sampling both the power
spectrum and the histogram of a textured image. The spectrum is sampled by measuring
the energy and equivalent bandwidths of 16 Gabor channels, plus a low pass residual (LPR).
The synthesis process consists of mixing 16 Gabor filtered independent noise signals (whose
energy and bandwidths are chosen to match the measured values) into a single image, whose
LPR power spectrum and histogram are also modified to match the original features.
4.4.1 Filters
The human visual system (HVS) is imitated by using a set of 4 ∗ 4 filters (four scales, four
orientations). Each filter is a separable Gabor function of the form

g(x, y)f,θ,α = exp[−πa2 x2 + y 2 ] exp[j2πf (x cos θ + y sin θ)]. (4.1)
where θ specifies the desired orientation, f gives the radial frequency and a is a weighting
factor that makes the function decay to zero as you move away from the origin. Only the
real part of this function is actually used.
The filtering is applied to shrunk versions of the input image. This means that only one
filter needs to be defined for each orientation. After filtering for the four highest frequency
channels the image is filtered with a low-pass filter and down-sampling by a factor of two
in both directions. Then the same procedure is repeated for each scale.
Demodulation is applied to the Gabor channels after filtering thus enabling a reduction
in the number of samples by a factor of two in each dimension. However, the resulting
channels become complex and so the effective compression ratio is 2. This means that
overall a four scale decomposition of a N ∗ N image produces:
1. 4 complex subimages of size (N/2) ∗ (N/2)

4.4. GABOR BASED TEXTURE SYNTHESIS 57
Therefore we have gone from N 2 numbers to 4 ∗ 2 ∗ N 2 (1/4 + 1/16 + 1/64 + 1/256) ≈

2.66N 2 coefficients. The lowest frequencies are extracted separately using the Discrete
Time Fourier Transform (DTFT).
4.4.2 Extracted features

The energy in each Gabor channel is computed to give information about the main direc-
tions in the texture.
Navarro et al report that the degree of spectral spreading of the spectrum at different
spectral locations is an essential feature to characterize the texture. Therefore for each
channel they compute the equivalent bandwidths along the u and v frequency axes. To
calculate the equivalent bandwidths they first calculate the 2-D power spectrum using the
DTFT and then convert this into a pair of 1-D normalized power spectra. This conversion
is achieved by integrating the power spectrum along the two frequency axes, and dividing
the results by their respective maxima. (Both integrations act on the original 2-D spectrum
and produce a 1-D spectrum.) The equivalent bandwidths are given by the areas of these
1-D spectra.
The histogram of the original image is also recorded with a resolution of 16 levels. It
is calculated by low-pass filtering and subsampling the full histogram.
Finally five parameters are extracted from the low frequency DTFT coefficients of the
original image. One is the DC component, two are the averages of the modulus of the
DTFT along the two frequency axes for low frequencies, and the final two are averages for
low oblique frequencies. These parameters are not particularly crucial but are reported to
be necessary for a complete visual description of many real textured images.
4.4.3 Method
The synthesis procedure consists of seven stages.
1. Noise Generation.
2. Gaussian filtering of the noise signals.
3. Weighting of the filtered noise signals.
4. Modulation of the weighted filtered noise signals to produce the synthetic channels.
5. Merging the synthetic channels.
6. Equalization of the LPR frequencies.
7. Adjustment of the histogram.
Noise generation
Sixteen independent signals of complex white Gaussian noise are generated.
Gaussian filtering
The noise signals are then convolved with separable elliptical Gaussian masks to provide
them with an elliptical Gaussian spectral shape. The filter function for the (p,q) synthetic
channel is:
gspq (x, y) = bupq bvpq exp[−π(b2upq x2 + b2vpq y 2)] (4.2)
The factors bupq and bvpq are chosen using an approximate formula so that when the Gabor
filtering scheme is applied to the synthetic texture, the resulting equivalent bandwidths of
its Gabor channels have equal values to those measured in the input image. The exact
computation is hard due to the overlapping between channels but it is reported that the
approximate scheme does not significantly affect the visual quality of the results.
Weighting
It is desired to weight the signals so that when the Gabor filters are applied to the synthe-
sized texture, the resulting energies will be equal to those measured in the original image.
This task is made hard due to the overlap between channels. Since the synthetic channels
are statistically independent the energy of a sum is equal to the sum of the energies and so
it is possible to calculate a matrix which describes the effect when the channels are added
together. When the matrix is multiplied by a vector of energies in the synthetic channels
the resulting vector contains the energies that would be observed using the Gabor filtering.
The inverse of this matrix can be precomputed and the appropriate weights are given by
multiplying this inverse matrix by the vector of the measured energies.
4.4. GABOR BASED TEXTURE SYNTHESIS 59
Modulation
The signals are then expanded by a factor of two in both spatial dimensions and modulated
by the appropriate central frequency.
Merging
A pyramid structure is then used to combine the synthetic channels. This means summing
the four lowest resolution synthetic channels, expanding the image by a factor of two in
both spatial dimensions, and then adding the result to the four synthetic channels of the
next resolution, and so on, until the highest frequency channels are added. The expansion
is performed by upsampling followed by a lowpass filter.
Equalization
The equalization is done in the frequency domain. First, the five average values of the LPR
frequency moduli obtained at the feature extraction stage are decompressed, by merely
replicating them in their respective spectral areas. The resulting square of the spectral
moduli is imposed on the lowest frequencies of the synthetic mix obtained before, keeping
the phase unchanged.
Histogram matching
The compressed original histogram is decompressed to its former size by expanding and
low-pass filtering it. Then a standard histogram matching algorithm is used to modify the
histogram of the synthesized texture to match the decompressed version of the original
one.
4.4.4 Results
The method described above achieves a good match in the histogram, LPR channel mod-
ulus and channel energies. The bandwidths of the channels are not so closely matched but
it is reported that the consequences of these inaccuracies are not significant compared to
the inaccuracies due to the limitations of the texture model.
Errors occur because the frequency content of each channel is always shifted to the
central location. This means that a texture with a well-defined orientation in the original
image will only be well reproduced if the orientation is one of the four orientations of the
Gabor filters.
4.5 Discussion
We have described two texture synthesis algorithms for producing synthetic textures with
certain feature values. The Gabor texture synthesis attempts to model the interaction
between signals inserted in different subbands. This method is fast and can be generalised
for alternative filters but only approximately achieves the goal of producing matching
feature values. The pyramid-based texture synthesis method achieves a better match for
the feature values but is iterative and only works for transforms that can be inverted, or
at least have a reasonable approximate inversion.
4.6 Texture features

We now propose a simple set of texture features based on the DT-CWT and then test
these features by synthesis experiments.
Let wk,b,s be the k th wavelet coefficient in subband b at scale s. We define a texture
feature fb,s for each subband as

fb,s = |wk,b,s|2 .
k
In other words, the wavelet texture features are given by the energies of the subbands.
We will also compare the results when we augment our feature set with the values of a
histogram of the original image. The histogram will be calculated using 256 equally spaced
bins across the range of intensity values in the image.
4.7 Algorithm
The structure of the algorithm is very similar to the method invented by Heeger and Bergen
that was explained in section 4.3. The principle differences are that we use the invertible
complex wavelet transform and that in the matching stage we match the energy of the
subbands rather than their histograms.
4.7. ALGORITHM 61
We compare texture synthesis using just the wavelet texture features (called energy
synthesis) with texture synthesis using the augmented feature set (called histogram/energy
synthesis). First we describe the algorithm for histogram/energy synthesis.
The input to the algorithm is an example texture. The texture features are measured
for this texture and then a new texture is synthesized from these features.
The synthesis starts with an image x(0) of the desired size that contains white Gaussian
noise of mean 0 and variance 1. The steps for iteration n are
1. Match the histogram of x(n−1) to the input texture. In other words, generate a new
image y(n) by applying a monotonically increasing function to x(n−1) in order that
the histogram of the new image matches the histogram of the input texture.
2. Use the complex wavelet transform to generate a multiresolution decomposition for

both y(n) (decomposition A) and the example texture image (decomposition B).
3. Scale the contents of each noise subband (those in decomposition A) so that the
resulting energy is equal to the corresponding energy for the texture subbands (those
in decomposition B). If the original energy is EA and the desired energy is EB then

the correct scaling factor is EB /EA .
4. Invert the complex wavelet transform of decomposition A to produce the next ap-
proximation x(n) to the synthesized texture.
These steps are then iterated K times where K is a positive integer. The output of the
method is the image x(K) .
The algorithm for energy synthesis is identical except that in step 1 y(n) is a direct
copy of x(n−1) .
Histogram matching is a relatively quick operation as it can be computed by means of
two lookup tables. The first lookup table is computed from the cumulative histogram of the
noise image and effectively gives a transform from pixel intensity to rank order. The second
lookup table is computed once for all iterations and is the inverse to the intensity to rank
transform for the example texture. Once the lookup tables have been constructed, each
pixel in the noise image is transformed once by the first lookup table to get its approximate
rank, and then by the second lookup table to discover the intensity in the example image
that should have that rank.
4.8 Results
This section contains a selection of textures synthesized by the algorithm. All the textures
other than the ones in figure 4.4 are taken from the Brodatz set and are all 128 by 128
pixels. 5 level transforms and K = 3 iterations are used in the algorithms.
Original sand texture Hist/Energy method
Original grass texture Hist/Energy method
Original water texture Hist/Energy method
Figure 4.1: Results of using histogram/energy synthesis
First some good results are shown in figure 4.1. The original textures are on the left and
the synthesized textures are on the right. The images are of grass, sand and water. The
method used was to match histograms in the image domain and energy in the transform
domain (the histogram/energy method). These images seem fairly well synthesized.
The same experiment was repeated for energy synthesis and the results are shown in
figure 4.2. The results appear just as good as for histogram/energy synthesis.
4.8. RESULTS 63
Original sand texture Synthesized sand texture
Original grass texture Synthesized grass texture
Original water texture Synthesized water texture
Figure 4.2: Results of using energy synthesis

Figure 4.3 shows an example where the results are not as good. The original texture
Original Texture Synthesized Texture
Figure 4.3: Results of using histogram/energy synthesis on a wood grain texture.
has a strong vertical orientation. Although the texture synthesized with histogram/energy
synthesis does seem to be biased towards the vertical, it looks very different because the
strong orientation has been lost.
Figure 4.4 shows an example of a texture where no performance is entirely satisfactory.
The texture consists of many diagonal blobs of the same intensity. We tested energy syn-
thesis, histogram/energy synthesis, and a version of histogram/energy synthesis based on
the DWT. Energy synthesis is a very bad choice for this texture as it makes the synthesized
image much less piece-wise smooth than the original and this difference is easily perceived.
Histogram/energy synthesis partly captures the diagonal stripes in the texture but the
variation in direction and size of the stripes again give rise to a quite noticeable difference.
However, for this texture the DT-CWT possesses a clear superiority over a comparable
algorithm based on a normal Discrete Wavelet Transform (DWT) as the DWT features
cannot discriminate between energy near +45◦ from energy near −45◦ .
Figure 4.5 demonstrates the good convergence of the histogram/energy synthesis al-
gorithm. The energy in each subband was measured just before the subband rescaling.
These measured energies are plotted against iteration number with one plot for each of the
subbands. The target texture in this case was the the water texture (bottom left plot of
figure 4.1). There is also a horizontal line in each plot corresponding to the target energy
value for the corresponding subband. For every subband the energies rapidly converge to
the target values. For energy synthesis the convergence is even more rapid.
4.9. CONCLUSIONS 65
Original Texture Histogram/Energy synthesis
Energy synthesis DWT Histogram/Energy synthesis
Figure 4.4: Results of using different methods on a strongly diagonal texture
4.9 Conclusions
The methods are not able to adequately synthesize images with either strong directional
components or with a regular pattern of elements but for textures without such problems
there is a clear order of performance; the DWT synthesis is worst, then energy synthe-
sis gives reasonable performance, and histogram/energy synthesis is best. However, the
differences are only noticeable in certain cases. The improvement of the DT-CWT rela-
tive to the DWT is seen when the texture has diagonal components. The DT-CWT can
separate features near 45◦ from those near −45◦ while the DWT combines such features.
The improvement of histogram/energy synthesis is seen when the histogram of the original
texture has strong peaks. This occurs if the image contains regions where the intensity is
constant.
Subband 1, Level 1 Subband 1, Level 2 Subband 1, Level 3 Subband 1, Level 4

5000 5000 5000 5000
0 Subband 2, Level 1 0 Subband 2, Level 2 0 Subband 2, Level 3 0 Subband 2, Level 4
5000 5000 5000 5000
5000 5000 5000 5000
5000 5000 5000 5000
5000 5000 5000 5000
5000 5000 5000 5000
0 0 0 0
Figure 4.5: Energy before rescaling for different subbands during the histogram/energy
synthesis algorithm. Horizontal lines represent the target energy values.
Chapter 5
Texture segmentation
5.1 Summary
The purpose of this chapter is to explore the performance of the DT-CWT features for non-
Bayesian texture segmentation. In a recent comparison of features for texture classification
[101] no clear winner was found, but the fast methods tended to give worse performance
and it was concluded that it was important to search for fast effective features.
The original scheme used a statistical classifier that required significant quantities of
reliable training data. This is unreasonable in practice and we propose an alternative
simpler method. This simple method gives poor experimental performance for the DWT
but is reasonable for the NDWT and when used with the DT-CWT the results are better
than any one of the schemes used in the original classification. We explain the reason for
the power of the DT-CWT in terms of its directionality and aliasing properties.
Finally we show how simple multiscale techniques yield a fast algorithm based on the
DT-CWT with even better results.
The original contribution of this chapter is principally the experimental comparison of

DT-CWT features with other features. The algorithms in this chapter are simple extensions
to existing methods. The chapter is of interest because it demonstrates that complex
wavelets are easy to use in existing algorithms and because the experimental results suggest
that the DT-CWT provides a powerful feature set for segmentation.
67
68 CHAPTER 5. TEXTURE SEGMENTATION
5.2 Introduction
Texture segmentation has been studied intensively and many features have been proposed.
Recently Randen and Husøy [101] performed an extensive comparison of different texture
feature sets within a common framework. Their study concluded that there was no clear
winner among the features (although some feature sets, decimated wavelet coefficients for
example, were clear losers) and that the best choice depended on the application. They
also claimed that computational complexity was one of the biggest problems with the more
successful approaches and that therefore research into efficient and powerful classification
would be very useful.
The comparison was performed on a supervised texture segmentation task. The input
to the algorithm is a set of training images each containing a single texture, and a mosaic
made up from sections of these textures. The problem is to assign each pixel in the
mosaic to the correct texture class. Randen and Husøy note that it is important to have
disjoint test and training data. This means that the textures present in the mosaic are
new examples of the same texture as the training data, rather than being direct copies of
a portion of the training texture.
This task is called supervised segmentation because of the availability of examples from
each texture class. Unsupervised segmentation attempts to solve the same problem without
these examples. An example of this is work by Kam and Fitzgerald who have produced
results of using DT-CWT features for unsupervised segmentation [59]. Care must be
taken when comparing results of supervised and unsupervised techniques because they
are actually attempting slightly different tasks. A supervised segmentation is considered
perfect when every pixel is classified into the right class while an unsupervised segmentation
is considered perfect when the boundaries are in the correct positions. This may seem very
similar but suppose that one class contains very bright images, while the other contains
very dark images. Now consider a test image whose every pixel intensity is an identical
medium brightness. The supervised method may have a 50% chance of classifying the
pixels correctly while the unsupervised method will classify every pixel into the same class
and hence be deemed to achieve a 100% accuracy.
First the original method used in the comparison is described in section 5.3. Section
5.4 describes how we simplify the training stage of the algorithm to give a more practically
useful scheme. Section 5.5 explains in detail the method tested. Section 5.6 contains the
5.3. ORIGINAL CLASSIFICATION METHOD 69
results of the proposed method and also the published results using the more advanced
training scheme. Sections 5.7 and 5.8 discuss the reasons for the relative performance.
The comparison used pixel by pixel classification in an attempt to isolate the effect of the
features. However, alternative classification schemes such as multiscale classification can
give faster and better results. In section 5.9 we propose and test a multiscale segmentation
algorithm in order to show that the benefits of the DT-CWT features are retained for these
more powerful techniques.
5.3 Original Classification Method

This section describes the method used in the comparison paper[101] and brief details of
the features tested. The main steps in the method are:
1. Filter the input image.
2. Square the filter outputs.
3. Smooth the resulting features.
4. Take the logarithm of the smoothed features.
5. Classify each pixel to the closest class.
The features were smoothed in step 3 using a Gaussian lowpass filter with σs = 8. We will
discuss the effect of this choice later in section 5.8.
The classification method used was “Type One Learning Vector Quantization” (LVQ)
[66]. This scheme results in a classifier that chooses the closest class where the distance is
defined as the standard Euclidean metric to the class centre. This scheme requires time
consuming training in order to select class centres that give good classification results on
the training data.
The principal difference between the compared methods was in the choice of the filters
used in the first step. Different filters result in different feature sets. Amongst the filters
examined were:
1. Laws filter masks.
2. Ring and Wedge filters.

3. Dyadic Gabor filter bank.
4. Wavelet transform, packets, and frames based on the Daubechies family of wavelets.
5. Discrete Cosine Transform.
6. Quadrature Mirror Filters.
7. Non-dyadic Gabor filter bank.
8. Eigenfilters derived from autocorrelation functions.
9. Prediction error filter.
10. Optimized representation Gabor filter bank.
11. Optimized FIR filter bank
Randen and Husøy also tried a few non-filtering approaches including:
1. Training neural networks using back propagation and median filtering the resulting
classification.
2. Statistical features (angular second moment, contrast, correlation, and entropy).
3. AR model-based features.
5.4 Training simplification

The LVQ training method used by Randen and Husøy results in a simple classification
scheme but makes use of the training data to improve the classification accuracy. The
training data is repeatedly classified and the parts incorrectly classified are used to update
the class centres to give better results. This is a very slow procedure and experiments
suggested that it gave only slightly improved results.
In the experiments we have instead used a simpler system that simply sets the class
centres to be the average over the training data of the feature vectors. This is clearly very
quick to compute and only a small amount of training data will be needed to obtain a
reasonable estimate.
5.5. DETAILED DESCRIPTION OF METHOD 71
The first goal of this chapter is to demonstrate the superiority of the DT-CWT features
and so we are allowed to alter the comparison technique only if, as is the case here, the
change will not make the results better.
5.5 Detailed description of method

We describe the method for the NDWT and then explain the necessary modifications for
the DWT and the DT-CWT. We first describe the generation of the feature vector and
then the classification scheme. Suppose that we wish to calculate the feature vectors for an
image of size M × N. We use X(x, y) to represent the image. For x ∈ {0, . . . , M − 1}, y ∈
{0, . . . , N − 1} X(x, y) is defined as the value of the pixel at position x, y in the image.
For all other values of x and y we define X(x, y) = 0 (this is called zero padding).
Let Ws (x, y) for s = 1, 2, . . . , S be the subbands produced by the NDWT acting on
X(x, y). As the transform is nondecimated each of these subbands will be the same size
as the original image. We define a smoothing filter h(x, y) for x ∈ {−K, . . . , K}, y ∈
{−K, . . . , K} as
2
1 x + y2
h(x, y) = exp − (5.1)
2πσs2 2σs2
where σs controls the amount of smoothing and K sets the point at which we truncate
the filter. As mentioned before a value of σs = 8 was used by Randen and Husøy and we
choose the same value. We use a value of K = 24.
Smoothed subbands Ws (x, y) are produced by convolving the rectified subbands with
this smoothing filter.

K
K
Ws (x, y) = h(u, v)|Ws(x − u, y − u)|2 (5.2)
u=−K v=−K
Finally a feature vector f(x, y) ∈ R S is defined for x ∈ {0, . . . , M − 1}, y ∈ {0, . . . , N −

1} as
fs (x, y) = log(Ws (x, y)) (5.3)
Now suppose that there are C classes and let f (c) (x, y) be the feature vectors calculated
from the training image for class c. The class centres µc are defined as
1 (c)
M −1 N −1
µc = f (x, y) (5.4)
NM x=0 y=0
The classification scheme has the following steps
1. Calculate the feature vectors f (x, y) for the test image.
2. For x ∈ {0, . . . , M − 1}, y ∈ {0, . . . , N − 1}, c ∈ {1, . . . , C} calculate the class

distances dc (x, y):
dc (x, y) =
f (x, y) − µc
2 (5.5)
3. Classify each pixel at position x, y as belonging to closest class r(x, y)
r(x, y) = argminc∈{1,... ,C} dc (x, y) (5.6)
For decimated transforms the subbands are reduced in size. The subbands at level k are
only of size M/2k ×N/2k . In order to apply the same method we first expand the subbands
until they are of size M × N. Let Ps (x, y) be a subband at level k of size M/2k × N/2k .
We define the expanded subband Ws (x, y) by
x y
Ws (x, y) = Ps ( , k ) (5.7)
2k 2
where z represents the largest integer not greater than z. The rest of the method is
identical. Note that this expansion of the DWT is not equivalent to the NDWT. The
expanded DWT subbands at level k are piecewise constant on squares of size 2k × 2k while
the NDWT subbands have no such restriction. This may seem a strange way of expanding
the subbands - usually some form of interpolation such as lowpass filtering is performed
during expansion operations. Two possible justifications for this approach are:
1. This expansion means that the value of Ws (x, y) is equal to the value of the wavelet
coefficient (of subband k) that is closest to the location x, y.
2. The feature values will be smoothed in subsequent processing.
These are not very compelling reasons. In fact, it is quite likely that an alternative expan-
sion will improve the performance. The main reason we use this very simple expansion is
in order to provide the fairest comparison with the original experiments. The results of
our experiments indicate that even this crude expansion allows the DT-CWT to perform
better than the alternative features. A better interpolation might improve the performance
5.6. EXPERIMENTS 73
of the DT-CWT method but could also provoke the criticism that the performance gain is
caused merely by the additional smoothing rather than the choice of wavelet transform.
We use 4 levels of decomposition of the biorthogonal (6,8) filters in the DWT and
NDWT [30]. The lowpass filter has 17 taps, and the highpass filter has 11 taps. Randen
and Husøy compared other wavelet types but found that the main difference was between
decimated and nondecimated wavelets rather than the filters used.
5.6 Experiments
The classification was tested on the same twelve test images and sets of training data as
used in the original comparison [101]. Figure 5.1 shows the different mosaics. We tested
features generated from the DWT, NDWT, and DT-CWT. The error in an image is defined
as the proportion of incorrectly classified pixels. The error therefore varies between 0 for a
perfect classification and 1 for a completely wrong classification. For a C class experiment,
random classification would get 1 in C pixels correct and would have an expected error of
1 − 1/C.
Table 5.2 contains the results of using the different features for the different mosaics.
These results are plotted in figure 5.3. The figure contains 4 bar chart plots comparing
the different feature sets. Each plot is dedicated to a mosaic with a particular number of
textures. Within each plot there is one cluster of bars for each of the mosaics. In each
cluster of bars the left bar shows the error for the DWT features, the centre bar shows the
error for the NDWT features, and the right bar shows the error for the DT-CWT features.
All errors are plotted as percentages.
Inspecting the results in figure 5.3 reveals that for almost all experiments the DT-CWT
does better than the NDWT, and the NDWT does better than the DWT.
In the published comparison there was no clear winner, different texture features per-
formed best on different textures. The classification errors for every mosaic and every
feature set were published and we summarise this information in two ways.
The first measure we extract is the average performance for each feature set averaged
over all mosaics. This measure is called the mean error rate. The problem with this
approach is that the average may be dominated by the performance on the mosaics with
a large number of classes (as these will have the largest errors).
The second measure is designed to be fairer but is slightly more complicated. For each
j k l
a b c
d e
f g
h i
Figure 5.1: Mosaics tested

5.6. EXPERIMENTS 75
Mosaic DWT error *100% NDWT error *100% DT-CWT error *100%
a 12.3 11.5 10.9
b 28.5 31 21.8
c 28.9 26.6 16.2
d 25.3 22 16.6
e 24.1 20.4 17.3
f 47.7 39 33.8
g 50.7 44.3 40.4
h 43.2 38.8 19.3
i 35.6 30.9 28.6
j 12.4 12.1 0.6
k 1.7 0.9 1.1
l 10.9 9.2 9.3
Figure 5.2: Comparison of segmentation results for different transforms
image we rank the different methods according to their performance (with a rank of 1 given
to the best), and then we average these ranks over all 12 mosaics. In addition to the three
new methods described above, the following methods from the published comparison are
used in the ranking:
1. Laws filters
2. Ring/wedge filters
3. Dyadic Gabor filter bank
4. Gabor filter bank
5. DCT
6. Critically decimated Daubechies 4 wavelet
7. Undecimated Daubechies 4 wavelet
8. Undecimated 16 tap quadrature mirror filters (QMF)
9. Co-occurrence features
Two textures Five textures

DWT
40 NDWT 40
DT−CWT
20 20
0 0
j k l a b c d e
Ten textures Sixteen textures
40 40
20 20
0 0
h i f g
Figure 5.3: Percentage errors for (DWT,NDWT,DT-CWT)
5.7. DISCUSSION OF RELATIVE PERFORMANCE 77
10. Autoregressive features
11. Eigenfilter
12. Prediction error filter
13. Optimized FIR filter
14. Back propagation neural network
We have omitted some of the badly performing methods and have taken just the best of
families of methods (such as all the different choices of wavelet transform). This gives a
total of 17 methods compared in the ranking. All the published results make use of the
LVQ training algorithm.
Table 5.4 tabulates the two measures of performance1 . The results in bold are original
while the others are taken from the published study [101]. Table 5.4 shows that the
NDWT features with simple training give a mean error of 23.9% while the non-decimated
QMF filters give a mean error of 20.8%. The biorthogonal filters we use in the NDWT
are similar to the quadrature mirror filters used in the comparison and therefore it seems
that the much simpler training method results in only a small decrease in performance.
Nevertheless, the DT-CWT features with an average error of 18% outperform all the other
methods despite the simpler training. The next best features are the non-decimated QMF
filters while the worst results are given by the neural network classifier.
5.7 Discussion of relative performance

The features are calculated from the energy in the wavelet subbands. For a shift dependent
transform (the DWT for example) a translation causes the energy to be redistributed
between subbands due to aliasing. This effectively contributes an extra noise source to the
feature values and hence increases the classification error.
There are two main reasons for the improved DT-CWT performance compared to the
NDWT. The first is the increased directionality of the DT-CWT. There are 6 subbands
1
The mean error rates were used as a measure of performance in the comparison [101]. However, the
values given here do not quite agree because there was an error in the published results for one mosaic.
The values given here are based on the corrected experimental results [100].
Method mean error *100% Average rank

Laws 28.3 9.42
Ring/wedge 30.6 11.5
Dyadic Gabor filter bank 27.8 8.75
Gabor filter bank 31.5 10.7
DCT 24.9 7.25
Critically decimated Daubechies 4 35.2 13.3
Nondecimated Daubechies 4 29.7 10.3
Nondecimated 16 tap QMF 20.8 3.5
Co-occurrence 29.1 9.17
Autoregressive 26.3 7.5
Eigenfilter 26.2 8
Prediction error filter 29.3 10.5
Optimized FIR filter 26.4 8.08
Backpropagation neural network 57.3 17
DWT with simple training 26.8 9.25
NDWT with simple training 23.9 6.17
DT-CWT with simple training 18.0 2.75
Figure 5.4: Performance measure for different methods
at each scale rather than 3 and the DT-CWT is able to distinguish between features near
45◦ from those near −45◦ . It is certainly plausible that the extra features from these extra
subbands should allow better classification.
The second reason relates to the smoothing step. The NDWT highpass subbands will
contain slowly oscillating values. Rectification will convert these to a series of bumps which
are finally smoothed. For coarse scales it is possible that the lowpass filter does not have a
narrow enough bandwidth to fully smooth these bumps and so some residual rectification
noise may remain. In contrast the magnitude of the DT-CWT coefficients is expected to
be fairly steady and we would expect much less rectification noise.
In 1D it is quite easy to see the effect of the aliasing and rectification problems. Consider
the sine wave shown in figure 5.5. We have chosen its frequency so that most of its energy is
0.8
0.6
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
80 100 120 140 160 180
Figure 5.5: Sine wave input
contained in the scale 4 highpass subband. We compute the scale 4 highpass coefficients for
both the nondecimated real wavelet transform and a nondecimated version of the complex
wavelet transform. These coefficients are shown in figure 5.6 (the imaginary part of the
complex wavelet coefficients is plotted with a dashed line). Figure 5.7 shows the rectified
values (before smoothing) that are calculated by squaring the transform coefficients. The
rectified real wavelet values show a strong oscillation of 100% of the average value while
the rectified complex wavelet values have only a small variation of about 2.5%.
The very low variation of the complex wavelet coefficients is no accident. It occurs
Real wavelet Complex wavelet

1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
−0.2 −0.2
−0.4 −0.4
−0.6 −0.6
−0.8 −0.8
−1 −1
80 100 120 140 160 180 80 100 120 140 160 180
Figure 5.6: Nondecimated scale 4 wavelet coefficients

0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
80 100 120 140 160 180 80 100 120 140 160 180
Figure 5.7: Rectified nondecimated scale 4 wavelet coefficients

because the complex wavelets have been designed to differentiate between positive and
negative frequencies. A sine wave can be represented as the combination of two complex
exponentials, one representing a positive frequency component, and one representing the
negative frequency component:
exp {jωt} + exp {−jωt}
cos(ωt) = (5.8)
2
Standard filter theory says that the output y(t) of a linear filter applied to this signal will
be
A B
y(t) = exp {jωt} + exp {−jωt} (5.9)
2 2
where A is the response of the filter to frequency ω and B is the response for −ω. If the
linear filter had zero response for negative frequencies then (assuming ω > 0) the output
would be simply
A
y(t) = exp {jωt} (5.10)
2
and hence the rectified signal would be constant and equal to
|A|2
|y(t)|2 = . (5.11)
4
The low variation for the DT-CWT means we can afford to decimate the output. Figure
5.8 plots every 16th sample from the nondecimated outputs. These plots correspond to the
rectified outputs for the decimated transforms (i.e. the DWT and the DT-CWT). There
is a huge variation in the DWT rectified outputs while the DT-CWT outputs are almost
√
constant. For σs = 8 the smoothing filter has a half peak width of σs 2 2 log 2 ≈ 19. This
should be about sufficient to remove the variation for the NDWT but is clearly insufficient
for the DWT. These graphs have been plotted for scale 4 coefficients. At finer scales the
coefficients will oscillate faster and we would therefore expect less rectification noise.
We have advanced two effects (directionality and rectification noise) to explain the
relative performance. It is natural to ask about the relative significance of these effects. To
answer this question we performed two further experiments both using a cut down version
of the DT-CWT:
HalfCWT In the first experiment we halved the size of the feature vector by combining the
energy in pairs of subbands. The 15◦ , 45◦ , and 75◦ subbands were paired respectively

0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
80 100 120 140 160 180 80 100 120 140 160 180
Figure 5.8: Rectified decimated scale 4 wavelet coefficients
with the −15◦ , −45◦ , and −75◦ subbands. More precisely, equation 5.3 was altered
to
fs (x, y) = log(Wa (x, y) + Wb (x, y)) (5.12)
where a and b are the two subbands that are combined to give feature s. This reduced
the transform to only distinguishing 3 directions, like real wavelet transforms.
RealCWT In the second experiment we set the imaginary part of all wavelet coefficients
to zero before rectification. This should introduce rectification noise to the features.
Note that these two modifications are intended to be harmful to the performance of the
method and such transforms should never be used for a real application. The results for
these new experiments are in table 5.9 and shown in figure 5.10 (compared to the original
DT-CWT results).
We conclude that rectification noise is not too significant because the results for the
RealCWT are similar to those for the DT-CWT. However, the results for the HalfCWT
are significantly worse, demonstrating that the main reason for the improved DT-CWT
performance is its improved directional filtering.
5.8. DISCUSSION OF DT-CWT PERFORMANCE 83
Mosaic HalfCWT error *100% RealCWT error *100%

a 11 10.6
b 25.7 22.5
c 21.2 17
d 23.1 17.5
e 20.7 17.5
f 39.9 36.3
g 45.2 41.9
h 37.8 21.9
i 33.9 30
j 5.1 1
k 0.7 1.1
l 11.6 8.8
Figure 5.9: Comparison of segmentation results for altered DT-CWT
5.8 Discussion of DT-CWT performance

Although the DT-CWT gives better results than the other methods, it is not perfect. For
example on the 16 texture mosaic “f” there is still a 34% error. The output of the DT-
CWT method for this mosaic is shown in figure 5.11. The original is shown on the left,
the classification results on the right. Each colour represents a different class. Notice that
there is usually a border of misclassified pixels around each segment. This is not surprising
because the feature values near the borders will be unreliable for two reasons:
1. Near the border the smoothing filter will be averaging rectified coefficients from both
sides of the border to produce some intermediate value.
2. Near the border the impulse response for the coarser wavelets will straddle both sides
and hence be unreliable partly because the value will average the response from both
textures and partly because there will often be discontinuities at the border giving
extra energy to the highpass coefficients.
Notice also that there are often fairly small groups of pixels assigned to some class. These
two defects in the classification are closely related to the size of the smoothing filter and
place contradictory requirements on its size. In order to resolve the position of the bound-

HalfCWT
40 RealCWT 40
DT−CWT
20 20
0 0
j k l a b c d e
40 40
20 20
0 0
h i f g
Figure 5.10: Percentage errors for (HalfCWT,RealCWT,DT-CWT)
5.8. DISCUSSION OF DT-CWT PERFORMANCE 85
Figure 5.11: Segmentation results for mosaic “f” using the DT-CWT
aries accurately the smoothing filter should be small, but in order to accurately determine
the class the smoothing filter should be large to give reliable feature estimates.
This is a well-known problem in image segmentation [127]. There are several methods
that address the issue. The basic concept is to use additional information about the nature
of segments. For example, we may not expect to see very small separate segments, or we
may know that the segments should have smooth boundaries. Different methods make
use of different types of information. Active contour models [60] are useful for encoding
information about the smoothness of boundaries while multiscale methods [128] are useful
for describing expectations about the spatial extent of segments. Methods also differ in
whether the information is explicitly contained in a Bayesian image model (such as Markov
Random Field approaches [9]) or just implicitly used [128].
We expect the advantages gained by these methods to be complementary to the advan-

tages of the DT-CWT feature set. The final sections of this chapter test this prediction
for a multiscale segmentation method. The author has also used the DT-CWT for imple-
menting a level set version of an active contour model [34]. However, in the level set paper
the emphasis is on using the DT-CWT to describe contour shape rather than the texture.
5.9 Multiscale Segmentation

We extend a previous multiscale segmentation method [128] to use the DT-CWT features.
The basic idea is to first calculate an approximate segmentation at a coarse scale, and
then to refine the estimate of the boundary location at more detailed scales. During the
refinement the classification reduces the importance given to the wavelet coefficients that
will be significantly affected by the boundary. This can be thought of as an adaptive
smoothing filter whose size depends on the estimate of boundary position. Section 5.9.1
defines the features used and 5.9.2 describes the multiscale classification procedure.
This method implicitly assumes that regions will have a reasonably large spatial extent
and are unlikely to contain small interior regions of alternative textures. These assumptions
are true for the mosaics tested and consequently the method works well. Care must be
taken when applying such methods in the real world to ensure that appropriate assumptions
are made.
This multiscale segmentation is also faster than the pixel by pixel segmentation de-
scribed above because the size of the feature set decreases for the more detailed levels.
5.9.1 Texture features for multiscale segmentation

Different texture feature sets are used during the multiscale algorithm. To define these sets
it is useful to distinguish between the wavelet subbands at different scales. Let Ws,k (x, y)
be the sth subband (s ∈ {1, 2, 3, 4, 5, 6}) at scale k ∈ {1, 2, 3, 4} (scale 1 being the most
detailed scale, scale 4 the coarsest that we shall use). For an initial image of size N × M
the complex subbands at scale k will be of size N/2k × M/2k for the DT-CWT.
We form a quad-tree from each subband denoted qs,k (x, y, l), k ≤ l, where

|Ws (x, y)|2 l=k
qs,k (x, y, l) =

1
4 a∈{0,1} b∈{0,1} qs,k (2x + a, 2y + b, l − 1) l > k
Suppose the segmentation algorithm is operating at scale L. The corresponding feature set
for position x, y is denoted f (L) (x, y). It is useful to index this scale L feature set f (L) (x, y)
with indices s ∈ {1, 2, 3, 4, 5, 6} (for subband) and k ∈ {1, 2, . . . , L} (for feature scale).
The features are defined from the quadtrees by:
(L)
fs,k (x, y) = (L + 1 − k) log (qs,k (x, y, L)) (5.13)
5.9. MULTISCALE SEGMENTATION 87
Notice the scaling factor L + 1 − k. The values in the quad-tree can be considered to be
local estimates of the average energy in the wavelet subbands. Naturally, we would expect
an estimate formed by averaging many numbers to be more accurate. The scaling factor
provides a simple way of favouring the more reliable estimates. Alternative scaling factors
may well give better results but section 5.9.3 explains why we do not try and optimise
these factors.
For each class c ∈ {1, 2, . . . . , C} the features are calculated for the corresponding
(L)
training image and used to calculate feature means µc,s,k
M/2k −1 N/2k −1
(L) 22k (L)
µc,s,k = fs,k (x, y) (5.14)
MN x=0 y=0
5.9.2 Multiscale classification

(L)
Once the test image feature vectors fs,k (x, y) have been calculated the image is first clas-
sified at the coarsest level and then the segmentation is refined by reclassifying at more
detailed scales. The classification at scale L has the following steps
1. For x ∈ {0, . . . , M/2L − 1}, y ∈ {0, . . . , N/2L − 1}, c ∈ {1, . . . , C} calculate the class
(L)
distances dc (x, y):
1 (L) 2
6 L
(L)
dc(L) (x, y) = fs,k (x, y) − µc,s,k (5.15)
6L s=1 k=1
2. Classify each pixel at position x, y as belonging to closest class r (L) (x, y)
r (L) (x, y) = argminc∈{1,... ,C} dc(L) (x, y) − bc(L) (x, y) (5.16)
(L)
All that remains is to define bc (x, y). This represents the information (from higher scales
and notions of continuity of regions) about the probability that the scale L block at x, y
(4)
belongs to class c. For classification at the coarsest scale bc (x, y) = 0. At more detailed
(L)
scales a reasonable first approximation is ac (x, y) defined to be 1 if the corresponding
parent block at scale L + 1 belongs to class c or 0 otherwise.

1 r (L+1) (x/2 , y/2) = c
ac(L) (x, y) = (5.17)
0 r (L+1) (x/2 , y/2) = c
Near the boundaries we should be less confident in the class assignment and we soften the
function to reflect this uncertainty. The softening is done by smoothing with a Gaussian
filter with smoothing parameter λ = 3.5.

K
K
− (u2 + v 2 )
bc(L) (x, y) =α exp ac(L) (x − u, y − v) (5.18)
u=−K v=−K
2λ2
where α is a parameter that controls the amount of information incorporated from previous
scales. For our experiments we used α = 1/4.
5.9.3 Choice of parameter values

There are a number of parameters in the method, (such as the scaling factors, λ, and
α), that will affect the performance. The values were chosen during development of the
algorithm to give sensible results on some mosaics (but not the test mosaics used by Randen
and Husøy). These values are certainly not theoretically or experimentally optimal. An
analysis of the effect of the parameters would be interesting but is beyond the scope of
this dissertation. An approximate treatment of a similar quad-tree algorithm for white
Gaussian noise fields is given by Spann and Wilson [111]. There is also a danger in an
experimental optimisation that the method will work very well on the mosaics used during
the optimisation but poorly on alternative mosaics (of different shapes or texture content).
The aim of this section on multiscale segmentation is to demonstrate that the DT-CWT
features are also useful for the more advanced classification schemes. This aim is satisified
better by using a “typical” multiscale algorithm than by testing a version optimised for a
particular dataset.
5.9.4 Multiscale results

The multiscale algorithm was tested on the 12 test mosaics as before using both the DT-
CWT features as explained above and using features calculated in an analogous way from
the DWT.
The average error for the DT-CWT multiscale method is 9.4% as compared to 18.0%
for the non-multiscale DT-CWT method. The average for the multiscale DWT method
is 16.5%. On most of the test images the DT-CWT performs substantially better than
the DWT. There is just one case (image l) where the DWT gives better results than the
DT-CWT and even in this case the difference is only 0.5%.
5.10. CONCLUSIONS 89
Mosaic Multiscale DT-CWT error *100% Multiscale DWT error *100%

a 3.4 3.8
b 21.3 24.7
c 11 21.3
d 3.8 17.5
e 7.8 7.8
f 15.8 31.8
g 23.9 31.7
h 9.4 24.3
i 13.3 25.1
j 0.4 6
k 0.3 1.4
l 2.9 2.4
Figure 5.12: Comparison of segmentation results for multiscale methods
The difference in performance of the method can be clearly seen in the results for mosaic
“f” in figure 5.14. There are many fewer small segments and the boundary errors are
greatly reduced.
5.10 Conclusions
The experimental results clearly show that the NDWT features are better than the DWT
features, and that the DT-CWT features are better than the NDWT features. For the
pixel by pixel classification experiments the average error was 26.8% for DWT features,
23.9% for NDWT features, and 18.0% for the DT-CWT features. The main reason for the
DT-CWT outperforming the NDWT features is the increased number of subbands that
allow more accurate orientation discrimination.
A comparison with published results [101] reveals that the simpler training scheme gives
almost as good results as LVQ training and the DT-CWT features performed better than
any of the feature sets tested in the published study, despite the simpler training.
Tests on a multiscale algorithm indicated that the superior performance of the DT-
CWT features is preserved even for more sophisticated classification methods. For the test
mosaics used the multiscale classification reduced the average error to 9.4%.

DT−CWT
40 multiscale DT−CWT 40
multiscale DWT
20 20
0 0
j k l a b c d e
40 40
20 20
0 0
h i f g
Figure 5.13: Percentage errors for single scale DT-CWT, multiscale DT-CWT, multiscale
DWT.
Figure 5.14: Segmentation results for mosaic “f” using the multiscale DT-CWT
Chapter 6
Correlation modelling
6.1 Summary
The purpose of this chapter is to give an example of the use of the phase of the complex
coefficients. We described in chapter 4 a synthesis technique that generated textures with
matching subband energies. This method works well for many stochastic textures but fails
when there is more structure in the image such as lines or repeated patterns.
Simoncelli has demonstrated good performance with a similar synthesis technique when
more parameters are extracted from an image than merely the energy [108]. Simoncelli
used over 1000 parameters to describe his textures. The main parameters came from the
auto-correlations of the wavelet coefficients and the cross-correlations of the magnitudes
of subbands at different orientations and scales. We compare the relative effect of these
different parameters for the DT-CWT. The auto-correlation allows better texture synthesis
and experiments indicate that sometimes auto-correlation based features can also give
improved segmentation performance.
The original contributions of this chapter are the experimental synthesis and segmen-
tation results. The method described is substantially based on a previous algorithm [108]
and is not claimed as original.
6.2 Auto-correlation Method

The basic method is to repeatedly match both the image statistics and the transform
statistics by alternating matching statistics and transforming to and from wavelet space
93
94 CHAPTER 6. CORRELATION MODELLING
and image space. We start by measuring the parameters of a target image and generating
a random (white noise) image of the correct size.
Simoncelli measured the following image pixels’ statistics; mean, variance, skewness,
kurtosis, minimum and maximum values [108]. These 6 values capture the general shape
of the histogram and we would expect them to give results very similar to using the full
histogram. However, to avoid mixing changes caused by matching correlation with changes
caused by a new set of image statistics we choose, as in chapter 4, to simply measure and
match the image histogram. Simoncelli based his synthesis upon the oriented complex
filters of the steerable pyramid described in section 2.4.3. We use instead the DT-CWT.
The DT-CWT subbands contain complex coefficients. We denote wk (x, y) the subband
k wavelet coefficient1 at position x, y where x and y are integers. To make the equations
simpler it is also convenient to define wk (x, y) = 0 for any positions that lie outside the
subband. We have tested two method based on auto-correlation based statistics.
Raw Auto-correlation The method generates the statistics rk (δx, δy) for subband k
directly from the complex valued auto-correlation of the subband.

rk (δx, δy) = wk (x, y)∗wk (x + δx, y + δy)
x y
Magnitude Auto-correlation The second method reduces the size of the parameter set
by calculating real valued statistics rk (δx, δy) based on the auto-correlation of the
magnitude of the complex wavelet coefficients.

rk (δx, δy) = |wk (x, y)||wk (x + δx, y + δy)|
x y
In both cases we match the appropriate statistics in essentially the same way. We first
describe the raw auto-correlation matching method.
We solve for an appropriate filter to apply to the subbands that will change the auto-
correlation by roughly the required amount. Let H(ω) be the spectrum of the filter.
The spectrum is a function of both horizontal (ωx ) and vertical (ωy ) frequency. We use
ω = (ωx , ωy ) as shorthand for these two variables. Let Pim (ω) and Pref (ω) be the Fourier
1
In this section we use the more compact notation of a single number k to index subbands at different
scales and orientations. For example, k ∈ {1, 2, 3, 4, 5, 6} indexes the 6 orientated subbands at scale 1,
while k ∈ {7, 8, 9, 10, 11, 12} indexes the subbands at scale 2.
6.2. AUTO-CORRELATION METHOD 95
transforms of the auto-correlations respectively of a subband and the corresponding sub-

band from the transform of the target texture. Pim (ω) and Pref (ω) are known as the power
spectra of the subbands. Standard filter theory tells us that the power spectrum of the
filtered image is given by:
Pf ilt (ω) = |H(ω)|2 Pim (ω) (6.1)
We require this output spectrum to be close to the reference spectrum and so the natural
filter to use is given by:

|Pref (ω)|
H(ω) = (6.2)
|Pim (ω)| + δ
The definition of the power spectrum ensures that Pim (ω) and Pref (ω) are always real. To
avoid divisions by small numbers we increase the denominator by a small amount δ (in the
experiments we use δ = 0.01Pim(0)).
We only have the central samples of the auto-correlations and so we estimate the power
spectra by zero-padding the auto-correlation matrices to twice their size before taking
Fourier transforms. After using equation 6.2 to produce the filter spectrum we use an
inverse Fourier transform to produce the actual filter coefficients. Note that as we only
retain a few auto-correlation coefficients the Fourier transforms involved are small and
consequently fast. We then convolve the subband with the filter (this is actually done in
the Fourier domain by multiplication) in order to produce a new subband with, hopefully,
a better matched auto-correlation. Although we have not proved that this will always
improve the match, in practice we found that this scheme converged within a few iterations.
For the magnitude auto-correlation method we first compute the magnitude and phase
of each coefficient in the subband. Then the above matching method is applied to a
subband consisting of just the magnitudes. Finally the new magnitudes are recombined
with the original phases to produce the new complex subband.
Throughout this chapter we will always use a 5 scale DT-CWT decomposition. This
results in 5 ∗ 6 = 30 complex subbands plus a real lowpass subband of scaling coefficients.
We only impose correlations on the complex subbands: the scaling coefficients are not
changed during the matching of transform statistics. For counting purposes we will treat
the real and imaginary parts as separate parameters. The form of the auto-correlation is
such that r(x, y) = r(−x, −y)∗ and so for an auto-correlation of size K by K (for odd K)
there are only (K 2 + 1)/2 independent complex numbers. Moreover, the central sample
r(0, 0) is always real. We conclude that we need to record K 2 numbers to record the
raw auto-correlation for a single subband, and hence we have a total of 30K 2 parameters
describing the transform statistics. For the magnitude auto-correlation this is reduced
to 15(K 2 + 1). For comparison, the energy synthesis method of chapter 4 needs only 30
parameters to describe a texture.
6.3 Auto-correlation Results

One texture (of wood grain) on which the energy synthesis method performs poorly is
shown in figure 6.1. The problem is the highly correlated lines of grain that cross the
Matching size 1x1 mag autocorrelation
Target texture Processed picture after 3 iterations
Figure 6.1: Results of original energy matching synthesis
texture and, although the general diagonal orientation of the texture is reproduced, the
strong correlation is lost.
We first test the raw auto-correlation method. Recording and matching merely the
central 3 by 3 samples of the autocorrelation matrix results in the improved results shown
in figure 6.2. The diagonal lines are longer and the image is more tightly constrained to
one orientation. The improvement is even greater if we use the central 5 by 5 samples
as shown in figure 6.3 where the synthetic texture appears very similar in texture to the
original.
Next we test the magnitude auto-correlation method. Figure 6.4 shows the results of
using the central 5 by 5 samples of the magnitude auto-correlation (there is no noticeable
difference if we just match a 3 by 3 auto-correlation). These results are just as bad as the
6.3. AUTO-CORRELATION RESULTS 97
Matching size 3x3 raw autocorrelation
Figure 6.2: Results of matching 3 by 3 raw auto-correlation values
Matching size 5x5 raw autocorrelation
Figure 6.3: Results of matching 5 by 5 raw auto-correlation values

original energy synthesis. We have shown the results after 3 iterations as these methods
Matching size 5x5 mag autocorrelation
Figure 6.4: Results of matching 5 by 5 magnitude autocorrelation values
were found to converge very quickly. More iterations produced negligible changes to the
synthesized images.
6.4 Discussion
The raw auto-correlation matching gives a significant improvement and so is managing to
capture the correlation in the image, while magnitude matching fails to help. This means
that significant information is contained in the phases of the wavelet coefficients. We
described in chapter 3 how complex wavelets can be used for motion estimation because
the change in phase of a wavelet coefficient between frames is roughly proportional to the
shift of an underlying image. In a similar way, if a subimage is responding to lines in the
image then the phases of the auto-correlation coefficients encode the relative positions of
the line segments. Therefore when we match the raw auto-correlation we are ensuring that
the line segments will be correctly aligned.
There are some textures for which the auto-correlation does not work as well such
as the hessian pattern in figure 6.5. The original texture contains alternating stripes of
strongly orientated material and although the synthetic texture does contain some patches
of strongly orientated texture, it also contains several places where there seems to be a
more checkerboard appearance. This is because at these places there is energy both at an
orientation of 45◦ and of −45◦ . We added measures of cross-correlation between subbands
in an attempt to solve this problem as described in the following sections.
6.5. CROSS-CORRELATION METHOD 99
Figure 6.5: Results of matching 5 by 5 raw autocorrelation values
6.5 Cross-correlation method

We measure the cross-correlations between subbands at one scale, and between subbands
at one scale and those at the next scale. The single scale cross-correlation cjk between
subband j and subband k is given by

cjk = |wj (x, y)||wk (x, y)|.
x y
It may seem odd that we use the magnitudes when we have discovered the importance
of phase for auto-correlation matching. Unfortunately, there is not any significant raw
correlation between subbands. The problem is that the phase of wavelet coefficients gives
information about the location of features in a particular direction. Suppose that there is
a particular pattern that is repeated many times in a certain texture and suppose we break
the original image up into lots of smaller subimages containing examples of this pattern.
There is no reason for the pattern to have any particular alignment within the subimages
and we can interpret the subimages as being a single prototype image distorted by different
translations. The phase of coefficients in the 45◦ subband, say, will rapidly change if the
image is translated at an angle of −45◦ , while the phase of the coefficients in the −45◦
subband will be much less affected by such a translation [74]. The translations therefore
alter the phase relationship between subbands and hence when the cross-correlation is
averaged across the entire texture the individual correlations will tend to cancel out. This
was not a problem for auto-correlation as the coefficients within a single band all respond
in the same way to a translation and hence the relative phase contains useful information.
The between scales cross-correlation measures the correlation between a subband at one
scale and the subbands at the next coarser scale. We define bl,j,k (a, b) to be the correlation
between subband j at scale l and subband k at scale l + 1:

bl,j,k (a, b) = |wl (2x + a, 2y + b)||wl+1(x, y)|
x y
where a and b take values 0 or 1. Due to the down sampling in the tree structure each
position at scale l + 1 is effectively the parent of 4 positions at scale l. We use a and b
to measure a separate correlation for each of these 4 choices. Naturally, we do not use a
between scales correlation for the coarsest scale subbands as these wavelet coefficients have
no parents. Again we use a measure of magnitude correlations as the raw phases will tend
to cancel out2
We need 6 ∗ 5/2 = 15 parameters to describe the cross correlations at a single scale,
and 6 ∗ 6 ∗ 4 = 144 parameters to describe the cross correlations between subbands at
one scale and the next coarser scale. For the 5 scale decomposition this gives a total of
15∗5+144∗4 = 651 parameters to describe cross correlations in addition to the parameters
used to describe the image statistics and auto-correlations.
The matching procedure first matches the raw auto-correlation values in the way de-
scribed earlier and then attempts to match the cross-correlations. The details of the
cross-correlation matching method can be found in Simoncelli’s paper [108].
6.6 Cross-correlation results and discussion

We compare three methods in this section:
Energy The subband energy matching method from chapter 4 using 30 parameters to
describe the wavelet statistics.
Raw auto-correlation The 7 by 7 raw auto-correlation matching method from the start
of this chapter using 30 ∗ 7 ∗ 7 = 1470 parameters to describe the wavelet statistics.
2
A way has been proposed to avoid the cancellation when computing the cross-correlation between two
subbands at different scales but the same orientation. The phases at the coarser scale will change at half
the rate of the finer scale and so to make the comparison meaningful the coarser scale coefficients must
have their phase doubled. The experiments presented here do not make use of this modification but details
can be found elsewhere [96].
6.6. CROSS-CORRELATION RESULTS AND DISCUSSION 101
Cross-correlation The 7 by 7 raw auto-correlation matching together with cross-correlation

matching to give a total of 30 ∗ 7 ∗ 7 + 651 = 2121 parameters to describe the wavelet
statistics.
Figure 6.6 displays the results of the different methods applied to 4 test images. In-
Original Energy Raw Auto−correlation Cross−correlation
Figure 6.6: Comparison of different synthesis methods.
cluding the cross-correlation statistics leaves the first three textures essentially unaltered.
The last hessian texture may be considered to be slightly improved but the improvement is
certainly not very large. On all of these textures the auto-correlation method gives a clear
improvement compared to the original energy synthesis method but it has a significant
increase in the size of the parameter set. There are several penalties associated with the
increased feature set size. The most obvious drawbacks are that the storage and com-
putation requirements are increased, but there is also a possible decrease in performance.
For texture synthesis it is reasonable to expect an increase in quality for each new fea-
ture matched, but synthesis itself is only of secondary interest. The main interest is in
using the features for texture analysis, and for analysis applications extra features can be
a disadvantage. The principal problems with extra features are that:
1. large textural regions are needed to reliably estimate the feature values,
2. the features may have significant correlations – this causes problems if we want to
use simple metrics to compare feature sets,
3. the features may be modelling variation within a single texture class.
The next section examines the performance of a larger feature set for the segmentation
task.
6.7 Large feature set segmentation

For the reasons mentioned above it would be inappropriate to try and use the more than
2000 auto and cross correlation parameters for texture segmentation but it would be in-
teresting to see the effect of using features based on the 3 by 3 raw auto-correlation.
We cannot directly use the auto-correlation to provide features because we just get one
auto-correlation value for the entire subband, while for segmentation we clearly need local
estimates of feature values. Instead we consider a very simple extension of the DT-CWT
that performs extra filtering on each subband to determine four features per subband.
We first describe this algorithm and then explain why it is approximately equivalent to
calculating a local auto-correlation estimate.
We propose using a simple shift invariant extension to the DT-CWT based on the Haar
wavelet transform. Each subband Wk (u, v) is split into four by:
1. Filter the subband Wk (u, v) horizontally with the filter 1 + z −1 to produce Lk (u, v).
2. Filter the subband Wk (u, v) horizontally with the filter 1 − z −1 to produce Hk (u, v).
3. Filter the subband Lk (u, v) vertically with the filter 1 + z −1 to produce Ak (u, v).
4. Filter the subband Lk (u, v) vertically with the filter 1 − z −1 to produce Bk (u, v).
5. Filter the subband Hk (u, v) vertically with the filter 1 + z −1 to produce Ck (u, v).
6. Filter the subband Hk (u, v) vertically with the filter 1 − z −1 to produce Dk (u, v).
6.7. LARGE FEATURE SET SEGMENTATION 103
All the filtering operations are nondecimated and use symmetric edge extension. Edge
extension is important when filters overlap the edge of an image. For the very short filters
used here we merely need to repeat the last row (or column) of the image. The features are
then calculated as before but based on Ak , Bk , Ck , Dk rather than on the original subbands.
This leads to four times as many features.
The new subbands approximately quarter the frequency response of the original sub-
band. Figure 6.7 shows contours of the frequency responses for the four subbands derived
from the original 45◦ subband at scale 2.
0.5 0.5
0.4 0.4
Vertical Frequency
Vertical Frequency
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4
Horizontal Frequency Horizontal Frequency
0.5 0.5
0.4 0.4
Vertical Frequency
Vertical Frequency
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4
Horizontal Frequency Horizontal Frequency
Figure 6.7: 2-D frequency responses for the four subbands derived from the level 2 45◦
subband. Contours are plotted at 90%, 75%, 50%, 25% of the peak amplitude. A dashed
contour at the 25% peak level for the original 45◦ scale 2 subband is also shown in each
plot.
Now we explain why this is an appropriate measure of local auto-correlation. The

short filters mean that the features will produce local features, but the connection with
auto-correlation is not as clear. To motivate our choice we consider a one dimensional
example. Suppose we filter a signal containing white Gaussian noise of mean zero and
variance 1 whose Z transform is X(z) with a filter H(z) = 1 + az −1 where we assume
a is real. Let {yk } denote the samples of the filtered signal and {xk } the original white
noise samples. The output samples {yk } will contain coloured noise. In other words, the
filtering introduces correlations between the samples. More precisely, as yk = xk + axk−1
and yk+1 = xk+1 + axk we can calculate the following expected correlations:

E {yk yk } = E x2k + 2aE {xk xk−1 } + a2 E x2k−1
= 1 + a2

E {yk yk+1 } = E {xk xk+1 } + aE x2k + aE {xk−1 xk+1 } + a2 E {xk−1 xk }
= a
This means that the average auto-correlation for lag ±1 is a, the average auto-correlation
for lag 0 is 1 + a2 and it can easily be shown that the auto-correlation for any other lag is
0.
Now consider the expected energy after using the Haar filters. The output {fk } of using
the 1 + z −1 filter is equivalent to filtering the original white noise with a combined filter of
(1 + z −1 )H(z) = 1 + (1 + a)z −1 + az −2
This will therefore produce an output with expected energy

E |fk |2 = 1 + (1 + a)2 + a2 = 2(1 + a + a2 )
If instead we filter with 1 − z −1 to get {gk } we have an expected energy of

E |gk |2 = 1 + (−1 + a)2 + a2 = 2(1 − a + a2 )
The sum of these two energies is 4 times the average auto-correlation at lag 0, while the
difference is 4 times the average auto-correlation at lag 1. This illustrates the close link
between the extra filtering and auto-correlation and suggests why the filtering provides an
appropriate measure of local auto-correlation statistics.
Table 6.8 presents the results of using the enlarged feature set for the pixel by pixel
segmentation experiment described in section 5.6. Also tabulated are the results for the
DT-CWT repeated from table 5.2. The average of the errors for the extended feature
set is 17.7% compared to 18.0% for the DT-CWT. The extended feature set is better for 9
out of the 12 test mosaics but 5.5% worse for mosaic a. This agrees with the argument in
the previous section that although extra features can sometimes provide improvements in
segmentation, this gain is not automatic and great care must be taken in choosing features.
Mosaic DT-CWT error *100% Extended DT-CWT* 100%

a 10.9 16.4
b 21.8 21.5
c 16.2 14.6
d 16.6 15.8
e 17.3 19.4
f 33.8 34
g 40.4 39.1
h 19.3 18.4
i 28.6 24.1
j 0.6 0.4
k 1.1 0.4
l 9.3 8.2
Figure 6.8: Comparison of segmentation results for different transforms
6.8 Conclusions
The extra features generated by measuring the autocorrelation of the subbands are useful
for modelling longer-range correlations and allow good synthesis of strongly directional
textures. The phase is an important part of the correlation because matching features based
solely on the magnitude autocorrelation gave inferior results. Matching cross-correlation
only slightly changed the results. These conclusions are all based on the subjective quality
of synthesized textures. Numerical experiments using an extended feature set confirmed
that auto-correlation related features can sometimes increase segmentation performance
but that they can also decrease performance in other cases.
Chapter 7
Bayesian modelling in the wavelet

domain
7.1 Introduction
The previous chapters have been concerned with non-Bayesian image processing techniques.
The general approach has been to design what we hope to be an appropriate algorithm for
addressing a particular problem and then examine the experimental results. The remaining
chapters have a very different flavour.
We will now approach image processing from a Bayesian perspective. The aim of this
dissertation is to explore the use of complex wavelets in image processing. The previous
chapters have illustrated their use within certain non-Bayesian methods and now we wish
to explore the use within Bayesian processing. Note that we are not directly aiming to
compare Bayesian and non-Bayesian methodologies. Both approaches are commonly used
and both approaches have different advantages. The specific motivation for using the
Bayesian methodology to address the problems in the following chapters is to provide a
mathematical framework within which we can place and compare techniques.
The purpose of this chapter is to introduce a complex wavelet model and compare
this with alternative model formulations and alternative wavelet choices. The concepts of
probability distributions and Bayes’ theorem are briefly stated and then used to construct
a common framework for a range of different texture models. We consider a number of
different ways in which we can define a prior distribution for images and reexpress each
model in terms of a multivariate Gaussian. Within this context the choice of wavelet
107
108 CHAPTER 7. BAYESIAN MODELLING IN THE WAVELET DOMAIN
transform is considered in section 7.4.

The main original contributions are; the description of shift invariant wavelet models
in terms of a Gaussian random process, the identification of the auto-correlation of the
process in terms of wavelet smoothing, and the experimental measurements in 1D and 2D
of the effects of using a decimated transform.
7.2 Introduction to Bayesian inference

This section gives a brief introduction to the terminology used for Bayesian inference. Basic
familiarity with the concepts of probability theory and with the ideas of random processes
and their covariance and correlation functions [116] is assumed. We will use the terms pdf
or cdf to describe the distribution of a random variable. These terms are defined as:
Cumulative Distribution Function (cdf) The cdf for a random variable X is a func-
tion FX : R → R , defined by
FX (x) = P (X ≤ x).
Probability Density Function (pdf) The pdf for a random variable X is a function
fX : R → R , defined by
∂FX (x)
fX (x) =
∂x
where FX (x) is the cdf for the random variable.
Joint cdfs are defined as
FX,Y (x, y) = P (X ≤ x, Y ≤ y)
and joint pdfs as
∂ 2 FX,Y (x, y)
fX,Y (x, y) =
∂x∂y
Conditional pdfs are defined as
∂P (X ≤ x|Y = y)
fX|Y (x|y) =
∂x
7.3. BAYESIAN IMAGE MODELLING 109
Using these definitions Bayes’ theorem for pdfs can be written as

f (y|x)f (x)
f (x|y) = (7.1)
f (y)
where we have dropped the subscripts for clarity (this will usually be done within the
dissertation when there is little risk of confusion).
We will make use of Bayes’ theorem when we have a model that generates images based
on some unknown parameters. Bayes’ theorem provides a way of estimating the parameters
of the model (such as the clean image before noise was added) from the observed data. In
equation 7.1, x represents the unknown parameters and y the observed image data.
There are special words that refer to particular parts of the formula:
Prior The prior distribution is the name for f (x). This represents the information we
have about the model parameters before observing the image data. For example, a
simple prior for an image might be that all pixels have intensity values independently
and uniformly distributed between 0 and 1.
Posterior The posterior distribution is the name for f (x|y). This represents all the in-
formation we have about the model parameters after observing the image data.
Likelihood The likelihood is the name for f (y|x). This represents our knowledge about
the model. Generating a sample from f (y|x) is equivalent to using our model to
generate some typical noisy image data based on some particular values of the pa-
rameters x.
7.3 Bayesian image modelling

In this chapter we are only concerned with specifying the prior distribution for images.
(Later, in chapters 8 and 9, we explain how, once equipped with a suitable prior, we can
proceed to define the likelihood for a particular problem and finally infer estimates from
observed data.) In other words, we are trying to mathematically express the expectations
we have for the images that are likely to occur. Clearly this will be intimately related with
the precise application and the models we describe will have a number of parameters that
can be tuned depending on the circumstances.
For notational simplicity we represent images by placing the pixels into column vectors.
We will use upper case for vectors of random variables and lower case for observations of
such vectors. We shall make extensive use of the multivariate Gaussian distribution (also
known as a multivariate Normal distribution). We use the notation Z s N (µ, B) to
denote that the random variables contained in Z are drawn from a multivariate Gaussian
distribution with mean µ and covariance matrix B. The multivariate Gaussian is defined
to have the following pdf for Z s N (µ, B)

1 1
p(Z = z) = exp − (z − µ)T B −1 (z − µ) (7.2)
(2π)N/2 |B|1/2 2
where N is the length of the vector Z and |B| is the determinant of B. One useful standard
result is that if Z s N (µ, B) and Y = AZ where A is a real matrix with N columns and

S ≤ N rows then Y s N Aµ, ABAT .
Images generated with a multivariate Gaussian distribution are also known as realisa-
tions from a discrete1 Gaussian random process. We are particularly interested in wide
sense stationary processes. Let N be the number of locations within the image. Let xa ∈ R 2
be the position of the ath pixel. A process is defined to be stationary in the wide sense if
1. the expectation is independent of position: there exists a c ∈ R such that for all
a ∈ {1, . . . , N}
E {Za } = c
2. the correlation of two random variables Za and Zb is a function only of their relative
position: there exists a function R : R 2 → R such that for all a, b
E {Za Zb } = R(xa − xb ).
R is called the autocorrelation function for the random process.
The first condition means that µa = E {Za } = c and for simplicity we shall assume that
the data has been shifted to ensure c = 0. The second condition means that the covariance
matrix B has a specific structure:
E {Za Zb } = Ba,b = R(xa − xb )

1
The intensity values are given by continuous random variables. The process is called discrete because
values are only defined on the grid of pixel positions that cover the image. There is no assumption that
the intensity values must be, for example, integers.
The covariance function of the random process is defined to be E {(Za − E {Za })(Zb − E {Zb })}
and is equal to R(xa − xb ) because E {Za } = E {Zb } = c = 0.
There are two styles of distribution specification that are often encountered. Sometimes
a formula for the pdf is explicitly stated, we shall call this the direct specification. In other
cases a method for generating samples from the prior is given, we call this a generative
specification.
We now consider a number of standard prior models and convert each case to the
equivalent multivariate Gaussian form for the distribution.
7.3.1 Filter model

A simple example of the generative method is the filter model. Imagine that we have an
image filled with random white noise of variance 1 which is then filtered. This process
generates sample images from the prior.
Let R be a column vector containing all the white noise samples. Then let A be a
matrix that applies the filtering operation so Z = AR. The values Z in the filtered image
will be distributed as

Z s N 0, AAT .
Many models are special cases of this system including, for example, the ARMA (autore-
gressive moving average) model. The signals produced by such a system are examples of
wide sense stationary discrete Gaussian random processes. This relationship is well known
[116] and we highlight a couple of standard results (assuming that the filter is stationary,
i.e. that the same filter is used across the entire image):
1. The i, j entry in the covariance matrix is equal to the covariance function of the
random process evaluated for the difference in position between pixel i and pixel j.
2. The covariance function of the random process is equal to the autocorrelation function
of the filter impulse response.
7.3.2 Fourier model

The Fourier model is another generative specification closely related to the previous model.
We imagine generating complex white noise in the Fourier domain with variance depending
on the frequency according to some known law, then inverting the Fourier transform, and
finally taking the real part of the output to generate a sample image.
We define a matrix F to represent the Fourier transform:

1 xTa xb
Fa,b = √ exp −2πj √
M M
√ √
where we assume that the images are square of dimensions M by M . This definition
of the Fourier transform has the following properties:
1. The inverse is the Hermitian transpose of F .
F H F = IM
2. The energy of the Fourier coefficients is equal to the energy in the image. This is
known as Parseval’s theorem:

a
2 = aH a = aH F H F a =
F a
2
Let D be the (real) diagonal matrix that scales white noise of variance 1 to give white
noise in the Fourier components of the desired variances. The images generated by the

Fourier model can be written as F H D (RR + jRI ) where RR and RI are distributed
as:
RR s N (0, IM )
RI s N (0, IM )
Denote the real part of F by FR , and the imaginary part by FI . The images can be
expressed as

Z = F H D (RR + jRI )

= FRT − jFIT D (RR + jRI )
= FRT DRR + FIT DRI

R
R
= FRT D FIT D
RI
and we deduce that the prior distribution is

Z s N 0, FRT D 2 FR + FIT D 2 FI .
again a multivariate Gaussian. Let B = FRT D 2 FR + FIT D 2 FI be the covariance matrix.

The entry Bij can be written as
Bij = eTi Bej

= eTi FRT D 2 FR + FIT D 2 FI ej

= eTi F H D 2 (FR + jFI ) ej
where ei is the column vector containing all zeros except for a one in the ith place. (We
also assume that D is a real matrix.) This equation represents the following process:
1. Take the Fourier transform of an impulse at location xj .
2. Multiply the Fourier coefficients by the diagonal entries of D 2 .
3. Invert the Fourier transform of the scaled coefficients.
4. Extract the real part of the entry at location xi .
This process represents a simple blurring operation and we conclude that the covariance
function of the generated random process is given by such a blurred impulse.
7.3.3 Wavelet direct specification

One way of using a wavelet transform to define the prior is to specify a pdf defined on a
weighted sum of the squares of the wavelet coefficients of the transform of the image.
If we have a complex wavelet transform then it is convenient to treat the real and imag-
inary parts of the output coefficients as separate real outputs of the transform. If we let the
(real) matrix W represent the forward wavelet transform, P the reverse wavelet transform,
and D a diagonal matrix containing the weights we apply to the wavelet coefficients then

1 1 T T 2
p(Z = z) ∝ exp −
DW z
= exp − z W D W z
2
2 2
which we can recognise as the form of a multivariate Gaussian.

Z s N 0, (W T D 2 W )−1
The covariance matrix C = (W T D 2 W )−1 has a strange form. One way to understand the
covariance is via the equation
W T D 2 W C = W T D 2 W (W T D 2 W )−1 = IN
This means that the ith column of C is transformed by a wavelet sharpening process
W T D 2 W to become ei .
For a balanced wavelet the wavelet sharpening algorithm consists of the following steps:
1. Take the wavelet transform of the image.
2. Multiply the wavelet coefficients by the diagonal entries of D 2 .
3. Invert the wavelet transform.
If we assume that the same weighting is applied to all the coefficients in a particular sub-
band then (for a shift invariant transform) the prior will correspond to a stationary discrete
Gaussian random process with some covariance function. The mathematics translates to
saying that the shape of this covariance function is given by the inverse of the sharpening
algorithm applied to an impulse. In other words, the covariance function is such that if we
apply a wavelet sharpening process we produce an impulse.
If we are using an orthogonal wavelet transform (i.e. a non-redundant balanced trans-
form) then the inverse of a wavelet sharpening process will give the same results as a wavelet
smoothing process, but this is not necessarily true for a redundant balanced wavelet trans-
form.
For non-balanced wavelets W T is not the same as the reconstruction transform P .
However, there is a natural interpretation for W T in terms of the filter tree used to compute
wavelets. Recall that for a standard wavelet transform H0 (z) and H1 (z) define the analysis
filters while G0 (z) and G1 (z) define the reconstruction filters. We define a new set of
reconstruction filters G0 (z) = H0∗ (z −1 ), G1 (z) = H1∗ (z −1 ) where the conjugation operation
in these equations is applied only to the coefficients of z, but not to z itself. In other
words, we use reconstruction filters given by the conjugate time reverse of the analysis
filters. These may no longer correspond to a perfect reconstruction system but if we
nevertheless use the reconstruction filter tree with these new filters then we effectively
perform a multiplication by W T .
7.3.4 Wavelet generative specification

We can also generate sample images using wavelets by generating white noise samples of
variance 1 for each wavelet coefficient, scaling them by a diagonal matrix D and then
7.4. CHOICE OF WAVELET 115
inverting the wavelet transform. This is the most important model for our purposes as it
is the model that will be used in the next chapter.
We will assume that the same weighting is used for all the coefficients in a particular
subband and that the choice of wavelet transform is such that the sample images are
implicitly drawn from a stationary prior; this second assumption is discussed in section
7.4.2.
The images are generated by Z = P DR and the prior distribution will be given by

Z s N 0, P D 2P T .
The assumption that this prior is stationary means that the prior is a stationary discrete
Gaussian random process. For a balanced wavelet W = P T and, just as for the Fourier
method, the covariance function of this process is given by a smoothing procedure applied
to an impulse:
1. Forward wavelet transform the impulse.
2. Scale the wavelet coefficients by D 2 .
An alternative view of this method for a S subband wavelet transform is to consider the
images as being the sum of the S reconstructions, one from each subband. Each subband
has a common scaling applied to the wavelet coefficients and so can be viewed as a special
case of the filtering method with the filter being the corresponding reconstruction wavelet.
The covariance function for images generated from noise in a single subband is therefore
given by the autocovariance of this wavelet.
Lemma 1 in Appendix B shows that the autocovariance function for a sum of two
independent images is given by the sum of the individual covariance functions. Conse-
quently, the covariance of the sum of the reconstructions will be the sum of the individual
covariances because the scales all have independent noise sources.
7.4 Choice of wavelet

There are a number of factors to consider in our choice of wavelet. The next sections report
on two factors relating to the accuracy of the results:
Shift invariance the wavelet generative model is only appropriate for transforms with
low shift dependence.
Flexibility in prior model The covariance structure of the prior model is determined
partly by the choice of scaling factors and partly by the choice of wavelet. We
should choose a wavelet that allows us to generate the covariance structure of a given
application.
Further discussion more tightly linked to the nature of the application can be found in
section 8.7.
Section 7.4.1 proposes five possibilities for the choice and the following two sections
estimate the importance of the factors for each of these choices.
7.4.1 Possible Basis functions

We based our discussion of the wavelet generative model on wavelet transforms but the
discussion is also applicable for any set of filter coefficients used in the pyramid structure
even if the choices do not belong to a wavelet system.
We will consider five wavelet-like systems:
1. A real fully decimated wavelet transform (DWT) based on the Daubechies filters of
order 8.
2. A real non-decimated wavelet transform (NDWT) based on the Daubechies filters of

order 8.
3. The W-Transform (WWT).
4. The dual tree complex wavelet transform (DT-CWT).
5. The Gaussian pyramid transform (GPT).
The GPT is one of the oldest wavelet-like transforms. Adelson et al. [1] suggested using
either the Gaussian or Laplacian pyramid to analyse images. The analysis filters for the
Laplacian pyramid are (short FIR approximations to) the differences between Gaussians
of different widths. The analysis filters for the Gaussian pyramid are Gaussians. To
reconstruct an image from a Laplacian pyramid we use Gaussian reconstruction filters.
We will be using the pyramid to reconstruct surfaces and it can be implemented in the
same way as a normal dyadic wavelet transform by a succession of upsampling operations
and filtering operations. Figure 7.1 shows the sequence of operations involved in recon-
structing a surface from wavelet coefficients at 3 scales. Choosing G(z) to be a simple
Wavelet Coefficients
at scale 3
Upsample Filter Rows Upsample Filter Columns

Rows with G(z) Columns with G(z)
at scale 2

at scale 1

Output
Surface
Figure 7.1: Sequence of operations to reconstruct using a Gaussian Pyramid

5-tap filter
√
G(z) = z −2 + 3z −1 + 4 + 3z + z 2 /6 2 (7.3)
gives a close approximation to Gaussian shape and provides good results for very little
computation.
The WWT (W-wavelet transform) [67] is a bi-orthogonal wavelet transform. The anal-
ysis lowpass filter is
1
H0 (z) = √ −z −1 + 3 + 3z − z 2
2 2
and the analysis highpass filter is

1
H1 (z) = √ −z −1 + 3 − 3z + z −2
2 2
The reconstruction filters are also very simple. The lowpass filter is
1
G0 (z) = √ z −1 + 3 + 3z + z 2
2 2
and the reconstruction highpass filter is
1
G1 (z) = √ z −1 + 3 − 3z − z −2
2 2
Unlike a standard wavelet transform the wavelets at a particular scale are not orthogonal
to each other: the orthogonality is sacrificed in order to produce smoother reconstruction
wavelets. We shall use the WWT in a manner analogous to the pyramid transform with
G(z) = z −1 + 3 + 3z + z 2 .
7.4.2 Shift Invariance

The wavelet generative model produces a stationary random process if the aliasing is as-
sumed to be negligible. However, for many choices of wavelet the aliasing will be significant.
It is useful to consider when a transform is shift invariant.
The direct specification will have a shift invariant prior if the energy within each scale
is invariant to translations. The energy will be shift invariant if there is no aliasing during
the wavelet transform. For the original signal the Nyquist frequency is half the sampling
frequency. Each time we subsample the output of a wavelet filter we halve the Nyquist
frequency. For shift invariance we require the bandwidth of each wavelet filter to be less
than the Nyquist frequency for the corresponding subband [63].
Similarly for the generative specification there will be no aliasing and hence shift in-
variance as long as this is true.
The NDWT will be shift invariant because it has no subsampling. Standard wavelet
transforms (DWT) repeatedly split the spectrum into two halves and downsample by a
factor of two. The finite length of the filters means that the bandwidth of each channel
will be slightly greater than half the spectrum making these transforms shift dependent.
The real wavelets in the DWT have both positive and negative passbands. By discrimi-
nating between positive and negative frequencies the DT-CWT wavelets only have a single
passband. This reduces the bandwidth to within the Nyquist limit and hence allows the
reduction of aliasing [63]. The GPT and WWT reduce aliasing by decreasing the band-
width of the lowpass reconstruction filters at the cost of increasing the bandwidth of the
lowpass analysis filters. As the generative specification only uses the reconstruction filters
the increased analysis bandwidth does not matter.
We demonstrate the effect of shift invariance with two experiments. The first gives a
qualitative feel for the effect by using a simple one dimensional example. The second gives
a quantitative estimate of the amount of variation in the two-dimensional case.
7.4.3 One dimensional approximation

The purpose of this section is merely to give an illustration of the effect of shift dependence.
We make use of the approximation method that will be developed in chapter 8 and readers
unfamiliar with approximation can safely skip this section.
The first experiment compares approximation methods in one dimension using the
different transforms. We use a five scale decomposition for each transform. The variances
for the different scales were (from coarsest to finest) 4,2,1,0.5,0.2,0.1 and we used a variance
of 0.01 for the measurement noise2 .
We chose the positions and values of six points and then used both methods to ap-
proximate 128 regularly spaced values. These precise values are not critical because this
experiment is just meant to give a feel for the relative performance. We repeat the experi-
ment 8 times, the only difference between repetitions is that each time the origin is slightly
shifted. The method described in chapter 8 is used to perform the approximation.
Figure 7.2 shows the estimated values. The results for different origin positions are
stacked above one another, and we have marked the original positions and data values
with crosses. The NDWT, GPT, and the DT-CWT all appear very close to being shift
invariant. The WWT has a small amount of shift dependence, and the standard DWT has
a large amount of shift dependence. These are the qualitative results that we wanted to
demonstrate with this experiment.
The actual shape of the approximation is not significant as it is highly dependent on the
scaling factors chosen. The different transforms produce different approximations because
2
For the NDWT the variance for each scale was reduced by the amount of oversampling in order to
produce equivalent results.
Gaussian Pyramid W Transform
20 20
10 10
0 0
−10 −10
20 40 60 80 100 20 40 60 80 100
Dual tree complex wavelet transform Orthogonal real wavelet
20 20
10 10
0 0
−10 −10
20 40 60 80 100 20 40 60 80 100
Nondecimated real wavelet
20
10
−10
20 40 60 80 100
Figure 7.2: One dimensional approximation results for different origin positions. Crosses
show location and values of measured data points.
the associated covariance structures are different. The WWT and the GPT use lowpass
filters with a smaller bandwidth than is usually used for wavelets and therefore the results
are smoother. This is not an important difference because we could also make the DT-CWT
and NDWT results smoother by changing the scalings used.
7.4.4 Two-dimensional shift dependence

Section 7.3.4 described the link between a wavelet smoothed impulse and the covariance
function of surfaces produced by the wavelet generative model. The associated random
process will only be stationary if the result of smoothing is independent of location. This
section describes the results of an experiment to measure the variation in such a smoothed
impulse.
The basic idea is simple: shift dependence means that smoothing a translated impulse
will not be the same as translating a smoothed impulse. We measure the amount of shift
dependence by examining the energy of the difference.
Using the same notation as in section 7.3.4, the wavelet smoothing produces the output
z = P D2P T d
where d represents the input image (0 everywhere apart from a single 1 in the centre of
the image).
The amount of smoothing is determined by the diagonal entries of the matrix D. Sup-
pose that we have a K level transform and that at level k all the scaling factors for the
different subbands are equal to σk . This will give approximately circular priors (the quality
of the approximation is demonstrated in the next section). Define diagonal matrices Ek
whose entries are (Ek )ii = 1 if the ith wavelet coefficient belongs to a subband at level k
and zero otherwise. This allows us to decompose D as

K
D= Ek σk (7.4)
k=1
The output of the wavelet smoothing is
z = P D2P T d
K
= σk2 P Ek P T d
k=1
Now define S(x, y) to be a matrix that performs a translation to the data of x pixels
horizontally and y pixels vertically. The inverse of this transform is S(x, y)T (assuming
periodic extension at the edges). By averaging over all translations we can compute a shift
invariant estimate that we call zave
K −1 2K −1

K 2
2K
zave = 1/2 σk2 S(x, y)T P Ek P T S(x, y)d
k=1 x=0 y=0

K
= σk2 zk
k=1
where zk is the average result of reconstructing the data from just the scale k coefficients.
K −1 2K −1
2
2K
zk = 1/2 S(x, y)T P Ek P T S(x, y)d.
x=0 y=0
For a particular translation of the data of x, y pixels we define the energy, E(x, y), of the
error between the wavelet smoothed image and the shift invariant estimate as
2
K

E(x, y) = σk S(x, y) P Ek P S(x, y)d − zave
2 T T

k=1
2
K

= σk2 S(x, y)T P Ek P T S(x, y)d − zk

k=1

K
2
≈ σk4 S(x, y)T P Ek P T S(x, y)d − zk
k=1
where in the last step we have assumed that the errors from each scale will be approximately
uncorrelated. If we define ek (x, y) to be be the error at scale k due to shift invariance,
ek (x, y) = S(x, y)T P Ek P T S(x, y)d − zk
we can write the energy Eave of the error averaged over all translations as
K −1 2K −1

K 2
Eave ≈ σk4 1/22K
ek (x, y)
2.
k=1 x=0 y=0
It may seem strange to have the fourth power of σk . This is a consequence of weighting
by σk2 during the smoothing step. The error energy depends on the parameters σk and will
tend to be dominated by the level k with the largest σk . Different applications will have
different priors. To give a quantitative estimate of the importance of shift dependence for
different priors we carry out the following procedure for K varying between 1 and 4:
1. Set σk = 0 for k = K and σK = 1.
2. Evaluate zK , zave , and Eave using the above equations.
3. Measure the amount of shift dependence f = Eave /

zave
2 .
The results of this experiment for the different transforms are shown in table 7.3. The
results are converted to signal to noise ratios given by −10 log10 (f ). The NDWT has a
Transform K=1 K=2 K=3 K=4

DWT 6.8 6.8 6.8 6.8
NDWT ∞ ∞ ∞ ∞
WWT 13.6 17.4 18.4 18.7
DT-CWT ∞ 29.5 32.9 32.0
GPT 26.7 32.7 35.1 35.6
Figure 7.3: Shift dependence for different scales/dB.
SNR of ∞ because this transform has no down-sampling and is shift invariant. Similarly
the multiple trees mean that there is effectively no down-sampling in the first level of the
DT-CWT and it also has infinite SNR for K = 1. Both the higher scales of the DT-CWT
and the GPT have very low amounts of shift dependence with SNR levels around 30dB.
However, the WWT only manages about 18dB while the DWT has a very poor performance
with 7dB. 7dB corresponds to an shift dependence error energy of about 20% of the signal
energy.
Care must be taken when interpreting the SNR values tabulated. The measurements
describe the degree to which the wavelet generative model produces a stationary prior.
However, the final solution to a problem is based on the posterior density and the posterior
is a combination of the likelihood and the prior. In some circumstances the information
in the likelihood can counteract the deficiencies in the prior to produce a good quality
posterior. An quantitative example of the importance of this effect is given in section
8.7.2 which explains the significance of these measurements for a particular application (of
interpolation).
7.4.5 Flexibility
If a prior is required to be anisotropic (i.e. some directions are allowed more variation than
others) then we alter the model so that we can separately vary the scaling factors for each
subband. The increased directionality of the DT-CWT means that it is much more flexible
than any of the other methods for modelling anisotropic images. For example, none of
the other methods can produce priors that favour images containing correlations at angles
near 45◦ without also favouring correlations at angles near −45◦ .
However, in many applications it will be reasonable to assume that the prior for the
images is isotropic and so one way of testing the flexibility is to measure how close the
covariance function is to being circularly symmetric.
Each choice of scalings for the subbands implicitly defines the signal model as a sta-
tionary process with a certain covariance function. As in section 7.4.4 the covariance is
calculated by a wavelet smoothing method applied to an impulse. For each wavelet trans-
form the following process is used:
1. Generate a blank image of size 64 × 64. (Here we use blank to mean every pixel value
is 0.)
2. Set the pixel at position (32, 32) to have value 1.
3. Wavelet transform the image using 4 levels.
4. Scale the wavelet coefficients at level l by σl2 = 42l .
5. Scale the low pass coefficients by σ52 = 49 .
The final image produced is proportional to the covariance function. The exact values used
for the scaling factors in this experiment are not crucial and are just chosen to give results
large enough for the symmetry to be seen. (The Gaussian pyramid and W transform
methods have smoother lowpass filters and we change the scaling factors slightly in order
to give a similar shaped covariance.) The results are shown in figures 7.4,7.5,7.6,7.7, and
7.8.
40
35
60
30
25 50
20 40
15
30
10
5 20
0
10
−5
80
10 20 30 40 50 60
60 80
60
40
40
20
20
0 0
Figure 7.4: Covariance structure for an orthogonal real wavelet.
The Gaussian pyramid produces the most circularly symmetric covariance. The W
transform, DT-CWT, and the NDWT all produce reasonable approximations to circular
symmetry. The most important part of these diagrams is the section near the centre. For
large distances the contours are not circular but at such points the correlation is weak and
hence not as important. The DWT has a significantly non-circular covariance function.
35
30
60
25
50
20
40
15
10 30
5 20
0
10
−5
80
10 20 30 40 50 60
60 80
60
40
40
20
20
0 0
Figure 7.5: Covariance structure for a nondecimated real wavelet.
35
30 60
25
50
20
40
15
30
10
20
5
10
0
80
10 20 30 40 50 60
60 80
60
40
40
20
20
0 0
Figure 7.6: Covariance structure for the Gaussian pyramid.

50
60
40
50
30
40
20 30
10 20
10
0
80
10 20 30 40 50 60
60 80
60
40
40
20
20
0 0
Figure 7.7: Covariance structure for the W transform.
35
30
60
25
50
20
40
15
10 30
5 20
0
10
−5
80
10 20 30 40 50 60
60 80
60
40
40
20
20
0 0
Figure 7.8: Covariance structure for the DT-CWT .

Figure 7.9 shows the same experiment (still using the DWT) except that we change step 2
to act on the pixel at position (28, 28). The covariance changes to a different (non-circular)
25
20 60
15 50
40
10
30
5
20
0
10
−5
80
10 20 30 40 50 60
60 80
60
40
40
20
20
0 0
Figure 7.9: Covariance structure for a translated orthogonal real wavelet.
shape.
It is possible to improve the circularity of the wavelet transform results by adjusting
the scaling factors for the ±45◦ subbands. These subbands should be treated differently
because their frequency responses have centres (in 2D frequency space) further from the
origin than the centres of the other subbands at the same scale. However, these changes
will do nothing to alleviate the problems of shift dependence found for the DWT.
7.4.6 Summary of results

Table 7.10 summarises the flexibility and shift dependence properties of the different trans-
√
forms. A indicates good behaviour with respect to the property, a × indicates bad
behaviour, and a ? indicates intermediate behaviour.
7.5 Conclusions
The first part of the chapter is based on the assumption that the chosen wavelet transform
is shift invariant. Based on this assumption, we conclude that each of the four image models
Transform Shift Isotropic Anisotropic

invariant modelling modelling
DWT × × ×
√ √
NDWT ×
√
WWT ? ×
√ √ √
DT-CWT
√ √
GPT ×
Figure 7.10: Summary of properties for different transforms
discussed are equivalent to a wide sense stationary discrete Gaussian random process. In
particular we conclude that:
• The Filter model corresponds to a process with covariance function given by the
autocovariance function of the filter impulse response.
• The wavelet generative specification of the prior corresponds to a process with co-
variance function given by a wavelet smoothed impulse.
• The wavelet direct specification corresponds to a covariance function that transforms

to an impulse when a wavelet sharpening operation is applied.
For a shift dependent transform the wavelet generative prior model will be corrupted by
aliasing. The experiments in 1D and 2D suggest that these errors are relatively small
for the DT-CWT but large for the DWT. The NDWT, WWT, DT-CWT, and GPT all
possess reasonably isotropic covariance functions even without tuning the scaling for the
±45◦ subbands.
Chapter 8
Interpolation and Approximation
8.1 Summary
The purpose of this chapter is to explore the use of the DT-CWT for Bayesian approxi-
mation and interpolation in order to illustrate the kind of theoretical results that can be
obtained. We assume that a simple stationary process is an adequate prior model for the
data but observations are only available for a small number of positions.
After a brief description of the problem area we place a number of different interpola-
tion and approximation techniques into the Bayesian framework. This framework reveals
the implicit assumptions of the different methods. We propose an efficient wavelet approx-
imation scheme and discuss the effect of shift dependence on the results.
Finally we describe two refinements to the method; one that increases speed at the cost
of accuracy, and one that allows efficient Bayesian sampling of approximated surfaces from
the posterior distribution.
We originally developed these methods for the determination of subsurface structure
from a combination of seismic recordings and well logs. Further details about the solution
of this problem and the performance of the wavelet method can be found in [35].
The main original contributions are; the Bayesian interpretations of spline processing
and minimum smoothness norm solutions, the proposed wavelet approximation method,
the theoretical estimates for aesthetic and statistical quality, the experimental measures
of these qualities, the method for fast conditional simulation, and the method for trading
speed and accuracy.
129
130 CHAPTER 8. INTERPOLATION AND APPROXIMATION
8.2 Introduction
The task is to estimate the contents of an image from just a few noisy point samples. In
this chapter we assume that the image is a realisation of a 2D (wide sense) stationary
discrete Gaussian random process. This is an example of an approximation problem. The
approximation is called an interpolation in the special case when the estimated image is
constrained to precisely honour the sample values.
Using the same conventions as in chapter 7 we use Z to represent the (unknown)
contents of the image. The prior distribution is
Z s N (0, B)
where B is a N × N covariance matrix and, as before, we assume that the image is shifted
to have zero mean. The assumption that the process is stationary means that B can be
expressed in terms of a covariance function R as
Bab = R (xa − xb )
Suppose we have S observations at locations xs(1) , . . . , xs(S) . It is notationally con-

venient to reorder the locations so that s(a) = a. In other words, we assume we have
observations at the locations x1 , . . . , xS . There is no restriction on the relative positions
(e.g. we do not assume x1 must be next to x2 ) and so this reordering does not reduce the
generality of the analysis.
We also assume that the observations are all at distinct locations within the grid.
Section 8.9 discusses the implications of the assumptions in this model. Denote the S
observations by Y ∈ R S and define a S by N matrix T as

1 a = b and a ≤ S
Tab = (8.1)
0 otherwise
When applied to a vector representing an image, the matrix T selects out the values
corresponding to the observed locations. Let Q ∈ R S be a vector of random variables
representing the measurement noise:

Q s N 0, σ 2 IS (8.2)
where σ 2 is the variance of the measurement noise. The observations are given by
Y = TZ + Q (8.3)
8.3. POSTERIOR DISTRIBUTION 131
In this chapter we will assume that both the covariance structure of the process and the
variance σ 2 of the noise added to the samples are known. In practical applications it
is usually possible to estimate these from the sample values [110]. Strictly speaking our
methods should be described as empirical Bayes because the prior is based on estimated
values. A completely Bayesian approach would involve treating the parameters of the model
as random variables and setting priors for their distributions. However, this introduces
further complications during inference that would distract from the main aim of evaluating
complex wavelets. A description of a fully Bayesian approach to the problem can be found
in the literature [12].
We will define the basic problem (with known covariance) to be “stationary approx-
imation” (or “stationary interpolation” when σ = 0) but we will usually shorten this to
simply “approximation” (or “interpolation”). This problem has been extensively studied
and many possible interpolation methods have been proposed. There are several crude
methods such as nearest neighbour, linear triangulation, and inverse distance that work
reasonably when the surfaces are smooth and there is little noise but are inappropriate oth-
erwise. There are also more advanced methods such as Kriging [13], Radial Basis Functions
[97], Splines [119], and Wiener filtering.
This chapter describes a wavelet based Bayesian method for (approximately) solving the
stationary approximation problem and shows how a number of the alternative techniques
are solutions to particular cases of the problem.
The first part of this chapter discusses the alternative techniques from a Bayesian
perspective. The second part describes the wavelet method and experimental results. As
a first step towards relating the techniques section 8.3 describes the form of the posterior
distribution for the problem.
8.3 Posterior distribution

Suppose we wish to obtain a point estimate for the random variable Zk corresponding to
location xk (we assume that there is no available observation at this location). A reasonable
estimate is the mean of the posterior distribution. Appendix B proves that the posterior
distribution for such a point given observations Y = y is Gaussian and gives an expression
for the mean. Let C be the S by S leading submatrix of the covariance matrix B that
expresses the correlations between the observation locations. Let D ∈ R S be the vector of
covariances between the observation locations and location xk .

Da = E Zs(a) Zk
= Bs(a),k
= R(xs(a) − xk )
= R(xa − xk )
Appendix B shows that the estimated value Ẑk is given by
Ẑk = E {Zk |y}

−1
= DT σ 2 IS + C y (8.4)
If we define a vector λ ∈ R S as
−1
λ = σ 2 IS + C y
then we can express the estimate as
Ẑk = DT λ
S
= R(xa − xk )λa
a=1
N
= R(xa − xk )Λa
a=1
where Λ ∈ R N
is a vector whose first S elements are given by λ1 , . . . , λS and whose
other elements are all zero. Λ represents an image that is blank except at the observation
locations. The equation for the estimate can be interpreted as filtering the image Λ with
the filter h(x) = R(−x) and then extracting the value at location xk . We express the
estimate in this form because λ (and hence Λ) does not depend on the location being
estimated and therefore point estimates for every location are simultaneously generated by
the filtering of Λ.
8.4 Approximation techniques
8.4.1 Kriging
Kriging is a collection of general purpose approximation techniques for irregularly sampled
data points [13]. Its basic form, known as Simple Kriging, considers an estimator Kk for
8.4. APPROXIMATION TECHNIQUES 133
the random variable Zk that is a linear combination of the observed data values:
Kk = w T Y
where w is a S × 1 vector containing the coefficients of the linear combination associated

with position xk . The technique is based on the assumption that the mean and covariance
structure of the data is known. As before we suppose that the data has been preprocessed
so that the mean is zero. The covariance assumption means that we know E {Zk Ya }
and E {Ya Yb } for a, b ∈ {1, . . . , S}. However, this is all that is assumed about the prior
distribution of the data. In particular, there is no assumption that the data is necessarily
distributed according to a multivariate Gaussian.
It is impossible to calculate the posterior distribution because the precise prior distri-
bution is unknown. However, there is sufficient information to calculate the estimator with
the nicest properties among the resticted choice of purely linear estimators. More precisely,
w is chosen to achieve the minimum expected energy of the error between the estimate
and the true value. The expected energy of the error F is given by:

F = E (Kk − Zk )2
S S

S
= E wa Y a wb Yb − 2E wa Ya Zk + E {Zk Zk }
a=1 b=1 a=1

S
S
S
= wa E {Ya Yb } wb − 2 wa E {Ya Zk } + E {Zk Zk }
a=1 b=1 a=1
F is minimised by setting ∇w F = 0
∂F S
=2 E {Ya Yb} wb − 2E {Ya Zk } = 0 (8.5)
∂wa b=1
This gives a set of S linear equations that can be inverted to solve for w.
Now consider the approximation problem again. We can calculate
E {Zk Ya } = E {Zk (Za + Qa )}

= E {Zk Za } + E {Zk } E {Qa }
= E {Zk Za }
= Da
where we have used the fact that Zk and the measurement noise Qa are independent and
that the noise has zero mean. If a = b then similarly
E {Ya Yb } = E {(Za + Qa ) (Zb + Qb )}

= E {Za Zb }
= Ca,b
while if a = b
E {Ya Yb } = Ca,b + σ 2
Using these results we can rewrite equation 8.5 in matrix form as

2 C + σ 2 IS w − 2D = 0
and hence deduce that the solution is

−1
w = C + σ 2 IS D
The estimator corresponding to this minimising parameters is

−1
Kk = DT C + σ 2 IS y
This estimator is exactly the same as the Bayesian estimate based on a multivariate Gaus-
sian distribution.
We have shown the well-known [4] result that if the random process is a multivariate
Gaussian then the simple Kriging estimate is equal to the mean of the posterior distribution.
8.4.2 Radial Basis Functions

Recall that the Bayesian solution can be implemented by placing weighted impulses (in
an otherwise blank image) at the sample locations and then filtering this image with a
filter whose impulse response is given by the covariance function. Additionally, in noiseless
conditions (i.e. for interpolation) the S weights take precisely the values needed to honour
the S known data values (for simplicity we ignore the possibility of these equations being
degenerate).
The link with Radial Basis Functions (RBFs) is straightforward. An interpolated image
based on RBFs is assumed to be a linear combination of S functions where the functions
are all of the same shape and centered on the S known data points [97]. The weights in the
linear combination are chosen to honour the known values. In practice exactly the same
equations are used to solve RBF and Kriging problems and the equivalence between these
techniques is well-known.
8.4.3 Bandlimited interpolation

Another approach is to assume that the image is bandlimited (only contains low frequen-
cies) and then calculate the bandlimited image of minimum energy that goes through the
data points [37].
As before let F be a N by N matrix that represents the (2 dimensional) Fourier trans-
form. Define D to be a N by N diagonal matrix whose ath diagonal entry Daa is 1 if the
corresponding frequency is within the allowed band, but zero otherwise. Let Ẑ denote the
estimate produced by this bandlimited interpolation. Using this notation we can write:
Ẑ = argminZ∈Ω ZT Z (8.6)
where Ω is the space of images that are both band limited and honour the known obser-
vations.
Ω = {Z : T Z = y, (IN − D) F Z = 0}
In order to show the equivalence it is convenient to introduce two additional variables α

and σ that represent the degree to which the constraints are imposed. Assuming that the
limits are well behaved then equation 8.6 can be rewritten as
1 1
Ẑ = lim lim argminZ∈RN ZT Z + 2

T Z − y
2 + 2
(IN − D) F Z
2
α→0 σ→0 σ α
1 1
= lim lim argminZ∈RN Z Z + 2
T Z − y
+ 2 ZH F H (IN − D) (IN − D) F Z
T 2
α→0 σ→0 σ α
1 1 1
= lim lim argminZ∈RN Z Z + 2
T Z − y
+ 2
F Z
2 − 2
DF Z
2
T 2
α→0 σ→0
σ α α
1 1 1
= lim lim argminZ∈RN 1 + 2 ZT Z + 2
T Z − y
2 − 2
DF Z
2
α→0 σ→0 α σ α
where we have made use of the identity 2D − D 2 = D 2 . Now consider expanding the
following expression
2
2 1
(a(IN − D) + D) F Z = (IN − D) + D F Z
−1
a
1
=
(IN − D) F Z
2 +
DF Z
2
a2
1
= 2 ZH F H (IN − D) (IN − D) F Z +
DF Z
2
a
1 T 1
= 2 Z Z + 1 − 2
DF Z
2
a a

1 1
= 1 + 2 ZT Z − 2
DF Z
2
α α
√
where a = α/ 1 + α2 .
Finally consider the Fourier model from section 7.3.2 with a coefficient weighting matrix
Dα = a (IN − D) + D. If we define the prior pdf for Z with this Fourier model and assume
that we have white measurement noise of variance σ 2 then Bayes’ theorem can be used to
show that:
1
−2 log (p(Z|y)) + k(y) = 2

T Z − y
2 +
Dα F Z
2
σ
where k(y) is a function of y corresponding to a normalisation constant. The previous
algebra proves that the RHS of this equation is equal to the expression within the earlier
minimisation and we conclude that the estimate
−1 2

1 α
Ẑσ,α = argminZ∈RN 2
T Z − y
2 + √ (IN − D) + D F Z
σ 1 + α2
is equal to the MAP (Maximum A Posteriori) estimate using the Fourier model, which in
turn is equivalent to the multivariate Gaussian model. Additionally, for a Gaussian density
function the MAP estimate is equal to the posterior mean estimate. In the limit σ → 0 the
measurement noise is reduced to zero and Ẑσ,α becomes the interpolation solution. Finally,
in the limit as α → 0, the prior parameters for the Fourier model tend to Dα = D.
We conclude that the estimate produced by the bandlimited interpolation is equivalent
to interpolation for a particular Fourier model prior. Section 7.3.2 shows that this is
equivalent to the multivariate Gaussian with a covariance function given by a lowpass
filtered impulse (which will be oscillatory due to the rectangular frequency response).
8.4.4 Minimum smoothness norm interpolation

Choi and Baraniuk [23] have proposed a wavelet-based algorithm that finds the signal
that goes through the data points with minimum norm in Besov spaces. They write that
“the interpolated signal obtained by minimum-smoothness norm interpolation is difficult

to characterize, even if the noise in the signal samples is white Gaussian”. This section
briefly describes their algorithm and suggests a characterisation of the interpolated signal
as the sum of the posterior mean estimate for stationary interpolation (as described in
section 8.2) and an error term caused by shift dependence.
We give the definitions of Besov and Sobolev norms first in the notation of the paper
[23] and then rewrite the definitions in the notation of this dissertation. In the original
notation1 the Besov norm
f
Bqα (LP (I)) for a continuous-time signal f (t), t ∈ I = [0, 1] is
defined as
 q/p 1/q

f
Bqα (LP (I)) = 
uj0 ,k
qp + 2αjp 2j(p−2) |wj,k |p  (8.7)
j≥j0 k
where
"
1. The Lp (I) space is the set of all functions on I with bounded norm
f
pp = I
|f (t)|p dt.
2. The scale j0 represents the coarsest scale under consideration.
3. uj0 ,k is the value of the k th scaling coefficient.
4. wj,k is the value of the k th wavelet coefficient at scale j.
5. α, p, q are hyperparameters of the norm.
The paper concentrates on the special case of p = q = 2. In this case the Besov norm is
called the Sobolev norm and is written
1/2

f
W2α(L2 (I)) =
uj0,k
22 + |2αj wj,k |2 (8.8)
j≥j0 ,k
Let Z be a vector of signal samples from the (prefiltered) continuous-time signal f (t).
Prefiltering is used when converting from continuous-time to discrete-time. It is necessary
in order for the wavelet coefficients of Z to match the wavelet coefficients of f (t). We do
not consider prefiltering in this dissertation because we assume that data will always be
1
There are a number of different treatments of the scaling coefficients in Besov norm definitions. In
practical algorithms the choice is not important because the scaling coefficients are generally preserved.
We choose the given definition as being the most natural.
provided in sampled form. Further details can be found in the literature [113, 23]. As
before we define the wavelet coefficients (from a one dimensional wavelet transform) to be
given by w = W Z. We now define a diagonal matrix D to have diagonal entries Daa = 1
if wa is a scaling coefficient and Daa = 2αk if wa is a scale k wavelet coefficient. Armed
with this notation the Sobolev norm can be expressed as

f
W2α(L2 (I)) =
DW Z
(8.9)
Although we have defined the Sobolev norm in terms of the one dimensional transform,
the same equation describes the norm for two dimensional wavelet transforms if we use the
earlier notation for which Z is a vector representing an entire image, and W represents a
two-dimensional transform.
The algorithm [23] found (using a least squares calculation) the wavelet coefficients
with minimum Sobolev norm that interpolated the known points. Recall that the wavelet
direct specification defined the prior as

1
p(Z = z) ∝ exp −
DW z
2
2
The minimum smoothness norm solution is therefore equivalent to selecting the highest
probability image that honours the observations. We conclude that the solution is equal to
the MAP estimate using the wavelet direct specification to generate the prior. Section 7.3.3
showed that this is equivalent to a stationary Gaussian discrete random process assuming
that the wavelet transform is sufficiently shift invariant. We conclude that the minimum
smoothness norm interpolation is equivalent to solving the stationary interpolation problem
(with covariance function given by an inverse wavelet sharpened impulse) with the quality
of the solution determined by the amount of shift dependence.
The published paper made use of a fully decimated wavelet transform. Later we display
experiments comparing the performance of the DWT with alternative transforms to show
the considerable effect of shift dependence. Also recall from section 7.3.3 that for an
orthogonal transform the covariance function can also be expressed in terms of a wavelet
smoothed impulse.
We have described the minimum smoothness norm (MSN) solution as a Bayesian so-
lution to stationary interpolation plus an error caused by shift dependence. It may be
thought that this is unfair, that MSN should really be considered as performing non-
Gaussian interpolation and that what we have called the “error” due to shift dependence
is actually an additional term that makes the technique superior to standard methods. Our
defence to this criticism is that there is no prior information about the absolute location of
signal features and shifting the origin location should not affect the output. Therefore the
solution provided by this method should be considered as the average over all positions of
the origin, plus an error term due to the particular choice of origin position used in the
algorithm. This last statement may need a little further support as it could be argued that
the average always gives smooth answers while MSN will be able to model discontinuities
better. We finish this section by proving that the accusation will never hold.
More precisely, we show that when all the origin positions are considered the average
solution will always be a better estimate than using the basic MSN solutions. In this
context we measure the quality of a solution by means of the energy of the error. The
proof is straightforward. Let Z represent the true values of the signal (or image – this
proof is valid for both signals and images) and let Ẑi represent the MSN estimate for the
ith origin position (out of a total of NO possible positions). The average solution Ẑ0 is
defined as
1
NO
Ẑ0 = Ẑi
NO i=1
Then it is required to prove that the energy of the error for the average solution
Z − Ẑ0
2

O
is always less than the average energy for the individual solutions 1/NO N i=1
Z − Ẑi
.
2
The error for the individual solutions can be written as
1
NO

Z − Ẑi
2 =
NO i=1
1
NO

Z − Ẑ0 + Ẑ0 − Ẑi
2
NO i=1
NO T
1
=
Z − Ẑ0
+
Ẑ0 − Ẑi
+ 2 Z − Ẑ0
2 2
Ẑ0 − Ẑi
NO i=1
2 T
NO
1
NO
=
Z − Ẑ0
+2

Ẑ0 − Ẑi
+
2
Z − Ẑ0 Ẑ0 − Ẑi
NO i=1 NO i=1
1
NO
=
Z − Ẑ0
2 +
Ẑ0 − Ẑi
2
NO i=1
≥
Z − Ẑ0
2
where the inequality in the last line is strict unless the transform is shift invariant. We
have therefore shown that the energy of the error will always be greater (when averaged
over all origin positions) if the MSN solution is used rather than the smoothed solution.
Note that this proof is very general, in particular note that no assumption had to be made
about the true prior distribution of the data. The result is equally valid for good and bad
models. It does not claim that the average solution will be a good solution, but it does
show that it will always be better than the shift dependent solutions.
8.4.5 Large spatial prediction for nonstationary random fields
Nychka et al. [88] have proposed a method for dealing with nonstationarity by using
the W transform basis [67]. Nychka allows the weighting factors to vary within a single
scale, and this variation produces nonstationary surfaces. Equation 8.4 shows that the
2
solution involves inverting the matrix C + σM IS . This matrix can be very large and so
is hard to invert. However, the matrix C can be written as P D 2 P T (using the notation
of section 7.3.4) and so multiplication by C can be efficiently calculated using wavelet
transform techniques. Nychka makes use of this result by solving using a conjugate gradient
algorithm. Such an algorithm only uses forward multiplication by C and so can be much
more efficient than inverting C. If C is a square matrix of width S then in the worst
case the gradient algorithm takes S steps to converge to the solution, but usually the
convergence is much faster than this, especially if some preconditioning methods are used.
To generate conditional simulations of a surface they use the method described in sec-
tion 8.8.1 which requires a Kriging-style interpolation to be performed for each realisation.
If their method takes K iterations of the conjugate gradient algorithm to converge, then
to generate P realisations they will require 2KP wavelet transforms.
8.4.6 Comparison with Spline methods
This section discusses the link with B-spline methods. B-splines have been proposed for
solving both interpolation and approximation problems. We first give an overview of the
technique and then discuss the methods from a Bayesian perspective. The main source for
the description of splines is [119] while the Bayesian interpretation is original.
B-spline
A B-spline is a continuous piecewise polynomial function. From a signal processing point

of view B-splines can be produced by repeatedly filtering a centred normalized rectangular
pulse of width d with itself. More precisely, a B-spline of order n can be generated by
convolving this rectangular pulse with itself n times. A zero order spline will therefore be
the square pulse and a first order spline will be a triangular pulse. Higher order splines
converge to a Gaussian “bell” shape. For the uniform sampling case the width d is chosen
to be the distance between data points and the height so that the spline has total area
equal to 1.
The B-splines can be used for either interpolation or approximation. For interpolation
the problem is to choose the spline coefficents so that the reconstructed signal passes
through the data points. These coefficents can be calculated by applying a simple IIR filter
to the data. If the coefficients are represented as delta functions (of area equal to the value
of the coefficients) at the appropriate locations it is possible to construct the interpolation
by filtering with the B-spline function. For high order splines this combination of the IIR
filter with the B-spline filtering can be viewed as a single filtering operation applied to the
original data points (also represented as impulses). This combined filter is called a cardinal
spline (or sometimes the fundamental spline) of order n and converges to a sinc function
that effectively performs bandpass filtering of the signal to remove aliased components of
the signal.
There are two main techniques for approximation. The first is called smoothing splines
approximation and involves minimising the energy of the error in the approximation plus
an additional energy term. This second energy term is the energy of the r th derivative of
the approximation.
The second technique is a least squares approximation and is derived by restricting the
number of spline coefficients that are to be used to generate the approximation, and then
solving for the least energy of the error.
We discuss each of the three (one for interpolation, two for approximation) main tech-
niques from a Bayesian perspective. Each technique produces an estimate for the true
signal, and we attempt to describe the prior model for the signal that would produce the
same estimates.
Interpolation
The interpolation solution of order n consists of a spline with knot points at the data
points. The similarity to the Bayesian estimate described in section 8.3 is clear in that the
solution is produced by filtering impulses at the data points with the weights chosen by
the requirement of exactly fitting the data. The Bayesian interpretation is that the prior
for the signal is a zero mean wide sense stationary discrete Gaussian random process with
covariance function equal to the B-spline function of order n.
B-spline filtering of order n is given by smoothing with a unit pulse n + 1 times.
Therefore for high orders the high frequencies become less and less likely a priori and the
solution will use the lowest frequencies possible that satisfy the data points. This solution
will therefore converge to the band-limited signal and hence the cardinal splines will tend
to the sinc interpolator.
Smoothing splines
Given a set of discrete signal values {g(k)}, the smoothing spline ĝ(x) of order 2r − 1 is
defined as the function that minimizes

+∞ # +∞ 2
∂ r ĝ(x)

2S = (g(k) − ĝ(k)) + λ
2
dx (8.10)
−∞ ∂xr
k=−∞
where λ is a given positive parameter. Schoenberg has proved the result that the minimising
function (even for the general case of non-uniform sampling) is a spline of order n = 2r − 1
with simple knots at the data points[105]. By analogy with the interpolation case it
is tempting to think that smoothing splines will correspond to a random process prior
with covariance function equal to a 2r − 1 B-spline and measurement noise depending
on λ. However, while the analogy is reasonable when close to the sample points, it is
inappropriate at long distances. The problem is that the integral is zero for polynomials of
degree less than r and hence there is no prior information about the likely low order shape
of the signal. Therefore, except for very careful choices of the data values, the smoothing
spline estimate at long distances will tend to infinity.
Least squares
Least squares techniques are equivalent to the Bayesian maximum a posteriori (MAP)
estimate when we assume that there is a flat prior for each parameter. For the least
squares spline approach we can also roughly interpret the restricted choice of coefficients
as indicating that we know a priori that the original image should be smooth and contain
only low frequencies. The least squares approach can be thought of as calculating a MAP
estimate based on this prior and the observation model of additive white Gaussian noise.
There is an interesting way of seeing that such an estimate cannot be shift invariant.
The explanation is somewhat convoluted and not needed for the rest of this disseratation.
If the following is confusing then it can be safely ignored.
First note that the B-splines are linearly independent. This is clear because in any set
of B-splines we can always find an “end” B-spline whose support is not covered by the rest
of the splines. In other words, a B-spline at the end of the set will contain at least one
location k such that
1. this “end” B-spline is non-zero at location k,
2. all the other B-splines are zero at location k.
Now suppose we have the smallest non-empty set of B-splines that possesses a non-trivial
(i.e. not all coefficients equal to zero) linear combination that is equal to zero at all
locations. Clearly the coefficient of the “end” B-spline must be zero to avoid having a
non-zero value at location k and hence we can construct a smaller set by excluding the
“end” B-spline. This contradiction proves that the B-splines are linearly independent.
Next suppose that we have a discrete grid containing N locations. The consequence of
the linear independence of B-splines is that we must be able to model any image if we can
choose the values of all N B-spline coefficients.
The least squares approach models the data using a set of K spline coefficients where
K is less than the number of observations. In particular, note that the model includes the
case of an image generated from a single non-zero spline coefficient. If the prior is shift
invariant then the model must also include the cases of non-zero coefficients centered on
any location. However, if we are allowed to use all N B-spline coefficients then our previous
claims show that we can make any image we want, including an image that interpolates
all the observations.
We have argued that a shift invariant least squares method must interpolate the ob-
servations. The least squares spline approach is an approximation method and hence we
conclude that it must be shift dependent.
This argument is included for interest only and is not meant to be a rigorous mathe-
matical argument (for example, a rigorous treatment would need to consider edge effects).
8.5 Wavelet posterior distribution

We now change track and describe an approximation scheme based on the wavelet gener-
ative model for images (see 7.3.4). This section derives the posterior distribution for the
images using Bayes’ theorem. The following sections will describe an efficient solution,
discuss the choice of wavelet, and show results of some numerical experiments that test
our predictions.
Recall that the wavelet generative specification uses the prior

Z s N 0, P D 2P T
where P represents the wavelet reconstruction transform and D a diagonal weighting ma-
trix. Instead of modelling the values at every point on the surface, it is better to model the
wavelet coefficients directly, with the surface indirectly defined as the reconstruction from
these wavelet coefficients. In wavelet space the wavelet generative specification corresponds
to the prior

w s N 0, D 2
where w is a column vector containing all the wavelet coefficients (with the real and
imaginary parts treated as separate real coefficients).
Now we wish to derive the posterior distribution for the wavelet coefficients. Suppose
we have measurements y1 , y2 , . . . , yS which we stack into a column vector y and that the
measurement noise is independent and Gaussian of variance σ 2 and mean zero. Let T be
a matrix of ones and zeros that extracts the values at the S measurement locations.
We can now use Bayes’ theorem
p(y|w)p(w)
p(w|y) =
p(y)
∝ p(y|w)p(w)
The likelihood p(y|w) is the pdf that the measurement errors are y − T P w, and so we can
write the likelihood as

1
p(y|w) ∝ exp − 2 (y − T P w) (y − T P w)
T
2σ
8.6. METHOD FOR WAVELET APPROXIMATION/INTERPOLATION 145
The prior for the wavelet coefficients is a multivariate Gaussian distribution of mean zero
and variance D 2 and so the prior pdf can be written as

1 T −2
p(w) ∝ exp − w D w
2
We can then calculate the posterior and use lemma 3 of appendix B to simplify the equa-
tions

1 1 T −2
p(w|y) ∝ exp − 2 (y − T P w) (y − T P w) exp − w D w
T
2σ 2

1 T −2
∝ exp − w D + P T T P/σ w + w P T y/σ
T T 2 T T T 2
2

1
∝ exp − (w − a) A(w − a)
T
2
where
A = D −2 + P T T T T P/σ 2 (8.11)
a = A−1 P T T T y/σ 2 (8.12)
and so we have shown that the posterior distribution for the wavelet coefficients is a
multivariate Gaussian with mean a and variance A−1 .
8.6 Method for wavelet approximation/interpolation

This section describes the wavelet method for estimating an image given a set of data
points and estimates of the measurement noise and the covariance structure of the image.
The method uses the following five steps:
1. Estimate the amount of energy we expect within each scale.
2. Calculate which wavelet coefficients are important.
3. Calculate the responses at the measurement locations to impulses in the wavelet

coefficients (matrix T P ).
4. Solve for the posterior mean estimates of the wavelet coefficients.
5. Reconstruct an estimate for the entire image.
The following sections describe how each of these steps can be performed efficiently.
8.6.1 Estimating scale energies

We need to calculate the diagonal matrix D that defines our prior model for the image. The
choice will depend on the prior information we have available and will therefore depend
on the application. For example, suppose we have a prior estimate for the covariance
structure of the data. One very crude method of choosing D would be to generate images
according to the wavelet generative model for several choices of D, measure the covariance
for each image, and select the choice that did best. A better method uses a result from
section 7.3.4 that proves that the covariance will be given by a simple combination of the
covariances from each wavelet subband. The covariance for each subband can be calculated
from the autocorrelation of the corresponding reconstruction wavelet and a simple least
squares method will allow a good choice of scaling factors to be found. This process of
model fitting and more sophisticated approaches can be found in [110].
8.6.2 Important wavelet coefficients

It is clear that if the support of a wavelet coefficient does not overlap with any of the data
points then the wavelet coefficient does not affect the likelihood of the measured data and
so it will have a posterior distribution equal to its prior distribution, and in particular its
mean value will be zero.
We can greatly reduce the dimension of the problem by leaving out all such unimportant
wavelet coefficients.
We define an importance image to be an image that is zero except at the measurement
locations. One quick way to determine the important coefficients is to transform such
an importance image containing ones at the measurement locations and zeros elsewhere.
Non-zero coefficients mean that the coefficient is important. We will define the important
coefficients as those whose absolute value is greater than some threshold. This threshold
is initially 0 but section 8.8.3 describes how alternative values permit a trade between
accuracy and results.
If the wavelets used in this method have some negative values then it is possible for
this method to miss important coefficients if the responses from different points cancel out.
This is unlikely to happen but can be partially guarded against by transforming another
importance map containing random positive numbers between 1 and 2 at the measurement
locations. Any additional important coefficients found can be added to the list.
8.6.3 Impulse responses

The impulse responses for the reconstruction filters depend only on which subband is being
inverted. This means that we can generate a single lookup table for each subband which
allows us to generate the i, j element of T P by calculating the position of the ith data
point relative to the j th wavelet coefficient and accessing the lookup table at this relative
position. This gives a fast generation of T P .
8.6.4 Solving for wavelet coefficients

Using sparse matrix methods we can quickly generate the matrix A directly from equation
8.11 and then solve the equations
Aa = P T T T y/σ 2
using Gaussian elimination which is fast for sparse matrices [98].
8.6.5 Reconstruct image estimate

The reconstruction is a straightforward application of the inverse wavelet transform using
the values in a to determine the important coefficient values, and 0 in all the unimportant
coefficients.
8.7 Choice of wavelet

Section 7.4 discussed some general principles concerning the choice of transform in the
wavelet generative specification. This section examines the effect on the speed of the
method and investigates the significance of shift dependence. Section 8.7.2 contains a
theoretical discussion about the probable significance of shift dependence, while section
8.7.3 contains some experimental results that test the predictions of the theory. Section
8.7.4 discusses the results of these comparisons.
8.7.1 Speed
The main computational burden is the solution of the linear equations. The number of
equations is given by the number of measurement locations plus the number of important
wavelet coefficients. All the decimated systems will have a similar number of important
coefficients, but the non-decimated system will have very many more. To illustrate this
we generated 20 random sample locations for a 128 by 128 image and counted the number
of important coefficients for a 4 scale decomposition. Table 8.1 shows the results. Notice
that the lack of subsampling in the NDWT produces about ten times more important
coefficients than the DT-CWT and will therefore be much slower.
Transform number of important coefficients

DWT 1327
NDWT 40860
WWT 372
DT-CWT 4692
GPT 540
Figure 8.1: Count of important coefficients for different transforms
8.7.2 Shift Invariance

In this section we use simple approximations to predict the effect of shift dependence on
the quality of the interpolated results. It is important to distinguish two types of quality.
The first type, the aesthetic quality, measures quality relative to the shift invariant solution
produced with a nondecimated version of the wavelet transform. If SN DW T is the energy
of the shift invariant estimate (we asssume that the mean is 0) and Eshif t is the average
energy of the error for a shift dependent estimate (relative to the shift invariant estimate)
then the aesthetic quality is defined in decibels as
SN DW T
QA = 10 log10
Eshif t
We call this the aesthetic quality because the shift invariant methods tend to give the
nicest looking contours (of constant intensity). These contours are nice in the sense that
they do not have the artefacts (of arbitrary deviations in the contour) seen in figure 7.4
produced by aliasing.
The second type, the statistical quality, measures quality relative to the actual surface
being estimated. Even the shift invariant estimate will only approximate the unknown
surface and it may be the case that the errors due to this statistical uncertainty are much
larger than the errors due to shift dependence. If SN DW T is the energy of the shift invariant

estimate and EN DW T is the average energy of the error between the shift invariant estimate
and the surface being estimated then the statistical quality is defined in decibels as
SN DW T
QS = 10 log10
EN DW T + Eshif t
To predict these qualities we need to estimate a number of energies.
First consider the simple interpolation case when the mean of the data is 0 and the
correct value is known at a single point. We have already claimed that:
1. The (posterior mean) estimate will be a scaled version of the covariance function
(section 8.3).
2. The covariance function for the wavelet method will be given by smoothed impulses
(section 7.3.4).
It is therefore reasonable to expect the aesthetic quality (the degradation caused by shift
dependence) to be equal to the measured degradation for smoothed impulses given by the
values in table 7.3.
Now consider the multiple data point case. Widely spaced sample locations will natu-
rally lead to a proportionate increase in the shift dependence error energy. However, there
are two main reasons why this may not hold for closer points:
1. The errors may cancel out.
2. There is a correction applied to the size of the impulses so that the interpolated
image will honour the known values.
The first reason may apply when there are two points close together. The aliasing terms
could cancel out to give a lower error, but it is just as likely that they will reinforce each
other and give an even higher energy error than the sum. In general this effect is not
expected to greatly change the shift dependence error.
The second reason is more important. The correction ensures that there will be zero
error at the sample values. This zero error has two consequences; first, that there is
no uncertainty in the value at the point and, second, that naturally there is zero shift
dependence error at the point. This effect will reduce the amount of shift dependence
error as the density of data points increases. In the limit when we have samples at every
location then there will be no shift dependence error at all as the output is equal to the
input.
Although the precise amount of error will depend on the locations of the sample points
and covariance function it is still possible to obtain a rough estimate of the amount of
error that reveals the problem. Consider solving an interpolation problem with a standard
decimated wavelet transform for a grid of N by N pixels. At level k and above there
will be N 2 41−k wavelet coefficients. Now suppose that we have N 2 41−p sample points
spread roughly evenly across our grid (for some integer p ≥ 1). For a standard decaying
covariance function these points will define the coarse coefficients (those at scales k > p)
fairly accurately but provide only weak information about the more detailed coefficients.
The coefficients at scale k = p will have on average about one sample point per coefficient.
These coefficients will therefore tend to produce about the same amount of shift dependence
as in the single sample case weighted by the proportion of energy at scale p. The statistical
uncertainty in the estimates will be roughly the amount of energy that is expected to be
found in the coefficients at scale p and the more detailed scales.
Let Ep be the expected energy of the coefficients at scale p. Let E≤p be the total
expected energy of the coefficients at scales 1 to p. Let r be the ratio Ep /E≤p . This ratio
will be close to one for rapidly decaying covariance functions.
The discussion above suggests that, approximately, the statistical uncertainty will cor-
respond to a noise energy of E≤p , while the shift dependence will correspond to a noise
energy of f Ep where f is the measure of the amount of shift dependence for the transform.
An estimate for the aesthetic quality is therefore:
SN DW T
QA ≈ 10 log10
f Ep
SN DW T
= 10 log10
f rE≤p
≈ Q0 − 10 log10 f − 10 log10 r
where Q0 ≈ 10 log10 (SN DW T /E≤p ) is the expected statistical quality of the shift invariant
estimate.
The aesthetic quality is therefore predicted to be the values in table 7.3 with an offset
given by the statistical quality of the estimate plus a constant depending on r. The offset
is the same whatever the choice of transform and hence the different transforms should
maintain a constant relative aesthetic quality. For example, the table gives a value for
−10 log10 f of about 32dB for the DT-CWT, but only 6.8dB for the DWT and we would
therefore expect the aesthetic quality for the DT-CWT to be about 25dB better than for
the DWT.
As the density of points increases the statistical quality and hence the aesthetic quality
will also increase.
By using the approximation log(1 + x) ≈ x (valid for small |x|) we can also write a
simple approximation for the statistical quality:

SN DW T
QS = 10 log10
E≤p + f Ep

SN DW T E≤p
= 10 log10
E≤p E≤p + f rE≤p
= Q0 − 10 log10 (1 + f r)
10
≈ Q0 − fr
log 10
In order to judge the significance of this we must know values for f and r. The measure of
shift dependence f has been tabulated earlier converted to decibels. This is convenient for
the aesthetic quality formula, but for the statistical quality we need to know the precise
value of this factor. For convenience, the actual values are shown in table 8.2. Now
Transform K=1 K=2 K=3 K=4

DWT 0.21 0.21 0.21 0.21
NDWT 0 0 0 0
WWT 0.044 0.018 0.015 0.014
DT-CWT 0 0.0011 0.0005 0.0006
GPT 0.0021 0.0005 0.0003 0.0003
Figure 8.2: Shift dependence for different scales.
suppose that r = 1/2. This is roughly the value for the covariance function plotted in
figure 7.4 because at each coarser level there are four times fewer coefficients, but σl2 is
eight times larger. There is therefore approximately twice the energy at the next coarser
level than just the previous level, and hence approximately equal energy at the next coarser
level to all the previous levels.
Substituting for the values in equation 8.13 allows the prediction of the reduction in
statistical quality caused by shift dependence. For the DWT the predicted reduction is
−(10/ log(10))(0.21)(0.5) = −0.46dB while for the WWT it is about −0.03dB and for the
DT-CWT it is only about −0.001dB.
These estimates are not very trustworthy due to the large number of approximations
used to obtain them, but they do suggest that we should expect shift dependence to cause
a significant decrease in both statistical and aesthetic quality when the standard wavelet
transform is used.
8.7.3 Experiments on shift dependence

We measured the statistical and aesthetic qualities for the DT-CWT and the DWT for a
variety of sample densities.
The wavelet generative model (section 7.3.4) was used to generate the data. In order
to produce shift invariant surfaces we use the NDWT transform in the generation. We use
equation 7.4 to define the scaling factors with the same choice of σl values as in section
7.4.5. The sample locations were arranged in a grid with equal horizontal and vertical
spacing between samples. Let this spacing be s pixels. The sample locations were at the
points {(as + δx, bs + δy)} within the image where a, b ∈ Z. The constants δx, δy ∈ Z
effectively adjust the origin for the transforms. For each realisation of the surface these
values were chosen uniformly from the set {0, 1, . . . , 15}. To avoid possible edge effects we
measured energies averaged only over a square grid of size 3s by 3s centred on a square of
four data points away from the edges. For a range of spacings we performed the following
procedure:
1. For i ∈ {1, 2, . . . , 32}:
(a) Generate a random surface Zi of size 128 by 128.
(b) Generate random values for δx, δy.
(c) Sample the surface at the points {(as + δx, bs + δy) : a, b ∈ Z} that are within
the image.
(d) Interpolate the sampled values using the method described in section 8.6. The
interpolation is repeated for three different transforms; the NDWT, the DWT,
and the DT-CWT.
(e) Measure the energies needed to calculate the measures of quality:

• SN DW T,i the energy of the NDWT solution.

• EN DW T,i the energy of the error between the NDWT estimate and the orig-
inal image values.
• EDW T,i the energy of the error between the DWT estimate and the NDWT
estimate.
• EDT −CW T,i the energy of the error between the DT-CWT estimate and the
NDWT estimate.
2. Average the energies over all values of i. For example,

1
32
SN DW T = SN DW T,i
32 i=1
3. Calculate the relative statistical and aesthetic qualities based on the averaged ener-
gies.
The aesthetic quality for the DWT is calculated as QA = 10 log10 (SN DW T /EDW T ) and
similarly for the DT-CWT. In order to highlight the difference in the absolute statistical
quality we compute a relative quality measure RS defined as the difference between the
statistical quality for the shift dependent estimate and the statistical quality of the shift
invariant estimate Q0 :
RS = QS − Q0
SN DW T SN DW T
= 10 log10
− 10 log10
EN DW T + EDW T EN DW T

EN DW T
= 10 log10 .
EN DW T + EDW T
RS will be a negative quantity that measures the loss of statistical quality caused by shift
dependence. Figure 8.3 plots the aesthetic quality against the density of points. Figure
8.4 plots the relative statistical qualtity. In both figures a cross represents the DT-CWT
estimate while a circle represents the DWT estimate. Results are not shown for the NDWT
since the definitions ensure that this transform will always have an infinite aesthetic quality
and a zero relative statistical quality.
8.7.4 Discussion of the significance of shift dependence

The simple theory we proposed predicted (in section 8.7.2) that the aesthetic quality for the
DT-CWT would be about 25dB better than for the DWT. The experimental results shown
45
40
35
Aesthetic quality /dB 30
25
20
15
10
5
−4 −3 −2 −1 0
10 10 10 10 10
Samples per pixel
Figure 8.3: Aesthetic quality for DWT(o) and DT-CWT(x) /dB
−0.1
−0.2
Relative statistical quality/dB
−0.3
−0.4
−0.5
−0.6
−0.7
−4 −3 −2 −1 0
10 10 10 10 10
Samples per pixel
Figure 8.4: Relative statistical quality for DWT(o) and DT-CWT(x) /dB
in figure 8.3 suggest that the improvement in aesthetic quality is actually about 20dB
for the DT-CWT. Considering the large number of approximations made in predicting
the value this is a reasonable match. The absolute value of the aesthetic quality for the
DWT varies from about 20dB for high densities to 9dB for low sample densities. These
experimental results confirm that aesthetically the quality of the DWT is low, while the
DT-CWT gives a much improved quality.
The theory predicted that the relative statistical quality for the DWT would be about
−0.46dB (this corresponds to an error of about 11%). The results in figure 8.4 suggest
that the relative quality is only about −0.5dB for very high sample densities, but that
for lower densities the relative quality becomes much smaller (in magnitude). This is a
fairly poor match with the predicted value. Nevertheless, the experiments confirm the the
qualitative prediction that the DWT has an appreciably lower statistical quality (of around
0.25dB,6%) while the DT-CWT has almost negligible errors (less than 0.01dB,0.2%) caused
by shift dependence.
Finally we discuss the expected effect of some of the approximations on the discrepancy
between the predicted and observed results for the relative quality. Bear in mind that a
larger (in magnitude) relative quality means worse results.
1. We use 4 level transforms. The theory only applies when the critical level is close to
one of these levels. For densities lower than about 1 in 162 = 256 the sample positions
are so widely spaced that they will have little effect on each other. The estimates
will be very uninformative and the statistical error will be roughly constant (and
equal to the variance of the original image). However, the shift dependence energy
will be proportional to the density, and therefore for low densities the shift depen-
dence will be relatively insignificant and so the relative quality improves (decreases
in magnitude).
2. The estimate that r = 1/2 is very crude. For the most detailed scale E1 = E≤1 and
the ratio must be one. For the other scales the ratio will be somewhere between 1
and 1/2. The effect of this is to predict that the relative quality will actually be
worse (larger in magnitude) than −0.46dB for high sample densities.
3. We assume that the wavelet coefficients at scales coarser than p are accurately es-
timated. In practice there will still be some error in these. This effect will tend to
increase the statistical error and hence decrease the significance of the shift depen-
dence. Therefore this will produce a slight improvement (decrease in magnitude) in
the relative quality.
4. We assume that the wavelet coefficients at scales ≤ p are inaccurately estimated.

In practice there will still be some information in these. This effect will tend to de-
crease the statistical error and hence increase the significance of the shift dependence.
Therefore this will produce a worse (larger in magnitude) relative quality.
5. We assume that the level p coefficients will produce an expected shift dependent
energy of f Ep . Ep is the expected energy of the level p coefficients in the prior but
the actual energy of the coefficients in the interpolated image will tend to be less
than this due to the limited information available. This is because the estimate is
given by the mean of the posterior distribution. The mean of the prior distribution
is zero and when there is little information the mean of the posterior will also be
close to zero. The effect of this is to predict less shift dependence and hence a better
(smaller in magnitude) relative quality.
The most significant of these effects are probably the first and last.
8.8 Extensions
8.8.1 Background
The Kriging mean estimate gives biased results when estimating a nonlinear function of
the image [130] (such as the proportion of pixels above a certain threshold). It is better
to generate a range of sample images from the posterior and average the results. In the
context of Kriging approximation methods this generation is called conditional simulation
[36] and works as follows:
1. Construct a queue of locations that cover a grid of positions at which we wish to

estimate the intensity values.
2. Remove a point from the queue and use Kriging to estimate the mean and variance
of the posterior distribution at that point conditional on all the data values and on
all the previous locations for which we have estimated values.
8.8. EXTENSIONS 157
3. Generate a sample from the Gaussian distribution with the estimated mean and
variance and use this to set the value at the new point.
4. Repeat steps 2 and 3 until the queue is empty.
This process generates a single sampled image and can be repeated to generate as many
samples as are desired. Each step in this process involves inverting a square matrix whose
width increases steadily from the number of measurements to the number of locations we
wish to know about. This calculation involves a huge amount of computation for more
than a few hundred locations.
As described, the estimate for the value at a position is based on all the data points and
all the previously calculated points but to increase the speed of the process it is possible to
base the estimate just on some of the close points. It is necessary to include points up to
the range of the variogram if the results are to be valid and again the computation rapidly
becomes prohibitively long.
A multiple grid approach has been proposed [41] that first simulates the values on a
coarse grid using a large data neighbourhood and then the remaining nodes are simulated
with a finer neighbourhood. Our approach can be viewed as a multigrid approach, with
the additional advantage of wavelet filtering to give better decoupling of the different
resolutions making our method faster and more accurate.
An efficient simulation method has been proposed [57, 89] that first generates a random
image with the correct covariance function (but that does not depend on the known data
values), and then adds on a Kriged image based on the data values minus the values of the
random image at the known locations. This will generate a conditional simulation of the
surface and the process can be repeated to generate many different realisations. We now
describe a similar method that acts in wavelet space.
8.8.2 Proposal
In order to efficiently calculate image samples we calculate samples of the wavelet coeffi-
cients and then apply the wavelet reconstruction transform. It is easy to generate samples
of the unimportant wavelet coefficients (whose posterior distribution is the same as their
prior) by simulating independent Gaussian noise of the correct variance.
To generate samples of the important wavelet coefficients consider solving the equations
T
T P/σ
AZ = P T T T y/σ 2 + R
D −1
where R is a vector of random samples from a Gaussian of mean zero and variance 1,
with length equal to the number of measurements plus the number of important wavelet
coefficients. Recall from equation 8.11 that A = D −2 + P T T T T P/σ 2 .
The vector of random variables Ẑ given by
 T 
T P/σ
Ẑ = A−1 P T T T y/σ 2 + R
D −1
will have a multivariate Gaussian distribution and noting that A is a symmetric matrix we
can simplify as follows:
 T 
T P/σ T P/σ
Ẑ ∼ N A−1 P T T T y/σ 2 , A−1 −1 −1
A−1 
D D

∼ N a, A−1 D −2 + P T T T T P/σ 2 A−1

∼ N a, A−1
This shows that such solutions will be samples from the posterior distribution for the
wavelet coefficients. The sparsity of A means that Gaussian elimination allows us to
quickly solve these equations. Gaussian elimination is equivalent to LU factorisation (the
representation of a matrix by the product of a lower triangular matrix L with an upper
triangular matrix U) and so we can generate many samples quickly by calculating this
factorisation once and then calculating Ẑ for several values of R. This is fast because
triangular matrices can be quickly inverted using back substitution.
A similar method of LU factorisation has been used to quickly generate many samples
for the Kriging Conditional Simulation method [2] but this method can only simulate a
few hundred grid nodes before the cost becomes prohibitive. We have generated simulated
images with a quarter of a million grid nodes using the wavelet method in less than a minute
on a single processor. The improvement is possible because the wavelet transform achieves
a good measure of decorrelation between different ranges of the covariance function and
so can interpolate each scale with an appropriate number of coefficients. We also get a
very sparse set of equations which can be solved much faster than the fuller system that
Kriging methods produce.
8.8. EXTENSIONS 159
8.8.3 Trading accuracy for speed

We examine the trade-off between accuracy and speed by adjusting the threshold used
to determine the importance map. An increased threshold makes the equations become
sparser, less correlated, and thus easier to solve. This trade-off only affects the time for
setting up and solving the equations; the time for determining the importance map and
inverting the wavelet transform depends only on the size of the grid. We use a grid of 512
by 512 and generate a mean posterior estimated image using the wavelet method of section
8.6. We calculate the approximation while adjusting the threshold mentioned in section
8.6.2 and measure the total time taken to produce this image (including the determination
of the importance map and the wavelet transforms). The correct image C is defined to be
that produced by the method with zero threshold. For each output image S we define the
signal to noise ratio (SNR) to be

2
C ij
SNR = 10 log10

i j (8.13)
j (Sij − Cij )
2
i
For example, a SNR of 30dB is equivalent to saying that the energy of the error is only
0.1% of the energy of the surface.
We perform the experiment twice, once with 128 measurements, and once with 256
measurements.
Figure 8.5 plots the time taken for the interpolation versus the SNR of the results. For
some of the points we have also displayed the associated threshold level. There is also a
horizontal dashed line drawn at the time taken for a threshold of zero.
It can be seen that the computation decreases for minor increases in the threshold while
producing little additional error. For example, a threshold of 0.3 reduces the time by a
factor of 3 while giving a SNR of 27dB.
Figure 8.6 shows the results of the experiment when we have twice as many data points.
The same computation decrease is evident. Consider the threshold of 0.3. For 128 mea-
surements, the interpolation takes about 6 seconds (SNR=27.2dB). For 256 measurements,
it takes about 14 seconds (SNR=29.3dB). We also timed an interpolation of 512 measure-
ments which took 32 seconds (SNR=33.8dB). This is not quite linear, but notice that
the same threshold gives greater accuracy with more measurements. Therefore when we
have more measurements we can reduce the computation by using a higher threshold while
maintaining the same accuracy. On balance the amount of computation is roughly linear
20
Threshold=0.04
15
Time for interpolation

Threshold=0.1
10
Threshold=0.2
Threshold=0.3
5
0
0 20 40 60 80 100
SNR /dB
Figure 8.5: Computation time versus SNR (128 measurements).
50
Threshold=0.01
Threshold=0.02
Time for interpolation
40 Threshold=0.04
30 Threshold=0.1
20 Threshold=0.2
Threshold=0.3
10
0
0 20 40 60 80 100
SNR /dB
Figure 8.6: Computation time versus SNR (256 measurements).

8.9. DISCUSSION OF MODEL 161
in the number of measurements for a constant accuracy.

In contrast, Kriging scales very badly with the number of data points. The bottle-neck
in Kriging is the inversion of a matrix. To solve a system with S measurements involves
the inversion of a S by S matrix, and such an inversion involves computation roughly cubic
in the number of points. For example, inverting a 256 by 256 matrix in Matlab takes 0.43
seconds, a 512 by 512 matrix takes 5.9 seconds and a 1024 by 1024 matrix takes 50 seconds.
For any problem that is taking a considerable amount of time to solve it will clearly be
preferable to use the new wavelet method.
8.9 Discussion of model

This section discusses the effect of the following assumptions. The model assumes that:
1. the image is a realisation of a 2D stationary discrete Gaussian random process,
2. the mean and covariance of the random process are known,
3. independent Gaussian noise corrupts the measurements,
4. the variance of the measurement noise is known,
5. the measurements are at distinct locations,
6. the measurements lie on grid positions.
The first assumption is the most significant. For most applications the original data
will only be approximately modelled as a stationary Gaussian random process. Often more
information may be known about the likely structure of the data and a more sophisticated
model using this information will almost certainly give better results but will probably also
require much more computation to solve.
We would expect an assumption of Gaussian measurement noise to be reasonably ac-
curate in most cases even for non-Gaussian noise distributions of zero mean and equal
variance. However, in certain circumstances this expectation may not hold. Two examples
are:
1. Infinite variance noise processes (such as alpha-stable noise).

2. A possible measurement model is that a certain proportion of the measurements

are accurate, but that the others are badly corrupted. In this case methods that
jointly estimate the surface and the reliable measurements should be able to give
significantly better results.
The assumption of known mean and covariance of the process will almost never be true,
but there are many methods available for obtaining estimates of these parameters [110].
Results in the literature related to radial basis functions [97] suggest that the precise shape
of the covariance function only has a small effect on the results and therefore we do not
expect the errors in the parameter estimates to be significant. The same argument applies
to the variance of the measurement noise.
The restriction that the measurement locations are at distinct locations is an unimpor-
tant constraint that is needed only for interpolation. The problem is that it is impossible
to interpolate two different values at the same position. A simple approach to the problem
is to replace repeats by a single sample with value given by the average of the repeats. For
approximation all the methods work equally well even with the original samples without
this constraint.
The assumption that the sample locations lie on grid points is another unimportant
constraint because approximating sample locations by the nearest grid point should give
sufficiently accurate results for most applications.
8.10 Conclusions
The first part of this chapter considered alternative interpolation and approximation tech-
niques from a Bayesian viewpoint. We argued that Kriging, Radial Basis Functions, Ban-
dlimited interpolation, and spline interpolation can all be viewed as calculating Bayesian
posterior mean estimates based on particular assumptions about the prior distribution for
the images. More precisely, each method can be viewed as assuming a stationary discrete
Gaussian random process for the prior, with the only theoretical difference between the
methods being the assumed covariance function.
The reason that this is important for complex wavelets is that our proposed method
based on the DT-CWT uses a prior of the same form. Our method has a number of pa-
rameters associated with it and these parameters could be tuned in order that the complex
wavelet method is an approximate implementation of any of the previously mentioned in-

terpolation techniques. However, it is more useful to use the freedom to tune the DT-CWT
method for a particular application.
The first part also pointed out problems with some other techniques:
1. Smoothing spline estimates tend to infinity when extrapolated away from the data
points.
2. Least squares spline approximation techniques cannot be shift invariant.
3. The minimum smoothness norm interpolation based on a decimated wavelet will

suffer from shift dependence.
We also prove from a Bayesian perspective that shift dependence will always be an addi-
tional source of error in estimates.
The second part of the chapter proposed a wavelet method for interpolation/approximation.
We discussed and predicted the effect of shift dependence on measures of aesthetic and sta-
tistical quality. These predictions were tested experimentally and found to be rather inac-
curate but they did give a reasonable guide to the relative importance of shift dependence
for the different methods. In particular, the DWT was found to give significantly shift
dependent results, even compared to the expected statistical error, while the DT-CWT
produced estimates with statistically insignificant errors due to shift dependence.
We also developed a method for generating samples from the posterior distribution
that can generate large numbers of sample images at a cost of one wavelet reconstruction
per sample image. For contrast, a comparable method based on the conjugate gradient
algorithm [88] requires 2K wavelet transforms per sample image where K is the number
of iterations used in the conjugate gradient algorithm.
Finally we found a simple method that can be used to increase the speed of the method
at the cost of a slight decrease in accuracy. Using this method to achieve a constant
accuracy we found that the time to solve the equations is roughly linear in the number
of data points, while the computation for Kriging is roughly cubic in the number of data
points.
This chapter has argued that the DT-CWT gives much better results than the DWT
and much faster results than the NDWT. However, in practice this is not an appropriate
application for any of these wavelet methods. Better and faster results (at least for the
isotropic case) could be obtained with the GPT. In contrast, the next chapter will describe
an application for which the DT-CWT is not only better than the DWT and the NDWT,
but also superior to the leading alternative methods.
Chapter 9
Deconvolution
The purpose of this chapter is to give an example of a Bayesian application that illustrates
the performance gains possible with complex wavelets. We explain how to use a complex
wavelet image model to enhance blurred images.
The background for this chapter is largely contained in appendix C which reviews a
number of deconvolution methods from a Bayesian perspective. We construct an empir-
ical Bayes image prior using complex wavelets and experimentally compare a number of
different techniques for solving the resulting equations.
We compare the results with alternative deconvolution algorithms including a Bayesian
approach based on decimated wavelets and a leading minimax approach based on a special
nondecimated wavelet [58].
The main original contributions are; the new iterative deconvolution method, the ex-
perimental results comparing the results for alternative transforms within the method, and
the experimental comparison with alternative techniques.
9.1 Introduction
Images are often distorted by the measurement process. For example, if a camera lens
is distorted, or incorrectly focused, then the captured images will be blurred. We will
assume that the measurement process can be represented by a known stationary linear
filter followed by the addition of white noise of mean 0 and variance σ 2 . This model can
be written as
y = Hx + n (9.1)
165
166 CHAPTER 9. DECONVOLUTION
where some lexicographic ordering of the original image, x, the observed image, y, and the
observation noise, n, is used. The known square matrix H represents the linear distortion.
As it is assumed to be stationary we can write it using the Fourier transform matrix F as
H = F H MF (9.2)
where M is a diagonal matrix. For an image with P pixels y, x, and n will all be P × 1
column vectors while F , M, and H will be P × P matrices. As both x and n are unknown
equation 9.1 therefore represents P linear equations in 2P unknowns and there are many
possible solutions. This is known as an ill-conditioned problem.
The best solution method depends on what is known about the likely structure of the
images. If the original images are well modelled as a stationary Gaussian random process
then it is well-known that the optimal (in a least squares sense) solution is given by the
Wiener filter. However, for many real world images this model is inappropriate because
there is often a significant change in image statistics for different parts of an image. For
example, in a wavelet transform of an image most of the high frequency wavelet coefficients
tend to have values close to zero, except near object edges where they have much larger
values.
There have been many proposed methods for restoring images that have been degraded
in this way. We restrict our attention to the more mathematically justifiable methods,
ignoring the cruder “sharpening” techniques such as using a fixed high-pass filter or some
simple modification of wavelet coefficients [14]. (These ignored techniques provide a quick,
approximate answer but are less scientifically useful because often they will not provide an
accurate reconstruction even in very low noise conditions.)
For astronomical imaging deconvolution there are three main strands; the CLEAN
method proposed by Högbom [44], maximum-entropy deconvolution proposed by Jaynes
[54, 29], and iterative reconstruction algorithms such as the Richardson-Lucy method [102].
For images containing a few point sources (stars) the CLEAN algorithm can give very accu-
rate reconstructions, but for images of real world scenes these methods are less appropriate.
Alternative image models are found to give better results. Constrained least squares meth-
ods [15] use a filter based regularisation, such as a Laplacian filter, but this tends to give
over smoothed results when the image contains sharp edges. More recently there have been
attempts to improve the performance near edges. These methods include total variation
[123], Markov Random Field (MRF) [56, 132], and wavelet based approaches. There are
9.1. INTRODUCTION 167
two main contrasting methodologies for using wavelets. The first group is based on a min-
imax perspective [38, 52, 58, 84, 87]. The second group is based on a Bayesian perspective
using wavelets to represent the prior expectations for the data [8, 11, 94, 124].
We first describe a general Bayesian framework for image deconvolution. In appendix
C we draw out the connections between the different approaches by reviewing the papers
mentioned above with reference to the Bayesian framework. Section 9.1.2 summarises the
main results from this review. Section 9.1.3 discusses the reasons guiding our choice of prior
model based on the material covered in the appendix. This model is detailed in section 9.2
and then we describe the basic minimisation method in section 9.3. We propose a number
of alternative choices for minimisation that are experimentally compared in section 9.4.
Section 9.5 compares the results to alternative deconvolution methods and section 9.6
presents our conclusions.
9.1.1 Bayesian framework

To treat image deconvolution from the Bayesian perspective we must construct a full
model for the problem in which all the images are treated as random variables. We shall
use upper case (Y,X,N) to represent the images as random variables, and lower case
(y,x,n) to represent specific values for the variables. To specify a full model requires two
probability density functions to be specified:
1. The prior p(x) encodes our expectations about which images are more likely to occur
in the real world.
2. The likelihood p(y|x) encodes our knowledge about the observation model.
(As before, we use the abbreviation x for the event that the random variable X takes
value x.) All the reviewed methods (except for the Richardson-Lucy method described in
section C.5) use the same observation model and so the likelihood is the same for all of the
methods and is given by

(
P
1 − ([Hx]i − yi )2
p(y|x) = √ exp .
i=1 2πσ 2 2σ 2
Given observations y, Bayes’ theorem can be used to calculate the a posteriori probability
density function (known as the posterior pdf):
p(y|x)p(x)
p(x|y) = . (9.3)
p(y)
There are several techniques available to construct an estimate from the posterior pdf.
Normally a Bayes estimator is based on a function L(θ̂, θ) that gives the cost of choosing
the estimate θ̂ when the true value is θ. The corresponding Bayes estimator is the choice of
θ̂ that minimises the expected value of the function based on the posterior pdf. However,
for the purposes of the review it is most convenient to consider the MAP (maximum a
posteriori) estimate.
The MAP estimate is given by the image x that maximises the posterior pdf p(x|y).
xM AP = argmaxx p(x|y)
Usually a logarithmic transform is used to convert this minimisation into a more tractable
form:
p(y|x)p(x)
xM AP = argmaxx
p(y)
= argmaxx p(y|x)p(x)
= argminx − log (p(y|x)p(x))
= argminx − log (p(x)) − log (p(y|x))
P
([Hx]i − yi )2
= argminx f (x) +
i=1
2σ 2
1
= argminx f (x) + 2

Hx − y
2
2σ
where f (x) is defined by
f (x) = − log (p(x)) . (9.4)
In summary, the MAP estimate is given by minimising a cost function 2σ 2 f (x)+

Hx − y
2
where the choice of f (x) corresponds to the expectations about image structure.
This minimisation problem often appears in the regularisation literature in one of two
alternative forms. The first is known as Tikhonov regularisation [117]. A class of feasible
solutions Q is defined as those images for which the norm of the residual image is bounded.
The residual image is the difference between the observed data and the blurred estimate.
Q = {x :
y − Hx
≤
}
9.1. INTRODUCTION 169
Tikhonov defined the regularised solution as the one which minimises a stabilising func-
tional f (x).
xT IKHON OV = argminx∈Q f (x)
The second form is known as Miller regularisation [80]. In this approach the energy of the
residual is minimised subject to a constraint on the value of f (x).
xM ILLER = argmin{x:f (x)≤E}

y − Hx
Using the method of undetermined Lagrangian multipliers it can be shown [15] that both
problems are equivalent to the MAP minimisation (for particular choices of σ).
9.1.2 Summary of review

Appendix C explains the principles behind the standard deconvolution techniques and at-
tempts to explain the differences from within the Bayesian framework. Several of the meth-
ods are equivalent to a particular choice of f (x) (in some cases with additional constraints
to ensure the image is positive). These choices are displayed in table 9.1. Explanations of
these formulae can be found in the appendix, the numbers in brackets indicate the corre-
sponding section. The Landweber and Van Cittert algorithms are special cases of Wiener
Algorithm (section) Expression for f (x)

CLEAN (C.1) − log(β) + α i xi

xi
Maximum Entropy (C.2) i
Pj xj log Pxjixj

i σ2 SNRi | [F x]i |
1 2
Wiener filtering (C.4)

1
|mi |2
K − |mi | | [F x]i |2
2
Van Cittert (C.5) i σ2

1 1−(1−αm i)
|mi |2
Landweber (C.5) i σ2 1−(1−α|mi |2 )K
− |mi |2 | [F x]i |2

i λi | [W x]i |
2
Wang (C.10)

|[W x] |
Starck and Pantin (C.10) − i λi [W x]i − mi − | [W x]i | log mi i

i λi | [W x]i |
p
Belge (C.10)

Piña and Puetter (C.10) ≈ i |wi |p

Figure 9.1: Prior cost function f (x) expressions for standard deconvolution techniques.
filtering (if we remove the positivity constraint). In appendix C we also explain why we
can approximate both constrained least squares (section C.6) and the Richardson-Lucy
algorithm (section C.5) as alternative special cases of the Wiener filter. The reason for
making these connections is because we can predict an upper bound for the performance
of all these methods by evaluating just the best case of Wiener filtering (the oracle Wiener
filter).
Expressions for f (x) for the total variation (section C.7), Markov Random Field (section
C.7), and Banham and Katsagellos’ methods (section C.10) can also be written down1 but
the projection (section C.3) and minimax (section C.8) methods are more difficult to fit
into the framework. The minimax methods are an alternative approach motivated by the
belief that Bayesian methods are inappropriate for use on natural images. Section C.8
discusses the two approaches and explains why we prefer the Bayesian method.
9.1.3 Discussion
This section discusses the reasons guiding the choice of prior based on the review presented
in appendix C. The main issue is to identify the nature of the dependence between the
pixels in the original image.
For astronomical images of sparsely distributed stars an independence assumption may
be reasonable, while for many other kinds of images (including astronomical images of
galaxies) such an assumption is inappropriate. If independence is a reasonable assumption
then the CLEAN, maximum entropy, and maximally sparse methods are appropriate and
the choice largely depends on the desired balance between accuracy and speed. For ex-
ample, the CLEAN method is fast but can make mistakes for images containing clustered
stars.
For images that are expected to be relatively smooth then the Wiener filter and iterative
methods are appropriate. If the images are known to satisfy some additional constraints
(for example, the intensities are often known to be non-negative for physical reasons) or if
the blurring function is space varying then the iterative methods such as Richardson-Lucy
or constrained least squares are appropriate. Otherwise it is better to use the Wiener filter
because it is fast and approximately includes the iterative methods as special cases.
1
We have not included these expressions because, firstly, they require a considerable amount of spe-
cialised notation to be defined and, secondly, these expressions can easily be found in the literature
[8, 56, 90].
9.2. IMAGE MODEL 171
For images of scenes containing discontinuities then the total variation and wavelet
methods are appropriate. The Markov Random Field and total variation methods are
good for images that are well-modelled as being piecewise flat, but for many natural images
this model is only correct for certain parts of the image while in other parts there may
be textured or smoothly varying intensities. The wavelet methods tend to give a good
compromise for images containing such a mixture of discontinuities and texture.
We are interested in examining the potential of the DT-CWT within deconvolution and
therefore we choose to study the restoration of real-world images rather than sparse star-
fields. The previous section described several ways of constructing a prior with wavelets.
For simplicity we choose to use the quadratic cost function proposed by Wang et al [124].
This choice means that our proposed method will be an empirical Bayes approach based
on the non-stationary Gaussian random process model.
9.2 Image model

We will assume that we have a balanced wavelet transform so that P H = W where P H
represents the Hermitian transpose of P . We will assume that the real and imaginary parts
of the transform’s outputs are treated as separate coefficients so that W and P are real
matrices. The assumption of a linear, additive, white Gaussian noise model means that
the likelihood p(y|w) is proportional to

1
exp − 2
HP w − y
2
(9.5)
2σ
As mentioned in the previous section, we choose to use a simple prior model based on an
adaptive quadratic cost function [124]. This can be considered as a simple extension to
the model of chapter 7. Specifically, we use a generative specification in which the real
and imaginary parts of the wavelet coefficients are independently distributed according to
Gaussian distribution laws of zero mean and known variance. We can write that the prior
pdf p(w) for the wavelet coefficients is proportional to

1 H
exp − w Aw (9.6)
2
where A is a diagonal matrix with A−1

ii being the variance of w i . Realisations from the
prior pdf for images could be calculated by generating wavelet coefficients according to this
distribution and then inverting the wavelet transform x = P w. The only difference to the
previous model is that the variances are allowed to vary between coefficients rather than
being the same for all coefficients in a given subband. We assume that each coefficient has
an equal variance in its real and imaginary parts.
9.3 Iterative Solution

The simple assumptions made mean that it is possible to write down the solution to the
problem using a matrix pseudoinverse. However, the number of pixels in such problems
will mean that the pseudoinverse will take a very long time to evaluate and a much quicker
approach is needed.
We define an energy E to be the negative logarithm of the likelihood times the prior
(ignoring any constant offsets). The MAP (maximum a posteriori) answer is given by
minimising this energy. For this problem the MAP estimate will be identical to the Bayesian
posterior mean estimate because the posterior pdf is a multivariate Gaussian. We will
attempt to minimise the energy by repeating low dimensional searches in sensible search
directions. For simplicity we assume that the image has been scaled so that σ = 1. With
this scaling the energy function is given by combining equations 9.5 and 9.6:
1 1
E =
HP w − y
2 + wH Aw (9.7)
2 2
The steps in our method are:
1. Estimate the PSD of the image.
2. Estimate the variances of the wavelet coefficients.
3. Initialise the wavelet coefficients.
4. Calculate a search direction.
5. Minimise the energy along a line defined by the search direction.
6. Repeat steps 4 and 5 ten times.
Figure 9.2 contains a flow diagram of this method. Section 9.3.1 explains the estimation
steps. During the estimation we compute a first estimate x0 of the original image. The
9.3. ITERATIVE SOLUTION 173
Start
Estimate model parameters
Image Initialisation
Calculate Search
Direction
Minimise Energy
along search direction
Have we done No
enough iterations?
Yes
Stop
Figure 9.2: Flow diagram for the proposed wavelet deconvolution method.
detail of this image will probably be unreliable but the lowpass information should be fairly
accurate. We initialise the wavelet coefficients to zero, and the scaling coefficients to the
scaling coefficients in the transform of the image x0 . Later in section 9.4 we will propose
a better initialisation. Section 9.3.2 explains how the search direction is chosen. Section
9.3.3 explains how to minimise the energy within a one dimensional subspace.
9.3.1 Variance estimation

The method of Wang et al [124] used a similar prior model (based on a real wavelet trans-
form). Their method for variance estimation was to perform an edge detection operation
on the original image and then increase the variances of coefficients near edges. Details of
this process were not given and so we use an alternative estimation technique. Figure 9.3
contains a block diagram of the estimation process. This process corresponds to the step
Estimate Model Parameters in the flow diagram of figure 9.2.
The variance estimates are given by the energy of the wavelet coefficients of the wavelet
transform of an estimate of the original image. In other words, we need to:
1. Obtain an estimate x̂ of the original image.
2. Compute the wavelet transform w = W x̂ of this image.
3. Estimate the variances

1
Aii = 2 2
wR(i) + wI(i)
where R(i) is the index of the real part, and I(i) the index of the imaginary part of
the complex wavelet coefficient corresponding to index i.
These steps are represented in figure 9.3 by the Estimate Wavelet Variances block.
A simple estimate for the original image would be the observed data x̂ = y. However,
for a typical blurring operation this would underestimate the variances of the coefficients
at detailed scales. Alternatively we could compute a deconvolved image via the filter
of equation C.5. Full regularisation (α = 1) corresponds to using a Wiener denoised
estimate. Wiener estimates tend to have smoothed edges and will therefore tend to produce
underestimates of the variances near edges. Smaller values of α will preserve the signal
more, but also contain more noise and thus produce overestimates of the variances. Our
Observed Image
Estimate PSD Under−regularized

Deconvolution
Initial estimate of x
Wavelet denoising
Second estimate of x
Estimate
wavelet variances
A
Figure 9.3: Block diagram of deconvolution estimation process.
chosen approach is to use the under regularised inverse (with α = 0.1) followed by soft
thresholding wavelet denoising.
The filter of equation C.5 requires (for α = 0) an estimate of the power spectrum of the
original image. This kind of estimate is often required in deconvolution [84, 48]. In some
experiments we will use the oracle estimate given by the square of the power spectrum of
the original image (before convolution). This is called an oracle estimate because it requires
knowledge of the original image but this information will naturally not be available in any
real application. Nevertheless, such an estimate is useful in testing methods as it removes
the errors caused by bad spectrum estimates. We will always make it clear when we are
using such an oracle estimate2
In a real application we need a different estimation technique. Autoregressive and
Markov image models have been used to estimate image statistics [21] but it is reported
that the method only works well in noise reduction and not in blur removal [48]. The
constrained least squares method is a variant in which the autocorrelation is assumed to be
of a known form [5]. Hillery and Chin propose an iterative Wiener filter which successively
uses the Wiener-filtered signal as an improved prototype to update the power spectrum
estimate [48]. Within this dissertation we are more concerned with the performance of
wavelet methods than classical estimation theory and we use a fast and simple alternative
estimation technique.
1. Calculate the Fourier transform of the observed data F y.

2. Square to get an estimate of the observed power spectrum p̂y i = |[F y]i |2 .
3. By Fourier transforming the observation equation 9.1 it is straightforward to show

that the expected value of this estimate is given by

E p̂y = M H Mpx + σ 2 1N
where 1N is a N × 1 vector of ones. We estimate the power spectrum of the original

image by
−1
p̂x = p̂y − σ 2 1N M H M + βIN
2
In particular, the comparisons of the DT-CWT method with other published results will never cheat
by using an oracle estimate.
where β is a parameter used to avoid over amplification near zeros of the blurring
filter. We use β = 0.01 in the experiments. Any negative elements in the estimated
power spectrum are set to zero.
This represents the Estimate PSD block in figure 9.3.

Once we have the power spectrum, the estimate of the original image is given by the
under regularised filter. This filtering can be expressed in matrix form as
−1
H H H 2
x̂0 = F M M M P̂x + ασ IN Fy (9.8)
where P̂x is a diagonal matrix whose diagonal entries are given by p̂x . This gives the
under-regularised image estimate x̂0 that is further denoised using wavelets. This filtering
is represented by the Under-regularized Deconvolution block in figure 9.3.
A similar approach is used to perform the initial wavelet denoising. First the signal
strengths are estimated for each wavelet coefficient and then a Wiener-style gain is applied
to each coefficient. The details of this algorithm are:
1. Calculate the complex wavelet transform of the image estimate x̂0 . Let wi be the
ith complex wavelet coefficient in the output of this transform. It is important that
wi is complex-valued here. In the rest of this chapter except for the four steps of
this algorithm we use the separated real form of the transform, but here it is more
convenient to use the complex form.
2. Calculate an estimate âi of the signal power in these coefficients.
âi = |wi |2 − γσi2 (9.9)
where σi2 is the variance of the noise in the wavelet coefficient and γ takes some
constant value.
The original white noise of variance σ 2 is coloured by both the wavelet transform
and the inverse filtering. The parameters of both these processes are known which
in theory allows the exact calculation of σi2 . The value of σi will be the same for all
coefficients within the same subband (because the filtering is a stationary filter and
different coefficients in a subband correspond to translated impulse responses). In
practice it is easier to estimate these values by calculating the DT-CWT of an image
containing white noise of variance σ 2 that has been filtered according to equation
9.8. The average energy of the wavelet coefficients in the corresponding subbands
provide estimates of σi2 .
A choice of γ = 1 would seem to give a good estimate of the original signal power.
However, with this choice there is a significant probability that a low power coefficient
will be incorrectly estimated as having a high energy. This is because the noise only
corrupts the coefficients with an average energy of σi2 . In practice we find it is better
to use a larger value to avoid this problem. In the experiments we will always use
γ = 3.
As before negative values of âi are set to zero.
3. New wavelet coefficients are generated using a Wiener style gain law.
âi
ŵi = wi (9.10)
âi + σi2
4. The inverse DT-CWT is applied to the new wavelet coefficients to give an image x̂.
These steps are represented by the Wavelet Denoising block of figure 9.3.
9.3.2 Choice of search direction

This section describes the contents of the Calculate Search Direction step in figure 9.2.
The choice of search direction will only affect the speed of convergence but not the final
result. One obvious choice is the gradient but better search directions are usually produced
by the conjugate gradient algorithm [98]. Both algorithms work best for well-conditioned
Hessians, i.e. for Hessians close to a multiple of the identity matrix. Convergence can be
improved for a badly conditioned system via a preconditioner. We will test three types of
preconditioning in both gradient and conjugate gradient algorithms for a total of six alter-
natives. We first describe the conjugate gradient algorithm and then our preconditioning
choices.
Let g(i) be the preconditioned descent direction at the ith step of the algorithm. We
construct a sequence of search directions h(0) , h(1) , . . . either by the steepest descent algo-
rithm
h(i) = g(i)
or by the conjugate gradient algorithm
|g(i) |2 (i−1)
h(i) = g(i) + h .
|g(i−1) |2
This formula is valid for i > 0. For the first pass, i = 0, the search direction for the
conjugate gradient algorithm is given by h(0) = g(0) .
We compare three types of preconditioning. The first type corresponds to no precon-
ditioning and g(i) is given by the negative gradient of the energy function (E was defined
in equation 9.7),
g(i) = −∇w E
= P H H H y − P H H H HP w − Aw.
The Hessian for our system is P H H H HP + A and so the ideal preconditioner would
H H −1
be P H HP + A which would transform the Hessian to the identity matrix but this
matrix inversion is far too large to be numerically calculated. Instead for the second type we
choose a simpler type of preconditioning that scales the energy function gradient in order
to produce a Hessian with diagonal entries equal to 1. Define scaled wavelet coefficients as
v = S −1 w where S = diag {s} for some vector of scaling coefficients s. The Hessian of the
energy expressed as a function of v is
∇2v E = S H P H H H HP S + S H AS
The ith diagonal entry of this equation is

∇2v E ii
= s2i ti
where ti is the ith diagonal entry of the matrix P H H H HP + A. The required scaling is
√
therefore si = 1/ ti . The gradient is given by
∇v E = S H ∇w E.
This defines appropriate directions for changes in the preconditioned coefficients v. Ap-
propriate directions for changes to the original coefficients w are therefore given by
g(i) = S∇v E
= S 2 ∇w E.
This method requires the precomputation of ti . The diagonal entries of A are known (these
are the inverses of the variance estimates) so consider the matrix P H H H HP . The entry ti
can be calculated by:
1. Set all wavelet coefficients to zero, except for the ith coefficient which is set to 1, to
get a unit vector ei .
2. Invert the wavelet transform to get P ei .
3. Apply the blurring filter H to get HP ei .
4. Apply the spatially reversed blurring filter to get H H HP ei .
5. Apply the wavelet transform to get P H H H HP ei .

6. Pick out the value of the ith coefficient pi = P H H H HP ei i .
7. Calculate ti , ti = Aii + pi .
The value of pi depends only on which subband contains the nonzero coefficient. We can
therefore compute all the pi by applying this process once for each subband. Also note that
because these values (for pi ) depend on the choice of blurring filter and wavelet transform
but not on the observed data they can be computed once and used for many different test
images.
The third type of preconditioning is based on analogy with the WaRD method [84].
To explain the analogy we first derive the analytic solution to the energy minimisation
problem. The expression for energy (equation 9.7) is a quadratic function of w and hence
the optimum can be found by setting the gradient equal to zero. The vector gradient of
the energy is
∇E(w) = −P H H H y + P H H H HP w + Aw
therefore the solution to the problem, wopt , is given by

−1 H H
wopt = P H H H HP + A P H y (9.11)
−1
Note that the factor P H H H HP + A is exactly the same as the ideal preconditioner.
A negative way of looking at this is to say that calculating a good preconditioner involves
the same effort as solving the original equations directly. However, we can also reverse
the logic to say that a reasonable method for solving the original equations will probably
also give a reasonable preconditioning method. On the basis of this logic we propose using
the WaRD method as a preconditioner because ee have found that it gives a good first
approximation to solving the original equations.
We now give the details of how this idea is applied. The WaRD method consists of a
linear filtering stage followed by a wavelet denoising stage. We can write the regularised
linear filtering used in the WaRD method as
Px
xα = F H F HHy
M H MP
2
x + ασ IN
P x
= FH H 2
F P P HHHy
M MPx + ασ IN
where Px is a diagonal matrix containing the estimated PSD of the image along the diagonal
entries. If we compare this equation with equation 9.11 we spot the term P H H H y on the
right of both equation. We deduce that in the WaRD method the rest of the terms together
with the wavelet denoising should provide an approximation to the ideal preconditioner.
On the basis of this analogy we will choose a search direction that is the wavelet denoised
version of the image
Px
xα = F H F P (−∇E|w ) (9.12)
M H MP 2
x + ασ IN
The denoising strategy used by the WaRD estimate is as follows:
1. Wavelet transform the image xα .
2. Modify each wavelet coefficient by

βi2
ŵi = wi (9.13)
βi2 + γi2
where βi2 is the estimated variance of the wavelet coefficients due to our image model
and γi2 is the estimated variance of the wavelets due to the noise (amplified by the
inverse filter).
3. Invert the wavelet transform of the image.
We use the same strategy except that we are calculating a search direction in wavelet space
and so we can omit the final step. The WaRD search direction is therefore given by
g(i) = ŵ.
9.3.3 One dimensional search

This section describes the contents of the Minimise Energy along search direction
step in figure 9.2. Suppose we have an estimate w0 and a search direction δw . Then if we
add on a times the search direction w = w0 + aδw and we can express the energy as a
function of a as

2E(a) =
HP w0 + aHP δw − y
2 + wH
0 + aδw
H
A (w0 + aδw ) (9.14)
We can minimise this expression by setting the derivative with respect to a equal to zero
d(E(a))
= a
HP δw
2 + aδw H Aδw
da
− δw H P H H H (y − HP w0 ) + δw H Aw0 = 0
therefore

δw H P H H H (y − HP w0 ) − δw H Aw0
a=

HP δw
2 + δw H Aδw
When we want to evaluate this expression we never need to do any matrix multiplications
because:
1. Multiplication by P is performed by an inverse wavelet transform,
2. Multiplication by H is performed by the original blurring filter,
3. Multiplication by P H = W is performed by a forward wavelet transform,
4. Multiplication by H H is performed by a reversed version h(−x, −y) of the blurring

filter.
For large blurring filters it is quicker to implement the linear filters using a Fourier trans-
form.
9.4 Convergence experiments and discussion

We consider the same problem studied by Neelamani et al [84] of the 256 × 256 Cameraman
image blurred by a square 9 × 9-point smoother. The blurred signal to noise ratio (BSNR)
is defined as 10 log 10 (
y
2 /(256 ∗ 256)σ 2 ) and noise is added to make this 40dB (we
assume that images have been scaled to have zero mean). We compare the performance of
the six different search direction choices. These will be called
9.4. CONVERGENCE EXPERIMENTS AND DISCUSSION 183
NOPRESD for the steepest descent method with no preconditioning.
NOPRECG for the conjugate gradient algorithm with no preconditioning.
PRESD for the steepest descent preconditioned to have ones along the diagonal of the
Hessian matrix.
PRECG for the conjugate gradient algorithm used with the preconditioned system.
WaRDSD for search directions defined by the WaRD method.
WaRDCG for search directions defined by the conjugate gradient algorithm acting on
the WaRD directions.
We use the oracle estimate for the power spectrum of the original image in order that the
SNR will be a measure of the convergence of the algorithm rather than of the quality of
the power spectrum estimate. Figure
9.4 plots the improvement in SNR (ISNR) defined by
10 log 10
x − y
2 /
x − x̂
2 for the sequence {x̂(1) , x̂(2) , . . . , x̂(10) } of restored images
(n)
produced by these algorithms. We observe the following characteristics of the plot:
1. The same initialisation is used for all methods and therefore all methods have the
same performance at the start of iteration 1.
2. The first pass of the conjugate gradient algorithm uses a steepest descent search
direction and therefore the CG and SD methods give the same performance at the
start of iteration 2.
3. The CG algorithm gives better results than the SD algorithm for the PRE and
NOPRE methods but not for the WaRD iterations.
4. Convegence is very slow without preconditioning (NOPRE).
5. The WaRD method achieves a high ISNR after the first pass, but there is little
subsequent improvement.
6. The results from the preconditioned method (PRE) start at a low ISNR but steadily
improve (a later experiment will show the performance over many more iterations).
7. The ISNR actually decreases on several of the steps when the WaRD direction is
used.
12
11
WaRD
10
9
ISNR/dB
6 PRE
4 NOPRE
3
1 2 3 4 5 6 7 8 9 10
Iteration
Figure 9.4: Performance of different search directions using the steepest descent (x) or the
conjugate gradient algorithm (o).
The WaRD direction is designed to give an estimate of the deconvolved image based on
the assumption that the signal and noise are diagonalised in wavelet space [58]. This is a
reasonable initial approximation and consequently the WaRD direction works well the first
time. However, for subsequent iterations we expect the off-diagonal elements to become
more significant and therefore it is not surprising that the WaRD direction is less effective.
We will discuss the unusual performance of the WaRD direction more at the end of this
section.
In the second experiment we use the WaRD method to initialise the wavelet coefficients
(step Initialisation of figure 9.2). More precisely, we calculate the wavelet transform
of the image x0 (defined in equation 9.8) and then use the WaRD modification step of
equation 9.13 to generate our initial wavelet coefficient estimates. We will call this “WaRD
initialisation”. Note that this is the same as using a single pass of the algorithm with a
WaRD direction based on an intialisation of both scaling and wavelet coefficients to zero.
We compare different choices for the second search direction. Figure 9.5 plots the results
of this experiment. Note that we have a much narrower vertical axis range in this figure
than before. The preconditioned conjugate gradient search direction gives the best final
results. There is an improvement of about 0.05dB from using the preconditioned direction
rather than the WaRD direction after 10 iterations, and another 0.05dB from using the
conjugate gradient algorithm rather than the steepest descent algorithm. Note from figure
9.5 that the WaRD intialisation means that the ISNR is about 11.3dB at the start of the
start of the first iteration, while a single iteration of the WaRD direction (starting from the
original initialisation of just the scaling coefficients) only reached about 10.8dB in figure
9.4. This again supports the argument that the WaRD method works best when used as
it was originally designed rather than to construct search directions.
The ISNR is still improving after ten iterations and the third experiment examines the
improvement over 100 iterations. Figure 9.6 compares the performance for the PRECG,
WaRDCG, and PRESD methods on the same image. The PRECG method performs
best initially, reaching its peak of 11.95dB within about 20 iterations, while the PRESD
method requires about 100 iterations to reach the same ISNR level. The WaRDCG method
displays an oscillation of INSR with increasing iteration, finally settling around 11.9dB.
Figure 9.7 plots the value of the energy function E(w) of equation 9.7 at the start of
each iteration. It can be seen that the PRECG method reaches the lowest energy, as
suggested by the ISNR results. In all of the ISNR plots of this section we have seen that
12
11.9 PRECG
PRESD
11.8
WaRDSD
11.7
11.6
ISNR/dB
11.5
NOPRE
11.4
11.3
11.2
WaRDCG
11.1
11
1 2 3 4 5 6 7 8 9 10
Iteration
Figure 9.5: Performance of different search directions using the steepest descent (x) or the
conjugate gradient algorithm (o) starting from a WaRD intialisation.
12
PRECG
11.9
PRESD
11.8
11.7
11.6
ISNR/dB
11.5 WaRDCG
11.4
11.3
11.2
11.1
11
0 20 40 60 80 100
Iteration
Figure 9.6: Performance of different search directions over 100 iterations
0.27
0.26
0.25
0.24
Energy /pixel
0.23
0.22
0.21 WaRDCG
0.2
0.19
PRESD
0.18
PRECG
0.17
0 20 40 60 80 100
Iteration
Figure 9.7: Value of the energy function over 100 iterations

the WaRDCG method behaves strangely in that the ISNR often decreases with increasing
iterations. Nevertheless, figure 9.7 confirms that the energy function decreases with every
iteration. We have already argued that the WaRD direction should not be expected to
produce sensible search directions except for the first pass (for which it was designed) but
it may still seem strange that the ISNR decreases. Recall that the energy function of
equation 9.7 has two terms, a “observation energy” that measures the degree to which the
current estimate matches the observations, and a “prior energy” that measures how well
the wavelet coefficients match our prior expectations3 . We now suggest an explanation for
how bad directions can cause such problems:
1. At the start of the method the wavelet coefficients are all zero and the estimated
image is a relatively poor fit to the observations. The observation energy is high, and
the prior energy is zero.
2. The first few search directions correct for the most significant places where the ob-
servations disagree with the current estimate at the cost of increasing the size of
some wavelet coefficients. This tends to improve the quality of the estimate. The
observation energy decreases and the prior energy increases.
3. The poor choice of search direction means that some wavelet coefficients are made
large during these initial stages despite having an expected low variance according
to the prior distribution.
4. Subsequent directions attempt to correct for these incorrectly large wavelet coeffi-
cients. The poor choice of search direction means that the direction also introduces
errors elsewhere in the image. The prior energy decreases but now at the cost of in-
creasing the observation energy. This will tend to reduce the quality of the estimate.
In each iteration the total energy (observation energy plus prior energy) decreases but
the internal redistribution of energy between the two terms can cause a corresponding
fluctuation in the ISNR.
We choose ten iterations of the PRECG search direction with a WaRD initialisation as
a reasonably fast and robust choice for the comparisons with alternative methods.
3
Note that we are using the word “prior” in a very loose sense. In our model the matrix A that defines
the prior for a particular image has been generated from the image itself.
9.5. COMPARISON EXPERIMENTS 189
9.5 Comparison experiments

We will compare the performance of algorithms on two images, the cameraman image used
in the previous section and an aerial image of size 256 by 256. The aerial image is provided
by the French Geographical institute (IGN). The two test images (before blurring) are
shown in figure 9.8. We compare two point spread functions (PSF), the 9 by 9 smoother
50 50
100 100
150 150
200 200
250 250
50 100 150 200 250 50 100 150 200 250
Figure 9.8: Test images used in the experiments.
used above, and an alternative 15 by 15 PSF more like a satellite blurring filter. This
alternative PSF is defined as
1
h(x, y) =
(1 + x2 ) (1 + y 2)
for |x|, |y| ≤ 7. This PSF is plotted in figure 9.9. This gives a product of four test data
sets which we will call
CMSQ for the cameraman image blurred with the square PSF.
CMST for the cameraman image blurred with the satellite-like PSF.
IGNSQ for the satellite image blurred with the square PSF.
IGNST for the satellite image blurred with the satellite-like PSF.
We will test the following algorithms
Landweber 235 iterations of the Landweber method. The number of iterations was cho-
sen to maximise the ISNR for the CMSQ image.
0.1
0.08
0.06
0.04
0.02
6
4
2 6
4
0 2
−2 0
−4 −2
−4
−6 −6
Figure 9.9: Alternative PSF used in experiments.
Wiener A Wiener filter using a power spectrum estimated from the observed data[48].
Oracle Wiener A Wiener filter using the (unrealisable) oracle power spectrum estimate.
Mirror The non-decimated form of mirror wavelet deconvolution. This algorithm is de-
scribed in detail in appendix C.
PRECGDT-CWT Ten iterations of the PRECG search direction starting from a WaRD
estimate (using the standard DT-CWT filters of the (13-19) tap near orthogonal
filters at level 1 together with the 14-tap Q-shift filters at level ≥ 2).
PRECGDWT The same algorithm as described in the earlier sections for complex wavelets,
but using a real decimated wavelet formed from a biorthogonal 6,8 tap filter set.
PRECGNDWT The same algorithm as described in the earlier sections for complex
wavelets, but using a real nondecimated wavelet formed from a biorthogonal 6,8 tap
filter set. The operation of each step is equivalent to averaging the operation of a
DWT step over all possible translations.
In each experiment we use a variance of σ 2 = 2. (For the convergence experiments the

variance was about 1.7). Except for the Oracle Wiener method we will use realisable
estimates of the power spectrum. In other words, all of the algorithms (apart from Oracle
Wiener) could be performed on real data. The results of these experiments are tabulated
in figure 9.10. Notice the following features of these results:
Algorithm CMSQ CMST IGNSQ IGNST

Landweber 7.48 3.53 7.49 0.489
Wiener 7.31 2.70 8.00 3.14
Oracle Wiener 8.51 6.38 9.11 6.07
Mirror 3.88 5.28 4.71 4.61
PRECGDT-CWT 9.36 7.3 9.83 6.72
PRECGDWT 8.36 5.43 8.84 4.89
PRECGNDWT 8.64 5.59 9.28 5.09
Figure 9.10: Comparison of ISNR for different algorithms and images /dB
1. The realisable Wiener filter always performs worse than the oracle Wiener by at least
1dB.
2. The Landweber method always performs worse than the Oracle Wiener, but some-
times beats the standard Wiener filter.
3. The Landweber method has a very poor performance on the IGNST image. This
illustrates the problems of an incorrect choice for the number of iterations. Further
tests reveal that for the IGNST image the optimum performance is reached after 36
iterations, reaching a ISNR of 5.4dB.
4. The Mirror wavelet algorithm beats the standard Wiener filter for the satellite like
blurring function (ST), but not for the square blur (SQ).
5. The nondecimated wavelet always performs better than the decimated wavelet.
6. The DT-CWT always performs at least 0.5dB better than any of the other tested
algorithms.
It is possible to rank these performances into three groups:

1. The Landweber, Mirror, and standard Wiener method.
2. The PRECGDWT, PRECGNDWT, and oracle Wiener method.
3. The PRECGDT-CWT method.
The methods in the second group are always better (in these experiments) than those in
the first group, and the method in the third group is always better than any other.
Finally we attempt to duplicate the experimental setups of published results. Some
authors [124] only present results as images and thus it is hard to compare directly but
usually a measure of mean squared error (MSE) or improved signal to noise ratio (ISNR)
is reported.
The cameraman image with a uniform 9 by 9 blur and a blurred signal to noise ratio
(BSNR) of 40dB was originally used by Banham and Katsaggelos [8] who report an ISNR
of 3.58dB for Wiener restoration, 1.71dB for CLS restoration, and 6.68dB for their adap-
tive multiscale wavelet-based restoration (Constrained Least Squares, CLS, restoration was
described in section C.6 and is another deterministic restoration algorithm whose perfor-
mance on this task will always be worse than the Oracle Wiener solution). Neelamani et
al claim [84] that they use the same experimental setup and quote an ISNR of 8.8dB for
the Wiener filter and 10.6dB for the WaRD method. There is a large discrepancy in the
Wiener filter results. A small discrepancy is expected as different power spectrum esti-
mates result in different estimates; Banham and Katsaggelos [8] explicitly state that that
high frequency components of the spectrum are often lost or inaccurately estimated while
Neelamani et al add a small positive constant to the estimate to boost the estimate at high
frequencies. A close examination of the published figures reveals that in fact the setup is
slightly different for the two cases. Banham and Katsaggelos use a filter that averages the
contents in a square neighbourhood centred on each pixel, while Neelamani et al use a filter
that averages the contents in a square neighbourhood whose corner is at the pixel. This
change in setup does not affect the amount of noise added as the BSNR is insensitive to
shifts, nor does it affect the generated estimates. However, it does affect the ISNR because
the starting SNR (for the blurred image) is considerably lowered by a translation.
Fortunately, the difference merely results in a constant offset to the ISNR values. This
offset is given by the ISNR that results from using the centred blurred image rather than
the displaced one. Let H represent the offset filter and S the translation that centres the
impulse response. The blurred image produced by Neelamani et al is given by Hx + n1 ,
while Banham and Katsaggelos produce an image of SHx+n2 where n1 and n2 are vectors
of the noise added to the images. The difference ISNRof f set in the ISNR values for an
image estimate x̂ is therefore

x − Hx − n1
2
x − SHx − n2
2
ISNRof f set = 10 log10 − 10 log10

x − x̂
2
x − x̂
2

x − Hx − n1

x − x̂
2 2
= 10 log10

x − x̂
2
x − SHx − n2
2

x − Hx − n1
2
= 10 log10

x − SHx − n2
2
For the Cameraman image the value of this offset is ISNRof f set = 3.4260dB when there is
no noise added. Using a typical noise realisation reduces this to ISNRof f set = 3.4257dB.
We see that the noise levels are very low compared to the errors caused by blurring.
In our experiments we have followed the setup of Neelamani et al in using the displaced
filter. For comparison we have calculated the results of the PRECGDT-CWT and our
version of standard Wiener on the same image. The results from the literature (including
the adjusted results of Banham and Katsaggelos) are shown in table 9.11, our original
results are printed in bold. In these results the CLS method gives the worst results while
Algorithm ISNR /dB

CLS 5.14 (1.71+3.43)
Wiener (Banham and Katsaggelos) 7.01 (3.58+3.43)
Multiscale Kalman filter 10.11 (6.68+3.43)
Wiener (Neelamani et al) 8.8
WaRD 10.6
Wiener (our version) 8.96
PRECGDT-CWT 11.32
Figure 9.11: Comparison of different published ISNR results for a 9 by 9 uniform blur
applied to the Cameraman image with 40dB BSNR.
the PRECGDT-CWT method gives the best. The adjustment for the offset filter shows
that the WaRD method is 0.5dB better than the multiscale Kalman filter (instead of the
claimed 4dB improvement), while the PRECGDT-CWT method is 0.7dB better than the
WaRD method (and 1.2dB better than the multiscale Kalman filter).
Figure 9.12 displays the deconvolved images for our Wiener and PRECGDT-CWT
approaches. This setup is exactly the same as for the initial experiments in section 9.4.
Original image Observed image
Wiener denoised image (ISNR=8.96) Restored image after 10 iterations (ISNR=11.32)
Figure 9.12: Deconvolution results for a 9 by 9 uniform blur applied to the Cameraman
image with 40dB BSNR using the PRECGDT-CWT method with WaRD initialisation.
The results for our method are slightly worse here (11.32dB instead of 11.9dB) because we
are now using a realisable estimate of the power spectrum. From figure 9.12 we can see
that the results of the PRECGDT-CWT method are considerably sharper and possess less
residual noise than the results of the Wiener filter.
Belge et al use a Gaussian convolutional kernel [11]
1
h(x, y) = exp −(x2 + y 2 )/(2σx σy )
4σx σy
with σx = σy = 2 to blur the standard 256 × 256 Mandrill image and add zero mean white
Gaussian noise to achieve a BSNR of 30dB. This is an unusual way of writing the kernel
(normally there would be a factor of π for normalisation and σx2 σy2 would be used to divide
x2 + y 2 ) but this is the kernel specified in the paper [11]. Their results are presented in

root mean square error ( RMSE = (1/N 2 )
x̂ − x
2 ) which we have converted to ISNR
values (ISNR = −20 log10 (RMSE/R0 ) where R0 is the RMSE of the blurred image). They
compare their method to CLS and a total variation algorithm. Table 9.13 compares these
results with the PRECGDT-CWT method (using a realisable power spectrum estimate).
Our original results are written in bold. We see that in this experiment the PRECGDT-
Algorithm ISNR /dB

CLS 0.716
TV 0.854
Adaptive edge-preserving regularization 0.862
PRECGDT-CWT 1.326
Figure 9.13: Comparison of different published ISNR results for a Gaussian blur applied
to the Mandrill image with 30dB BSNR.
CWT method improves the results by 0.46dB compared to the adaptive edge-preserving
regularization method. The original, blurred, and restored images using our method are
shown in figure 9.14. One warning should be attached to these results: the definition
of SNR is not explicitly stated in the reference, we assume the definition (based on the
variance of the image) given in section 9.4.
Sun [115] used a uniform 3 by 3 blur and 40dB BSNR acting on a 128 by 128 version
of Lenna to test a variety of Modified Hopfield Neural Network methods for solving the
Constrained Least Squares formulation. Sun tested three new algorithms (“Alg. 1”, ”Alg.
2”, ”Alg. 3”) proposed in the paper [115] plus three algorithms from other sources (the
“SA” and “ZCVJ” algorithms [133], and the “PK” algorithm [92]). We claim in section
C.6 that the converged solution must be worse than the Oracle Wiener estimate, but
acknowledge that intermediate results may be better. We tested the PRECGDT-CWT
method (using a realisable power spectrum estimate) and two choices of Wiener filter (the
oracle Wiener filter, and a Wiener filter based on the same power spectrum estimate as used
in the PRECGDT-CWT method) on this problem. The results of all these comparisons is
shown in figure 9.15. Our original results are written in bold.
Original
Blurred Restored
Figure 9.14: Deconvolution results for a Gaussian blur applied to the Mandrill image with
30dB BSNR using the PRECGDT-CWT method with WaRD initialisation.
Algorithm ISNR /dB

Alg. 1 6.84
Alg. 2 6.91
Alg. 3 6.85
SA 7.19
ZCVJ 5.38
PK 5.41
Oracle Wiener 8.13
Realisable Wiener 6.37
PRECGDT-CWT 9.90
Figure 9.15: Comparison of the PRECGDT-CWT and Wiener filtering with published
results of Modified Hopfield Neural Network algorithms for a 3 × 3 uniform blur applied
to the Lenna image with 40dB BSNR.
The best of the previously published results is the “SA” algorithm which attains an
ISNR of 7.19dB. This is better than the realisable Wiener filter results, but almost 1dB
worse than the Oracle Wiener estimate. The PRECGDT-CWT method does particularly
well in this case4 , outperforming the “SA” algorithm by 2.7dB.
9.5.1 Discussion
Mirror wavelets are designed for hyperbolic deconvolution. The inverse filtering produces
large noise amplification for high frequencies and, in order to achieve a bounded variation in
the amplification for a particular subband, the subbands have a tight frequency localisation
for high frequencies. This is an appropriate model for the satellite-like PSF. However, the
9 by 9 smoothing filter (SQ) has many zeros in its frequency response and consequently
the mirror wavelets are inappropriate and give poor results. A better performance could
be achieved by designing a more appropriate wavelet transform but no single wavelet
transform will be best for all blurring functions. In contrast, the Bayesian approach uses
one term (the prior) to encode the information about the image using wavelets, while a
second term (the likelihood) is used to describe the observation model and makes explicit
use of the PSF. A change in PSF requires a change in the likelihood term but the same
4
In this method we use a centred blurring filter to avoid the artificial BSNR improvement described
earlier.
prior wavelet model should remain appropriate. Therefore the same wavelet method gives
good performance for both of the blurring functions.
If we now look at the relative performance of different wavelets we see a familiar result.
The nondecimated wavelet transform outperforms the decimated wavelet transform. The
decimated wavelet gives shift dependent results, and shift dependence will always tend to
cause worse performance for estimation tasks like this one (as discussed in section 8.4.4).
However, the improvement is only about 0.2dB. A much larger improvement is gained from
using the DT-CWT which is an average of 1.15dB better than the NDWT. It is not hard to
see a plausible reason for this; the worst errors occur near edges and the DT-CWT is able to
distinuish edges near 45◦ from those near −45◦ . The signal energy near diagonal edges will
therefore tend to be concentrated in a smaller proportion of the wavelet coefficients than
for a real wavelet transform and hence will be easier to detect. Note in particular that the
PRECGDT-CWT method outperforms even the Oracle Wiener method and consequently
will perform better than any version of standard Wiener filtering, including the Landweber
or Van Cittert iterative techniques.
We have achieved our goal of comparing the performance of complex wavelets with
decimated and nondecimated real wavelets in a practical deconvolution method but we
have certainly not “solved” deconvolution or even fully exploited the potential of complex
wavelets in this application. The most promising direction for further research is by taking
account of the correlations between wavelet coefficients. Deconvolution methods tend to
produce rather blurred estimates near edges in the original image that significantly affect
both the SNR and the perceived quality of the restoration. It may be possible to use the
HMT (Hidden Markov Tree [24]) to deduce the likely presence of large wavelet coefficients
(at a fine detail scale) from the presence of large coefficients at coarser scales and hence im-
prove the estimation. This approach has already been shown to be promising for standard
denoising when there is no blur (as described in chapter 3).
All the results here are based on simulated data in order to allow the performance to
be objectively measured. In real world applications the following isses would need to be
addressed:
1. The estimation of the noise variance and the Point Spread Function (PSF) of the
blurring filter.
2. The degree to which the PSF is linear and stationary.

3. The degree to which a Gaussian noise model is appropriate.
9.6 Conclusions
We conclude that
1. The WaRD algorithm gives a good starting point for the method but provides inad-
equate subsequent directions.
2. The preconditioned conjugate gradient algorithm provides sensible search directions

that achieve good results within ten iterations.
3. The NDWT outperformed the DWT in this approach by about 0.2dB. Shift depen-
dence therefore has a small effect on performance.
4. The DT-CWT outperformed the NDWT by an average of 1.15dB.
5. The Mirror wavelet method performed badly on the 9 by 9 uniform blur due to the
presence of extra zeros in the response. Such minimax algorithms must be tuned to
the particular blurring case.
6. The Landweber and Van Cittert iterative algorithms are special cases of Wiener
filtering, and hence are worse than Oracle Wiener filtering.
7. The DT-CWT method performed better than the Oracle Wiener and hence bet-
ter than all versions of standard Wiener filtering, including methods based on the
Landweber or Van Cittert iteration.
8. The method based on the DT-CWT performed better than all the other methods
tested and better than the published results on similar deconvolution experiments.
In summary, complex wavelets appear to provide a useful Bayesian image model that is
both powerful and requires relatively little computation.
Chapter 10
Discussion and Conclusions
The aim of the dissertation is to investigate the use of complex wavelets for image process-
ing. In particular, we aim to compare complex wavelets with alternative wavelet trans-
forms. In this chapter we explain how the contents of the dissertation support the thesis
that complex wavelets are a useful tool and then discuss the wider implications of the
research. Many peripheral results have already been mentioned in the conclusions section
at the end of each chapter and we will not repeat them here.
First we describe the experimental support for the thesis. We have examined four main
image processing tasks that are described in chapters 5, 6, 8, and 9. For every applica-
tion the experimental results for the complex wavelet methods display an improvement
in accuracy over the standard decimated wavelet methods. These improvements can be
seen qualitatively in the synthesized textures of chapter 4 and most clearly quantitatively
in chapter 9. For most of the applications the complex wavelet method also gives better
results than the nondecimated wavelet transform. The exception is chapter 8 on inter-
polation. For this application the complex wavelet method produces almost exactly the
same results as the nondecimated transform. Nevertheless this application still supports
the thesis because the new method is much faster than the non-decimated method.
For each application we have compared wavelet methods with alternative methods to
determine when the wavelets are useful. The complex wavelet models were found to be
particularly good for segmentation of differently textured regions and image deconvolution.
These cases provide the main experimental justification that complex wavelets are useful
for image processing.
Now we describe the theoretical support for the thesis. There are two main reasons
201
202 CHAPTER 10. DISCUSSION AND CONCLUSIONS
why the complex wavelets are expected to give better results. The first is increased di-
rectionality. It is usually clear why this should increase performance although the precise
amount of improvement will be strongly dependent on the nature of the input images.
Therefore qualitative explanations are given rather than mathematical proofs. In chapter
4 we explain why the extra subbands are necessary for synthesizing texture with a diago-
nal orientation. As a natural corollary we explain in chapter 5 why the extra features are
useful for segmentating textures with diagonal features. The extra subbands also give a
better model for the object edges that commonly appear in images and this results in the
improved deconvolution performance of chapter 9.
The second main reason for improved results is the reduction in shift dependence as
compared to the standard fully decimated wavelet transform. Chapter 2 proves that any
non-redundant wavelet system based on short filters will have significant shift dependence.
Chapter 8 calculates an approximation for the reduction in SNR caused by shift dependence
for interpolation that predicts that complex wavelets should achieve a significant increase
in quality compared to a typical real decimated wavelet.
10.1 Discussion
The experimental comparisons did not always favour a complex wavelet model. The models
were found to be too simple for synthesizing more regular textures like brickwork. The
models were also found to be unnecessarily complicated for interpolating a stationary
random process with an isotropic autocorrelation function. In this case a solution based
on the Gaussian Pyramid transform would be both faster and less shift dependent.
The DT-CWT has a number of properties that are beneficial in different circumstances:
1. Perfect reconstruction
2. Six directional subbands at each scale (when used to analyse images)
3. Near balanced transform
4. Approximate shift invariance
5. Complex outputs
10.2. IMPLICATIONS OF THE RESEARCH 203
Table 10.1 summarises the importance of these properties for a number of applications. A
√
indicates that the property is useful for the corresponding application. We include three
types of application:
A The applications we have tested that are described in the main body of the dissertation.
B The applications we have tested that are not described in this dissertation but have
been published elsewhere [33, 34].
C Applications in the literature described in chapter 3.
Application Perfect Six Near Shift Complex

reconstruction subbands balanced invariance outputs
√ √ √ √
Texture synthesis (A)
√ √
Texture segmentation (A)
√ √ √ √ √
Autocorrelation synthesis (A)
√ √
Interpolation (A)
√ √ √ √ √
Deconvolution (A)
√
Texture database (B,C)
√ √ √ √ √
Levelset segmentation (B)
√ √ √
Motion Estimation (C)
√ √ √ √ √
Denoising (C)
Figure 10.1: Summary of useful properties for different applications
We would expect complex wavelets to be most useful for the applications (such as decon-
volution) that require all of the properties.
10.2 Implications of the research

It is well known that shift dependence often causes a significant degradation in performance.
The experimental results of this dissertation confirm this connection. Consequently many
wavelet based methods rely on the nondecimated form of the wavelet transform. This
dissertation provides evidence that the benefits of such a nondecimated transform can
usually be achieved with only a minor increase in redundancy and computational time
by means of a decimated complex wavelet transform. It is the author’s hope that this
204 CHAPTER 10. DISCUSSION AND CONCLUSIONS
dissertation will help to add complex wavelets to the list of good general purpose transforms
that are automatically considered for a new problem. This is not only on the grounds that
the DT-CWT provides an efficient substitute for the NDWT, but on the stronger claim
that the DT-CWT often improves the results over even nondecimated real transforms.
Chapter 11
Future work
In this dissertation we have shown how complex wavelets can be used on a small selection of
image processing tasks. We have proposed a number of simple methods based on the DT-
CWT that produce good results. This chapter discusses possible future research directions
for these tasks and suggests some additional applications.
11.1 Segmentation
We have demonstrated the power of the complex wavelet features for supervised texture
segmentation. This method is appropriate for the task of classifying satellite imagery into
urban and non-urban regions but images taken at more moderate distances of real world
objects will contain a mixture of image types. Texture-based segmentation will not always
be appropriate and it may be possible to combine the segmentation method described
here with more traditional edge-based segmentation algorithms. A single complex wavelet
transform could be used for detecting both texture and the object edges. Large complex
wavelet coefficients correspond to edges in an image and a simple edge detector can be
built by simply detecting such large coefficients. At each position we have responses in the
six bandpass subbands which allow the orientation to be estimated.
In some preliminary work along these lines we have found that it is possible to signifi-
cantly improve the coarse estimates from this simple scheme using a reassignment technique
[6].
Another possibility is to extend the Hidden Markov Tree (HMT) model [28] to unify
texture and edge based segmentation by adding extra hidden states that encode a form of
205
206 CHAPTER 11. FUTURE WORK
image grammar. This grammar could be used to generate regions that can be recognised
by their texture content, by the presence of large wavelet coefficients near their boundaries,
or even a combination of these. It may even be possible to use the HMT directly for object
recognition if a sufficiently sophisticated grammar is used.
11.2 Texture synthesis

It would be interesting to determine additional features that can be used to improve the
synthesis performance. For example, the texture synthesis method performs poorly on
piecewise smooth images. A piecewise smooth image has large valued wavelet coefficients
at the edges of regions while the wavelet coefficients inside the regions have small values.
In the synthesized images the energy tends to be spread more uniformly across the image.
It may be possible to encode measures of persistence that will tend to group large valued
wavelet coefficients together.
There has been a large amount of research into how the human visual system detects
and processes structure in images. For example, Marr [78] proposed the concept of the
“full primal sketch”. In the full primal sketch groups between similar items are made and
information about their spatial arrangement is extracted. Hubel and Wiesel performed a
famous study into the responses of single neurons at an early stage in the visual system
and discovered responses that were frequency and orientation selective and were similar to
Gabor functions [49, 50]. Such Gabor functions have shapes similar to complex wavelets.
It may be possible to connect the low level features provided by complex wavelets with
higher level structure extraction processes in order to achieve better texture synthesis or
better image understanding.
11.3 Deconvolution
We mentioned in chapter 9 that the most promising direction for future research in de-
convolution is taking account of the correlations between wavelet coefficients. By using
models such as the HMT to give a better prior model for images we hope that the results
will also improve.
An alternative direction is the possibility of using the method for a “super-resolution”
application. Super-resolution is the name for the process of constructing a single high
11.4. OTHER POSSIBILITIES 207
quality image from multiple low quality views of the same image. For example, if a low
resolution video camera is used to film a document then, while individual frames may
be unreadable, it may be possible to construct a readable estimate by combining the
information from several frames. The basic mathematics is very similar but there is the
added complication of having to estimate an appropriate transformation to place all of the
images onto a common reference grid.
11.4 Other possibilities

We have been using the complex wavelet transform for image processing but it can also be
used in spaces of both greater and smaller dimension.
For example, the author is also investigating the application of the DT-CWT to 3D data
sets (such as from medical imaging techniques). For these data sets analogous segmentation
techniques can be used to identify different biological structures.
The DT-CWT can also be used for 1D applications. We have used the DT-CWT to
represent regions with smooth boundaries via a levelset approach [34]. An alternative
approach to shape modelling is to represent each point (x, y) along a contour by the
complex number x + jy to produce a one-dimensional sequence of complex numbers. For
real numbers the 1D DT-CWT only needs to possess 1 high frequency subband for each
scale as the spectrum of real signals is the same for positive and negative frequencies.
However, the 1D DT-CWT can easily be extended to produce subbands for both positive
and negative frequency. Using such an extended DT-CWT on the sequence of complex
numbers mentioned above will produce (complex valued) wavelet coefficients with the
following properties:
1. Anticlockwise curves will be represented by wavelet coefficients corresponding to

positive frequencies.
2. Clockwise curves will be represented by wavelet coefficients corresponding to negative

frequencies.
3. A translation of the contour will only alter the lowpass coefficients.
4. A rotation by θ of the contour will alter the phase of every coefficient by θ.

208 CHAPTER 11. FUTURE WORK
5. A rescaling by a factor of r of the contour will change the magnitude of every coeffi-
cient by a factor of r.
Such a model should be useful for recognition or coding of contours.

Appendix A
Balance and noise gain
A.1 Amplification of white noise by a transform

Consider applying independent white noise of variance σ 2 to each of the inputs {x1 , . . . , xN }
of a real transform represented by w = W x where x ∈ R N , w ∈ R M , M and N are positive
integers, and W is a real M ×N matrix but there is no restriction on the relative sizes of M
and N. We first calculate an expression for the expected energy in the wavelet coefficients.
The ith wavelet coefficient wi is given by

j=N
wi = Wij xj (A.1)
j=1

j=N
and so wi has a N(0, σ 2 j=1 Wij2 ) distribution. E(wi2 ) is given by the variance because
E(wi ) = 0 and so the expected energy in the wavelet coefficients is
i=M

i=M
2
E wi = E wi2 (A.2)
i=1 i=1
j=N
i=M
= σ2 Wij2 (A.3)
i=1 j=1
Now consider the trace of W T W (given by the sum of the diagonal elements)

j=N

T
tr(W W ) = WTW jj
(A.4)
j=1

j=N i=M

= WT ji
Wij (A.5)
j=1 i=1
209
210 Appendix A.2

j=N i=M
= Wij Wij (A.6)
j=1 i=1
j=N
i=M
= Wij2 (A.7)
i=1 j=1
Putting equations A.7 and A.3 together we get the desired result that the expected
total output energy is tr(W T W )σ 2 .
Suppose now we consider the model described in section 2.5.1. We have wavelet coeffi-
cients v consisting of the original wavelet coefficients w = W x plus a vector n containing
independent white Gaussian noise of mean zero and variance σ 2 .
v =w+n
If P is a matrix representing reconstrution filtering (that satisfies PR so that P W = IN )

then the expected energy of the error is given by

E
x − y
2 = E
x − P v
2

= E
P W x − P (w + n)
2

= E
P n
2 .
In words, the energy of the error is given by the energy of the output of the linear transform
P applied to a vector of white noise. Using the first result of this section we can write that
the expected energy of the error is given by tr(P T P )σ 2 . Using definition 2.14 we are now
in a position to write down a first expression for the noise gain g,
E {
y − x
2 }
g =
E {
v − w
2 }
tr(P T P )σ 2
=
Mσ 2
tr(P T P )
=
M
1 2
N M
= P .
M i=1 j=1 ij
A.2 Definition of d1, ...dN

Consider a real linear transform represented by w = W x. If W has size M by N and
now M ≥ N then we can take the singular value decomposition to give three real matrices
Appendix A.3 211
U, S, V such that
W = USV T (A.8)
where
• U has size M by M and is an orthogonal matrix.
• S has size M by N and is zero except for entries on the main diagonal S1,1 , ..., SN,N .
• V has size N by N and is an orthogonal matrix.
(A matrix is defined to be orthogonal if U T = U −1 . For a matrix of size n × n this is

equivalent [17] to saying that the rows form an orthonormal basis of R n .) di is defined to
be the square of the ith singular value:
2
di = Si,i . (A.9)
Without loss of generality we can order these such that d1 ≥ d2 ≥ ... ≥ dN .
A.3 Determining frame bounds

If we write a set of wavelets operating on a finite number of input samples in matrix form,
then the definition of a frame [30] says that a set of wavelets give a frame if and only if
there exists real values for A and B such that 0 < A ≤ B < ∞ and ∀x ∈ R N
A
x
2 ≤
W x
2 ≤ B
x
2 (A.10)
Consider
W x
2 ,

W x
2 = xT W T W x (A.11)
= xT V S T U T USV T x (A.12)
= xT V S T SV T x (A.13)
= yT S T Sy (A.14)

i=N
= di yi2 (A.15)
i=1

i=N
where y = V T x. All the di are real and non-negative and so clearly i=1 di yi2 varies
between dN
y
2 and d1
y
2 as dN and d1 are the smallest and largest of the di . Note that
212 Appendix A.5
by writing ek for the vector in R N with a 1 in the kth place and zeros elsewhere, we can
attain these bounds by y = eN or y = e1 .
Also note that because V T V = V V T = I,
x
2 = xT x = xT V V T x =
V T x
2 =
y
2 .
Putting these last results together we discover that ∀x ∈ R N
dN
x
2 ≤
W x
2 ≤ d1
x
2 (A.16)
with the bounds attainable by x = V eN or x = V e1 . This means that the tightest possible
frame bounds for the transform are dN and d1 and that the wavelets associated with the
transform form a frame if and only if dN > 0.
A.4 Signal energy gain

The definition of a frame says that (except for tight frames) the energy of a signal can
change as it is transformed (within certain bounds) and so we define the signal energy gain
as the gain in energy for a theoretical white noise input.
From A.1 we know that the expected energy in the wavelet coefficients is given by
tr(W T W )σ 2 . The input energy is σ 2 for each of the N inputs and so the signal energy
gain is
tr(W T W )σ 2 1
2
= tr(W T W ) (A.17)
Nσ N
1
= tr(V S T U T USV T ) (A.18)
N
1
= tr(V S T SV T ) (A.19)
N
1
= tr (SV T )T SV T (A.20)
N
1 T
= tr SV (SV T )T (A.21)
N
1 T
= tr SV V S T (A.22)
N
1
= tr(SS T ) (A.23)
N
1
i=N
= di (A.24)
N i=1
where we have made use of the result tr(AT A) = tr(AAT ).

Therefore the gain in signal energy is given by the average of the di .
Appendix A.5 213
A.5 Reconstruction noise gain bound

Consider the problem of inverting w = W x. We assume that the wavelets form a frame in
order for a PR system to be achievable. If M > N then the transform is redundant and
there are many choices for the reconstruction matrix. We seek the solution with minimum
noise gain. Section A.1 shows that the noise gain is given by the sum of the squares of the
entries in the inversion matrix P . We can solve for the minimum noise gain reconstruction
by separately solving the problem of inverting to get each of the xi .
Suppose that the inversion matrix P has rows qT1 , ..., qTN . The ith row qTi is involved
only in producing xi and so we need to minimise
qi
2 subject to qTi W x = xi . This is
equivalent to minimising
qi
2 subject to (qTi W )T = W T qi = ei where ei is defined as in
A.3. This is a standard problem whose solution is given by qi = W (W T W )−1 ei and so
qTi = ei (W T W )−1 W T .
Putting all the separate solutions together we discover that the minimum noise gain
perfect reconstruction matrix Q is given by Q = (W T W )−1 W T . Note that W T W must be
invertible for this solution to exist. W T W will be invertible if and only if all the singular
values are non-zero which will be true if and only if the associated wavelets form a frame.
We can now substitute the singular value decomposition for W to get
Q = (V S T U T USV T )−1 V S T U T (A.25)

= (V S T SV T )−1 V S T U T (A.26)
= (V T )−1 (S T S)−1 V −1 V S T U T (A.27)
= V (S T S)−1 S T U T (A.28)
S T S is the diagonal N by N matrix with diagonal entries of di and so S T S is always

invertible as these di are all non-zero.
Putting white noise of variance σ 2 in each wavelet we have a total expected energy of
Mσ 2 in the wavelets. From A.1 we expect an energy of tr(QT Q)σ 2 after reconstruction
and so we can now calculate the expected noise gain:
tr(QT Q)σ 2 1
= tr(QT Q) (A.29)
Mσ 2 M
1
= tr((V (S T S)−1 S T U T )T V (S T S)−1 S T U T ) (A.30)
M
1
= tr((US(S T S)−1 V T )V (S T S)−1 S T U T ) (A.31)
M
214 Appendix A.7
1
= tr(US(S T S)−1 (S T S)−1 S T U T ) (A.32)
M
1
= tr((S T S)−1 S T U T US(S T S)−1 ) (A.33)
M
1
= tr((S T S)−1 S T S(S T S)−1 ) (A.34)
M
1
= tr((S T S)−1 ) (A.35)
M
1 1
i=N
= (A.36)
M i=1 di
Q represents the transform with minimum noise gain and we conclude that any linear
perfect reconstruction transform that is used to invert W has noise gain bounded below

i=N 1
by M1 i=1 di and this lower bound is achievable.
A.6 Consequences of a tight frame

If the frame is tight then d1 = dN and so all the singular values are equal. From before
we know that the sum of the singular values is N and so d = 1. We can write S T S = IN
and so the matrix Q representing a reconstruction with the least noise gain becomes Q =
V (S T S)−1 S T U T = V S T U T = W T .
Therefore for a tight frame the transform can be inverted by the matrix W T , and this
inversion achieves the lower bound on noise gain.
A.7 Relation between noise gain and unbalance

From section A.1 we know that the noise gain of any linear perfect reconstruction transform,
tr(P T P )
P , used to invert W is given by M
. Consider the unbalance between P T and W.
We define the unbalance to be the sum of the squares of the differences between the two

matrices and so from equation A.7 we find that the unbalance is tr (P T − W )T (P T − W ) .
We can expand this expression to get:

tr (P T − W )T (P T − W ) = tr P P T − W T P T − P W + W T W (A.37)
= tr(P P T ) − tr(W T P T ) − tr(P W ) + tr(W T W )(A.38)
= tr(P T P ) − 2tr(P W ) + tr(W T W ) (A.39)
= tr(P T P ) − 2N + N (A.40)
Appendix A.7 215
= tr(P T P ) − N (A.41)
where we have used that P W = IN (as P is a perfect reconstruction transform) and that
tr(W T W ) = N (from the normalisation condition). We can rearrange this last result to
find that:
g = (N + U)/M (A.42)
tr(P T P )
where g = M
is the noise gain of the reconstruction and U = tr (P − W T )(P T − W )
is the unbalance.
216 Appendix A.7
Appendix B
Useful results
B.1 Summary
This appendix contains a number of useful mathematical results. The results are not
original but are included (expressed in the notation of this dissertation) for the sake of
completion. Although some of the proofs are long, they are all relatively straightforward.
Lemma 1 If Z and Y are independent zero mean wide sense stationary discrete Gaussian
random processes, then the covariance of Z + Y is given by the sum of the covariances of
Z and Y.
Proof. Let RZ (d) be the covariance of Z for vector displacement d and similarly let RY (d)
be the covariance of Y. The covariance RZ+Y of the sum of the random processes is given
by
RZ+Y = E {(Za + Ya ) (Zb + Yb) | b − a = d}

= E {Za Zb + Za Yb + Ya Zb + Ya Yb | b − a = d}
= E {Za Zb | b − a = d} + E {Za Yb | b − a = d}
+E {Ya Zb | b − a = d} + E {Ya Yb | b − a = d}
= RZ (d) + E {Za | b − a = d} E {Yb | b − a = d}
+E {Ya | b − a = d} E {Zb | b − a = d} + RY (d)
= RZ (d) + RY (d)
where we have made use of E {AB} = E {A} E {B} for independent random variables,
and that E {Za } = E {Zb } = 0 as the processes are zero mean. We have also made use of
217
218 Appendix B.1
the equivalence between correlation and covariance for zero mean processes, noting that
Z + Y will also have zero mean.
Lemma 2 The only separable circularly symmetric filter is a Gaussian.
Proof. Suppose the filter is given by f (r) where r is the radius. Then as the filter is
separable we know that
f (r) = g(x)h(y)
and if we assume f (0) = 1 we can adjust the scaling of g and h such that g(0) = h(0) =
f (0) = 1. Then we can set y = 0 to find
f (x) = g(x)h(0) = g(x)
and similarly x = 0 to find
h=g=f
This means that we have the following relationship
f (r) = f (x)f (y)
Taking logarithms and defining w(x2 ) = log f (x) we find
w(x2 + y 2 ) = w(x2 ) + w(y 2)
and so w(a) is a linear function with w(0) = log f (0) = log 1 = 0. Therefore w(a) = ka
for some constant k and we can write the filter as
f (r) = exp{w(r 2)} = exp{kr 2 }
thus showing that the filter must be a Gaussian. If f (0) = A then it is easy to show the
filter is of the form f (r) = A exp{kr 2 }.
Lemma 3

1 1
exp − (z − a) A (z − a) exp − (z − b) B (z − b)
T T
2 2

1 −1 T −1
∝ exp − z − (A + B) (Aa + Bb) (A + B) z − (A + B) (Aa + Bb)
2
Appendix B.1 219
Proof.

1 T 1 T
exp − (z − a) A (z − a) exp − (z − b) B (z − b)
2 2

1
= exp − (z − a) A (z − a) + (z − b) B (z − b)
T T
2

1 T
= exp − z Az − 2z Aa + a Aa + z Bz − 2z Bb + b Bb
T T T T T
2

1 T
= exp − z (A + B) z − 2z (Aa + Bb) + a Aa + b Bb
T T T
2

1 −1 T −1
= k exp − z − (A + B) (Aa + Bb) (A + B) z − (A + B) (Aa + Bb)
2
where

1 −1
k = exp − − (Aa + Bb) (A + B) (Aa + Bb) + a Aa + b Bb
T T T
2
Lemma 4 If C is a S × S symmetric matrix, I is the S × S identity matrix, D is a S × 1

vector, 0 is a S × 1 zero vector, then
 −1 −1
2
I/σM 0 C D C D C 2 −1
 +  = − IσM +C C D
0T 0 DT σ 2 DT σ 2 DT
Proof. This is a special case of the matrix inversion lemma, but we shall prove it
directly by multiplying the RHS by the inverse of the LHS.
 −1 
2
I/σM 0 C D
 + 
0T 0 DT σ 2

C D C 2 −1
− IσM +C C D
DT σ 2 DT
−1
2
I/σM 0 C D C D C D
= +
0T 0 DT σ 2 DT σ 2 DT σ 2

I/σM2
0 C 2 −1
− IσM +C C D
0T 0 DT
−1
C D C 2 −1
− IσM +C C D
DT σ 2 DT
220 Appendix B.2

2
C/σM 2
D/σM 2
C/σM +I−I 2 −1
= +I− IσM +C C D
0T 0 0T
−1
C D C D 2
(IσM +C)
−1
− C D
DT σ 2 DT σ 2 0T

2
C/σM 2
D/σM 2
C + IσM −1
2
= T
+I− T
IσM +C C D 2
/σM
0 0 0

I 2
−1 2
(IσM +C)
−1
+ IσM +C C D − C D
0T 0T

2
C/σM 2
D/σM 2
C + IσM 2
−1
= T
+I− T
IσM +C C D 2
/σM
0 0 0

2
(IσM +C)
−1 2
(IσM +C)
−1
+ C D − C D
0T 0T

2
C/σM 2
D/σM I
= T
+I− T
C D 2
/σM
0 0 0
= I
B.2 Bayesian point inference

This section describes the construction of the posterior distribution for the value at a point
in the case of noisy irregular sampled data from a wide sense stationary discrete Gaussian
random process. The observation model has been described in section 8.2 and we use the
same notation as in chapter 8.
Suppose we wish to obtain a point estimate for the random variable Zk corresponding
to location xk (we assume k > S). Consider the a priori joint distribution of this random
variable with the random variables Z1 , Z2 , . . . , ZS corresponding to the sample locations.
This is a multivariate Gaussian distribution that can be written in the following form:

Γ C D
s N 0, (B.1)
Zk DT σk2
Appendix B.2 221
where we have stacked the first S random variables into a column vector Γ, C is the
covariance between the values at the first S locations, D is the covariance between Zk and
the random variables for the first S locations, and σk2 is the variance of Zk .
The measurement equation 8.3 can be expressed as

Y s N Γ, σ2 IN (B.2)
where σ 2 is the (known) variance of the measurement noise.

Using Bayes’ theorem we can write
p(Y = y | Zk = z, Γ = γ)p(Zk = z, Γ = γ)
p(Zk = z, Γ = γ | Y = y) =
p(Y = y)
The prior pdf p(Zk = z, Γ = γ) is defined by the distribution in equation B.1 and the
likelihood pdf p(Y = y | Zk = z, Γ = γ) is defined by equation B.2 and so we can expand
this equation to
p(Zk = z, Γ = γ | Y = y) =
 T −1 
1 γ C D z
k exp −
y − γ
2 /2σ 2 exp − 
2 z DT σk2 z
 T 
1 γ−y 2
IS σ 0 γ −y
= k exp − 
2 z 0 T
0 z
 T −1 
1 γ C D γ
exp − T

2 z D σk 2
z
where k is a normalisation constant that ensures that the expression is a valid pdf. Using
lemma 3 we can write that
 T 
1 γ γ
p(Zk = z, Γ = γ | Y = y) ∝ exp − −a A−1 −a 
2 z z
where
 −1 −1
2
IS /σ 0 C D
A =  + 
0T 0 DT σk2

IS /σ 2 0 −y
a = A
0T 0 0
222 Appendix B.2

−y/σ 2
= A
0
and so we have proved that the joint posterior distribution is a multivariate Gaussian
distribution with mean a and covariance matrix A.
We are now able to compute the posterior distribution for Zk alone. If we write Zk as
T
0 Γ
Zk =
1 Zk
we can use the algebraic identity proved in lemma 4 to simplify A and calculate that Zk
has a normal distribution with mean
T T
0 0 y/σ 2
a = A −
1 1 0
T
0 C D C 2 −1
= − − σ IS + C C D
1 DT σk2 DT

−y/σ 2
0
 T 
D −y/σ 2
−1
= − − DT σ 2 IS + C C D 
σk2 0
−1
= DT y/σ 2 − DT σ 2 IS + C Cy/σ 2
−1 2 −1
= DT σ 2 IS + C σ IS + C y/σ 2 − DT σ 2 IS + C Cy/σ 2
−1
= DT σ 2 IS + C y.
Appendix C
Review of deconvolution techniques
This appendix reviews a number of deconvolution methods from a Bayesian perspective.

The emphasis in this review is on the estimates produced by the different approaches rather
than on the methods used to solve the minimisation problem. We attempt to relate each
deconvolution technique to the Bayesian framework described in section 9.1.1.
C.1 CLEAN
The CLEAN algorithm [44] consists of two steps that are repeated until the energy of the
residual is below a certain level (Tikhonov regularisation). The algorithm starts with a
blank image x(0) and produces a sequence of restored images x(1) , . . . , x(k) . The steps at
the k th iteration are
1. Find the strength s and position of the greatest intensity in the residual image y −
Hx(k−1) . Let m(k) be the index of this greatest intensity.
2. Add a point to the restored image at the peak position of strength s multiplied by a
damping factor γ known as the loop gain.

[x(k−1) ]i + sγ i = m(k)
[x(k) ]i =
[x(k−1) ]i i = m(k)
Let K be the number of iterations before convergence. The output of the algorithm is
x(K) , the last restored image (for cosmetic reasons this output is often smoothed with a
223
224 Appendix C.1
Gaussian after this restoration process). More efficient implementations of this method
exist, two examples are the Clark algorithm [25] and the Cotton-Schwab algorithm [106].
Marsh and Richardson have proved [79] that under certain conditions the CLEAN esti-
mate is equivalent to the MAP maximisation described in section 9.1.1 (and the additional
constraint that xi ≥ 0 for all i) with the choice of

f (x) = − log(β) + α xi
i
where α is chosen to make the algorithm terminate with K non-zero point sources in the
estimate and β is a constant chosen so that the prior corresponds to a valid pdf (with total
integral equal to one). The conditions for the proof to hold are essentially that the original
image consists of sufficiently well-separated point sources.
This choice of f (x) corresponds to a prior pdf of

p(x) = β exp −α xi I(X)
i
(
= β I(xi ) exp {−αxi }
i
where we define I(x) to be an indicator function that is zero if any component of x is

negative, and one otherwise. In summary, under certain conditions the CLEAN algorithm
uses a prior assumption that all the pixels are independently identically distributed with
an exponential distribution.
This equivalence does not always hold. Jeffs and Gunsay provide a counter example
[55] of two close point sources that are restored by the CLEAN algorithm to three sources
(including a false one half way between the two true sources) whereas the corresponding
Bayesian MAP estimate correctly restores the two original sources. As the MAP estimate
seems to perform better the authors proposed a maximally sparse restoration technique
[55] that explicitly uses the following image prior:
(
p(x) ∝ I(xi ) exp {−αxpi }
i
where p is a shape parameter that is normally in the range 0 < p < 1 for astronomical
denoising. This corresponds to a model in which the prior distribution for each pixel’s
intensity is an independent and identically distributed, one sided generalised p Gaussian.
Appendix C.2 225
The generalised p Gaussian distribution is also known as the Box-Tiao distribution and
includes many other distributions for specific choices of the shape parameter (e.g. p = 1
for exponential, p = 2 for Gaussian, p → ∞ for uniform).
C.2 Maximum Entropy

Briefly stated, the maximum-entropy principle (MAXENT) is [53, 54]
when we make inferences based on incomplete information, we should draw

them from that probability distribution that has the maximum entropy per-
mitted by the information we do have.
One way of applying this principle to image deconvolution [29] results in the following
definition of prior probability:
p(x) ∝ exp {S(x)/λ} I(x)
where
x x
S(x) = −
i log
i
i j xj j xj
and λ is an undetermined parameter. S(x) is known as the configurational entropy of the

image. The maximum entropy method fits naturally into our common Bayesian framework
with the choice of f (x) = −S(x).
There is some variation in the choice of entropy definition [125] and another common
definition used for image deconvolution is
xi
S(x) = − xi log
i
mi e
where mi is the ith component of a default image m. This default image is often chosen to
be a low resolution image of the object and will be the maximum entropy estimate in the
limit as the measurement noise increases to infinity. For this choice of entropy the prior
pdf can be factorised as
(
xi xi
p(x) ∝ I(x) exp log
i
λ mi e
226 Appendix C.3
( xi xi /λ
= I(xi ).
i
m i e
A factorised joint pdf means that the distribution of each pixel is independent. While this
may be appropriate for astronomical images, we have argued that real world images contain
significant correlations and that therefore such a maximum entropy deconvolution is less
appropriate. Note that such a conclusion rejects merely the precise application rather than
the MAXENT principle itself. The grounds for the rejection are that these methods do
not make use of all the available information about the nature of real images. Section C.10
gives an example of using the MAXENT principle together with a wavelet transform to
give a more appropriate algorithm.
C.3 Projection methods

There have been several attempts [42, 71, 131] to perform image recovery by the method
of projection onto convex sets. The convex sets specify required properties for the restored
image and then some algorithm is used to find a solution that is in the intersection of all
of the sets. The simplest method for this is to sequentially project an estimate onto each
of the sets in turn and then repeat until all the constraints are satisfied. A convex set is a
set for which any linear interpolation between two points in the set will also be in the set.
In mathematics, a set S is convex if and only if
∀x1 ∈ S, ∀x2 ∈ S, ∀λ ∈ [0, 1], λx1 + (1 − λ)x2 ∈ S
Combettes uses this method for the problem of restoring an image blurred with a 9 × 9
uniform kernel by means of the following constraints (that each correspond to convex sets)
[27]:
1. The image x should contain nonnegative pixel values.
2. It is assumed that the DFT of the image x is known on one fourth of its support
for low frequencies in both directions. (These known values are taken from the DFT
of the observed image divided by the gain of the blurring filter at the corresponding
frequencies.)
Appendix C.4 227
3. The assumption of Gaussian noise means that, with a 95% confidence coefficient, the
image satisfies the constraint

y − Hx
2 ≤ ρ
where ρ takes some value that can be calculated from statistical tables based on the
variance σ 2 of the measurement noise and the number of pixels. It is assumed that
the above constraint will be satisfied by the restored image.
For a general value of ρ this scheme does not naturally fit into the Bayesian framework,
partly because the final solution can depend on the starting conditions (normally chosen
to be the observed image). However, if ρ is reduced until the intersection of the constraint
sets contains a single point then the method becomes equivalent to Miller regularisation.
The corresponding prior is proportional to the characteristic function of the intersection
of the first two constraints described above. In other words, the prior takes some constant
value within the intersection, but is zero outside.
C.4 Wiener filtering

Wiener filtering [5] can be implemented by the following algorithm:
1. Compute the Fourier transform coefficients fi = [F y]i of the data.
2. Compute the Fourier transform coefficients mi of the blurring filter.
3. Multiply each coefficient fi by a gain gi given by

m∗i
gi =
|mi |2 + 1/SNRi
where SNRi is the estimated ratio of signal energy to noise energy for this coefficient
and m∗i represents the complex conjugate of mi .
4. Invert the Fourier transform of the new coefficients gi fi .
The problem with Wiener filtering is the estimation of signal to noise ratios. It can be
easily shown that the best gain (in terms of minimising the expected energy of the error)
for a given image x is given by SNRi = | [F x]i |2 . We call the corresponding gain the Oracle
gain but this cannot be used in practice because it requires access to the original image.
228 Appendix C.5
In terms of our Bayesian viewpoint the cost function that corresponds to Wiener filtering
is
1
f (x) = | [F x]i |2 (C.1)
i
σ 2 SNR i
The previous methods were equivalent to assuming that the pixel intensity values were
independent and indentically distributed. In contrast, the above cost function cannot be
factorised in the same way. Instead it is the Fourier components of the image that are
assumed independent, but with different Gaussian distributions for each component.
C.5 Iterative methods

The three most common iterative deconvolution methods are the Van Cittert [112], Landwe-
ber [68], and Richardson-Lucy [102] methods. These iterative algorithms define a sequence
x(0) , . . . , x(K) of restored images evolving according to some equation [112]. Van Cittert
proposed the following iteration:

x(n+1) = x(n) + α y − Hx(n) (C.2)
where α is a convergence parameter generally taken as 1. Landweber proposed the iteration

x(n+1) = x(n) + αH T y − Hx(n) (C.3)
The Richardson-Lucy method uses

−1
x(n+1) = diag x(n) H T diag Hx(n) y (C.4)
We use the notation diag {a} for a ∈ R N to represent a diagonal matrix of size N by N
with the entries of a along the diagonal. When these algorithms are used for astronomical
images there is also a step that after each iteration sets all the negative entries in xk to
zero. The restored solution is taken to be the restored image at a particular iteration.
We first describe the usual [81], but misleading, way of viewing these methods within
the Bayesian framework, and then a better way. Consider the Van Cittert and Landweber
methods. If these algorithms converge then the converged solution can be either considered
as the MAP estimate corresponding to a flat (improper) prior
p(x) ∝ 1
Appendix C.5 229
or the maximum likelihood estimate. (If the non-negativity constraint is used then the
solution is the MAP estimate for a prior of p(x) ∝ I(x)). The Richardson-Lucy method
can be viewed in the same way except that it uses a different observation model. Based
on the idea of random photon arrival times the observations are modelled as independent
samples from Poisson distributions where the parameters of the Poisson processes are the
unknown source intensities x. In this model the observations in y consist of non-negative
integer counts of the number of photons detected. The likelihood function is
( [F x]yi i
p(y|x) = exp {− [F x]i }
i
yi !
Although the iterative methods are sometimes justified [81] by these choices of prior and
likelihood the converged estimates can be severely corrupted due to the large amplification
of noise [15] while the intermediate restorations are better. The methods are therefore
unusual in that it is crucial to terminate the algorithm before convergence is reached.
To explain the effect of early convergence we assume that we are deconvolving images
without using the positivity constraint. The effect has been explained [15] with an eigen-
vector analysis but here we give an alternative treatment based on the Fourier transform.
Recall that the blurring filter can be expressed as F H MF . Let o(n) be the Fourier
transform coefficients of the restored images:
o(n) = F x(n)
Taking the Fourier transform of the Van Cittert iteration we find

F x(n+1) = F x(n) + α y − Hx(n)

o(n+1) = F x(n) + α F y − F F H MF x(n)

= o(n) + α f − Mo(n)
= o(n) (IN − αM) + αf
This is a very simple expression because all the matrices are diagonal. We can write the
iteration separately for each Fourier coefficient as
(n+1) (n)
oi = oi (1 − αmi ) + αfi
(0)
Using the initialisation of oi = 0 we can solve this equation to find
(K)

K−1
oi = αfi (1 − αmi )n
n=0
230 Appendix C.5
1 − (1 − αmi )K
= αfi
αmi
1 − (1 − αmi )K
= fi
mi
The restored image produced by the Van Cittert iteration is given by x(K) = F H o(K) and
by comparison with the algorithm for Wiener filtering above we conclude that Van Cittert
restoration is equivalent to Wiener filtering with a gain gi given by
1 − (1 − αmi )K
gi = .
mi
The corresponding assumption about the signal to noise ratio is
1
SNRi = m∗i
gi
− |mi |2
1
= |mi |2
1−(1−αmi )K
− |mi |2
Assuming that α is small enough for the algorithm to converge then it is clear that gi →
1/mi in the limit as K → ∞ but the algorithm is designed to terminate long before
convergence. The assumption about the SNR level of a Fourier coefficient is a function of
α, K, and the level of blurring mi for that coefficient. Figure C.1 shows the assumed SNR
values for gains mi varying between 0 and 1 if K = 3 and α = 1. For a typical blurring
function that decays with increasing frequency the Van Cittert method effectively assumes
that the data has a power spectrum that also decays with increasing frequency. However,
for small gains the assumed SNR actually increases. This may lead to high frequency noise
artefacts in reconstructed images.
Similarly we can also take the Fourier transform of the Landweber method. The only
difference is the multiplication by H T . As this is a real matrix we can write
HT = HH
H
= F H MF
= F HMHF
Using this result we can rewrite the Landweber method in terms of Fourier coefficients as

1 − α|mi |2 + αm∗i fi
(n+1) (n)
oi = oi
Appendix C.5 231
70
60
Assumed SNR/dB
50
40
30
20
10
0 0.2 0.4 0.6 0.8 1
Filter gain
Figure C.1: Effective assumption about SNR levels for Van Cittert restoration (K = 3,α =
1).
232 Appendix C.5
n+1
1 − (1 − α|mi |2 )
= fi m∗i
|mi |2
n+1
1 − (1 − α|mi |2 )
= fi
mi
Again we conclude that the Landweber method is equivalent to Wiener filtering with the
following assumption about the SNR
1
SNRi = m∗i
gi
− |mi |2
1
= |mi |2
1−(1−α|mi |2 )K
− |mi |2
Figure C.2 shows this assumed SNR values for gains mi varying between 0 and 1 if K = 3
and α = 1. The figure shows that the Landweber method has a smooth decrease in
60
50
Assumed SNR/dB
40
30
20
10
0
0 0.2 0.4 0.6 0.8 1
Filter gain
Figure C.2: Effective assumption about SNR levels for Landweber restoration (K = 3,α =
1).
assumed SNR levels even for low gains and therefore the restored results should avoid the
high frequency artifacts of the Van Cittert method.
Appendix C.6 233
The Richarson-Lucy method is not as simple to express in the Fourier domain due
to the presence of multiplications and divisions that are implemented pixel-wise on im-
ages. However, we claim that early termination corresponds to making approximately the
same assumption (of a stationary Gaussian random process) about image structure as in
the Landweber method and consequently that early termination of the Richardson-Lucy
method is approximately a particular case of Wiener filtering.
To demonstrate this claim we imagine applying the Richardson-Lucy method to an
image after increasing all the intensity values (in the data and the intermediate restorations)
by a constant 1/
. The iteration becomes
−1
x(n+1) + 1/
= diag x(n) + 1/
H T diag H x(n) + 1/
(y + 1/
)
We perform this shift in order to be able to approximate multiplications and divisions by

additions and subtractions. We use the notation O(
a ) to represent terms that are of order
a . In other words, O(
a ) represents a polynomial function of
in which every exponent
of
is at least a. We will assume that the image has been rescaled such that the blurring
filter has unity response at zero frequency and thus H1 = H T 1 = 1 (H T corresponds to
filtering with a filter h(−x, −y) rather than h(x, y) and so will also have unity response at
zero frequency). For sufficiently small values of
this expression can be written as

x(n+1) = −1/
+ diag x(n) + 1/
H T diag
1 −
Hx(n) + O(
2 ) (y + 1/
)

= −1/
+ diag x(n) + 1/
H T
IN −
2 diag Hx(n) + O(
3 ) (y + 1/
)

= diag x(n) H T 1 + H T y − H T diag Hx(n) 1 + O(
)

= diag x(n) 1 + H T y − H T Hx(n) + O(
)

= x(n) + H T y − Hx(n) + O(
)
Comparing this with equation C.3 we conclude that (if the algorithm is initialised to have
x(0) = 0) then the shifted Richardson-Lucy method (with removed positivity constraints)
is within order
of the Landweber method. In particular, the restored image from this
shifted Richardson-Lucy method will tend to the Landweber solution in the limit as
→ 0.
We have described the link between the iterative methods and Wiener filtering. An
explicit definition of the assumed cost function is given by substituting the above expresse-
nions for the assumned SNR levels into equation C.1.
234 Appendix C.7
C.6 Constrained Least Squares

Constrained least squares methods use
f (x) =
Cx
2
for some square matrix C known as the regularising operator [15, 61]. This operator is
chosen to apply little regularisation where the signal energy is expected to be high, but
significant regularisation where the noise energy dominates the signal. For many images
the signal energy is concentrated at low frequencies and so the operator is chosen to act
like a high pass filter. One common choice for the coefficients is a discrete approximation
to a 2-D Laplacian filter such as [115]
 
0.7 1 0.7
 
 1 −6.8 1 
 
0.7 1 0.7
Recent attempts to solve this problem have been based on Hopfield neural networks [133,
115, 92] where the idea is to perform gradient descent minimisation of the energy function
with the restriction that the change to each intensity value must always belong to the set
{−1, 0, 1}.
If both the blurring operator H and the regularising operator C represent linear, space-
invariant filters then the energy function can be efficiently represented in terms of the
Fourier coefficients of the image x. In this case it is straightforward to show that the
estimate that minimises the energy function is given by a Wiener filter (with the esti-
mated SNR values inversely proportional to the squares of the Fourier coefficients of the
regularising operator). This proves that the performance of the Oracle Wiener filter is an
upper bound on the performance of such methods unless stopping before convergence gives
improved results.
C.7 Total variation and Markov Random Fields

The total variation functional JT V (u) is defined on a continuous function u(x, y) as [90]
# #
JT V (u) = |∇u|dxdy
y x
Appendix C.8 235
JT V (u) is used as the stabilising functional within Tikhonov regularisation and therefore
the corresponding f (x) is equal to a discretized version of the above integration. To avoid
difficulties with the nondifferentiability the functional Jβ (u) is often used instead [123]
# #
Jβ (u) = |∇u|2 + β 2 dxdy.
y x
The prior pdf corresponding to total variation denoising therefore penalises oscillations
in an image as this increases the total variation, and instead favours edges and tends to
produce piecewise constant images.
Markov Random Fields [56, 132] provide a more general way of constructing a prior pdf
based on local neighbourhoods that again aims to favour smooth regions while permitting
occasional discontinuities.
C.8 Minimax wavelet deconvolution

The minimax wavelet methods generally work via a two stage process that uses a stationary
inverse filter followed by a wavelet denoising technique. The frequency response Gα (f) of
the inverse filter is given by

1 |H(f)|2 Px (f )
Gα (f ) = (C.5)
H(f) |H(f)|2 Px (f) + ασ 2
where Px (f) is the power spectral density of the signal, H(f) is the Fourier transform of the
linear blurring filter, σ 2 is the amount of noise added after blurring, and α is a parameter
that controls the amount of regularisation. Both the inverse filter and the wavelet denoising
can be used to reduce the level of noise. Several choices for the amount of regularisation
and the wavelet denoising method have been proposed. Donoho [38] uses α = 0 so that all
the noise removal must be done by the wavelet denoising. Wiener filtering corresponds to
α = 1 with no wavelet denoising. Nowak and Thul used an under-regularized linear filter
(0 < α < 1) [87] and Neelamani et al studied the effect of the amount of regularization and
found α ≈ 0.25 usually gave good results [84]. This approach is known as WaRD standing
for Wavelet-based Regularized Deconvolution. Kalifa and Mallat [58] use α = 0 together
with a mirror wavelet basis. This algorithm and the mirror wavelet basis are described
in detail in section C.9. An algorithm of this type is used in the production channel of
CNES satellite images[58]. Their mirror wavelet transform is similar to a standard real
236 Appendix C.8
wavelet transform except that additional filtering steps are performed on the highpass
subbands. The extra filtering produces greater frequency localisation for high frequencies.
Extra filtering can also be applied to the DT-CWT to produce a complex wavelet packet
transform [52]. Such a transform has been found to give slightly better performance than
the nondecimated version of the mirror wavelet algorithm (and much superior performance
to the decimated version) [52].
These methods are harder to place within our common framework because they are
motivated by the belief that such a framework is fundamentally flawed and that minimax
methods offer an alternative solution.
Bayesian techniques require stochastic models of the expected signals. It is claimed [58]
that there is no “good” stochastic model for natural images. Instead minimax estimation
makes use of a prior set Θ where the signal is guaranteed to be. The estimator is then
designed by trying to minimise the maximum risk over Θ. Suppose that we have obser-
vations y ∈ R N from which we wish to construct an estimator F̂(y) of the original signal
x ∈ R N . The risk of the estimator is defined to be [58]
+ ,
r(F̂, x) = EN
F̂(y) − x
2 .
Note that this is a function of the original signal x. The original signal is fixed and the
expectation is taken over all possible values for the noise in the observation model.
In the Bayesian approach we have an estimate for the relative probability of different
signals (the prior) and we can compute the total expected risk for an estimator with
the expectation taken over all possile signals. A standard result is that this total risk is
minimised by using the posterior mean estimate. However, the minimax approach avoids
using the prior pdf on the grounds that the prior is not a sufficiently “good” model. Instead
the estimator is based on the maximum risk, defined as
+ ,
r(F̂, Θ) = sup E
F̂(y) − x
2 .
x∈Θ
The estimator is designed to minimise this maximum risk.

This approach is standard in the theory of games. Consider the following game. Player
A has to design a deconvolution algorithm to maximise the SNR of deconvolved images.
Player B has to choose the test images1 . Suppose that after A has designed the algorithm
1
The idea is that B chooses the test image x. This test image is then blurred and degraded according
to equation 9.1 before being given to A.
Appendix C.9 237
B is allowed to see the program and construct a test image deliberately designed to produce
the algorithm’s worst possible performance. It can easily be shown that the best approach
for A (in terms of maximising SNR) is to use the minimax approach to design the algorithm.
However, if player B is not so malicious and simply decides to produce test images according
to some stochastic model then the best approach for A is to use a Bayesian posterior mean
estimate, using B’s stochastic model for the prior pdf.
The conclusions in these two cases are widely accepted, what is not agreed is an appro-
priate approach for the case when player B is not malicious, but when the model used to
produce images is unknown. This is the case for most real world images.
We mentioned earlier the claim that there is no “good” model for natural images.
The problem with this claim is that it is not clear what “good” means. We agree that
realisations from typical Bayesian models do not produce realistic images, but models can
often give a reasonable guide to the relative probability of small deviations from a real
world image. For example, wavelet models will prefer a priori a portion of the image being
smooth rather than containing high frequency noise. Furthermore, if “good” is taken
to mean that the resulting algorithms give high accuracy results then the claim can be
experimentally tested. Later results will show that the Bayesian model is good in this
sense.
In summary, the minimax approach gives results with a known worst case performance.
The methods tend to be robust but take no advantage of the probable structure within
the data (other than limiting the data to the set Θ). Bayesian methods attempt to model
the prior information about likely image structures and thus give results whose quality
depends on the accuracy of the model. The Bayesian method has the potential to give
better reconstructions but it is possible that certain images will be very badly restored. As
we are interested in getting good results on typical images we select the Bayesian method.
C.9 Mirror wavelet deconvolution

This section describes the algorithm for deconvolution using mirror wavelets proposed by
Kalifa and Mallat [58]. The performance of the method depends on the choice of parameter
values. Unfortunately, for commercial reasons, the values used in the published experiments
were not given. We make use of what we believe to be a reasonable choice but it is possible
that other choices could give better performance.
238 Appendix C.9
We first give a brief description of the mirror wavelet transform and then explain how
this transform is used for deconvolution.
C.9.1 Mirror Wavelet Transform

Figure 2.2 of chapter 2 shows the subband decomposition tree for a standard wavelet
transform. Such a tree produces some very short wavelets whose frequency response covers
the upper half of the spectrum. For some blurring functions there can be a considerable
difference in amplification across this range of frequencies. Consequently, at the lower end
there may be a high SNR, while for the highest frequencies the SNR may be very low. It
therefore seems inappropriate to group all these frequencies within a single subband. The
mirror wavelet decomposition addresses this problem by performing recursive filtering on
the most detailed subband [58]. Figure C.3 shows the subband decomposition structure
for the mirror wavelet. The filters are given by the symlet wavelets of order 4. These
Level 3 -
Level 4
x000a
- ↓ 2 - x0000a
- -
H0a
- - Level 2 x00a

↓2

H0a
-
Level 1 - x0a
-
- -
↓2 ↓2 x0001a

H0a H1a
- - -
↓2 - - ↓2

H0a H1a x001a
-

↓2
Standard
H1a x01a

Tree
- - - x000b

↓2

x0000b

H0b
- - x00b

↓2
x

H0b
- - x0b
- - -
- -
↓2 ↓2 x0001b

H0b H1b
- - -
↓2 - - ↓2

Mirror H1a H1b x001b
Tree - H1b ↓2 x01b
Figure C.3: The mirror wavelet tree structure

Appendix C.9 239
orthogonal wavelets possess the least asymmetry and highest number of vanishing moments
for a given support width [30]. The filters in the mirror tree (for levels above 1) are given
by the time reverse of the filters in the standard tree:
H0a (z) = H0b (z −1 ) = −0.0758z 3 − 0.0296z 2 + 0.4976z 1 + 0.8037z 0

+0.2979z −1 − 0.0992−2 − 0.0126z −3 + 0.0322z −4
H1a (z) = H1b (z −1 ) = −0.0322z 3 − 0.0126z 2 + 0.0992z 1 + 0.2979z 0 − 0.8037z −1
+0.4976z −2 + 0.0296z −3 − 0.0758z −4
The tree can be inverted by repeated application of the reconstruction block.

The mirror wavelets are extended to 2D in such a way as to achieve a fine frequency
resolution for increasing frequency. Details of the 2D basis and a fast 2D transform can be
found in [58]. Figure C.4 shows the frequency response contours that are given by a three
level 2D mirror wavelet transform. The frequencies are normalised so that a frequency of
32 corresponds to half the sampling rate.
C.9.2 Deconvolution Algorithm

The deconvolution algorithm consists of inverse filtering followed by wavelet denoising. As
in chapter 9 suppose that the blurring operator is represented by the matrix F H MF where
M is a diagonal matrix with diagonal entries m. Define a new vector u (representing the
inverse filter) by

1/mi if |mi | >
ui =
0 otherwise
We choose
= 0.01 in our experiments. The inverse filtering step produces a new image x0
given by x0 = F H diag {u} F d (this, of course, is implemented using the Fourier transform
rather than matrix multiplication).
The wavelet denoising is based on estimates σk2 of the variance of the (inverse filtered)
noise in the subbands of the mirror wavelet transform of x0 . The inverse filtering step tends
to considerably amplify the noise for high frequencies for typical blurring functions. These
variances can be precisely computed from the Fourier transform of the filter [58] but in
practice it is easier to estimate these values by calculating the mirror wavelet transform of
an image containing white noise of variance σ 2 that has been inverse filtered. The average
energy of the wavelet coefficients in the corresponding subbands provide estimates of σk2 .
240 Appendix C.9
30
25
20
15
10
5 10 15 20 25 30
Figure C.4: 2D frequency responses of the mirror wavelet subbands shown as contours at
75% peak energy amplitude.
Appendix C.10 241
We define a “noise subband” to be a subband for which σk > T where T is some

threshold level. We choose T = 30 in our experiments. These subbands contain very little
useful information.
The wavelet denoising process consists of the following steps:
1. Compute the mirror wavelet transform of x0 .
2. Set all wavelet coefficients in noise subbands to zero.
3. Apply a soft thresholding rule to all the wavelet coefficients. For a coefficient wi
belonging to subband k the new value ŵi is given by


 wi − βσk wi > βσk

ŵi = wi + βσk wi < βσk


 0 otherwise
where we choose β = 1.6 in our experiments.
4. Invert the mirror wavelet transform to compute the deconvolved estimate x̂.
This is a single pass algorithm involving only one forward and one inverse wavelet transform
and hence is fast. In practice (and in our experiments) the shift invariant version of the
mirror wavelet transform is always used as it gives better results. This can be viewed
theoretically as averaging the results of the decimated version over all possible translations.
This averaging is implemented by using the much slower nondecimated form of the mirror
wavelet.
C.10 Wavelet-based smoothness priors

A second group of methods use wavelet transforms to construct a prior pdf for images. We
have shown that the basic model described in chapter 7 (using a common prior variance for
every coefficient in a given subband) is approximately equivalent to a stationary Gaussian
random process model but there are many alternative priors that can be constructed using
wavelets. Some examples of priors that have been used for deconvolution are:
1. Wang et al [124] used the output of an edge detector applied to the noisy data to alter
the degree of regularisation in a multiscale smoothness constraint. This algorithm
242 Appendix C.10
used the cost function

f (x) = λi | [W x]i |2
i
where W represents a forward real wavelet transform (Daubechies’ fifth order com-
pactly supported wavelet [30]) and {λi } are scaling parameters chosen using the
output of the edge detector.
2. Starck and Pantin have proposed [93] a multiscale maximum entropy method that
uses the cost function2

| [W x]i |
f (x) = − λi [W x]i − mi − | [W x]i | log
i
mi
where W represents a nondecimated forward transform (a form of Laplacian pyramid

is used where the lowpass filtering is performed with a 5 × 5 mask based on a cubic
B spline), {mi } are some constant reference values given by mi = σ/100, and {λi }
are weighting factors that alter the degree of regularisation. These factors are chosen
based on the size of the wavelet coefficients of the observed image. If the coefficient is
below a threshold of 3σ then it is deemed to be in the multiresolution support and the
factor is set to some constant value σj that depends only on the scale j of the wavelet
coefficient. These coefficients are allowed to vary in order to match the observations.
However, if the coefficient is large then the factor is set to zero and it is not allowed
to vary. A constraint is added to the problem that requires all coefficients not in the
multiresolution support to be equal to the value in the observed image.
3. Banham and Katsaggelos [8] use an autoregressive prior model which evolves from
coarse to fine scales. The parameters of the model are based on the output of an
edge detector applied to a prefiltered version of the noisy data. A multiscale Kalman
filter is then used to estimate the original image.
4. Belge et al [11] use a non-Gaussian random process model for which the wavelet
coefficients are modelled as being independently distributed according to generalised
Gaussian distribution laws. The resulting energy function is minimised in a doubly
2
This cost function appears strange because it is not a symmetric function of the wavelet coefficients.
This is probably a mistake but we have chosen to keep the form as given in the reference.
Appendix C.10 243
iterative algorithm. In other words, the iterative minimisation algorithm is based

upon another iterative algorithm, which itself requires wavelet and Fourier transforms
to evaluate. This algorithm corresponds to the cost function

f (x) = λi | [W x]i |p
i
where W represents a forward real wavelet transform (Daubechies’ 8 tap most sym-
metric wavelets were used [30]), {λi } are scaling parameters for the different wavelet
coefficients, and p is a parameter chosen near 1.
5. Piña and Puetter use a Pixon method [94, 99] that adapts the number of parameters
describing the image to the smallest number consistent with the observed data. The
parameters in the Pixon method are coefficients of certain kernel functions. These
kernel functions are defined at a number of different scales (typically 4 per octave)
and orientations and can be regarded as the reconstruction wavelets corresponding to
some redundant wavelet transform. Using this interpretation of the kernel functions
suggests that the Pixon method is approximately equivalent to using a sparseness
prior (of the sort seen in section C.1) for the wavelet coefficients:

f (x) = |wi|p
i
where p is a real scalar parameter controlling the degree of sparseness and {wi } is
the set of parameters (wavelet coefficients) that specify the image via the relation
x = Pw
where P is a reconstruction matrix built out of the kernel functions used in the
Pixon method. Note that an estimate based on this objective function would only
approximate the true Pixon estimate as it neglects certain features of the method
(for example, for a particular position in the image there will be, say, K parameters
corresponding to the K different subbands but the Pixon method only allows at most
one of these parameters to be non-zero).
The first three methods use a prior that is a function of the noisy image and are therefore
known as empirical Bayes methods. There are many other choices for the image prior that
have been used in other applications. One example that has already been mentioned in
this dissertation is the Hidden Markov Tree (HMT) model discussed in chapter 3.
244 Bibliography
Bibliography
[1] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden. Pyramid

methods in Image processing. RCA Engineer, 29(6):33–41, 1984.
[2] F. Alabert. The practice of fast conditional simulations through the LU decomposi-
tion of the covariance matrix. Math Geology, 19(5):369–386, 1987.
[3] B. Alpert. A class of bases in L2 for the sparse representation of integral operators.
SIAM J. Math Analysis, 1993.
[4] T. Anderson. An Introduction to Multivariate Statistical Analysis. John Wiley &

Sons, 1958.
[5] H. C. Andrews and B. R. Hunt. Digital Image Restoration. Englewood Cliffs, NJ:
Prentice-Hall, 1977.
[6] F. Auger and P. Flandrin. Improving the Readability of Time-Frequency and Time-
Scale Representations by the Reassignment Method. IEEE Trans. on Signal Proc.,
43(5):1068–1089, May 1995.
[7] R. H. Bamberger and M. J. T. Smith. A filter bank for the directional decomposition
of images. IEEE Trans. on Sig. Proc., 40(4):882–893, April 1992.
[8] M. R. Banham and A. K. Katsaggelos. Spatially adaptive wavelet-based multiscale

image restoration. IEEE Trans. Image Proc., 5:619–633, Apr 1996.
[9] S. A. Barker and P. J. W. Rayner. in Lecture Notes in Computer Science Vol. 1223.
(Eds. M. Pelillo and E. R. Hancock), chapter Unsupervised image segmentation using
markov random field models. Springer-Verlag, 1997.
245
246 Bibliography
[10] C. Becchetti and P. Campisi. Binomial linear predictive approach to texture mod-
elling. In International Conference on Digital Signal Processing 97, volume 2, pages
1111–1114, 1997.
[11] M. Belge, M. E. Kilmer, and E. L. Miller. Wavelet domain image restoration with
adaptive edge-preserving regularization. IEEE Trans. on Im. Proc., 9(4):597–608,
Apr 2000.
[12] J. Berger, V. De Oliveira, and B. Sans. Objective Bayesian Analysis of Spatially

Correlated Data Tech. Report 2000-05. Technical report, Centro de Estadstica y
Software Matemtico Universidad Simn Bolvar, 2000.
[13] J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer series in
statistics. Springer-Verlag Inc., New York, 1985.
[14] K. Berkner, M. J. Gormish, E. L. Schwartz, and M. Boliek. A new wavelet-based

approach to sharpening and smoothing of images in Besov spaces with applications
to deblurring. In 2000 IEEE Int. Conf. on Image Proc., 2000.
[15] J. Biemond, R. L. Lagendijk, and R. M. Mersereau. Iterative methods for image

deblurring. Proc. IEEE, 78(5):856–883, May 1990.
[16] J. S. De Bonet. Novel Statistical Multiresolution Techniques for Image Synthesis,

Discrimination, and Recognition. Master’s thesis, The Microsoft Advanced Technol-
ogy Vision Research Group, 1997.
[17] I.N. Bronshtein and K. A. Semendyayev. Handbook of Mathematics. Verlag Harri

Deutsch, Thun and Frankfurt/Main, 1958.
[18] P. Burt and E. Adelson. A Multiresolution Spline with Application to Image Mosaics.
ACM Transactions on Graphics, 2:217–236, 1983.
[19] E. J. Candès. Harmonic analysis of neural networks. Applied and Computational

Harmonic Analysis, 6(2):197–218, 1999.
[20] E. J. Candès and D. L. Donoho. Ridgelets: a key to higher-dimensional intermit-

tency? Phil. Trans. Royal Society London A, 357, 1999.
Bibliography 247
[21] R. Chellappa and R.L. Kashyap. Digital image restoration using spatial interaction
models. IEEE Trans. Acoust., Speech, Signal Processing, 30:461–472, June 1982.
[22] R. Chellappa and R.L. Kashyap. Texture synthesis using 2-D noncausal autoregres-
sive models. IEEE Trans. Acoust., Speech, Signal Processing, 33(1):194–203, 1985.
[23] H. Choi and R. Baraniuk. Interpolation and denoising of nonuniformly sampled data
using wavelet-domain processing. In ICASSP 99, 1999.
[24] H. Choi, J. K. Romberg, R.G. Baraniuk, and N. G. Kingsbury. Hidden Markov

Tree Modeling of Complex Wavelet Transforms. In Proceedings of IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing - ICASSP’00, Istanbul,
Turkey, June 2000.
[25] B. G. Clark. An efficient implementation of the algorithm ‘CLEAN’ . Astron. As-

trophys., 89:377–378, 1980.
[26] R. Coifman and D. L. Donoho. Lecture Notes in Statistics, chapter Translation-

invariant de-noising. Springer-Verlag, 1995.
[27] P. L. Combettes. Convex Set Theoretic Image Recovery by Extrapolated Iterations

of Parallel Subgradient Projections. IEEE Trans. on Image Processing, 6(4):493–506,
March 1997.
[28] M. S. Crouse, R. D. Nowak, and R. G. Baraniuk. Wavelet-Based Signal Processing

Using Hidden Markov Models. IEEE Transactions on Signal Processing (Special
Issue on Wavelets and Filterbanks), April 1998.
[29] G. J. Daniell and S. F. Gull. Maximum entropy algorithm applied to image enhance-
ment. IEE Proceedings, part E, 127(5):170–172, September 1980.
[30] I. Daubechies. Ten Lectures on Wavelets. SIAM, 1992.
[31] J. Daugman. Two-dimensional spectral analysis of cortical receptive field profiles.

Vision Research, 20:847–856, 1980.
[32] J. Daugman. Complete Discrete 2-D Gabor Transforms by Neural Networks for
Image Analysis and Compression. IEEE Trans. on Acoustics, Speech, and Signal
Proc., 36(7):1169–1179, July 1988.
248 Bibliography
[33] P. F. C. de Rivaz and N. G. Kingsbury. Complex wavelet features for Fast Texture
Image Retrieval. In ICIP 99, 1999.
[34] P. F. C. de Rivaz and N. G. Kingsbury. Fast segmentation using level set curves of
complex wavelet surfaces. In ICIP 2000, 2000.
[35] P. F. C. de Rivaz, N. G. Kingsbury, and J. Moffatt. Wavelets for Fast Bayesian Geo-
physical Interpolation CUED/F-INFENG/TR.354. Technical report, Department of
Engineering, University of Cambridge, July 1999.
[36] C. V. Deutsch and A. G. Journel. GSLIB Geostatistical Software Library and User’s
Guide. Oxford University Press, 1992.
[37] E. J. Diethorn and D. C. Munson, Jr. A linear, time-varying system framework

for noniterative discrete-time band-limited signal extrapolation. IEEE Trans. Signal
Proc., 39:55–68, Jan 1991.
[38] D. L. Donoho. Nonlinear solution of linear inverse problems by Wavelet-Vaguelette

Decomposition. App. Comp. Harmonic Anal., 2:101–126, 1995.
[39] D. L. Donoho and I. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage.

Biometrika, 81:425–455, 1994.
[40] A. Fournier, D. Fussel, and L. Carpenter. Computer Rendering of Stochastic Models.

Communications of the ACM, 1982.
[41] J. Gómez-Hernández. A stochastic approach to the simulation of block conductiv-

ity fields conditioned upon data measured at a smaller scale. PhD thesis, Stanford
University, Stanford, CA, 1991.
[42] R. Gordon, R. Bender, and G. T. Herman. Algebraic reconstruction techniques

(ART) for three-dimensional electron microscopy and X-ray photography. J. Theoret.
Biol., 29:471–481, December 1970.
[43] S. Hatipoglu, S. Mitra, and N. Kingsbury. Texture classification using dual-tree

complex wavelet transform. In Proceedings of the IEE conference on Image Processing
and its Applications, Manchester, pages 344–347, July 1999.
Bibliography 249
[44] J. A. Högbom. Aperture synthesis with a non-regular distribution of interferometer

baselines. Astron. Astrophys. Suppl. Ser., 15:417–426, 1974.
[45] D. J. Heeger and J. R. Bergen. Pyramid-based texture analysis/synthesis. In ICIP

95, volume 3, pages 648–651, Oct 1995.
[46] L. Hervé. Multi-resolution analysis of multiplicity d. Application to dyadic interpo-

lation. Comput. Harmonic Anal., pages 299–315, 1994.
[47] P. R. Hill, D. R. Bull, and C. N. Canagarajah. Rotationally invariant texture classi-

fication. In IEE seminar on time-scale and time-frequency analysis and applications,
London, February 2000.
[48] A. D. Hillery and R. T. Chin. Iterative Wiener Filters for Image Restoration. IEEE
Trans. Signal Processing, 39(8):1892–1899, Aug. 1991.
[49] D. H. Hubel and T. N. Wiesel. Receptive fields, binocular interaction and functional
architecture in the cat’s visual cortex. J. Physiol. (Lond.), 166:106–154, 1962.
[50] D. H. Hubel and T. N. Wiesel. Receptive fields and functional architecture of monkey
striate cortex. J. Physiol. (Lond.), 195:215–243, 1968.
[51] H. Igehy and L. Pereira. Image Replacement through Texture Synthesis. In ICIP
97, volume 3, pages 186–189, 1997.
[52] A. Jalobeanu, L. Blanc-Féraud, and J. Zerubia. Satellite image deconvolution using

complex wavelet packets. INRIA Research Report 3955, June 2000.
[53] E. T. Jaynes. Information theory and statistical mechanics. Phys. Rev., 106(4):620–
630, May 1957.
[54] E. T. Jaynes. The rationale of maximum entropy methods. Proc. IEEE, 70:939–952,
1982.
[55] B. D. Jeffs and M. Gunsay. Restoration of blurred star field images by maximally
sparse optimization. IEEE transactions on image processing, 2(2):202–211, April
1993.
250 Bibliography
[56] F. C. Jeng and J. W. Woods. Compound Gauss-Markov random fields for image
estimation. IEEE Trans. Acoust., Speech, Signal Processing, 39:914–929, April 1991.
[57] A. G. Journel and C. J. Huijbregts. Mining Geostatistics. London: Academic Press,

1978.
[58] J. Kalifa and S. Mallat. Bayesian inference in wavelet based methods, chapter Mini-
max restoration and deconvolution. Springer, 1999.
[59] A. H. Kam and W. J. Fitzgerald. Image segmentation: An unsupervised multiscale

approach. In Proceedings of the 3rd International Conference on Computer Vision,
Pattern Recognition and Image Processing (CVPRIP 2000), volume II, pages 54–57,
February 2000.
[60] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. Int. J.
Comput. Vis., 1:321–332, 1988.
[61] A. K. Katsaggelos, J. Biemond, R. W. Schafer, and R. M. Mersereau. A regularization

iterative image restoration algorithm. IEEE Trans. Acoust., Speech, Signal Proc.,
39:914–929, April 1991.
[62] N. G. Kingsbury. The dual-tree complex wavelet transform: a new efficient tool
for image restoration and enhancement. In Proc. European Signal Processing Conf.,
pages 319–322, Sep 1998.
[63] N. G. Kingsbury. Shift invariant properties of the Dual-Tree Complex Wavelet Trans-
form. In Proc. IEEE Conf. on Acoustics, Speech and Signal Processing, Phoenix, AZ,
1999.
[64] N. G. Kingsbury. Complex wavelets and shift invariance. In Proc IEE Colloquium
on Time-Scale and Time-Frequency Analysis and Applications, IEE, London, 29 Feb,
2000.
[65] N. G. Kingsbury. Complex wavelets for shift invariant analysis and filtering of signals.
Submitted by invitation to Applied Computational and Harmonic Analysis, June 2000.
[66] T. Kohonen. The Self-Organizing Map. In Proc. IEEE, volume 78, pages 1464–1480,
Sep 1990.
Bibliography 251
[67] M. K. Kwong and P. T. P. Tang. W-Matrices, Nonorthogonal Multiresolution Anal-

ysis, and Finite Signals of Arbitrary Length MCS-P449-0794. Technical report, Ar-
gonne National Laboratory, 1994.
[68] L. Landweber. An iteration formula for fredholm integral equations of the first kind.
Am. J. Math., 73:615–624, 1951.
[69] M. Lang, H. Guo, J. E. Odegard, C. S. Burrus, and R. O. Wells Jr. Noise reduction
using an undecimated discrete wavelet transform. IEEE Signal Processing Letters,
3(1):10–12, 1996.
[70] J. Lebrun and M. Vetterli. High Order Balanced MultiWavelets. In Proceedings of

IEEE ICASSP, Seattle, pages 1529–1532, May 1998.
[71] A. Lent and H. Tuy. An iterative method for the extrapolation of band-limited
functions. J. Math. Anal. Appl., 83:554–565, October 1981.
[72] J. P. Lewis. Texture Synthesis for Digital Painting. In Proceedings of SIGGRAPH

84, 1984.
[73] J. M. Lina and M. Mayrand. Complex Daubechies wavelets. J. of Appl. and Comput.
Harmonic Analysis, pages 219–229, 1995.
[74] J. Magarey. Motion Estimation using Complex Wavelets. PhD thesis, Cambridge
University, 1997.
[75] S. Mallat. A Wavelet Tour Of Signal Processing. Academic Press, 1998.
[76] B. S. Manjunath and W. Y. Ma. Texture features for browsing and retrieval of
image data. IEEE Trans. Patt. Anal. Mach. Int. Special Issue on Digital Libraries,
18(8):837–842, Aug 1996.
[77] S. Marcelja. Mathematical description of the responses of simple cortical cells. J.

Opt. Soc. Amer., 70:1297–1300, 1980.
[78] D. Marr. Vision, A Computational Investigation into the Human Representation and
Processing of Visual Information. W.H. Freeman and Company, 1982.
252 Bibliography
[79] K. A. Marsh and J. M. Richardson. The objective function implicit in the CLEAN
algorithm. Astron. Astrophys., 182:174–178, 1987.
[80] K. Miller. Least Squares methods for ill-posed problems with a prescribed bound.
SIAM J. Math. Anal., 1:52–74, 1970.
[81] R. Molina, J. Mateos, and J. Abad. Prior Models and the Richardson-Lucy Restora-
tion Method. In The Restoration of HST Images and Spectra II, R. J. Hanisch and
R. L. White, eds., pages 118–122, 1994.
[82] G. P. Nason and B. W. Silverman. The Stationary wavelet transform and some
statistical applications . Springer Lecture Notes in Statistics, 103:281–300, 1995. in
Wavelets and Statistics (ed. A. Antoniadis & G. Oppenheim).
[83] R. B. Navarro and J. Portilla. Robust method for texture synthesis-by-analysis based
on a multiscale Gabor scheme. In Proceedings of SPIE - The International Society
for Optical Engineering, volume 2657, pages 86–97, 1996.
[84] R. Neelamani, H. Choi, and R. Baraniuk. Wavelet-Domain Regularized Deconvolu-

tion for Ill-Conditioned Systems. In ICIP 99, Kobe, 1999.
[85] D. E. Newland. Harmonic Wavelet Analysis. Proc. R. Soc. Lond. A, 443:203–225,

1993.
[86] D. E. Newland. Harmonic wavelets in vibrations and acoustics. Phil. Trans. R. Soc.
Lond. A, 357(1760):2607–2625, 1999.
[87] R. D. Nowak and M. J. Thul. Wavelet-Vaguelette restoration in photon-limited

imaging. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing – ICASSP 98,
pages 2869–2872, Seattle, 1998.
[88] D. Nychka, C. Wikle, and J.A. Royle. Large spa-

tial prediction problems and nonstationary random fields.
http://goldhill.cgd.ucar.edu/stats/nychka/manuscripts/large5.ps, 1998.
[89] Wonho Oh. Random Field Simulation and an Application of Kriging to Image
Thresholding. ftp://ams.sunysb.edu/pub/papers/theses/who.ps.gz , Dec 1998.
Bibliography 253
[90] S. Osher, L. I. Rudin, and E. Fatemi. Nonlinear Total Variation based noise removal
techniques. Phys. D, 60:259–268, 1992.
[91] R.D. Paget and D. Longstaff. Nonparametric multiscale Markov random field mode
for synthesizing natural textures. In Proceedings of the International Symposium on
Signal Processing and its Applications, ISSPA 96, volume 2, pages 744–747, 1996.
[92] J. K. Paik and A. K. Katsaggelos. Image restoration using a modified Hopfield

network. IEEE Trans. Image Processing, pages 49–63, January 1992.
[93] E. Pantin and J. L. Starck. Deconvolution of astronomical images using the multi-
scale maximum entropy method. Astron. and Astrophys. Suppl. Ser., 118:575–585,
September 1996.
[94] R. K. Piña and R. C. Puetter. Bayesian Image Reconstruction: The Pixon and
Optimal Image Modeling. Publications of the Astronomical Society of the Pacific,
105:630–637, 1993.
[95] M. Pötzsch, T. Maurer, L. Wiskott, and C. v.d. Malsburg. Reconstruction from

graphs labeled with responses of Gabor filters. In Proc. ICANN 1996, 1996.
[96] J. Portilla and E. P. Simoncelli. A parametric texture model based on joint statis-
tics of complex wavelet coefficients. International Journal of Computer Vision, To
appear, 2000.
[97] M. J. P. Powell. in J.C. Mason and M.G. Cox (Eds), Algorithms for Approximation,
chapter Radial basis functions for multivariable interpolation: a review, pages 143–
167. Oxford: Clarendon Press, 1987.
[98] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical

Recipes in C. Cambridge University Press, 1988.
[99] R. C. Puetter and A. Yahil. The Pixon Method of Image Reconstruction. In

Proc. ADASS ’98 in Astronomical Data Analysis Software and Systems VIII, D.
M. Mehringer, R. L. Plante, and D. A. Roberts, Eds. ASP Conference Series, pages
307–316, 1998.
254 Bibliography
[100] T. Randen. Comments regarding the Trans. PAMI Article, April 1999.
http://www.ux.his.no/∼tranden/comments.html, 1999.
[101] T. Randen and J. H. Husøy. Filtering for Texture Classification: A Comparative

Study. IEEE Trans. on Patt. Anal. and Mach. Int., 21(4), Apr 1999.
[102] W. H. Richardson. Bayesian-Based Iterative Method of Image Restoration. Journal

of the Optical Society of America, 62(1):55–59, January 1972.
[103] J. K. Romberg and H. Choi. Software for Image Denoising using Wavelet-domain
Hidden Markov Tree Models. http://www.dsp.rice.edu/software/WHMT/.
[104] J. K. Romberg, H. Choi, and R. G. Baraniuk. Shift-Invariant Denoising using

Wavelet-Domain Hidden Markov Trees. In Proc. 33rd Asilomar Conference, Pacific
Grove, CA, October 1999.
[105] I. J. Schoenberg. Spline Functions and the problem of graduation. Proc. Nat. Acad.
Sci., 52:947–950, 1964.
[106] F. R. Schwab. Relaxing the isoplanatism assumption in self-calibration; applications

to low-frequency radio interferometry. Astron. J., 89:1076–1081, 1984.
[107] E. P. Simoncelli, W. T. Freeman, E. H. Adelson, and D. J. Heeger. Shiftable

Multi-Scale Transforms. IEEE Transactions on Information Theory, Special Issue
on Wavelets, 38:587–607, 1992.
[108] E. P. Simoncelli and J. Portilla. Texture Characterization via Joint Statistics of

Wavelet Coefficient Magnitudes. In ICIP 98, 1998.
[109] D. Slepian. Prolate spheroidal wavefunctions, Fourier analysis and uncertainty I.

Bell. Syst. Tech. J., 40:43–64, 1961.
[110] R. L. Smith. Environmental Statistics.

http://www.stat.unc.edu/faculty/rs/papers/RLS Papers.html , 1998.
[111] M. Spann and R. Wilson. A quadtree approach to image segmentation which com-
bines statistical and spatial information. Pattern Recognition, 18:257–269, 1985.
Bibliography 255
[112] J.-L. Starck and F. Murtagh. Image restoration with noise suppression using the
wavelet transform. Astronomy and Astrophysics, 288:342–348, 1994.
[113] G. Strang and T. Nguyen. Wavelets and Filter Banks. Wellesley-Cambridge Press,
1997.
[114] V. Strela and A. T. Walden. Signal and Image denoising via Wavelet Thresholding:
Orthogonal and Biorthogonal, Scalar and Multiple Wavelet Transforms TR-98-01.
Technical report, Dept. of Mathematics, Imperial College of Science, Technology &
Medicine, 1998.
[115] Yi Sun. Hopfield Neural Network Based Algorithms for Image Restoration and
Reconstruction–Part I: Algorithms and Simulations. IEEE Transactions on Signal
Processing, 48(7):2105–2118, July 2000.
[116] C. W. Therrien. Discrete random signals and statistical signal processing. Prentice
Hall, 1992.
[117] A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-Posed Problems. New York:

Wiley, 1977.
[118] G. Turk. Generating Textures on Arbitrary Surfaces Using Reaction-Diffusion. In

Proceedings of SIGGRAPH 91, 1991.
[119] Michael Unser, Akram Aldroubi, and Murray Eden. B-Spline Signal Processing:Part
I–Theory. IEEE Trans. on Sig. Proc., 41(2):821–833, 1993.
[120] R. van Spaendonck, F. C. A. Fernandes, M. Coates, and C. S. Burrus. Non-redundant,

directionally selective complex wavelets. In 2000 IEEE International Conference on
Image Processing, 2000.
[121] M. Vetterli. Filter Banks allowing perfect reconstruction. Signal Processing,

10(3):219–244, 1986.
[122] M. Vetterli and J. Kovac̆ević. Wavelets and Subband Coding. Prentice Hall, Engle-
wood Cliffs:NJ, 1995.
256 Bibliography
[123] C. R. Vogel and M. E. Oman. Fast, robust total variation-based reconstruction of

noisy, blurred images. IEEE transactions on image processing, 7(6):813–824, June
1998.
[124] G. Wang, J. Zhang, and G. W. Pan. Solution of inverse problems in image processing
by wavelet expansions. IEEE Trans. Image Proc., 4:579–593, May 1995.
[125] S. J. Wernecke and L. R. D’Addario. Maximum entropy image reconstruction. IEEE

Transactions on Computers, C-26(4):351–364, April 1976.
[126] R. Wilson. Finite Prolate Spheroidal Sequences and Their Applications I: Generation
and Properties. IEEE Trans. on Patt. Anal. and Mach. Int., 9(6):787–795, Nov. 1987.
[127] R. Wilson and G. H. Granlund. The uncertainty principle in image processing. IEEE
Trans. on Patt. Anal. and Mach. Int., 6(6):758–767, 1984.
[128] R. Wilson and M. Spann. Finite Prolate Spheroidal Sequences and Their Applications
II: Image Feature Description and Segmentation. IEEE Trans. on Patt. Anal. and
Mach. Int., 10(2):193–203, March 1988.
[129] A. Witkin and M. Kass. Reaction-Diffusion Textures. In Proceedings of SIGGRAPH

91, 1991.
[130] D. J. Wolf, K. D. Withers, and M. D. Burnaman. Integration of Well and Seismic

Data Using Geostatistics. AAPG Computer Applications in Geology, pages 177–199,
1994.
[131] D. C. Youla and H. Webb. Image restoration by the method of convex projections:
Part 1, theory. IEEE Trans. Med. Imaging, 1:1982, October 1982.
[132] J. Zhang. The mean field theory for EM procedures in blind MRF image restoration.
IEEE Trans. Image Processing, pages 27–40, January 1993.
[133] Y. T. Zhou, R. Chellappa, A. Vaid, and B. K. Jenkins. Image restoration using a

neural network. IEEE Trans. Acoust., Speech, Signal Processing, 36(7):1141–1151,
July 1988.
Bibliography 257
[134] S. C. Zhu, Y. Wu, and D. Mumford. FRAME: filters, random fields, and minimax
entropy towards a unified theory for texture modelling. In Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, pages
686–693, 1996.

Complex Wavelet Based Image Analysis and PHD

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Complex Wavelet Based Image Analysis and PHD

Enviado por

Direitos autorais:

Formatos disponíveis

www.GetPedia.

* The Ebook starts from the next page : Enjoy !

This dissertation is submitted for the degree of Doctor of Philosophy

All rights reserved. No part of this work may be reproduced, stored in a

This dissertation investigates the use of complex wavelets in image processing.

4 Complex wavelet texture features 53

7 Bayesian modelling in the wavelet domain 107

8 Interpolation and Approximation 129

10 Discussion and Conclusions 201

11 Future work 205

A Balance and noise gain 209

B Useful results 217

C Review of deconvolution techniques 223

1.1 Guide to the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 Building block for wavelet transform . . . . . . . . . . . . . . . . . . . . . 14

4.1 Results of using histogram/energy synthesis . . . . . . . . . . . . . . . . . 62

6.1 Results of original energy matching synthesis . . . . . . . . . . . . . . . . . 96

7.1 Sequence of operations to reconstruct using a Gaussian Pyramid . . . . . . 117

8.1 Count of important coeﬃcients for diﬀerent transforms . . . . . . . . . . . 148

10.1 Summary of useful properties for diﬀerent applications . . . . . . . . . . . 203

BSNR Blurred Signal to Noise Ratio.

1.2 Justiﬁcation for the research

1.3 Bayesian and Non-Bayesian Approaches

characteristic of non-Bayesian approaches:

2. It requires no knowledge about the type of degradation.

4. In practice it is often eﬀective.

1.4 Original contributions

1.4.1 Most important contributions

1. We experimentally compare feature sets for segmentation (5.6,5.9.4).

2. We develop theoretical links between a number of interpolation and approximation

1.4.2 Medium importance contributions

1. We explain why it is impossible to have a useful single-tree complex wavelet or a

3. We describe shift invariant wavelet models in terms of a Gaussian random process

4. We describe how Bayesian interpolation can be implemented with wavelet methods

5. We calculate theoretical predictions of aesthetic and statistical quality of solutions to

6. We explain a method for fast conditional simulation for problems of interpolation

7. We show how the speed of wavelet interpolation can be signiﬁcantly increased by

1.4.3 Least important contributions

1. We derive connections between the singular values of a wavelet transform matrix,

3. We perform segmentation experiments to measure the eﬀect of some additional auto-

5. We experimentally compare the shift dependence of interpolation methods based on

7. We review the main deconvolution techniques from a Bayesian perspective (appendix

1.4.4 Contributions based largely on previous work

4. We explain how the autocorrelation of DT-CWT subbands can be used to improve

1.5 Organisation of the dissertation

Figure 1.1: Guide to the dissertation

2.3 The Wavelet Transform

Figure 2.1: Building block for wavelet transform

This diagram is to be understood as representing the following sequence of operations:

2. Downsample the ﬁlter output by 2 to give output coeﬃcients y0 (n).

4. Downsample the ﬁlter output by 2 to give output coeﬃcients y1 (n).

Downsampling by k is a common operation in subband ﬁltering. It is represented by the

b(n) = a(kn). (2.1)

Figure 2.2: Subband decomposition tree for a 4 level wavelet transform

Figure 2.3: Building block for inverting the wavelet transform

1. Upsample the lowpass coeﬃcients y0 by 2.

2. Filter the upsampled signal with G0 .

3. Upsample the highpass coeﬃcients y1 by 2.

4. Filter the upsampled signal with G1 .

5. Add the two ﬁltered signals together.

Upsampling by k is represented by the notation ↑k . This operation converts a sequence of M

1. Use the block to reconstruct y000 from y0000 and y0001 .

2. Reconstruct y00 from y000 and y001 .