0% found this document useful (0 votes)
5 views

Tonelli 2012

Uploaded by

haduong812
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Tonelli 2012

Uploaded by

haduong812
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 166

Blind reverberation cancellation techniques

Massimiliano Tonelli

A thesis submitted for the degree of Master of Philosophy.


The University of Edinburgh.
October 2011
“Man does not live by bread alone.”

ii
Abstract

Reverberation, a component of any sound generated in a natural environment, can degrade


speech intelligibility or more generally the quality of a signal produced within a room. In a
typical setup for teleconferencing, for instance, where the microphones receive both the speech
and the reverberation of the surrounding space, it is of interest to have the latter removed from
the signal that will be broadcast. A similar need arises for automatic speech recognition sys-
tems, where the reverberation decreases the recognition rate. More ambitious applications have
addressed the improvement of the acoustics of theatres or even the creation of virtual acoustic
environments. In all these cases dereverberation is critical.

The process of recovering the source signal by removing the unwanted reverberation is called
dereverberation. Usually only a reverberated instance of the signal is available. As a conse-
quence only a blind approach, that is a more difficult task, is possible. In more precise terms,
unsupervised or blind audio de-reverberation is the problem of removing reverberation from an
audio signal without having explicit data regarding the system and the input signal. Different
approaches have been proposed for blind dereverberation. A possible discrimination into two
classes can be accomplished by considering whether or not the inverse acoustic system needs
to be estimated.

The aim of this work is to investigate the problem of blind speech dereverberation, and in
particular of the methods based on the explicit estimate of the inverse acoustic system, known
as “reverberation cancellation techniques”. The following novel contributions are proposed:
the formulation of single and multichannel dereverberation algorithms based on a maximum
likelihood (ML) approach and on the natural gradient (NG); a new dereverberation structure that
improves the speech and reverberation model decoupling. Experimental results are provided to
confirm the capability of these algorithms to successfully dereverberate speech signals.
Declaration of originality

I hereby declare that the research recorded in this thesis and the thesis itself was composed and
originated entirely by myself in the Department of Electronics and Electrical Engineering at
The University of Edinburgh.

Massimiliano Tonelli

iv
Acknowledgements

I would like to express my gratitude to the following people:

First and foremost, my supervisor, Prof. Michael E. Davies,for his constant support and close
supervision of my project. Without him, this thesis would have never seen the light.

The Institute for Digital Communications, School of Engineering of The University of Edin-
burgh for financial support.

Everyone in my family.

v
Contents

Declaration of originality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Acronyms and abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

1 Introduction 1
1.1 Why dereverberation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Applications of dereverberation . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Blind dereverberation approaches . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Room acoustics prerequisites 5


2.1 Reverberation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Reverberation time, T60 and EDT . . . . . . . . . . . . . . . . . . . . 7
2.2 Air absorption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Modal description of reverberation . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Statistical model for reverberation . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Dividing the audio spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Reverberation as a linear filter. Impulse responses. . . . . . . . . . . . . . . . . 14
2.6.1 Impulse responses analysis . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Impulse response synthesis techniques . . . . . . . . . . . . . . . . . . . . . . 18
2.7.1 Geometric acoustic based techniques . . . . . . . . . . . . . . . . . . 18
2.7.2 Comments on “image method for efficiently simulating small-room
acoustic” by J. B Allen and D. A. Berkley . . . . . . . . . . . . . . . . 21
2.8 Numerical solution of the the acoustic wave equation . . . . . . . . . . . . . . 23

3 Impulse response identification and equalization. Input-output techniques 25


3.1 IR identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 Pseudo-impulsive methods . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Cross-correlation based methods . . . . . . . . . . . . . . . . . . . . 26
3.1.3 Inverse filtering based methods . . . . . . . . . . . . . . . . . . . . . 27
3.1.4 Adaptive identification . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Input-output reverberation cancellation techniques . . . . . . . . . . . . . . . . 30
3.2.1 Single channel reverberation cancellation . . . . . . . . . . . . . . . . 31
3.2.2 Non minimum phase SISO system inversion . . . . . . . . . . . . . . 33
3.2.3 Multi-channel reverberation cancellation, the multiple input /output in-
verse theorem (MINT) . . . . . . . . . . . . . . . . . . . . . . . . . . 36

vi
Contents

3.2.4 Multi-channel system inversion . . . . . . . . . . . . . . . . . . . . . 39


3.3 Dereverberation quality measure . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Intrusive measures based on the comparison with the undistorted speech 42
3.3.2 Intrusive Channel-Based Measures . . . . . . . . . . . . . . . . . . . . 44

4 Blind dereverberation techniques. Problem statement and existing technology 47


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Reverberation suppression methods . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.2 Spectral subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.3 LP residual enhancement . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Blind reverberation cancellation methods . . . . . . . . . . . . . . . . . . . . 60
4.4 Signal pre-whitening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.1 A criticism of pre-whitening . . . . . . . . . . . . . . . . . . . . . . . 66
4.5 Blind deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6 Higher order statistics (HOS) methods . . . . . . . . . . . . . . . . . . . . . . 68
4.6.1 Reverberation cancellation based on HOS methods . . . . . . . . . . . 74
4.7 Multi-channel SOS methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.7.1 SIMO identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.7.2 SIMO equalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.8 HOS and SOS approaches, a comparison . . . . . . . . . . . . . . . . . . . . . 94
4.9 Other dereverberation cancellation techniques . . . . . . . . . . . . . . . . . . 95
4.9.1 HERB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.9.2 Homomorphic deconvolution . . . . . . . . . . . . . . . . . . . . . . . 97
4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5 Novel HOS based blind dereverberation algorithms 99


5.1 A single channel maximum likelihood approach to blind audio dereverberation 99
5.1.1 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 A multi-channel maximum likelihood approach to dereverberation . . . . . . . 106
5.2.1 A comparison between ML dereverberation and the MINT . . . . . . . 108
5.2.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 Blind dereverberation algorithms based on the natural gradient . . . . . . . . . 111
5.3.1 The Bussgang algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.2 Natural/Relative Gradient . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3.3 Dereverberating speech . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4 A novel dereverberation structure that improves the speech and reverberation
model decoupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4.1 On the correct structure for a single channel dereverberator. The re-
versed structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4.2 A causal normalized NGA dereverberation algorithm based on the re-
versed structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4.3 Results. The reversed vs the forward structure. . . . . . . . . . . . . . 117
5.5 A blind multichannel dereverberation algorithm based on the natural gradient . 120
5.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6 Conclusions and Further Research 125

vii
Contents

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125


6.2 Suggestions for further research . . . . . . . . . . . . . . . . . . . . . . . . . 127

A Relative/Natural Gradient De-reverberation Algorithm 129


A.1 The Bussgang algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.2 Natural/Relative Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.3 De-reverberating speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
A.3.1 A Causal NGA De-reverb Algorithm . . . . . . . . . . . . . . . . . . 133
A.3.2 A Causal NGA Algorithm with Acausal w . . . . . . . . . . . . . . . 133
A.4 The normalization constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
A.4.1 Natural Gradient Blind Deconvolution . . . . . . . . . . . . . . . . . . 135
A.4.2 Natural Gradient Blind De-Reverb . . . . . . . . . . . . . . . . . . . . 136
A.4.3 Causal normalized NGA for De-Reverb . . . . . . . . . . . . . . . . . 137

References 138

viii
List of figures

2.1 Sound reflection from a surface . . . . . . . . . . . . . . . . . . . . . . . . . 5


2.2 The energy decay curve of a Sabinian room . . . . . . . . . . . . . . . . . . . 8
2.3 3-D representation of the sound pressure distribution in a rectangular room for
the tangential mode 3,2,0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Resonance curve of a mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Impulse response of a small room . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Echogram of the previous IR . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 An idealized energetic distribution of an IR . . . . . . . . . . . . . . . . . . . 16
2.8 EDR at low frequencies of the of the previous IR . . . . . . . . . . . . . . . . . 17
2.9 2D representation of the ray tracing algorithm . . . . . . . . . . . . . . . . . . 19
2.10 A virtual source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.11 MISM for a simple rectangular room . . . . . . . . . . . . . . . . . . . . . . 20
2.12 a) IR from a Allen and Berkley type algorithm b) enhanced IR by using a nega-
tive reflection coefficient c)measured IR . . . . . . . . . . . . . . . . . . . . . 21

3.1 An adaptive identifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28


3.2 a SIMO system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 The total number of filter taps in the equalizer can be reduced by increasing
the number of microphones N . In the figure is reported equation 3.43 for a
1000-tap impulse response when the number of channels is increased from 2
to 251. The non monotonic decrease of the total tap number is due to the fact
that the channel filter length, M , can only assume integer values (as stated by
equation 3.42). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Illustration of the inversion: (a) single echo impulse response with α = −0.9
and k = 50; (b) truncated inverse filter. A strong reflection requires a very long
inverse filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Illustration of the inversion of a two channel system: h1 (n) = δ(0) − 0.9δ(n −
600), h2 (n) = δ(0) − 0.9δ(n − 1000). (a),(b) Inverse filters calculated by
MINT. (c) Equalized IR. In the multichannel case, a strong reflection can be
perfectly equalized by FIR filters. . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1 Beamformer structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49


4.2 Amplitude response of a simple low pass filter. . . . . . . . . . . . . . . . . . . 52
4.3 Beampatterns of linear beamformers: (a)16 sensors, (b)32 sensors, (c)Dolph-
Cebyshev 16 sensors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Wideband behavior of a 16 sensor Dolph-Chebyshev beamformer. . . . . . . . 53
4.5 General structure for LP residual enhancement . . . . . . . . . . . . . . . . . 59
4.6 Linear prediction residual of a voiced segment. . . . . . . . . . . . . . . . . . 64
4.7 Model of the blind SISO deconvolution problem. . . . . . . . . . . . . . . . . 67
4.8 Bussgang type equalizer structure. . . . . . . . . . . . . . . . . . . . . . . . . 72
4.9 A single channel time-domain adaptive algorithm for maximizing kurtosis of
the LP residual. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

ix
List of figures

4.10 Illustration of the relationships between the input s(n) and the observations
x(n) in an M-channel SIMO system. . . . . . . . . . . . . . . . . . . . . . . . 78
4.11 Multichannel blind equalizer for a SIMO system. . . . . . . . . . . . . . . . . 85
4.12 2-channel Correlation shaping block diagram. . . . . . . . . . . . . . . . . . . 90
4.13 Signal flow of the method proposed by Furuya et al. . . . . . . . . . . . . . . . 93
4.14 Block diagram of the HERB algorithm . . . . . . . . . . . . . . . . . . . . . . 97

5.1 Kurtosis maximization during adaptation . . . . . . . . . . . . . . . . . . . . 100


5.2 Example of misconvergence of the single channel time domain algorithm pro-
posed by Gillespie and Malvar. (a) system IR, (b) identified inverse IR, (c)
equalized IR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3 A single channel time-domain adaptive algorithm for maximizing kurtosis of
the LP residual. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4 (a)Echogram of the original IR (reference), DRR=-2.9 dB (b) Echogram of the
equalized IR obtained by the single-channel time-domain kurtosis maximization
algorithm as proposed by Gillespie and Malvar, DRR=-2.7 dB. (c) Echogram
of the equalized IR obtained by the single-channel time-domain kurtosis maxi-
mization algorithm based on equation 5.14, DRR=-1.1 dB . . . . . . . . . . . 105
5.5 Convergence of the single-channel time-domain kurtosis maximization algo-
rithm based on equation 5.14 (solid line) and of the single-channel time-domain
kurtosis maximization algorithm as proposed by Gillespie and Malvar (dashed
line). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.6 Two-channel system composed by two single-channel structures . . . . . . . . 107
5.7 Comparison of the MINT and multi-channel equalizer. (a), (b) Inverse filters
calculated by MINT. (d), (e) Inverse filters calculated by the multi-channel al-
gorithm. (c), (f) Equation (4.25) evaluated in both cases. . . . . . . . . . . . . 109
5.8 (a) Reference echogram relating to the shortest source-to-receiver path, DRR=-
2.9 dB (b) Echogram of the equalized IR obtained by the 8-channel delay-sum
beamfomer, DRR =-0.1 dB. (c)Echogram of the equalized IR obtained by the
8-channel ML dereverberator, DRR =2.3 dB. The ML multichannel structure
provides improved dereverberation in respect to the delay and sum beamformer. 110
5.9 Comparison of the adaptation behavior of the NGA and the time domain kur-
tosis maximization proposed by Gillespie and Malvar when they are applied
to a supergaussian input filtered by a single echo with a delay of 50 samples
and a reflection gain =-1. Equalizer order P=100. Perfect equalization is not
achieved since a truncated equalizer is used. . . . . . . . . . . . . . . . . . . . 112
5.10 (a) Diagram of the time domain dereverberation algorithm proposed by Gille-
spie and Malvar (forward structure). (b) Diagram of the proposed model (re-
versed structure). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.11 (a) In the forward structure, the residual is a function of multiple LPC filters.
(b) In the reversed structure, the residual is a function of one LPC filter. . . . . 114

x
List of figures

5.12 Results of the toy non-blind example described in section 5.4.3.1. Dereverbera-
tion performance with a time variant source filter: (a1) forward structure; (a2)
reversed structure. While the reversed structure can correctly dereverberate the
input signal, this does not happen for the forward structure. Dereverberation
performance with a time invariant source filter: (b1) forward structure; (b2)
reversed structure. There is no appreciable difference in the performance of the
two structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.13 Echogram of the original impulse response (above), DRR=-2.9 dB, and of the
equalized with the reversed structure (below), DDR=-0.7 dB . . . . . . . . . . 119
5.14 proposed multi-channel dereverberation structure. . . . . . . . . . . . . . . . . 121
5.15 (a) Reference echogram relating to the shortest source-to-receiver path, DRR=-
2.9 dB (b)Echogram of the equalized IR obtained by the 8-channel delay-sum
beamfomer, DRR =-0.1 dB. (c) Echogram of the equalized IR obtained by the
proposed 8-channel dereverberator, DRR =3.1 dB. The proposed structure pro-
vides improved dereverberation in respect to the delay and sum beamformer. . . 122
5.16 (a) Reference echogram relating to the shortest source-to-receiver path, DRR=-
12.9 dB (b) Echogram of the equalized IR obtained by the 12-channel delay-
sum beamfomer, DRR =-4.9 dB. (c)Echogram of the equalized IR obtained by
the 12-channel dereverberator (female3 speaker), DRR =-0.9 dB. (d)Echogram
of the equalized IR obtained by the 12-channel dereverberator (male1 speaker),
DRR =-2.9 dB. The proposed algorithm provides better dereverberation in re-
spect to the delay and sum beamformer. . . . . . . . . . . . . . . . . . . . . . 124

A.1 (a) Diagram of the time domain de-reverberation algorithm proposed by Gille-
spie and Malvar (forward structure). (b) Diagram of the proposed model(reversed
structure). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

xi
List of tables

2.1 Air absorption at 50% relative humidity. . . . . . . . . . . . . . . . . . . . . . 9


2.2 Optimal room proportion ratios . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.1 DRR improvement in dB in respect to a DS beamformer. . . . . . . . . . . . . 111


5.2 DRR improvement in dB in respect to a DS beamformer. . . . . . . . . . . . . 123

xii
Acronyms and abbreviations

BEM Boundary Element Method


BSD Bark Spectral Distortion
CR Cross Relation
DSB Delay and Sum Beamformer
DRR Direct to Reverberation Ratio
DWM Digital Waveguide Mesh
EDC Energy Decay Curve
EDR Energy Decay Relief
EDT Early Decay Time
FDTD Finite Difference Time Domain
FEM Finite Element Method
FFT Fast Fourier Transform
FIR Finite Impulse Response
GED Generalized Eigenvalue Decomposition
HERB Harmonicity based dEReverBeration
HOS Higher Order Statistics
IACC Interaural Cross Correlation
IIR Infinite Impulse Response
IR Impulse Response
ITU-T International Telecommunications Union
LMS Least Mean Square
LIME LInear-predictive Multi-input Equalization
LP Linear Prediction
LPC Linear Prediction Coefficient
LSD Log Spectral Distortion
LTI Linear Time-Invariant

xiii
Acronyms and abbreviations

ME-LMS Multiple Error Least Mean Square


MI Modulation Index
MINT Multiple input/output INverse Theorem
MISM Mirror Image Source Method
ML Maximum Likelihood
MLS Maximum Length Sequence
MMSE Minimum Mean Squared Error
MSE Mean Squared Error
NG Natural Gradient
NGA Natural Gradient Algorithm
NMCFLMS Normalized Multichannel Frequency Domain LMS
NPM Normalized Projection Misalignment
PDF probability distribution function
PEF Prediction Error Filter
PSD Power Spectral Density
RDT Reverberation Decay Tail
SIMO Single-Input Multi-Output
SISO Single-Input Single-Output
SNR Signal to Noise Ratio
SOS Second Order Statistics
SRR Signal to Reverberation Ratio
STFT short-time Fourier transform
STI Speech Transmission Index
UMCLMS Unconstrained Multi-Channel LMS
ZF Zero-Forcing

xiv
Nomenclature

Notations

x scalar quantity
x vector quantity
X matrix quantity
x(n) discrete time signal
x(t) continuous time signal

Operators

x∗y linear convolution


xT vector transpose
|x| absolute value of x
Re {} real component of a complex number
Im {} imaginary component of a complex number
E {} mathematical expectation
h.i spatial expectation
∇ gradient operator

xv
xvi
Chapter 1
Introduction

1.1 Why dereverberation?

The way sound waves reflect off various surfaces before reaching our ears is a fascinating phe-
nomenon, usually taken for granted in our every day life. Without reverberation our perception
of the world would be greatly affected. Just to give a few examples, we could hardly speak
to someone that is not facing us directly, we would be less aware of the potential danger of a
car that is approaching, we would not appreciate the performance of any orchestra or musical
instrument. Sounds would be lifeless. All the acoustic information conveyed by the surround-
ing space and objects would be missing. So why we should remove it? The reason is simple:
for specific applications, reverberation can introduce detrimental modifications to the source
signal.

1.2 Applications of dereverberation

In all the following cases, and surely in many others, dereverberation would be beneficial.

It is well known that reverberation can decrease speech intelligibility [1]. This is a particularly
severe problem for hearing impaired people [2]. Therefore, speech enhancement techniques
capable of improving the comprehensibility of speech are of interest for applications such as
hearing aids, forensic analysis, surveillance and recording restoration.

A large amount of research is focused on improving the human-computer interaction. In auto-


matic speech recognition systems, reverberation decreases the recognition rate [3]. This is of
particular concern if the distance between the talker and the microphone is large. To overcome
this problem the use of close-talk microphones is, so far, mandatory. Dereverberation would
allow us to design a reliable voice-controlled system without the requirement to wear a headset
or a microphone.

A similar scenario arises in hand free communication (mobile phones or teleconferencing

1
Introduction

equipments), where it is desirable to broadcast a signal without the reverberation of the sur-
rounding space.

Usually the above mentioned applications require to estimate the original source signal by
removing the reverberation components from the received signal(s), without knowledge of the
surrounding acoustic environment. Therefore this estimate must be performed “blindly”.

1.3 Blind dereverberation approaches

Even though several approaches have been proposed, blind dereverberation algorithms can be
divided in two classes [4]: “reverberation suppression” and “reverberation cancellation” meth-
ods.

Reverberation suppression techniques require no identification of the impulse response of the


acoustic system. This is achieved, for instance, by using beamforming [5] to isolate the signal
coming from a specific direction, or by treating the reverberation as noise and by employing
denoising techniques (i.e. spectral subtraction [4]).

Reverberation cancellation methods, on the other hand, are based on the explicit estimate of
the inverse acoustic system. Therefore, dereverberation is obtained by convolving the received
signal with an estimate of the acoustic inverse filter.

Due to the spatial diversity and temporal instability that characterize the acoustic system im-
pulse responses [6], the first class of algorithms can offer, so far, more effective results in prac-
tical conditions [7] [8]. However, the algorithms belonging to the second class can potentially
lead to ideal performances [9].

At the current state, practical dereverberation is still largely an unsolved problem. Furthermore,
research has been focusing on the problem of dereverberation of speech signals, and no organic
extension to a larger class of acoustic signals exists.

The aim of this work is to investigate blind speech dereverberation, and in particular the “rever-
beration cancellation” techniques.

2
Introduction

1.4 Thesis Overview

The work presented in this thesis is structured as follows:

Chapter 2 is an overview of the physics and the modelling of room acoustics. Here some of
the properties of room acoustics, which are important to understand why particular models and
methods are used in speech dereverberation, are introduced.

Chapter 3 examines the principal non-blind techniques for room identification and equaliza-
tion. From the discussion it emerges the complexity of the dereverberation problem even in the
non-blind context.

Chapter 4 analyses the blind dereverberation problem and provide a wide literature survey of
the existing technology, with a focus on “reverberation cancellation” techniques;

Chapter 5 proposes novel blind dereverberation algorithms based on single and multichannel
“reverberation cancellation” methods. More specifically, single and multichannel de-reverberation
algorithms based on a maximum likelihood approach and on the natural gradient are formu-
lated. A new de-reverberation structure that improves the speech and reverberation model de-
coupling is also discussed. Experimental results are provided to confirm the capability of these
algorithms to successfully de-reverberate speech signals.

Finally, in Chapter 6, the results and contributions are summarised, and several directions for
future research are proposed.

1.5 Publications

Associated with this work are the following publications:

• M. Tonelli, N. Mitianoudis, and M. E. Davies, A maximum likelihood approach to blind


audio de-reverberation, Proc. Digital Audio Effects Conference (DAFx04), pp. 256 261,
2004.

• M. Tonelli, M. Jafari, and M. E. Davies, A multi-channel maximum likelihood approach


to de-reverberation, in Proc. European Signal Processing Conf. (EUSIPCO), 2006.

• M. Tonelli and M. E. Davies, A blind multichannel dereverberation algorithm based on


the natural gradient, IWAENC, 2010.

3
4
Chapter 2
Room acoustics prerequisites

2.1 Reverberation

The process of reverberation starts with the production of sound at a location within a room.
The acoustic pressure wave expands radially, reaching walls and other surfaces where energy
is both absorbed and reflected. All reflected energy is reverberation.

2.1.1 Reflections

Let us consider an acoustic plane wave which strikes a perfectly flat and rigid surface of infinite
extension.

The direction of the impinging wave is characterized by an angle θ with respect to the wall
normal. It is called the angle of incidence. According to the law of optics the reflected wave
leaves the boundary under the same angle. Furthermore the wave normals of both waves and
the wall normal are lying in the same plane [1]. When the wave impinges on the surface, part
of the wave energy is reflected back and part is absorbed by the medium, either because it is
dissipated by losses occurring within it, or because it is transmitted through it. The absorption
coefficient α is a real value between 0, perfectly absorbent surface, and 1, perfectly reflective
surface, that can be calculated by

Figure 2.1: Sound reflection from a surface

5
Room acoustics prerequisites

4ξcos(θ)
α= 2 (2.1)
|(ζ)| cos2 (θ)
+ 2ξcos(θ) + 1

where ζ is the specific acoustic impedance of the surface (the surface impedance normalized
by the characteristic impedance of air) and ξ is its real part [1].

Incidentally, in literature values greater than 1 for the absorption coefficient can be found. This
is caused by the testing methods and can be misleading. Any absorption coefficients listed as
greater than 1 should be taken as 1 in any calculation or consideration [10].

The reflection coefficient r is related to the absorption coefficient α by the equation

α + r = 1. (2.2)

This stems from the principle of energy conservation.

Since only the energetic aspects are considered, the values of the absorption and reflection
coefficients are real numbers. Therefore they do not take into account the frequency depen-
dent phase shift that occurs upon reflection. If those aspects have to be evaluated, a complex
reflection factor for the surface must be used [1]

ζcos(θ) − 1
R(θ) = . (2.3)
ζcos(θ) + 1

The reflection coefficient r and the complex reflection factor R are related by

r = |R|2 . (2.4)

While the reflection coefficient r is related to energetic aspects (it describes the amount of
power that is reflected), the R coefficient describes the attenuation of the acoustic pressure
wave.

Reflection off non-uniform finite surfaces is a more complicated process and it can be approxi-
mated by a specular reflection only when surface dimensions are large relative to the wavelength
λ of sound being evaluated (i.e. > 4λ) [11]. The specular reflection hypothesis, that is the foun-

6
Room acoustics prerequisites

dation of geometric acoustics, fails both when the wavelength is comparable to the unevenness
of the surface, and when the wavelength is comparable to the room dimensions or to the di-
mensions of the objects placed inside the room. All models based on geometric acoustics, that
ignore the undulatory nature of sound propagation, and therefore the diffuse reflection at the
high frequencies and the diffraction at low frequencies, are potentially inaccurate.

2.1.2 Reverberation time, T60 and EDT

In nature no perfectly reflective surfaces exist, thus the energy of the impinging wave will
decrease at every reflection. This will cause an intensity decay in the reflected energy. The
time in seconds required for the average sound-energy density to reduce to one-millionth (a
decrease of 60 decibels) of its initial steady-state value after the sound source has been stopped
is defined as the reverberation time (RT) or T60 [12]. Reverberation time can be measured by
exciting a room with a wide band (i.e. 20 Hz to 20 kHz) or narrow band signals (i.e. one octave,
1/3 octave, 1/6 octave, etc.)[13]. Very roughly, the T60 can be considered as the time required
for a very loud sound to decay to inaudibility. This concept, introduced by W.C. Sabine at the
turn of 19th century, is still the most important characteristic to evaluate the acoustical quality
of a room. Sabine determined that reverberation is proportional to the volume of the room and
inversely proportional to the amount of absorption

V
T60 ∝ (2.5)
A

where V is the volume of the room and A is a measure of the total absorption of materials in
the room.

Sabine relation was derived empirically. The subsequent formal derivation showed that Sabine
relation holds only under the following hypothesis:

1. the energy decay is the same in all the positions within the room;

2. perfect energy diffusion and and no preferential direction for the reflections exists;

3. air absorption is neglected;

4. the discrete phenomenon of the energy impinging on the room walls can be modelled as
a continuous one.

7
Room acoustics prerequisites

Figure 2.2: The energy decay curve of a Sabinian room

These can be approximated only for not too absorptive and not too large regular rooms with
uniformly-distributed absorption and with the excitation source positioned in proximity of the
room barycentre. Under these conditions the space is “Sabinian” and the the acoustic energy
decay curve (EDC) has a linear dB decay over time. Usually the T60 is calculated by a line fit
to the portion of the EDC between -5 and -35 dB.

In all other cases (i.e. coupled spaces, large auditoria) a unique T60 is not definable. The T60
becomes therefore a locally defined value depending on the source and the receiver position.
The more the room features depart from the Sabinian hypothesis, the more local variations of
the T60 are present.

A useful measure to evaluate the perception of reverberation while listening to a dynamically


changing signal is the early decay time (EDT). The EDT is the 60 dB decay time calculated
by a line fit to the portion of the energy decay curve between 0 and -10 dB. The initial rate
of decay of reverberant sound appears to be more important than the total reverberation time.
A rapid initial decay is interpreted by the human ear as meaning that the reverberation time is
short. This gives a more subjective evaluation of the reverberation time [14].

Both the T60 and the the EDT are linked only to the energetic aspects of reverberation and hide
all the directional information connected to the room geometry.

2.2 Air absorption

For frequencies above 1 kHz and for very large rooms, the absorption of sound by the air in
the space is not negligible [15]. The amount of sound that air absorbs increases with audio

8
Room acoustics prerequisites

frequency and decreases with air density, and also depends on temperature and humidity. This,

Frequency Sabins/m3
2000 Hz 0.010
4000 Hz 0.024
8000 Hz 0.086

Table 2.1: Air absorption at 50% relative humidity.

in conjunction with the fact that porous absorptive materials have higher absorption at higher
frequencies, is why the treble reverberation time falls off faster.

2.3 Modal description of reverberation

In a perfectly rectangular room with rigid walls the acoustical wave equation

1 ∂2p
∇2 p − =0 (2.6)
c2 ∂t2

where ∇2 is the Laplace operator, p is the acoustic pressure and c the speed of sound, can be
solved in closed form. This approach yields a solution based on the natural resonant frequencies
of the room, called normal modes. The resonant frequencies are given by [16]:

s
c nx 2 ny 2 nz 2
fn = + + (2.7)
2 Lx Ly Lz

where

fn =n-th normal frequency in Hz;

nx , ny , nz =integers from 0 to ∞ that can be chosen separately;

Lx , Ly , Lz = dimensions of the room in meters;

c=speed of sound in m/sec.

When a sound source is turned on in an enclosure, it excites one or more of the normal modes
of the room. When the source is turned off, the modes continue to resonate their stored energy,
each decaying at a separate rate determined by the mode’s damping constant, which depends on

9
Room acoustics prerequisites

Figure 2.3: 3-D representation of the sound pressure distribution in a rectangular room for the
tangential mode 3,2,0

the absorption of the room. This is entirely analogous to an electrical circuit containing many
parallel resonances [17]. Room response is made up of combined modes summed vectorially
with both magnitude and phase.

To any triplet of integers nx , ny , nz a natural resonant frequency is associated.

• The mode is defined as axial, if only one value is different from zero (i.e. nx = 1, ny =
0, nz = 0).

• The mode is defined as tangential, if two values are different from zero (i.e. nx =
1, ny = 1, nz = 0),

• The mode is defined as oblique, if all three values are different from zero (i.e. nx =
1, ny = 1, nz = 1).

Axial modes are linked to the concept of standing wave. Assume two flat, solid parallel walls
separated a given distance. A sound source between them radiates sound of a specific frequency.
The wavefront striking the right wall is reflected back toward the source, striking the left wall
where it is reflected back toward the right wall, and so on. One wave travels to the right, the
other toward the left. Only the standing wave, the interaction of the two, is stationary [15].
Tangential and oblique modes are due to reflections upon more surfaces, and therefore suffer
more losses. As a consequence, they tend to be less intense than axial modes.

Every mode determines within a cavity, with perfectly rigid walls, a pattern of minima and
maxima of pressure according to the equation

10
Room acoustics prerequisites

Figure 2.4: Resonance curve of a mode

nx πx ny πy nz πz
p(x, y, z) = cos( )cos( )cos( ). (2.8)
Lx Ly Lz

An example of the pressure distribution associated to a single mode is reported in Fig.2.4.

Every mode exhibits a resonance curve as shown in Fig.2.4. The bandwidth of each mode can
be determined by the equation
2.2
∆f = (2.9)
T60
where T60 is the reverberation time measured by using a pure tone excitation. Therefore, the
more absorption, the shorter the reverberation time, and the wider the mode resonance. This
means that adjacent modes tend to overlap for rooms with short reverberation time.

For a room with a good acoustic, a primary goal is to avoid coincidences of modes and in
particular of the axial ones. In fact, coincident modal frequencies tend to overemphasize signal
components at that frequency. Modes are directly linked to room dimension, therefore mode
overlapping or excessive closeness, that can cause beating, is minimized by properly choosing
the room dimensions. A uniform distribution of axial modes is also important. Gilford [18]
stated that to avoid coloration axial modes should not be separated more than 20 Hz. Another
criterion was suggested by Bonello [19], who considered all three types of modes, not axial
modes alone. He states that, to provide a good modal distribution it is desirable to have all
modal frequencies in a critical band at least 5% of their frequency apart (i.e. 40 Hz and 41 Hz
would not be acceptable).

In the following table, several optimal ratios for the room dimensions are reported. These ratios

11
Room acoustics prerequisites

minimize the modal overlapping [20].

Author Height Width Length Overlapped modes


Sepmeyer 1.0 1.14 1.39 9
Sepmeyer 1.0 1.28 1.54 10
Sepmeyer 1.0 1.6 2.33 9
Louden 1.0 1.4 1.9 11
Louden 1.0 1.3 1.9 8
Louden 1.0 1.5 2.5 12
Volkmann 1.0 1.5 2.5 12
Boner 1.0 1.26 1.59 11
Beranek 1.0 1.6 2.3 9
Beranek 1.0 2.0 3.0 10
Cesare Consumi 1.0 1.49 2.98 6
Cesare Consumi 1.0 1.49 2.22 7

Table 2.2: Optimal room proportion ratios

The behavior of an irregularly shaped room can also be described in terms of its normal modes,
even though a closed form solution may be impossible to achieve [1]. Splaying one or two
walls of a room does not eliminate modal problems, although it might shift them slightly and
provide somewhat better diffusion [21]. However, while the proportion of a rectangular room
can be selected to eliminate, or at least greatly reduce degeneracies, making the sound field
asymmetrical by splaying walls only introduces unpredictability.

2.4 Statistical model for reverberation

When a sufficient density of modes exists, an equal energy density at all points in the room
can be assumed. As a consequence, there will be an equal probability that sound will arrive
from any direction. This condition, as mentioned before, is known as a diffuse field, and it can
be described by a statistical model [1]. This statistical model for reverberation is justified for
frequencies higher than the, so called, Schroeder frequency [1]

r
T60
fg = 2000 (2.10)
V
where V is the volume of the room in m3 .

The number Nf of normal modes below frequency f is approximately [17]

12
Room acoustics prerequisites

4πV 3
Nf = f . (2.11)
3c3

Differentiating with respect to f , we obtain the modal density as a function of frequency

dNf 4πV 2
= f (2.12)
df 3c3

therefore the number of modes per unit bandwidth grows as the square of the frequency.

Similarly, the temporal density of echoes is described by [17]

dNt 4πc3 2
= t (2.13)
dt 3V

therefore the density of echoes grows as the square of time.

2.5 Dividing the audio spectrum

In considering the acoustics of small rooms, it is useful to consider the audio spectrum divided
in four regions:

1. The region with upper bound given by c/2L, where c is the speed of sound and L the
longest dimension of the room. In this frequency range there is no resonant support for
sound in the room.

2. The region with upper bound given by the Schroeder frequency fg . In this frequency
range the wavelength of the sound being considered is comparable to the room dimen-
sions, therefore the room behavior can be described by its modes, therefore wave acous-
tics must be used.

3. The region with upper bound given by 4fs . This is a transition region between region 2,
and region 4.

4. The region in which the wave lengths are short enough for geometric acoustic to be valid.

13
Room acoustics prerequisites

“Very small room, with too few modal resonances spaced too far apart, are characterized by
domination of a great stretch of the audible spectrum by modal resonances. This is the small
listening room problem in a nutshell” [15].

2.6 Reverberation as a linear filter. Impulse responses.

From a signal processing perspective, it is convenient to describe a room containing sound


sources and listeners as a system with inputs and outputs, where the input and output signal
amplitudes correspond to acoustic variables, usually the sound pressure, at points in the room
[17]. If the acoustic path (the channel) is modelled as a linear-time invariant system charac-
terized by an Impulse Response (IR) h(n), the source signal, s(n), and the reverberant signal,
x(n), are linked by the equation

x(n) = h(n) ∗ s(n) (2.14)

where ∗ denotes the discrete linear convolution.

The Z transform of the IR is the Room Transfer Function (RTF). Since reverberation is due
to energy that is decaying below the noise floor in a finite amount of time, RTF are usually
described as a finite impulse response (FIR) filter, that is an all zero model (MA model). Re-
cursive models are sometimes assumed (i.e. AR or ARMA models). In general, any RTF can
be split in an FIR part and an IIR part [22], and this latter part is independent on the source or
receiver position, being given by the common resonances of the room. In fact, modes do not
change when source and receiver location is modified. For low frequencies modes, the distance
between minima and maxima of pressure is wide. As a consequence, the room frequency re-
sponse, taken in different positions, usually exhibit common peaks in the low frequencies. This
is an intrinsic physical aspect of room acoustics and does not depend on the adopted model.
These peaks can be modelled by common acoustic poles shared among the room transfer func-
tions. This property has been employed to achieve multiple point low frequency equalization
[23].

Even if it might be expected that a pole-zero representation would be much more economical
than an all-zero one, in [24] it is shown that in general this is not the case. Furthermore, many
system identification strategies provide an FIR model, while the process to calculate a recursive

14
Room acoustics prerequisites

Figure 2.5: Impulse response of a small room

Figure 2.6: Echogram of the previous IR

one is less straightforward [25]. Thus, the FIR model is often preferred.

2.6.1 Impulse responses analysis

Assuming that the IR of a specific source/listener path has been estimated, several parameters
(i.e.T60, EDT, modal distribution, steady state frequency response, intelligibility, clarity etc.)
can be estimated from it. This allows one to verify and to diagnose possible problems in the
acoustic quality of a room.

A useful representation of the IR is the echogram, obtained by considering the logarithm of


the square value of the IR as shown in Fig.2.6. In the echogram, it is easier to detect the
most energetic reflections that characterize the IR. As reported by Gardner [26] the number of
these reflections is small in comparison to the length of the whole IR. Usually these reflections
are contained in the first 100 ms and are called early reflections. Early reflections convey
information about the room shape. When both the direct sound and reflection are presented
frontally and the reflection delay is greater than 80 ms, the reflection is perceived as distinct
echo of the direct sound if it sufficiently loud. As the reflection delay becomes smaller, the
reflection and direct sound fuse into one sound, but with a tonal coloration attributed to the

15
Room acoustics prerequisites

Figure 2.7: An idealized energetic distribution of an IR

cancellation between the two signals [17].

Two portions of reverberation can be distinguished: early reverberation, composed by strong


sparse reflections, and late reverberation, characterized by uniform diffusion of the reflections.
Therefore the impulse response of a room is characterized by a transition from a sparse to a
stochastic phenomenon. In this sense sparsity and stochasticity coexist in reverberation. A
simple measure, based on this observation, that can determine the point in time when the early
reflections have fully transitioned to late reverberation is described in [27].

The energy decay curve, that describes how the energy of the IR decays, is defined by the
Schroeder integral

Z ∞
EDC(t) = h2 (τ )dτ . (2.15)
t

The T60 and the EDT can be directly estimated from the EDC by linearly interpolating it
respectively between -5 and -35 dB and 0 and -10 dB. The T60 and the EDT are often calculated
also for 1/3 octave ISO bands.

A time-frequency extension of Schroeder’s Energy Decay Curve can be defined on the basis of
a power density time-frequency representation ρh (t, f ) obtained from h(t)

Z ∞
EDR(t, f ) = ρh (τ, f )dτ . (2.16)
t

This generalization of the EDC to multiple frequency bands has been formalized by Jot [28],
as the energy decay relief, EDR(t, f ), which is a time frequency representation of the energy

16
Room acoustics prerequisites

Figure 2.8: EDR at low frequencies of the of the previous IR

decay. Therefore, EDR(0, f ) gives the power gain as a function of frequency and EDR(t, f0 )
gives the energy decay curve for some frequency f0 . This representation allows one to diagnose
undesired unevenness in the time and frequency response of a room.

Several other parameters can be computed from the IR.

The D50 parameter (also known as “Definition”) is the early to total sound energy ratio. It is
defined as:

R 50 ms 2
h (τ )dτ
D50 = 0R ∞ 2 . (2.17)
0 h (τ )dτ

Expressed by values between 0 and 1, it is associated with the degree to which rapidly occurring
individual sounds are distinguishable. Usually rooms designed for speech require D50 > 0.5,
while rooms designed for music requires D50 < 0.5

The ts parameter (also known as “Centre Time”) is the time of the centre of gravity of the
squared impulse response. A high value is an indicator of poor clarity. It is defined as:

R∞
t · h2 (τ )dτ
ts = 0R ∞ 2 (2.18)
0 h (τ )dτ

and expressed in ms. Usually rooms designed for speech require ts < 50 ms, while rooms
designed for music requires 50 ms < ts < 250 ms

17
Room acoustics prerequisites

Beyond the fact that many other descriptive acoustic parameters can be defined (i.e. Loudness,
Intimacy, Warmth, IACC, ST1 etc.), there are four general factors that are of main importance
for a good acoustic:

• adequate reverberation time according to the room functionality (i.e. classroom, confer-
ences hall, theater, church etc.)

• uniformity in the frequency response (avoid large peaks or dips), that is linked to an
adequate modal density

• uniformity in the time response (avoid excessive comb filtering or echo), that is linked to
an adequate time density

• spatial uniformity of time and frequency response

2.7 Impulse response synthesis techniques

One of the main needs of the acoustician, is to be able to predict the acoustic of a room prior
to construction. This allows one to correct the design of a room and evaluate many options and
surface treatments. Computer simulations are now possible and several different approaches
have been proposed. The main strategies available are based on geometric acoustics or on the
direct numerical solution of the acoustic wave equation.

2.7.1 Geometric acoustic based techniques

Being computationally less intensive, these techniques are, at the current state, much more
diffuse. However they can be inaccurate at low frequencies where the modal behavior prevales.
The most popular methods belonging to this family are ray tracing [29], and mirror image
source method (MISM) [30]. Other derived methods, such as conical and pyramidal tracing,
are less usual [31].

2.7.1.1 Ray tracing

The ray tracing method assumes that:

• the propagation of the sound energy takes places along linear trajectories

18
Room acoustics prerequisites

Figure 2.9: 2D representation of the ray tracing algorithm

• the sound is specularly reflected by surfaces

• the energy of the source is quantized in a finite number of sonic rays

• starting from the source, the propagation takes place toward all directions and follow the
law of geometric acoustics

• the sonic rays have infinitesimal section

• geometric divergence of rays takes into account the geometric divergence of energy

• the ray energy decreases for surface and air absorption

• the rays are summed together, neglecting the phase, when in proximity of the receiver

Ray tracing is computationally efficient, can be used for arbitrary shape rooms, can easily model
diffusion, refraction and, but not easily, diffraction. On the down side, the choice of the number
of ray to employ and of the size of the receiver is somewhat arbitrary and critical.

2.7.1.2 Mirror image source method (MISM)

The MISM method is based on the idea that a specular reflection from a flat surface is equivalent
to the direct field of an identical virtual source placed symmetrically on the opposite side of the
surface. This, as shown in Fig.2.11, can be simply extended to higher order reflections by
considering as a real source, the virtual source considered for the previous reflection.

The following hypothesis hold for this method:

19
Room acoustics prerequisites

Figure 2.10: A virtual source

Figure 2.11: MISM for a simple rectangular room

• the sound is specularly reflected by surfaces, according to geometric acoustics

• a virtual source is associated to any specular reflection

• any source emits spherical wavefronts

• the ray energy decreases for geometric divergence, surface and air absorption

MISM is therefore theoretically more solid than ray tracing. However it becomes extremely
inefficient for arbitrary shape rooms and it cannot model refraction and diffraction.

To address these drawbacks, hybrid methods based on MISM and ray tracing have been pro-
posed [32].

20
Room acoustics prerequisites

Figure 2.12: a) IR from a Allen and Berkley type algorithm b) enhanced IR by using a negative
reflection coefficient c)measured IR

2.7.2 Comments on “image method for efficiently simulating small-room acous-


tic” by J. B Allen and D. A. Berkley

The most well known paper about the MISM method is [30] by Allen and Berkley. The pro-
posed algorithm is often used to create synthetic IRs for the evaluation of blind separation,
deconvolution and dereverberation methods. One of its assumptions is to consider a positive
value for the pressure reflection coefficient. This choice determines a non-negative IR that
neglects all phase modification caused at every reflection. To remove the DC component the
authors suggest to high pass the obtained impulse response. After this heuristic post-processing,
as shown in Fig.2.12 the IR is still heavily biased, and very dissimilar to a natural acoustic IR.

Possible improvements to the Allen and Berkley algorithm have been discussed in a recent
paper [33] (2008). In it, the use of a negative value for the pressure reflection coefficient has
been suggested, but without strong physical evidence of the choice.

The physical reason of this improvement can be easily explained: the power absorption and
reflection coefficients α and r and the pressure reflection coefficient R (in [30] called β) are
linked by the following equation:

21
Room acoustics prerequisites

α + r = α + R2 = 1 (2.19)

therefore two possible values of R can be obtained from the power reflection coefficient r


R = ± r. (2.20)

This ambiguity is not discussed in [30], and positive pressure reflection coefficients are chosen
without commenting. However, if a a normally incident plane wave is considered, the incident
and the reflected pressure wave, respectively pi and pr , are related by the equation [1],

pr z2 − z1
= (2.21)
pi z2 + z1

where z2 is the acoustic impedance of the material interested by the incident and reflected
pressure waves and z1 the impedance of the reflective material 1 , then

if z1 >> z2 (i.e a concrete wall and the air)

pi ' −pr (2.22)

the incident and the reflected waves have opposite phase. Therefore a negative coefficient can
better model the reflection caused by the jump of impedance from air to a rigid wall.

A better solution, but computationally intensive and difficult to realize due to the lack of data
about material properties, would be to consider the complex reflection coefficient of the mate-
rial, and model the surface as a filter with precise phase and modulus response.

Sometimes the MISM problem is formulated in terms of energetic aspects, by using the inten-
sity instead of the pressure level. This can generate confusion about the MISM formulations
and about which reflection coefficients should be used (r calculated from the absorption coeffi-
cient α or the complex valued coefficient R obtained from the material impedance). However,
this confusion should be avoided by considering the fact that a microphone is a pressure trans-
ducer, not an intensity one. Another assumption made in the Allen and Berkley paper is the
1
The acoustic impedance of the medium is a real value for a plane wave[1].

22
Room acoustics prerequisites

independence of the reflection coefficient from the angle of incidence. Even if this is not stated,
this implies that the surface is locally reacting, thus that the motion at one point of the surface
is not related to other points of the surface. This approximation however applies only to some
materials (i.e. a porous surface) [16].

These observations suggest that it might be necessary to evaluate with care the simulations
performed with synthetic IRs, before generalizing the result to a real room. As an example,
room IR estimation techniques based on the non-negative impulse response hypothesis have
even been proposed [34]. This is not meaningful from a physical point of view. Real IRs are
both positive and negative. The wrong assumption of considering the non-negativeness has
been probably influenced by a non critical use of the Allen and Berkley MISM method.

2.8 Numerical solution of the the acoustic wave equation

Room simulation can be realized by the numerical solution of the acoustic wave equation:

1 ∂2p
∇2 p − =0 (2.23)
c2 ∂t2

where ∇2 is the Laplace operator, p is the acoustic pressure and c the speed of sound. The ad-
vantage of this approach is that it gives a correct solution also when it is not possible to neglect
the undulatory nature of sound (i.e. the wavelength is comparable to the room dimension). The
main drawback, however, is efficiency. To assure a minimum of accuracy, the discretization step
of the acoustic space that is under evaluation must be at least 1/8 of the shortest wavelength
of interest [35]. Therefore it is common to restrict the analysis to the low frequency range or
to small spaces. In fact, the extension to higher frequencies implies the evaluation of billions
of variables, with a huge computational power demand. Furthermore, the boundary conditions
must be specified in term of complex acoustic impedance of the material, and these data are
still largely unavailable.

The most famous method is the finite element method (FEM) [36], that can be employed for
irregular geometries and with variable boundary conditions. The FEM method neither depends
on the system geometry nor on the medium properties, therefore complex systems composed
of different media can be considered. It can be shown that the method converges to the exact
solution when the number of elements is increased. In FEM calculations, the acoustic field

23
Room acoustics prerequisites

in the region of interest is approximated as a sum of simple functions multiplied by unknown


coefficients. These coefficients are then calculated by solving a linear system of equations with
the right hand side defined by the sources producing the acoustic field. This calculation is
usually done in the frequency domain, i.e. under the assumption that the fields are oscillating
at one single frequency. While the FEM method is based on the approximation of the spatial
domain, the Boundary Element Method (BEM) is based on the discretization of the boundary
of the spatial domain. This reduces drastically both the computational power requirements and
the data required to specify the system [37].

A different approach is based on the finite difference time domain (FDTD) method [38]. In
FDTD calculations, the solution procedure is quite different in respect to FEM. Instead of
approximating the fields directly, the derivatives of the fields are approximated as the difference
of the field values at adjacent locations divided by the distance between these locations. The
acoustic wave equation is thus transformed into a set of difference equations, valid at a number
of points. These points are usually located on a rectangular mesh, or grid. The main advantages
of the FDTD method are that a single calculation is sufficient to study a wide frequency band,
and that the time domain behavior of the reflected sound can be directly inspected [39].

A promising technology for room acoustic simulation is the digital waveguide mesh (DWM)
[40]. DWMs algorithms are based on a subset of the wider family of finite difference time
domain (FDTD) numerical approximation. These methods have been successfully used in room
acoustic prediction and can provide, in respect to the geometric acoustic based methods, better
accuracy at low frequencies [41].

24
Chapter 3
Impulse response identification and
equalization. Input-output techniques

The knowledge of the IR of an acoustic space is useful to:

• extract and analyze physical descriptors of the acoustic space;

• simulate acoustic spaces (auralization by convolution);

• reduce the unwanted artifacts due to reverberation (equalization by deconvolution).

In this chapter the main techniques to estimate them will be reviewed.

3.1 IR identification

3.1.1 Pseudo-impulsive methods

Theoretically, the IR of a room, modeling it as a linear time invariant (LTI) system, can be
obtained by generating a Dirac δ in the position of the speaker and sampling the signal in the
position of the listener. However, while in the digital domain it is easy to obtain an impulse,
a δ is a pure mathematical abstraction in the physical domain. Practically, the measurement
can be obtained using pseudo-impulsive signals (e.g. the pop of an exploding balloon), but
only a rough approximation of the real IR is provided. In fact, the lack of uniformity in the
energetic distribution of the pseudo-impulsive source, determines a distortion of the measured
IR and, potentially, a low signal to noise ratio at certain frequencies. On the other hand, pseudo-
impulsive methods are very simple and only a minimum amount of equipment is necessary to
implement them.

25
Impulse response identification and equalization. Input-output techniques

3.1.2 Cross-correlation based methods

A better estimate can be obtained using the cross-correlation between the source and the mea-
sured signal.

Considering :

• an LTI system;

• x(n), a real valued white random process (used as the input);

• y(n), the output of the system with IR h(n).

The autocorrelation and the cross-correlation functions, defined as

rxx (m) = E {x(n) · x(n + m)} (3.1)

rxy (m) = E {x(n) · y(n + m)} (3.2)

are linked by the relation


rxy (m) = h(m) ∗ rxx (m) (3.3)

where E {.} and ∗ are respectively the statistical expectation and the convolution operators.

Since x(n) is white noise


rxx (m) = δ(m) (3.4)

and thus
rxy (m) = h(m). (3.5)

Therefore, the impulse response can be calculated directly by the cross-correlation between the
white noise emitted by the source and the signal measured at the listener position.

In principle, the expected value must be computed by averaging x(n) · x(n + l) and x(n) ·
y(n + l) over all realizations of the stochastic process x and y. In practice, an estimate can
be obtained by averaging a finite number of realizations. This is called an “ensemble average”
across realizations of a stochastic process. If the signals are stationary (which primarily means
their statistics are time-invariant), then it is possible to average across time to estimate the
expected value. In other words, for stationary noise-like signals, time averages equal ensemble

26
Impulse response identification and equalization. Input-output techniques

averages [42].

Therefore the autocorrelation and the cross-correlation functions can be estimated by [42]

N −1
1 X
r̂xx (m) = x(n) · x(n + m) (3.6)
N
n=0

N −1
1 X
r̂xy (m) = x(n) · y(n + m). (3.7)
N
n=0

The above definitions are only valid for stationary stochastic processes [42].

The IR identification approach known as the Maximum Length Sequence(MLS) method [43] is
based on the previous observation. The MLS method can offer better linearity and better signal
to noise ratio in respect to pseudo-impulsive approaches. The principal disadvantage of this
technique is the strong dependence on the linearity of the measurement system. Non-existent
echoes and phase problems can appear even with small non-linearities in the measuring chain.

3.1.3 Inverse filtering based methods

An alternative approach that assures greater noise immunity, improved robustness against mild
time variations of the system and against the non-linearities present in the measurement chain,
is based on the log swept-sine technique [44]. The swept-sine method uses a known sequence
x(n) and a proper inverse filter f (n), so that

x(n) ∗ f (n) = δ(n) (3.8)

thus the unknown IR h(n) can be calculated by convolving the measured output signal y(n)
with the inverse filter f (n):

h(n) = y(n) ∗ f (n). (3.9)

Using as an input a log swept-sine (a sweep with frequency increasing at a logarithmic rate),

27
Impulse response identification and equalization. Input-output techniques

Figure 3.1: An adaptive identifier

" #
ω1 · T t ω
ln( ω2 )
x(t) = sin ω2 · (e − 1) (3.10)
T 1
ln( ω1 )

where ω1 and ω2 are the starting and the ending frequency, and T is the sweep duration, the
inverse filter can be calculated in closed form, and it is created by time reversing the excitation
signal x(t), and then applying to it an amplitude envelope to reduce the level by 6 dB/octave,
starting from 0 dB and ending to −6 log2 ( ωω21 ) dB[44].

3.1.4 Adaptive identification

If the linear system is not time invariant (i.e. a slowly moving source within a room), and the
system inputs and outputs are known, an adaptive filter can be used to identify and track its
IR. The adaptive filter identifies the unknown system by minimizing the error, according to a
chosen criterion, between the system output and the filter output.

One of the most popular adaptive filter algorithm is the least mean square (LMS). “The LMS
algorithm is a linear adaptive filtering algorithm, which consists of two basic processes:

1. A filtering process, which involves computing the output of a linear filter in response
to an input signal and generating an estimation error by comparing this output with a
desired response.

2. An adaptive process, which involves the automatic adjustment of the parameters of the
filter in accordance with the estimation error.” [45].

The LMS algorithm is composed by three basic relationships

28
Impulse response identification and equalization. Input-output techniques

1. the filter output


y(n) = hT · x(n) (3.11)

2. the estimation error


e(n) = d(n) − y(n) (3.12)

3. the adaptation of the filter coefficients

hn+1 = hn + µ · e(n) · xT (n) (3.13)

where x is the system/filter input, y is the filter output, d the system output, µ the adaptation
step, x(n) = [x(n), x(n − 1), ..., x(n − p)]T and hn = [hn (0), hn (1), ..., hn (p)]T is the vector
of the FIR filter coefficients at time n.

The main ideas behind the LMS algorithm applied to system identification are based on:

a) a cost function that measures the error in terms of the power of the error signal, defined as:

J = E[e(n) · e(n)T ] = E[|e(n)|2 ] (3.14)

J is a quadratic function, therefore a “bowl” like surface

b) the minimum value of J can be estimated going in the steepest descent, using a proper step
size µ

hn+1 = hn − µ · ∇J(n) (3.15)

where ∇J(n) is the gradient of the cost function (that is a function of the estimated IR at time
n)

c) since usually the gradient cannot be calculated, it must be estimated from the available data.

The gradient is the derivative of J in respect to h

∇J = ∇E[e(n) · e(n)T ] = E[e(n)∇eT (n)] (3.16)

29
Impulse response identification and equalization. Input-output techniques

∇eT (n) = ∇[d(n) − y(n)]T = ∇[d(n) − hT · x(n)]T = −x(n)T (3.17)

therefore

∇J = −E[e(n) · x(n)T ]. (3.18)

The simplest estimation of the gradient is the instantaneous estimate

∇J ≈ −e(n) · x(n)T (3.19)

thus the update equation becomes

hn+1 = hn + µ · e(n) · x(n)T . (3.20)

It is important to point out, that the LMS does not converge to the real minimum, but only in
expectation. Hence the estimate of a time varying system will be noisy. Several other adaptive
algorithms with faster convergence properties exist [45]. However the LMS is, for its simplicity,
the most used.

3.2 Input-output reverberation cancellation techniques

Reverberation may degrade speech intelligibility and, more generally, the quality of sound and
music [1]. This is common for rooms that have not been designed for such functionalities. In
this case, the recommended approach is to improve the acoustics of the space by modifying
its physical properties. However, this might not be possible for functional and economic con-
straints.
If the speaker-to-receiver impulse response(s) is/are known, an estimate of the source signal
can be obtained from its reverberated instance(s) by the following techniques.

30
Impulse response identification and equalization. Input-output techniques

3.2.1 Single channel reverberation cancellation

If the acoustic path (the channel) is modelled as a linear-time invariant system characterized by
an Impulse Response (IR), h(n), the source signal, s(n), and the reverberant signal, x(n), are
linked by the equation
x(n) = h(n) ∗ s(n) (3.21)

where ∗ denotes the discrete linear convolution. Dereverberation is achieved by finding a filter
with impulse response w(n) so that

δ(n − τ ) = h(n) ∗ w(n) (3.22)

where w(n) is defined as the inverse filter, or the equalizer, of h(n), δ(k) is the unit sample
sequence and τ a delay [46].

The source signal s(n) is given by

s(n) = w(n) ∗ x(n). (3.23)

3.2.1.1 Zero forcing equalizer

The Zero-Forcing (ZF) equalizer applies the inverse of the channel to the received signal, to
restore the signal before the channel. The inverse of the channel, W (ω), is

1
W (ω) = (3.24)
H(ω)

where W (ω) and H(ω) are the Fourier transforms of w and h at frequency ω.
A condition for the existence of the ZF equalizer is that H(ω) must have no spectral nulls (i.e.
the system function H(z) has no zeros on the unit circle). This affects the robustness of the
ZF approach in noisy environment, even for transfer function zeros close to the unit circle [47].
The inversion in fact would induce an extremely high gain that would amplify the noise (i.e.
ringing). With noise present, perfect source recovery becomes impossible.
A possible solution, among several [48], is to use a regularization scheme [49] or Wiener de-
convolution [50]. In general, it is desirable to be able to flatten the frequency response, but not
at the expense of boosting dips or notches to the point where the boost causes amplifier and

31
Impulse response identification and equalization. Input-output techniques

speaker overload or massive amounts of the boosted frequency at other listening positions. To
overcome this problem, one needs to measure the original room frequency response, and not to
equalize for the actual measured response, but for a “regularised” version of the room response
which, in some manner, has filled in deep troughs of the measured response [51].

3.2.1.2 Wiener deconvolution

In the Wiener deconvolution the “closeness” between s(n) and y(n) in terms of the Mean-
Squared Error (MSE) is used

E [s(n) − y(n)]2 .

(3.25)

Choosing the equalizer transfer function to minimize 3.25 results in the Minimum Mean-
Squared Error (MMSE) equalizer. The MMSE equalizer is most easily described in the fre-
quency domain

1 Ps (ω)
W (ω) = (3.26)
H(ω) Ps (ω) + Pv (ω)

where Ps (ω) is the power spectrum of the source signal s(n) and Pv (ω) the power spectrum of
the noise v(n).

The MMSE equalizer can be viewed as a cascade of a ZF equalizer followed by a Wiener


smoothing filter. From equation 3.26 it can be observed that the equalizer magnitude decreases
as the noise at certain frequencies increases. Therefore the Wiener smoothing filter attenuates
frequencies dependent on their signal-to-noise ratio. An MMSE equalizer takes the spectral
densities of both source signal and noise into account to offer the best possible trade-off between
system inversion and noise suppression. Indeed, if no noise is present Pv (ω)=0 and the MMSE
equalizer reduces to the ZF equalizer. The benefits of the MMSE equalizer normally makes it
the first-hand choice over the ZF equalizer.

Equation 3.26 reports the most general, usually non-causal, MMSE equalizer. A practical linear
equalizer, on the other hand, must be stable and causal. This can be obtained by designing an
FIR filter, therefore stable, that minimize the following MSE [52]

32
Impulse response identification and equalization. Input-output techniques

E [d(n) − y(n)]2

(3.27)

where the desired output d(n) = s(n − τ ), with τ (τ ≥ 0), is a delayed version of the source
signal s(n). This leads to the Wiener-Hopf equation [50]

w = Rx −1 rdx (3.28)

where w = [w(0), ..., w(L − 1)]T is the Lth order inverse filter, Rx = E xn xTn

is the
autocorrelation matrix of the received signal and rdx = E {d(n)xn } is the cross-correlation
vector between the target and the received signals.

While providing a closed-form expression, equation 3.28 has practical inconveniences. Rx


cannot be inverted when the power spectrum of the received signal x(n) has spectral nulls
[52]. Furthermore, the direct matrix inversion may require excessive computational resources.
Therefore, iterative procedures, such as the LMS or the Recursive Least Squares (RLS) algo-
rithm [45], may be preferred.

3.2.2 Non minimum phase SISO system inversion

When the speaker-to-receiver impulse response h(n) is non-minimum phase (i.e. it has zeros
outside the unit circle), the calculation of its inverse filter is problematic. In fact, the inverse
of a non-minimum phase FIR system is an unstable IIR filter. The acoustic signal-transmission
channel is, except for almost anechoic rooms or very close to the sound source in normal
rooms, a non minimum phase function [53]. A possible solution is to consider a truncated FIR
approximation of the inverse IIR filter, that is by definition always stable. However, truncation
can lead to inaccuracies. Therefore, other techniques to invert a non-minimum phase single
channel systems have been investigated.

Homomorphic techniques [54] are based on the decomposition of the IR into minimum and
maximum-phase (all zeros outside the unit circle) components prior to inversion. However,
performing this decomposition is not trivial and therefore, in general, the calculated inverse
filter generates poor results.

The least-square inversion method, proposed in [55] by Mourjopoulos, relies on the minimiza-

33
Impulse response identification and equalization. Input-output techniques

tion of the cost function

L−1
X
J(τ ) = (δ(j − τ ) − f (j))2 (3.29)
j=1

PL−1
with f = i=1 w(i)h(j − i), where h is the time invariant impulse response and w =
[w(0), ..., w(L − 1)]T its least-square inverse filter of length L, and τ a delay.

MMSE tends to spread the error out in time, causing long non zero tails (i.e. longer rever-
beration) in the equalized impulse response. To address this problem a weighted least-square
inversion method that forces the system to equalize the long tails, at the expense of the energy
near the main tap has been proposed in [56] by Gillespie et al.

Both the homomorphic and least-square approaches highlight that an acausal inverse filter is
necessary to compensate for the system non-minimum phase component. Such acausal filters
can generate perceptually disturbing “pre-echoes” due to any inversion error (i.e. the imperfect
match between response and inverse filter) [57].

Even if the least-square approach can give in theory very good results, in real-time situations
it offers marginal improvement [58]. In fact, the performance deteriorates dramatically when
a filter designed for one position within the room is employed for equalization at a different
position. The performance also deteriorates for longer filters. This is due to the fact that Room
Transfer Functions (RTF) vary dramatically from point to point within the same enclosure.
Just a few tenths of the acoustic wavelength can cause large degradations in the equalized
room response [59]. Thermal variations can also induce substantial non stationarity. Therefore,
even a small mismatch between measured response and inverse filter can introduce measurable
errors larger than those removed by the dereverberation technique [59]. At points in the room
other than that of the measurement microphone, the equalizer and the room response do not
completely cancel each other out, so that a “pre-echo” or pre-response can be heard before the
main sound arrives. Such pre-responses sound highly unnatural and are very audible [60],[61].

These observations led Hatziantoniou and Mourjopoulos to consider it unrealistic that the ideal
equalization filters could be designed for all source/receiver positions. They also observed that
the direct response inversion of the mixed-phase room response can suffer from perceptually
disturbing time/frequency domain artifacts related to pre-echoes and ringing poles compensat-

34
Impulse response identification and equalization. Input-output techniques

ing for the original spectral dips. They therefore proposed a room equalization method, called
“Complex Smoothing”, based on a frequency domain regularization scheme of the measured
room responses, that generates an inverse filter of reduced spatial, spectral and time complex-
ity, but still capable of decreasing phase, transient and spectral distortions introduced by room
acoustics [62], [57]. The Complex Smoothed room discrete-frequency response Hcs (ω) is
transformed into a corresponding smoothed room impulse response hcs (n). Then, an inverse
filter wcs (n) is evaluated, which inverts the Complex Smoothed response, i.e.

hcs (n) ∗ wcs (n) ≈ δ(n). (3.30)

This approach does not achieve a theoretically-perfect room acoustics deconvolution, but re-
alizes a good compromise between reduction of perceived reverberation effects and the intro-
duction of inaudible processing artifacts. In the time domain the equalized response has more
power shaped in the direct and early reflection path sounds and less power allocated in some of
the reverberant components. In the frequency domain, the equalization procedure corrects gross
spectral effects, without attempting to compensate for many of the original narrow-bandwidth
spectral dips. Such improvements are achieved to the benefit of reproduction in other positions
within the same enclosure [58]. In fact, since the inverse filter is designed to have progressively
reduced frequency resolution from low to high frequencies, it can compensate for the full range
audio spectrum while modeling the low frequency modes of the room, that are independent of
position.

Another approach, to achieve multiple point equalization, that cannot recover the frequency re-
sponse dips of the multiple room transfer functions, but that can suppress their common peaks
due to resonance was proposed by Haneda et al. [23]. The proposed equalization scheme is
based on common acoustic poles equalization and employs an IIR model for the RTF. The
inverse filter, that is a causal FIR, achieves equalization without the pre-echo problem. This
method is only useful for low frequency equalization. However, since it is quite easy to achieve
reverberation reduction at high frequencies by using foam acoustic absorbers, it can have prac-
tical applications when the modal response of a room needs to be improved. A blind method to
calculate common acoustic poles has been proposed by Hiikichi at al. [63]. This can be used in
theory to build a self-adapting equalizer that can track the room changing conditions.

35
Impulse response identification and equalization. Input-output techniques

3.2.3 Multi-channel reverberation cancellation, the multiple input /output in-


verse theorem (MINT)

The dereverberation problem can be generalized for an arbitrary N -channel system, where
reverberated instances xi (n) of the source signal s(n) are acquired at N different positions
within a room. This leads to the following set of relations

xi (n) = hi (n) ∗ s(n), 1 ≤ i ≤ N (3.31)

dereverberation is achieved by finding a set of filters with impulse responses wi (n) so that

N
X
δ(n − τ ) = hi (n) ∗ wi (n) (3.32)
i=1

where xi (n), hi (n), wi (n) are respectively the i-th observation, transfer function and the in-
verse filter of the corresponding source-to-receiver channel and τ a delay. The previous equa-
tion can be written in the Z domain as

N
X
z −τ = Hi (z)Wi (z). (3.33)
i=1

An estimate y(n) = ŝ(n) of the source signal s(n) is given by

N
X
ŝ(n) = y(n) = xi (n) ∗ wi (n). (3.34)
i=1

FIR inverse filters wi (n) exist only if the channel transfer functions Hi (z), i = 1, ..., N , have
no common zeros, or in other words, only if they are coprime. This is what is meant by “channel
diversity”.

Comprimeness is strictly related to the Bezout identity [64]. Bzout’s theorem for polynomials
states that if P and Q are two polynomials with no roots in common, then there exist two other
polynomials A and B such that AP + BQ = 1.

Consider the multichannel model shown in Fig.3.2. If the subchannels are not coprime, then

36
Impulse response identification and equalization. Input-output techniques

Figure 3.2: a SIMO system.

there exists a common factor such that

Hi (z) = c(z)H̃i (z); i = 1, ..., N. (3.35)

Consequently, without further information, it is difficult to distinguish whether is part of the


input signal or part of the channel. Therefore, the identification cannot be made unique (unless
other properties of the input sequence and the channel are used) [65].

The set of filters that satisfy equation 3.32 is not unique. A possible solution is provided by
the MUltiple-input/output INverse Theorem (MINT), proposed in [9]. The MINT allows one
to calculate a set of FIR filters, wi (n), when hi (n) are FIR with no common zeros.

The filters wi (n) can be calculated from the following equations [9]:

Hw = b (3.36)

where

H = [H1 , H2 , .., HN ] (3.37)

and Hi , with i = 1, 2, ..., N

37
Impulse response identification and equalization. Input-output techniques

 
h (0) 0 ··· 0
 i
.. 

..
.

 hi (1) h( 0) . 
 
 .. .. .. 
 . . . 0 
 
Hi = hi (J) (3.38)
 
 hi (0) 

 
 0 hi (J) hi (1) 

 .. .. 

..
 . . . 
 
0 0 hi (J)

is an [J + M, M ] matrix, J is the length of the impulse response, M the length of the inverse
filter

w = [w1 (1), ..., w1 (M ), ..., wN (1), ..., wN (M )]T (3.39)

b = [0, ..., 0τ , 1, 0, ..., 0]T (3.40)

where τ (τ ≥ 0) is an arbitrary delay. The inverse impulse responses can be obtained by

w = H+ b (3.41)

where H+ is the pseudo inverse of the matrix H. Any generalized inverse can also be used.
For an N -channel system, considering a J-tap long impulse response, the minimum number of
taps, M , required in each filter is calculated by setting M so that the matrix H is square, i.e. ,
J + M = N M holds, which leads to

M = dJ/(N − 1)e. (3.42)

The filter length can be set at M > dJ/(N − 1)e as well [66], where dxe is the smallest integer
not less than x.

For a given impulse response of length, J, by increasing the number of microphones, N , the
inverse filter length, M , can be reduced. As a consequence, as shown in Fig. 3.3, the total

38
Impulse response identification and equalization. Input-output techniques

Figure 3.3: The total number of filter taps in the equalizer can be reduced by increasing the
number of microphones N . In the figure is reported equation 3.43 for a 1000-tap
impulse response when the number of channels is increased from 2 to 251. The non
monotonic decrease of the total tap number is due to the fact that the channel filter
length, M , can only assume integer values (as stated by equation 3.42).

number of taps required for the equalizer

T aps = M · N (3.43)

is reduced. In other words, the use of more microphones yields less computational demand and
less memory requirements. Thus multi-channel based structures can potentially exploit these
properties to provide more efficient dereverberation.

On the other hand, in [66] Hikichi et al. show that it is desirable to reduce the inverse filter
norm to reduce the sensitivity to RTF variations caused by source position changes. This can
be achieved by lengthening the filter and by choosing the arbitrary delay τ in equation 3.40 to
positive values, so that the causality constraint is relaxed. Therefore, it seems advisable to use
a longer filter length than the one suggested by 3.42.

3.2.4 Multi-channel system inversion

The MINT theorem offers a solution to the instability issue associated with the inversion of
non-minimum phase transfer functions [9], by ensuring that the filters for the equalizer will be
FIR if the channel transfer functions are FIR. Since reverberation is essentially due to energy
that is decaying in a finite amount of time, every room transfer function can be represented by

39
Impulse response identification and equalization. Input-output techniques

1 1

Amplitude
0 0.5

!1 0
0 500 1000 0 500 1000
Samples Samples

Figure 3.4: Illustration of the inversion: (a) single echo impulse response with α = −0.9 and
k = 50; (b) truncated inverse filter. A strong reflection requires a very long inverse
filter.

an FIR filter. Therefore, the inverse impulse responses are characterized by shorter lengths than
in the single-channel case. In fact, also in the minimum-phase case, the inverse of a finite IR is
an IIR filter, and therefore of infinite length.

As an example, let us consider the problem of equalizing a single echo IR of the kind:

h(n) = δ(0) + α · δ(k) (3.44)

where k is the delay expressed in samples and α the reflection gain.

Its Z transform is
H(z) = 1 + α · z −k . (3.45)

This filter is also known as an FIR comb. Its inverse transfer function is

1
W (z) = . (3.46)
1 + α · z −k

This filter is also known as an IIR comb.

The inverse Z transform of the IIR comb is

α−n · δ(n − m) m = 0, k, 2k, 3k, ... (3.47)

It can be observed that, when the reflection gain increases, the decay rate of the inverse filter
impulse response slows down. This implies, as shown in Fig.3.4, that, if the approximation of
a truncated inverse filter is used, a very long inverse filter might be necessary to obtain a good
dereverberation.

On the other hand, the MINT assures, under mild conditions, the existence of a finite length

40
Impulse response identification and equalization. Input-output techniques

Figure 3.5: Illustration of the inversion of a two channel system: h1 (n) = δ(0) − 0.9δ(n −
600), h2 (n) = δ(0) − 0.9δ(n − 1000). (a),(b) Inverse filters calculated by MINT.
(c) Equalized IR. In the multichannel case, a strong reflection can be perfectly
equalized by FIR filters.

equalizer. This yields better statistical properties in the equalizer estimation, less computational
demand and less memory requirement. Thus multi-channel based structures can potentially
exploit these properties to provide better and more efficient dereverberation. An example of a
multichannel system inversion by the MINT is shown in Fig. 3.5.

Beyond the already mentioned improvement of the MINT algorithm by Hikichi et al [66] , other
modifications have been published. In [67] a sub-band implementation, that greatly reduces the
computational requirements of the original algorithm has been proposed by Yamada et al. In
[68] the relation between the MINT and an adaptive algorithm called Multiple Error-LMS (ME-
LMS) is discussed. The theory presented reconciles the two approaches and derives explicit
conditions which must be fulfilled if an exact inverse is to exist.

3.3 Dereverberation quality measure

At the current state there is no agreement on the best metric to adopt in the evaluation of
the performance of dereverberation algorithms. Often dereverberation quality is associated to
the concept of intelligibility. In phonetics, intelligibility is a measure of how comprehensible
speech is, or the degree to which speech can be understood. Late reverberation and noise are
the main causes of the degradation in speech intelligibility. This is of interest in particular in
automatic speech recognition. However also other factors should be taken into account.
Early reflections are considered to provide a positive contribution when “strengthening” the di-
rect signal and therefore when improving intelligibility. From a “communication” perspective
this is desirable [69], [70], [71],[56]. However early reflections impress their own strong char-
acter to a signal by creating, often undesired, colorations. There are applications, as for instance
hands free communication or speech recording restoration, where it might be of interest to re-

41
Impulse response identification and equalization. Input-output techniques

move all reverberation, included early reflections. Therefore, it is debatable if a dereverberation


should cancel all the reverberation components or if it should focus on late reverberation only.

According to this scenario, two kinds of assessment can be found in the literature:

1. the success at increasing automatic speech recognition accuracy

2. the capability of removing reverberation according to a reference signal (i.e. the undis-
torted speech or the system IR)

A possible classification of dereverberation quality measures is given in [4]: intrusive (also


known as end-to-end or reference) and non-intrusive measures. The intrusive measures com-
pare the distorted signal with the undistorted signal, which is usually called the reference signal.
The non-intrusive measures do not require a reference signal, i.e., the speech quality is deter-
mined given only the distorted speech signal.

3.3.1 Intrusive measures based on the comparison with the undistorted speech

3.3.1.1 Segmental Signal to Reverberation Ratio

The Segmental Signal to Reverberation Ratio (SRRseg ) [72] of the l-th frame is defined simi-
larly to the segmental SNR, i.e.

K−1 PkN +N −1 !
10 X s(n)2
SRRseg = log10 PkN +Nn=kN
−1
(3.48)
K n=kN (s(n) − ŝ(n))2
k=0

where K is the number of frames, N is the frame length in samples (the window length is
usually of 32 ms), s(n) is the undistorted signal, and ŝ(n) is the enhanced signal. This implies
that the higher the SRRseg , the more closely the dereverberated signal approaches the original
signal

3.3.1.2 Log Spectral Distortion

The log-spectral distortion (LSD), also referred to as log-spectral distance is a distance measure,
expressed in dB, between two spectra. The log-spectral distance between spectra P (ω) and
P̂ (ω) is defined as:

42
Impulse response identification and equalization. Input-output techniques

k  p1
2
−1
2 X p
LSD(l) =  L(P (ω)) − L(P̂ (ω))  (3.49)
k
k=0

where L = 10 log10 (P (ω))2 . In most cases short-time spectra are used, which are obtained
using the Short Time Fourier Transform (STFT) of the signals. The mean Log Spectral Dis-
tortion is obtained by averaging the previous equation for all frames containing speech. The
most common choices for p are 1, 2, and ∞, yielding mean absolute, root mean square, and
maximum deviation, respectively [4].

3.3.1.3 Speech Transmission Index

Voice can be considered as an amplitude modulated signal [73]. If the original characteristic
of the modulation envelope are retained at the receiver, good intelligibility is present. The
amount of modulation, denoted by the Modulation Index (MI), is used to evaluate the Speech
Transmission Index (STI), a widely accepted measure to predict the effect of room acoustics on
speech intelligibility. The STI is a machine measure of intelligibility whose value varies from
0 (completely unintelligible) to 1 (perfect intelligibility). The modulation index as function of
modulation frequency can be calculated as follow [4]:

1. The speech signal is analysed using an octave filter bank. The filter bank can be bypassed
resulting in a broad-band analysis.

2. For each octave band the envelope is estimated by taking the magnitude of a standard
Hilbert-transform, which results in the Hilbert envelope. The Hilbert envelope is low-
pass filtered with a 50 Hz low-pass filter and then downsampled to a frequency of 200
Hz. The resulting signal will be referred to as the envelope signal.

3. For each octave band the Power Spectral Density (PSD) of the envelope signal is esti-
mated using a standard Welch procedure. The parameters used for the Welch procedure
are a window length of 8 seconds, a Hanning window with 40% overlap between succes-
sive windows.

4. The intensity values of the PSD are summed over modulation frequencies for each octave
band and are normalized to 1 using the DC-component of the PSD.

43
Impulse response identification and equalization. Input-output techniques

3.3.1.4 Perceptually-Based Measures

Other intrusive measures based on the comparison with the undistorted signal take into account
the human auditory system, among them the Bark Spectral Distortion (BSD), that transforms
the speech signal into a perceptually relevant domain [74], the Reverberation Decay Tail (RDT)
proposed in [75] by Wen and Naylor, and the PESQ described in ITU-T Recommendation P.862
(February 2001) [76].

BSD was developed by Wang et al. [77]. It was the first objective measure to incorporate
psychoacoustic responses. Its performance was quite good for speech coding distortions as
compared to traditional objective measures (in time domain and in spectral domain). The BSD
measure is based on the assumption that speech quality is directly related to speech loudness,
which is a psychoacoustical term defined as the magnitude of auditory sensation. In order to
calculate loudness, the speech signal is processed using the results of psychoacoustics measure-
ments. BSD estimates the overall distortion by using the average Euclidean distance between
loudness vectors of the reference and of the distorted speech. BSD works well in cases where
the distortion in voiced regions represents the overall distortion, because it processes voiced
regions only; for this reason, voiced regions must be detected.

3.3.2 Intrusive Channel-Based Measures

If the the acoustic system impulse response is known, dereverberation performance can be
easily evaluated by comparing the original system IR and the equalized one. In a multichannel
system the impulse response between the source and the closest microphone can be used as the
reference IR.

3.3.2.1 Direct to Reverberation Ratio

The Direct to Reverberation Ratio (DRR) is defined as:

h2 (δ)
DRR = 10 log10 PM −1 dB (3.50)
2 (k)
k=0(k6=δ) h

where h(n) is the speaker-to-receiver impulse response, M, its length in samples, and δ the
time-index of the direct path in samples. The DRR depends on the distance between the source

44
Impulse response identification and equalization. Input-output techniques

and the microphone and on the reverberation time of the room. Since DRR is a ratio between
the energy of the direct path versus the rest of the energy in the speaker-to-receiver impulse
response and, it does not involve any information about the duration of the impulse response.

It can be shown that the DRR can be estimated by the SRR [78]. Therefore without the explicit
knowledge of the channel impulse response.

3.3.2.2 The Definition index - D50 and D80

The Definition index is the early to total sound energy ratio. It is defined [1] as:

Rt 2
0 h (τ )dτ
D = R∞ 2
100% (3.51)
0 h (τ )dτ

expressed in unity fraction, is associated to the degree to which rapidly occurring individual
sounds are distinguishable. The Definition index attempts to define an objective criterion of
what may be called the distinctness of sound. t is usually set to 50 or 80 ms (i.e., in case
t = 50ms the Definition index is denoted by D50).

3.3.2.3 Clarity index - C50 and C80

The Clarity index is the Early to Late reverberation energy Ratio (ELR). It is defined [1] as:

Rt 2
0 h (τ )dτ
C = 10 log10 R∞
2
(3.52)
t h (τ )dτ

where t is usually set to 50 or 80 ms (i.e., in case t = 50ms the Clarity index is denoted by
C50).

3.3.2.4 Normalized Projection Misalignment

The Normalized Projection Misalignment (NPM) [79] criterion is particularly useful in com-
paring the identified IR with a target response. NPM projects estimation onto true impulse
response, ignoring scaling factors. This is mandatory in many situations, because the estimated
impulse response is usually a scaled version of the true impulse response.

45
Impulse response identification and equalization. Input-output techniques

kς(k)k
N P M (k) = (3.53)
khk

where

hT ĥ
ς(k) = h − ĥ. (3.54)
ĥT ĥ

This implies that with a perfectly identified set of channels, the NPM will be zero.

46
Chapter 4
Blind dereverberation techniques.
Problem statement and existing
technology

4.1 Introduction

In the recent years, an increasing interest in blind dereverberation techniques is observable.


The aim of blind dereverberation is to estimate the original source signal by removing the
reverberation components from the received signal(s), without knowledge of the surrounding
acoustic environment. A similar need arises for automatic speech recognition systems, where
the reverberation decreases the recognition rate [3] or for hand free communication, where the
microphones receive both the speech and the reverberation of the surrounding space. Signal
degradation due to reverberation is also a bottleneck for the performance and applicability of
algorithms to practical problems (i.e. the cocktail party problem). A review article on blind
speech dereverberation techniques was published in the 2005 by Naylor and Gaubitch in [72].
A review on the experimental validation of blind multimicrophone speech dereverberation was
published in the 2007 by Eneman and Moonen in [80]. Books related to the blind dereverbera-
tion problem have been recently published [81], [79], [82], [8].

Even though several approaches have been proposed, a possible discrimination into two classes
can be accomplished by considering whether or not the inverse IR needs to be estimated. In fact,
all dereverberation algorithms attempt to obtain dereverberation by attenuating the IR effects
or by undoing it. In a simplistic view, one approach tries to alleviate the “symptoms” of the
signal degradation, while the other attempts to address its “cause”. Due to the spatial diversity
and temporal instability that characterize the IRs [6], the first class of algorithms can offer, at
the current state, more effective results in practical conditions [7] [8]. However, the algorithms
belonging to the second class can potentially lead to ideal performances [9]. It must be noted
that practical dereverberation is still largely an unsolved problem.

To be consistent with the useful definitions reported in [4], the first class of algorithms will

47
Blind dereverberation techniques. Problem statement and existing technology

be addressed as “reverberation suppression” and the latter as “blind reverberation cancellation”


methods.

“Reverberation suppression methods” are based on diverse set of techniques such as: beam-
forming [5][7], spectral subtraction [8], temporal envelope [83], LPC enhancement [84] [85].

“Blind reverberation cancellation methods” can be distinguished into two sub-classes: the tech-
niques that are based on the IR blind estimation followed by its inversion [86] and the ones that
attempt to directly estimate the inverse system [87] [88] [89]. While the first methods have the
benefit of providing the access to the IR estimation, and this is of interest for the extraction of
many acoustic parameters (i.e. T60 , EDT, C80 etc. [1]), the calculation of the inverse system
is not trivial even in the non blind case [55] [49] [66] and it might lead to inaccuracies [6],
[59], [47]. Therefore, it is probably more consistent for the dereverberation purpose to achieve
a direct estimation of the inverse system.

Blind reverberation cancellation and suppression methods can be combined to offer hybrid
strategies [90][91].

Another useful distinction within dereverberation algorithms is between single or multi-channel


structures. Multi-channel approaches can take advantage of spatial diversity. While it might be
seem that the step leading from a single channel structure to its multi-channel version is a
simple generalization, the multi-channel framework can rely on strategies not applicable to the
single-channel case [9], [92].

This chapter will be focused on reverberation cancellation methods. For completeness, the
more relevant reverberation suppression methods will be briefly described.

4.2 Reverberation suppression methods

4.2.1 Beamforming

Beamforming is a spatial filtering technique that can discriminate between signals coming from
different directions. Beamforming applications are several and diverse [5] (i.e. radar, sonar,
telecommunication, geophysical exploration, biomedicine, image processing, acoustics). In
acoustics, beamforming is employed to create sensors with an electronically configurable di-
rectivity pattern. This can be used to separate a source in a noisy environment or to minimize

48
Blind dereverberation techniques. Problem statement and existing technology

Figure 4.1: Beamformer structure.

the interference caused by reverberation. Enhanced speech obtained from a far field acquisi-
tion is a typical application. In a similar way, this technique can be used to obtain a higher
intelligibility of the diffused signal by designing a loudspeaker array that can focus the acoustic
energy in a confined spatial region, minimizing the reflections due to the surrounding walls and
objects.

4.2.1.1 FIR filters and beamforming

The simplest form of a beamformer is a linear combination of the signals acquired by an N


equi-spaced linear array of omni-directional microphones.

N
X
y(k) = hi xi (k) (4.1)
i=1

where hi is the weight and xi the signal at the i − th sensor and y(k) the beamformer output.
This beamformer will be called for simplicity “linear beamformer”. This equation has the same
structure of an FIR filter.

The weight hi can be a simple real number or a more complex filter. In this last form beam-
forming, as it will be shown in chapter 4.7.2, has a close similarity to the structures employed
in multichannel blind reverberation cancellation methods.

When the linear beamformer operates at a single frequency ω (narrow band hypothesis), the
analogy with an FIR filter is intuitive.

49
Blind dereverberation techniques. Problem statement and existing technology

• In an FIR filter, the time interval between two consecutive samples in the filtered signal
is determined by the sampling period T .

• In a linear beamformer, the time interval present between two sensor is determined once
it is specified the Direction Of Arrival (DOA) θ of the impinging wave, its velocity, c and
the distance between the sensors, d.

The delay present between two neighboring sensors is 1

d
τ (θ) = sin(θ). (4.2)
c

While an FIR filter is based on the combination of uniformly spaced temporal samples, the
linear beamformer is based on the combination of uniformly spaced spatial samples. As in
an FIR filter, the beamformer weights h determine completely its response, and the FIR filter
design techniques can be used, replacing the concept of frequency with the one of directivity.

4.2.1.2 FIR filter frequency response and linear beamformer directivity pattern

The frequency response of an FIR filter of length N , with impulse response h at a sampling
period T is given by

N
X
H(ω) = hi e−jωT (i−1) (4.3)
i=1

that can be written vectorially as

H(ω) = hT d(ω) (4.4)

where
hT = [h(1), h(2), ..., h(N )] (4.5)

1
A far field hypothesis is assumed: the sources are far enough to consider the wave as plane. This relation is not
valid in the close field.

50
Blind dereverberation techniques. Problem statement and existing technology

and
d(ω) = [1, ejωT , ..., ejω(N −1)T ]. (4.6)

The squared absolute value of the frequency response is the filter power response.

To obtain the spatial response of a linear beamformer, it is sufficient to replace the sampling
d
frequency T with τ (θ) = c sin(θ)

H(ω) = hT d(θ, ω) (4.7)

where
d(θ, ω) = [1, ejωτ (θ) , ..., ejω(N −1)τ (θ) ]. (4.8)

The square of the absolute value of H(ω) is the beamformer “directivity pattern” or “beam-
pattern”.

In the temporal sampling, the highest frequency that can be represented without ambiguity
is fN yquist = fs /2, where fs is the sampling frequency. If a signal is sampled with an in-
sufficiently high fs , the sampled signal, due to the ambiguity in the representation (aliasing),
will contain frequency components that are not really present in the original signal. In a similar
way, when a plane wave is spatially sampled with sensors placed at a uniform distance d, spatial
aliasing can happen. It is possible to show that the highest frequency that can be reconstructed
without ambiguity is

c
fmax = . (4.9)
2d

Spatial aliasing can cause undesired images in the beam-pattern.

4.2.1.3 A simple beamformer

A simple FIR low pass filter is given by the average of N neighboring samples

h(n) = [1/N, 1/N, ..., 1/N ] . (4.10)

51
Blind dereverberation techniques. Problem statement and existing technology

Figure 4.2: Amplitude response of a simple low pass filter.

Figure 4.3: Beampatterns of linear beamformers: (a)16 sensors, (b)32 sensors, (c)Dolph-
Cebyshev 16 sensors.

The filter frequency cut is linked to the filter length N . The frequency response of this system
can be calculated by 4.3 and its power spectrum is reported in Fig.4.2.

For a linear beamformer, the same h(n) can be interpreted as a peak of sensitivity at 0◦ , as
shown in Fig.4.3(a). The spatial resolution of the beamformer can be increased by augmenting
the number of sensors, as shown in Fig. 4.3(b).

In the FIR filter design it is usual to weight the coefficients with a windowing function to obtain
a smoother frequency response. This at the expense of frequency resolution. An optimal choice
in this sense, that offers equiriple behavior in the stopband, is the Dolph-Chebyshev window
[93]. An example of a beamformer obtained by using the weights reported in 4.10 and smoothed
by a Dolph-Chebyshev window is reported in Fig. 4.3(c).

52
Blind dereverberation techniques. Problem statement and existing technology

Figure 4.4: Wideband behavior of a 16 sensor Dolph-Chebyshev beamformer.

4.2.1.4 Steering and 2-D beamformers

The previous example considered a beamformer with a maximum of sensitivity at 0◦ . If it is


desired to steer the maximum to a different direction, it is sufficient to properly delay the signals
acquired by the sensors. It can be shown [5] that the coefficients h of a narrowband beamformer
operating at frequency ω0 steered to θ0 are given by

h = d(θ0 , ω0 ). (4.11)

A linear beamformer can discriminate the direction of arrival from a 2-D space. A 3-D dis-
crimination can be obtained by a 2-D matrix of equi-spaced sensors. The smoothing window
coefficients are given by the factorization of the 1-D window. Therefore, the 2-D linear beam-
former can be viewed as the composition of 1-D linear beamformers.

4.2.1.5 Beamformer response to a wide band excitation

If a narrowband beamformer designed for a specific frequency is hit by a wideband plane wave,
it will appear large for high frequencies, therefore the spatial resolution will be high, and small
for low frequencies, and as a consequence the spatial resolution will decrease. At high fre-
quencies spatial aliasing might happen, and unwanted maxima of sensitivity might appear in
the beampattern. The behavior of a beamformer at different frequencies is shown in Fig.4.4.

Since the beampattern exhibits a non constant lobe width at different frequencies, the interfering
signal will not be completely suppressed, but only low-pass filtered. Design techniques for
beamformers with a constant beampattern have been proposed, among them the approach of
Ward et al. [94].

53
Blind dereverberation techniques. Problem statement and existing technology

4.2.1.6 Beamforming and dereverberation

A beamformer can be used to reduce reverberation. If a beamformer is oriented toward a


source positioned within a reverberant room, the reverberation that does not fall within the
beampattern is attenuated. This implies the knowledge of the direction of arrival of the source.
A popular method in this sense is the “delay and sum” beamformer (DSB), where the observed
microphone signals are delayed to compensate for different times of arrival and then weighted
and summed [95]. This causes the constructive summation of the components due to the direct
path and the attenuation of the incoherent components due to reverberation [95]. It can be
shown that the DSB forms a beam in the direction of the desired source.

A beamformer is however not capable of reducing the components of reverberation that fall
inside the beampattern. In other words, since reverberation comes from all possible directions
in a room, it will always enter the path of the beam.

Gaubitch [96] has shown that the expected improvement in direct-to-reverberant ratio that can
be achieved with a DSB is

 
D0 2
PM PM 1
 m=1 l=1 Dm Dl
E DRR = 10 log10 PM PM sin(k||l −l ||)
  (4.12)
m
m=1
l
l=1 k||lm −lm+1 || cos(k(Dm − Dl ))

where Dm is the distance between the source and the m-th microphone, lm is the m-th mi-
crophone three dimensional coordinate vector and D0 = minm (Dm ) is the distance from the
source to the closest microphone. k = 2πf /c is the wave number with f denoting frequency
and c being the speed of sound in air.

The expected improvement that can be achieved with the DSB depends only on the distance
between the source and the array and the separation of the microphones and is consequently
independent of the reverberation time. The performance increases by augmenting the number
of microphones and the distance from the source.

In summary, beamforming and in particular the delay-and-sum beamformer are simple ap-
proaches that can provide moderate improvement in dereverberation.

54
Blind dereverberation techniques. Problem statement and existing technology

4.2.2 Spectral subtraction

4.2.2.1 Noise reduction based on spectral subtraction

Spectral subtraction is not a recent approach to noise compensation and was first proposed in
1979 [97]. There is however a vast amount of more recent work in the literature relating to
different implementations and configurations of spectral subtraction.

Spectral subtraction will be described below as summarized in [98].

Spectral subtraction is usually applied to additive noise reduction. Its main advantage is the
simplicity of implementation and the low computational requirements. Speech degraded by
additive noise can be represented by

y(n) = x(n) + v(n) (4.13)

where x(n) is a speech signal corrupted by the additive noise v(n). Assuming that speech and
noise are uncorrelated, the enhanced speech, x̂(n), is obtained from




|Y (k)| − |V̂ (k)|, if |Y (k)| > |V̂ (k)|

X̂(k) = (4.14)




0, otherwise

where X(k), Y (k) and V (k), denote the short-time magnitude spectra of x(n), y(n) and v(n)
of the kth frame. These short-time magnitude spectra are usually obtained from the discrete
Fourier transform (DFT) of sliding frames, typically in the order of 20-40 ms. The noise spec-
trum is estimated during non-speech intervals. Noise reduction is thus achieved by suppressing
the effect of noise from the magnitude spectra only. The subtraction process can be in true
magnitude terms or in power terms. Phase terms are ignored.

In the particular case of magnitude spectral subtraction, the enhanced speech is reconstructed
by using the phase information of the corrupted signal y(n)

h i
6
x̂(n) = IDF T |X̂(k)|ej Y (k)
(4.15)

where IDFT represents the inverse discrete Fourier transform.

55
Blind dereverberation techniques. Problem statement and existing technology

The main inconvenience with this approach is the generation in the processed signal of an
annoying interference, termed musical noise. This noise is composed of tones at random fre-
quencies [99] and is mainly due to the rectification effect caused by equation 4.14. The spectral
subtraction process can also be described as a filtering operation in the frequency-domain as
[99]

X(k) = G(k)Y (k), with 0 ≤ G(k) ≤ 1. (4.16)

Equations, that allow one to design the filter G(k) to perform magnitude subtraction, power
spectral subtraction and Wiener filtering, have been similarly proposed in [98], [100], [99].
Here, the formulation proposed in [98] is reported:

n h iγ oγ2 h iγ
(k) 1 V (k) 1
 1 − α VY (k)

 , if Y (k) < 1
α+β


G(k) = (4.17)




β otherwise .

The following control parameters adjust G(k):

• α (oversubtraction factor). It controls the amount of denoising at the expense of raising


distortion.

ˆ to zero, a threshold
• β (spectral flooring). Instead of assigning negative values of |X(k)|
β can be set in such a way that musical noise caused by the rectification effect can be
reduced.

• Exponents γ1 and γ2 . Determines the path between G(k) = 1 and G(k) = 0. Three
classical methods are defined: magnitude subtraction, with γ1 = γ2 = 1, power spectral
subtraction, defined by γ1 = 2 and γ2 = 0.5, and Wiener filtering, with γ1 = 2 and
γ2 = 1.

Speech distortion and residual noise cannot be minimized simultaneously. Parameter adjust-
ment is dependent on the application. As a general rule, human listeners can tolerate some
distortion, but they are sensitive to fatigue caused by noise. Automatic speech recognizers
usually are more susceptible to speech distortion [98].

56
Blind dereverberation techniques. Problem statement and existing technology

In [101], the limitations of spectral subtraction have been analyzed. In the same paper, the
exact relation between the clean speech spectrum X(k) and the noise, V (k), and distorted
signal, Y (k), spectra is given:

1/2 θ (k)
X(k) = |Y (k)|2 − |V (k)|2 − X(k) · V ∗ (k) − X ∗ (k) · V (k)

eX . (4.18)

This expression suggests that three sources of error in a practical implementation of spectral
subtraction thus exist:

• phase errors, arising from the differences between the phase of the corrupted signal
θY (ejw ) and the phase of the true signal θX (k)

• cross-term errors, from neglecting X(k) · V ∗ (k) and X ∗ (k) · V (k)

• magnitude errors, which refer to the differences between the true noise spectrum |V (k)|
and its estimate |V̂ (k)|

Except for the worst levels of SNR, errors in the magnitude make the greatest contribution.
However, as noise levels in the order of 0 dB are approached phase and cross-term errors are
not negligible and lead to degradations that are comparable to those caused by magnitude errors.

4.2.2.2 Dereverberation based on spectral subtraction

The use of spectral subtraction for speech dereverberation of noise-free speech was proposed
by Lebart et al. in [102]. Spectral subtraction dereverberation methods are based on the obser-
vation that reverberation creates correlation between the signal measured at time t0 and at time
t0 + ∆t. Therefore, reverberation can be reduced by considering as noise the contribution of
the signal at time t0 to the signal at time t0 + ∆t. The problem of reverberation suppression
differs from classical de-noising in that the “reverberation noise” is non stationary.

The non-stationary reverberation-noise power spectrum is usually based on a statistical model


of late reverberation, that assumes that the room IR can be modelled as a zero-mean random
sequence modulated by a decaying exponential [103]

h(n) = v(n)e−τ n u(n) (4.19)

57
Blind dereverberation techniques. Problem statement and existing technology

where v(n) represents a white zero-mean Gaussian noise, u(n), the unit step function, and τ is
a damping constant related to the reverberation time T60

τ = 3 ln(10)/T60 . (4.20)

The non-stationary reverberation-noise power spectrum, due to late reverberation, can be mod-
elled as [98]

V (n) = e−2τ Td X(n − Td ) (4.21)

where Td is the number of samples that identify the threshold that separates the direct compo-
nent from the late reverberant one (usually between 40 and 80 ms). Therefore, it is the number
of samples where reverberation is not suppressed. V (n) is an exponentially attenuated power
spectrum of the acquired signal x(n).

To achieve an effective dereverberation, an accurate and consistent estimation of the reverbera-


tion time is necessary. Since τ is related to the reverberation time, the T60 should be estimated
blindly from the captured signal. Different approaches have been proposed to tackle this prob-
lem [104], [105], [106]. The main difficulty is the requirement of silence regions between spo-
ken words. Particularly in short utterances, this condition may not be fulfilled, with a resulting
error in the estimate of T60 .

In contrast to deconvolution methods, the reverberation suppression method based on spectral


subtraction is not sensitive to the fluctuation of impulse responses, therefore it is more robust
in practical applications. On the other hand the nonlinear processing distortion (i.e. musical
noise) and the necessity of an accurate blind estimate of the reverberation time can degrade the
quality of the processed reverberant speech.

Spectral subtraction is also used as a post-processing step in blind reverberation cancellation


algorithms [91], [90]. If this step is applied after the reverberation cancellation, as it is done
in the referred papers, the exponential decaying model is not anymore valid, and a modified
representation for the reverberation noise is required.

58
Blind dereverberation techniques. Problem statement and existing technology

Figure 4.5: General structure for LP residual enhancement

4.2.3 LP residual enhancement

The LP residual of reverberant voiced speech segments contains the glottal pulses followed by
other peaks due to multi-path reflections. dereverberation, as reported in Fig.4.5, can thus be
achieved by attenuating these undesired peaks and synthesizing the enhanced speech waveform
using the modified LP residual and the time-varying all-pole filter with coefficients calculated
from the reverberant speech. This approach relies on the assumption that the LP coefficients
are unaffected by reverberation. The validity of this hypothesis will be discussed in more in
detail in section 4.4. The main interest here is that these methods show similarities with many
blind reverberation cancellation algorithms, where the LP residual calculation is often used as
a preprocessing step. However, it might be misleading to classify these last methods as “LP
residual enhancement algorithms”, since it would be advisable to consider in this class, only
the algorithms that do not require system identification.

The first approach based on LP residual enhancement was most likely proposed by J.B. Allen
and F. Haven, from Bell Telephone Laboratories Inc., in a patent that was filed in 1972 [84].
A detector to distinguish between voiced and unvoiced speech frames, a pitch estimator, and a
gain estimator were used. All signals were then employed to synthesize a clean LP residual.

LP residual enhancement was also proposed by Yegnanarayana and Murthy [85]. Their method
involves identifying and manipulating the linear prediction residual signal in three different
regions of the speech signal: high SRR, low SRR, and only reverberation component regions.
A weight function is derived to modify the linear prediction residual signal. The weighted
residual signal samples are used to excite a time-varying all-pole filter to obtain perceptually

59
Blind dereverberation techniques. Problem statement and existing technology

enhanced speech. A following approach [107] is based on the use of time-aligned Hilbert
envelopes to represent the strength of the peaks in the LP residuals. The Hilbert envelopes are
then summed and used as a weight vector which is applied to the LP residual.

Another approach has been proposed by Griebel and Brandstein [108]. Their technique is based
upon the observation that residual impulses due to the original speech tend to be predictable
while those due to reverberation effects are relatively uncorrelated in both amplitude and time.

All these methods use the scheme reported in Fig.4.5: LPC analysis followed by residual en-
hancement and speech re-synthesis. The advantage is that no system identification is required.
However, the main limitation is that they do not consider the original structure of the excitation
signal and therefore the enhanced residual can differ from the original clean residual and can
result in unnatural sounding speech.

An advanced practical implementation based on the enhancement of the LP residual from the
output of a delay and sum beamfomer was proposed in [109] by Gaubitch et al. The fact
that the waveform of the LP residual between adjacent larynx-cycles varies slowly was used,
so that each such cycle can be replaced by an average of itself and its nearest neighboring
cycles. The averaging results in a suppression of spurious peaks in the LP residual caused by
room reverberation. This algorithm was practically implemented in a computationally efficient
system that can achieve a 5 dB SNR improvement in a real environment without the knowledge
or the estimate of the room transfer functions [7].

4.3 Blind reverberation cancellation methods

The reverberation cancellation techniques described in 3.2.1 for the SISO system and in 3.2.3
for the SIMO system calculates the inverse filter(s) from the known system impulse response(s).
However, when no information about the system impulse responses are available, the estimation
of the inverse filters can be obtained only by observing the system output(s). These techniques
are called blind.

Single channel blind reverberation cancellation is connected to the blind estimation of the equal-
izer w(n) so that its convolution with the received signal x(n)

ŝ(n) = y(n) = w(n) ∗ x(n) (4.22)

60
Blind dereverberation techniques. Problem statement and existing technology

gives an estimate y(n) = ŝ(n) of the source signal s(n).

The relation that exists between the system impulse response and its inverse,

δ(n − Nd ) = h(n) ∗ w(n) (4.23)

or in the frequency domain

1
W (ω) = (4.24)
H(ω)

reveals that, in theory, the blind equalization and the blind system identification problem are
essentially the same. Therefore any blind identification scheme can be used to perform blind
deconvolution by identifying and then inverting the system impulse response. Due to the es-
timation error of h(n), all the problems regarding the system inversion analyzed in 3.2.1 are
magnified in the blind case. Therefore it is probably even more appropriate to achieve a direct
evaluation of the equalizer instead of calculating it by an inversion from the identified system
impulse response.

In a similar way, the multichannel blind reverberation cancellation is connected to the blind
estimation of a set of filters wi (n) that satisfy the equation:

N
X
δ(n − Nd ) = hi (n) ∗ wi (n) (4.25)
i=1

and
N
X
ŝ(n) = y(n) = xi (n) ∗ wi (n) (4.26)
i=1

gives an estimate y(n) = ŝ(n) of the source signal s(n).

Blind dereverberation can be therefore viewed as the problem of blindly estimating one or more
filters to equalize the received signal. This is much more problematic than in the non-blind case,
in particular when colored source signals are considered.

In fact, suppose the channel, h(n), is Linear Time-Invariant (LTI), then the observed signal
x(n) can be expressed as
x(n) = h(n) ∗ s(n) (4.27)

61
Blind dereverberation techniques. Problem statement and existing technology

where * denotes convolution. If either h(n) or s(n) are the convolution of two signals,

h(n) = h1 (n) ∗ h2 (n) (4.28)

s(n) = s1 (n) ∗ s2 (n) (4.29)

then
x(n) = h1 (n) ∗ h2 (n) ∗ s1 (n) ∗ s2 (n) (4.30)

thus, it is impossible to understand which component belongs to the source signal or to the
distortion operator. Unambiguous blind deconvolution can be obtained only if s(n) and h(n)
are irreducible. An irreducible signal is one which cannot be exactly expressed as the convolu-
tion of two or more components signals (except with a delta function) [110]. The problem of
unambiguous deconvolution is present whenever we want to decouple the effect of coloration
of the source (e.g. the resonances due to the vocal tract in speech) and the one connected to the
reverberation. Unless additional knowledge is available, blind deconvolution of colored signals
(i.e. speech, music) cannot be achieved.

On the other hand, if the source signal s(n) has a “white” spectral structure several techniques
can be employed to blindly estimate the filter(s) to equalize the received signal x(n) [111],
[112], [113], [92].

Hence, a reverberation cancellation algorithm, can be realized by a blind deconvolution al-


gorithm if the transmitted signal is white or if it is known the actual power spectrum of the
received signal.

4.4 Signal pre-whitening

The problem of unambiguous deconvolution in speech dereverberation is essentially how to


achieve a reliable separation of the room and the vocal tract IRs.

Although the speech signal is not statistically white, it can be modelled as a convolution of the
white signal v(n) and the vocal tract filter a(n)

s(n) = v(n) ∗ a(n) (4.31)

62
Blind dereverberation techniques. Problem statement and existing technology

where a(n) has the characteristic of the speech spectrum A(ω). Ideally the whitening filter
g(n) must be able to remove only the inherent correlation of the speech signal without affecting
reverberation, so that

A(ω) · G(ω) = 1 (4.32)

where G(ω) is the frequency response of g(n).

The most popular techniques for pre-withening the reverberated speech are based on Linear
Predictive (LP) analysis [87] or on block whitening by Fast Fourier Transform (FFT) magni-
tude equalization [91]. The foundation of these methods is based on the fact that reverberation
and speech have different temporal structure. While speech can be considered quasi-stationary
within blocks of 20-30 ms of duration, reverberation, unless considering a source that is mov-
ing, can be considered approximately stationary. Other possible approaches are based on the
non-stationarity of signals [114] or on the knowledge of the source signal statistics [115] [116].

LP pre-whitening is based on the observation that signals with a harmonic structure can be
modelled using a cascade of second order all-pole sections driven by a white noise signal.

A second order all-pole section can be written as

1 1
H(z) = −1 −2
= (4.33)
2
1 − 2Re(p) · z + |p| z 1 − a1 · z − a2 · z −2
−1

with p a pole expressed in polar form as

p = r · ejω (4.34)

where ω is the normalized angular frequency and r the pole radius.

Having a segment of speech s(n), the optimal coefficients ai can be calculated [50] by mini-
mizing the squared error

X
εp = |e(n)|2 (4.35)
n=0

where
e(n) = s(n) − ŝ(n) (4.36)

63
Blind dereverberation techniques. Problem statement and existing technology

Figure 4.6: Linear prediction residual of a voiced segment.

and p
X
ŝ(n) = − ap (k) · s(n − k). (4.37)
k=1

This method is called Linear Prediction (LP), since ŝ(n) is an estimate, or prediction, of the
signal s(n) in terms of a linear combination of the previous p values. p, defined as the LP order,
is related to the number of poles, therefore to the number of resonant peaks, used to model the
signal spectrum.

The residual error e(n) represents the unpredictable components of y(n). Thus e(n) is small
when the signal has an harmonic structure and large for transients and noise-like behaviors.

The coefficients ap (k) are called the linear prediction coefficients. The filter

p
X
Ap (z) = 1 + ap (k)z −k (4.38)
k=1

is called the Prediction Error Filter (PEF).

A speech signal, or more generally any audio signal, can be analyzed by splitting it into quasi-
stationary blocks (20-30 ms for speech) and by calculating the filter coefficients for every block.

In voiced segments, as shown in Fig.4.6, a quasi-periodic structure can be identified. Large


peaks occurring once per pitch period are clearly distinguishable. These peaks are the glottal
pulses and are due to the vibration of the vocal folds. These vibrations are the excitation signal
for voiced speech. The LP residual calculation removes the harmonic structure of the input sig-
nal. Therefore the speech residual has a “whiter” spectral structure in comparison to the original
speech signal. Since both the LP and the dereverberation FIR filter are convolutional operators
the ambiguity problem is not completely solved by the LP approach. A possible improvement

64
Blind dereverberation techniques. Problem statement and existing technology

can be obtained in the multi-channel case, as proposed in [117] by Gaubitch et al. and in [118]
by Triki and Slock, by exploiting both the temporal and the spatial diversity. Gaubitch et al.
show that a more accurate decoupling can be achieved, in a multichannel framework, by av-
eraging the LP coefficients calculated from instances of the same speech recorded in different
positions within a room. Triki and Slock suggest exploiting the spatiotemporal diversity to
estimate the source correlation structure, which hence is used to determine a source whiten-
ing filter. Their observation relies on the fact that, according to statistical acoustics, above the
Schroeder frequency

r
T60
fg = 2000 (4.39)
V
the spatially averaged reverberation spectrum is flat,

1−β
h|H(ω)|2 i = (4.40)
πAβ
where h.i is the spatial expectation (estimated by averaging over all possible source and micro-
phone positions), β is the average wall absorption coefficients and A is the total wall surface
area. Therefore the speech spectrum can be estimated by averaging the frequency response of
the speech signals received by multiple microphones. It can be criticized that this might be
problematic for rooms where the modal behaviour prevails (i.e. small reverberant rooms). In
this case, the flatness hypothesis might happens only for relatively high frequencies, and room
resonances are overlapped with speech resonances.

An iterative approach method, applicable in the single channel case also, was proposed in [88]
by Yoshioka et al. The authors claim that by jointly estimating the channel’s inverse filter and
the PEF, the channel’s inverse is identifiable due to the time varying nature of the PEF.

In [91] Furuya et al. suggest whitening the reverberant speech by block whitening by FFT
magnitude equalization. A whitening filter g(k) with a short time span (16-512 taps at a sam-
pling frequency of 12 kHz) was used to remove the correlation due to speech while leaving the
correlation of reverberation with a longer time span. A possible limitation is that if reflective
surfaces are placed in the proximity of the source, also the correlation due to this early reflec-
tions will be removed. This will happen more evidently if the FFT block length is increased. In
their method, the source speech spectrum A(ω) is estimated by using

65
Blind dereverberation techniques. Problem statement and existing technology

A(ω) = h|S(ω, m|i ≈ h|Xj (ω, m)|i (4.41)

where h.i is the spatial expectation, S(ω, m) is the STFT of the original speech signal and
Xj (ω, m) is the STFT of the received speech signal xj (n).

The whitening filter g(k) is calculated by using

1
G(ω) = (4.42)
A(ω)

and performing the inverse Fourier transform of G(ω). The decorrelated signal is given by
convolution of the received signal at microphone j, xj (k), and the whitening filter g(k) and is
used to compute the inverse filters instead of using the received signal xj (n).

4.4.1 A criticism of pre-whitening

All the methods that have been discussed offer only an improvement to the unambiguous de-
convolution problem for speech, and cannot be extended to the signals that contains sustained
harmonics (i.e. music or singing voice). The choice of the data window length to be whitened,
while being critical, is usually chosen on heuristic criteria. Furthermore, it can be criticized
that dereverberation should happen before whitening, while usually the opposite approach is
adopted [87], [118],[91]. In fact, the order of two linear filters can be swapped only if they
are time invariant. Since the vocal tract filter is not stationary, performing whitening before or
after dereverberation will lead to different results. From a physical perspective, reverberation
is added after the convolution with the vocal tract. Therefore reverberation spreads the infor-
mation contained in a segment of speech signal into a wider temporal frame, and whitening on
short block base is unable to remove this. This aspect will be analyzed in more detail in section
5.4.

4.5 Blind deconvolution

Blind deconvolution techniques are an unsupervised learning approach that identifies the in-
verse of an unknown linear time-invariant, possibly non minimum phase, system, without

66
Blind dereverberation techniques. Problem statement and existing technology

Figure 4.7: Model of the blind SISO deconvolution problem.

having access to a training sequence (i.e a desired response). An overview of existing blind
deconvolution techniques can be found in [119], [120], [45], [121].

Blind deconvolution is often called, in the communication framework, blind equalization. A


discrete-time model of the linear equalization problem is shown in Fig.4.7.

The following model assumptions are made

• The source signal s(n) is a discrete-time, real, stationary stochastic process with zero
mean and discrete-time power spectrum Ps (ω).

• The distorting system is linear and time invariant (LTI) with discrete-time transfer func-
tion H(ω).

• v(n) is noise, statistically independent of s(n), modelled as a discrete-time, real, station-


ary stochastic process with zero mean and power spectrum Pv (ω).

• u(n) = x(n) + v(n) is the observed signal.

• The equalizer is an LTI system with discrete-time transfer function W (ω).

In a blind approach no information about H(ω) is available. Therefore, to achieve equalization,


an estimate of the system transfer function or of its inverse must be found. In a noise free
scenario, only second order statistics (SOS) of the received signal x(n) (i.e. the system output)
are needed to equalize the magnitude |H(ejω |. However SOS are not sufficient to equalize
the phase component. In fact, the phase response of an LTI system is not available in the
output SOS. This can be observed from the expression of the power spectrum (i.e. the Fourier
transform of the autocorrelation) Px (ejω ) of x(n)

67
Blind dereverberation techniques. Problem statement and existing technology


X

Px (e ) = rx (k)e−jkω (4.43)
k=−∞

that is linked to the power spectrum of the system input s(n) by the relation

Px (ω) = Ps (ω)|H(ω)|. (4.44)

If a different LTI system H 0 (ω) = H(ω) ∗ A(ω), where A(ω) is an allpass filter (i.e. unit
magnitude and arbitrary phase response), the power spectrum Px (ω) is unchanged. SOS cannot
distinguish between H 0 (ω) and H(ω). This is why SOS is often described as phase-blind.

A unique relationship between the magnitude |H(ω)| and the phase 6 H(ω) of an LTI system
exists only when H is either minimum phase or maximum phase (i.e. the transfer function of
the system is stable and has all its zeros confined either to the interior or exterior of the unit
circle in the z-plane) [54]. Therefore, a whitening filter, that equalizes only the magnitude
response |H(ω)| of a system, is insufficient for blind equalization of mixed-phase systems.

4.6 Higher order statistics (HOS) methods

A possible approach to overcome the limitation of SOS and to recover the phase information is
by using Higher Order Statistics (HOS).

Let u(n), u(n + τ1 ), .., u(n + τk−1 ) denote the random variables obtained by observing the
process at times n, n + τ1 , .., n + τk−1 .

The second, third and fourth-order cumulants for a stationary random process are given by [45]

c2 (τ ) = E {u(n) · u(n + τ )} (4.45)

c3 (τ1 , τ2 ) = E {u(n) · u(n + τ1 ) · u(n + τ2 )} (4.46)

68
Blind dereverberation techniques. Problem statement and existing technology

c4 (τ1 , τ2 , τ3 ) = E {u(n) · u(n + τ1 ) · u(n + τ2 ) · u(n + τ3 )}

− E {u(n) · u(n + τ1 )} · E {u(n + τ2 ) · u(n + τ3 )}


(4.47)
− E {u(n) · u(n + τ2 )} · E {u(n + τ3 ) · u(n + τ1 )}

− E {u(n) · u(n + τ3 )} · E {u(n + τ1 ) · u(n + τ2 )}

The pth -order moment Mp of a random variable A is

M p (A) = E {Ap } (4.48)

where E {} is the statistical expectation.

As an example, M 1 (A) = E {A} is the mean, and M 2 (A)−(M 1 (A))2 = E A2 −(E {A})2


is the variance of A.

The generalization to a stationary random process u(n) is the k th − order moment function
Ru , defined as

Ru [τ1 , ..., τk−1 ] = E {u(n) · u(n + τ1 )...u(n + τk−1 )} (4.49)

It can be observed that:

• the second-order cumulant c2 (τ ) is the same as the autocorrelation function r(τ ) (the
second order moment function E {u(n) · u(n + τ ]});

• the third order cumulant c3 (τ ) is the same as the third order moment function
E {u(n) · u(n + τ1 ) · u(n + τ2 )};

• the fourth order cumulant differs from the fourth order moment function.

A polypectrum of order k is defined as the k th -dimensional Fourier transform of the k th -order


cumulant ck [45]

69
Blind dereverberation techniques. Problem statement and existing technology


X ∞
X
Ck (ω1 , ..., ωk−1 ) = ... ck [τ1 , ..., τk−1 ] · e−j(ω1 τ1 +...+ωk−1 τk−1 ) (4.50)
τ1 =−∞ τk−1 =−∞

for k = 2, the ordinary power spectrum is obtained


X
P (ω) = c2 (τ ) · e−jωk (4.51)
k=−∞

for k = 3, we have the bispectrum


X ∞
X
C3 (ω1 , ω2 ) = c3 [τ1 , τ2 ) · e−j(ω1 τ1 +ω2 τ2 ) (4.52)
τ1 =−∞ τ2 =−∞

and the trispectrum for k = 4


X ∞
X ∞
X
C4 (ω1 , ω2 , ω2 ) = c4 [τ1 , τ2 , τ3 ] · e−j(ω1 τ1 +ω2 τ2 +ω3 τ3 ) . (4.53)
τ1 =−∞ τ2 =−∞ τ3 =−∞

Cumulants and polyspectra can be considered respectively as a generalization of the autocorre-


lation function and of power spectra.

When a real-valued stationary random process is considered, its power spectrum is real, there-
fore no phase information can be extracted from it,on the other hand, it can be shown that
polyspectra preserve phase information. In particular, for a bispectrum the following relation-
ship holds [122]

Bx (ω1 , ω2 ) = Bs (ω1 , ω2 )H(ω1 )H(ω2 )H ∗ (ω1 + ω2 ) (4.54)

where, the higher-order spectrum of the received signal x(n) is linked to the the higher-order
spectrum of the source signal s(n) by a complex valued relation from which both magnitude
and phase of the transfer function H(ω) can be identified. In [123] a closed form for an FIR
model of the identified system impulse response is given

70
Blind dereverberation techniques. Problem statement and existing technology

Ry [P, n]
h(n) = n = 0, ..., P (4.55)
Ry [−P, P ]

where y is the system output and P the FIR model order.

Unfortunately, equation 4.55 has limited practical use as the model order P must be known, and
the estimate of the moment function Ry [p, n] comes with large variances, even without noise
present. Nevertheless, it demonstrates the usability of HOS for blind system identification.

A review of blind identification methods based on the explicit use of higher order spectra can be
find in Giannakis [124], Mendel [123] and in Nikias and Mendel [125]. See [126] Hatzinakos
and Nikias for the application to deconvolution.

A different approach started with the work of Wiggins [127], where it was shown that blind
estimation of the equalizer W (ω) could be achieved by maximizing the non-Gaussianity of
the received signal. A possible measure of the non-Gaussianity is Kurtosis, a measure of the
“peakedness” of the probability distribution of a real-valued random variable, was proposed as
a metric for the non-Gaussianity. Kurtosis is defined2 as the fourth moment around the mean
divided by the square of the variance (that is the second moment) of the probability distribution
minus 3
µ4
γ2 = −3 (4.56)
σ4

therefore it implies the computation of the fourth and second moments only. The previous
expression is also know as kurtosis “excess” and is commonly used because γ2 of a normal
distribution is equal to zero, while it is positive for distributions with heavy tails and a peak
at zero, and negative for flatter densities with lighter tails. Distributions of positive [negative]
kurtosis are thus called super-Gaussian [sub-Gaussian]. A signal with sparse peaks and wide
low level areas is characterized by a high positive kurtosis values.

Donoho [129] provided a statistical foundation to Wiggins’ method, by pointing out that, by the
central limit theorem [130], a filtered version of a non-Gaussian i.i.d. process appears “more
Gaussian” than the source itself. He also concluded that general HOS can be used to reflect the
amount of Gaussianity of a random variable.

Later Shalvi and Weinstein [111] provided a theoretical foundation to the non Gaussianity maxi-

µ4
2
Other definitions of kurtosis exist (i.e. µ2
or µ4 − 3µ22 [128])
2.

71
Blind dereverberation techniques. Problem statement and existing technology

Figure 4.8: Bussgang type equalizer structure.

mization approach extending it to any non-Gaussian source signal and a necessary and sufficient
condition for blind deconvolution of nonminimum phase linear time-invariant systems.

More recently [131], the use of skewness

µ3
γ1 = (4.57)
σ3

and the advantages in respect to kurtosis in some special cases of asymmetric source signals
was proposed.

A different class of HOS-based blind deconvolution techniques indirectly exploits higher order
statistics of the received signal, by employing a static non linear function.

This approach is due to Bellini [113], [132], [133]. In these papers, a technique for blind
deconvolution, which is efficient when the input sequence is i.i.d. and the channel distortion is
small, is proposed. The algorithms are know as Bussgang algorithms because the deconvolved
signal exhibits Bussgang statistics when the algorithm converges in the mean value.

Let us recollect the definition given before:

s(n) - the source signal

x(n) - the received signal

y(n) - the estimate of the source signal (the equalized signal)

w(n) - the equalizer impulse response

The Bussgang SISO algorithm for a LTI system is

72
Blind dereverberation techniques. Problem statement and existing technology

P
X
y(n) = w(p)e(n − p) (4.58)
p=0

wnew (p) = w(p) + µf (y(k))x(k − p) (4.59)

where e(n) = f (y(n)) − y(n) is the estimation error, P is the equalizer order, µ the adaptation
parameter and f (.) the Bussgang nonlinearity 3 .

The standard Bussgang algorithm has a very slow convergence. To address this problem, Amari
et al. proposed in [134] an on-line adaptive algorithm for blind deconvolution, called Natural
Gradient (NG). The same algorithm was also discovered by Cardoso et al. and described in
[135] as the “relative gradient”.

The SISO Natural Gradient Algorithm (NGA) for a LTI system is [136]

P
X
y(n) = w(p)e(n − p) (4.60)
p=0

P
X
u(n) = w(P − m)y(n − m) (4.61)
m=0

 
wnew (p) = w(p) + µ w(p) + f (y(n − P ))u(n − p) . (4.62)

The standard gradient descent (as in the standard Bussgang algorithm) is most useful for cost
functions that have a single minimum and whose gradients are isotropic in magnitude with
respect to any direction away from this minimum. In practice, however, the cost function
being optimized is multi-modal, and the gradient magnitudes are non-isotropic about any min-
imum. In such a case, the parameter estimates are only guaranteed to locally-minimize the cost
function, and convergence to any local minimum can be slow. The natural gradient adapta-
tion modifies the standard gradient search direction according to Riemannian structure of the
parameter space. While not removing local cost function minima, natural gradient adaptation

3
The − tanh(.) or the − sgn(.) functions is usually assumed for super-Gaussian source signals.

73
Blind dereverberation techniques. Problem statement and existing technology

provides isotropic convergence properties about any local minimum independently of the model
parametrization and of the dependencies within the signals being processed by the algorithm.
Moreover, natural gradient adaptation overcomes many of the limitations of Newton’s method,
which assumes that the cost function being minimized is approximately locally “quadratic”. By
providing a faster rate of convergence, the natural gradient increases the usability of Bussgang
type methods to non stationary environments.

In [137] Bell and Sejnowski derive a self-organising learning algorithm which maximises the
information transferred in a network of non-linear units to perform blind deconvolution cancel-
lation of unknown echoes and reverberation in a speech signal. However the examples reported
in their paper are restricted to the deconvolution of unrealistically short IRs.

It must be highlighted that a non Gaussian source signal s(n) is required for all the HOS blind
identification and deconvolution methods. In fact for a Gaussian distribution with expected
value µ and variance σ 2 , the cumulants are k1 = µ, k2 = σ 2 , and k3 = k4 = ... = 0. Therefore
no information is present in higher order cumulants/moments.

4.6.1 Reverberation cancellation based on HOS methods

Single channel reverberation cancellation that can be referred to blind deconvolution HOS ap-
proaches are reported in [87],[90], [138]. All these methods are based on an LPC pre-whitening
step, as described in section 4.4, followed by a blind deconvolution algorithm that performs the
non-Gaussianity maximization of the received signal in a similar fashion to what suggested by
Wiggins [127], and discussed in section 4.6. While their performance in the SISO case is quite
limited, a considerable improvement is offered by their extension to a multichannel framework.
This allows one the use of both spatial and temporal diversity to achieve better deconvolution.
However, the way adopted to create this extension often does not have a clear theoretic foun-
dation and their better behavior can be mainly justified by the consideration that a MINT-like
structure is employed.

In [87] the kurtosis of the reverberant residual was proposed by Gillespie et al. as a rever-
beration metric. It was observed that for clean voiced speech, LP residuals have strong peaks
corresponding to glottal pulses, whereas for reverberated speech such peaks are spread in time.
A measure of amplitude spread of LP residuals can serve as a reverberation metric. By build-
ing a filter that maximize the kurtosis of the reverberant residual it is theoretically possible to

74
Blind dereverberation techniques. Problem statement and existing technology

identify the inverse function of the RTF and thus to equalize the system. This approach is blind
since it requires only the evaluation of the kurtosis of the reverberant residual of the system
output. Gillespie’s observation can be explained by considering that by its nature, reverberation
is the process of summing a large number of attenuated and delayed copies of the same signal.
Thus, by the central limit theorem [129][130], the reverberated signal has a more Gaussian
distribution in respect to the original one. The LPC residual of speech is mainly constituted
by the glottal pulses, so it is sparse and characterized by high, positive kurtosis. Therefore
reverberation causes this signal to assume a more Gaussian distribution.

In more detail, the key idea behind the Gillespie approach is to build an FIR filter, applied to
the residual x̃(n) of the received noisy reverberated speech signal x(n),

ỹ(n) = w(n) ∗ x̃(n) (4.63)

so that the kurtosis of ỹ(n) is maximized. Since ỹ(n) is a zero mean signal, the kurtosis
expression can be simplified as

E ỹ 4

γ2 = 2 2 . (4.64)
E {ỹ }

Using the kurtosis as the cost function an LMS like algorithm can be formulated 4

w(n + 1) = w(n) + µ · ∇J(w(n)) (4.65)

where the gradient of the kurtosis is

4 E ỹ 2 E ỹ 3 x̃ − E ỹ 4 E {ỹx̃}
    
∂J
∇J(w(n)) = = . (4.66)
∂w E 3 {ỹ 2 }

Compared to the LMS algorithm, the kurtosis approach does not require the evaluation of an
error signal in respect to a target, although the update equation is based on very similar structure.
In [87] the gradient is simplified as

4
’+’ instead of ’-’ because maximization of the cost function is desired

75
Blind dereverberation techniques. Problem statement and existing technology

Figure 4.9: A single channel time-domain adaptive algorithm for maximizing kurtosis of the
LP residual.

4 E ỹ 2 y 2 − E ỹ 4 ỹ
   
∂J
∇J(w(n)) = = · x̃(n) = f (n) · x̃(n) (4.67)
∂w E 3 {ỹ 2 }

and thus the update equation becomes

w(n + 1) = w(n) + µ · f (n) · x̃(n) (4.68)

where µ controls the speed of the adaptation. For an efficient real-time implementation, the
expected value are estimated using

E ỹ(n)2 = βE ỹ(n − 1)2 + (1 − β)ỹ(n)2


 
(4.69)

E ỹ(n)4 = βE ỹ(n − 1)4 + (1 − β)ỹ(n)4


 
(4.70)

where β controls the smoothness of the moment estimates.

A similar approach is also used as the first stage of a single microphone dereverberation algo-
rithm proposed by Wu and Wang [90]. The algorithm shows satisfactory results for reducing
reverberation effects when T60 is between 0.2s and 0.4s.

76
Blind dereverberation techniques. Problem statement and existing technology

4.7 Multi-channel SOS methods

If a multichannel system is considered, SOS can be used both for system identification and
deconvolution.

4.7.1 SIMO identification

4.7.1.1 The “cross relation” and other subspace techniques

In [139], based on the results in [140], Xu,Tong et al. proposed a method to blindly identify a
set of FIR filters. The method relies on the fact that all the outputs from a multiple channel FIR
system are correlated if driven by the same input. Any pair of different noise free instantiation
of the same source signal s(n) is linked by the following relations

xi (n) = hi (n) ∗ s(n) xj (n) = hj (n) ∗ s(n) (4.71)

then

xi (n) ∗ hj (n) = s(n) ∗ hi (n) ∗ hj (n) = xj (n) ∗ hi (n); i, j = 1, 2, ..., M ; i 6= j. (4.72)

From this relation, an overdetermined set of linear equations, with hi , hj as unknowns, can be
written [139]. For n = L, . . . , N , where N is the last sample index of the received data xi (n)
and xj (n) and L is the maximum length for the channel impulse response, we have N − L + 1
linear equations

 
h
.
i hj
Xi (L) .. −Xj (L) =   (4.73)
hi

where hm = [hm (L), . . . , hm (0)]T and

 
xm (L) xm (L + 1) ... xm (2L)
 
 xm (L + 1) xm (L + 2) . . . xm (2L + 1)
 
Xm (L) = 

.. .. .. ..
. (4.74)
.

 . . . 
 
xm (N − L) xm (N − L + 1) . . . xm (N )

77
Blind dereverberation techniques. Problem statement and existing technology

Figure 4.10: Illustration of the relationships between the input s(n) and the observations x(n)
in an M-channel SIMO system.

Equation 4.73 can be written for each pair of channels (i, j). The equations of all channels
can be combined and a larger set of linear equations in terms of h1 , ..., hL or simply h =
 T T
h1 , . . . , hTL , so that all the channel impulse responses can be calculated simultaneously

X(L)h = 0 (4.75)

where h is the matrix of the impulse responses, and X(L) the matrix containing the received
signals.

 
X1 (L)
..
 
X(L) =  (4.76)
 
. 
 
X M −1 (L)

with

 
0 . . . 0 Xi+1 (L) −Xi (L) 0 0
.. ..
 
Xi (L) =  ..
0 . (4.77)

.

. . 0
 
0 . . . 0 XM (L) 0 . . . −Xi (L)

The necessary and sufficient conditions to ensure a unique solution to the above equation, or in
other words to assure identifiability, are [139]:

78
Blind dereverberation techniques. Problem statement and existing technology

1. the channel transfer functions do not share any common zeros;

2. the autocorrelation matrix of the source signal Rss = E s(k)sT (k) is of full rank (such


that the SIMO system can be fully excited).

The first condition is the usual coprimeness, or channel diversity, identifiability condition for a
SIMO system discussed in section 3.2.3. The second condition is relatively mild and does not
imply the knowledge of the exact statistic of the input signal, nor even constrain it to be an i.i.d
process. Therefore, theoretically, any input signal that can fully excite the SIMO system can be
employed. These conditions are sufficient for the blind identification of any SIMO system.

This approach is known in the literature as the Cross Relation (CR) approach. The CR approach,
a termed coined by Hua [141], was discovered independently and in different forms by several
authors, among them, by Liu et al. [142], and Gurreli and Nikias [143]. These algorithms,
originally aimed at solving communication problems, are often referred to as “deterministic
subspace methods”, since the statistical properties of the source are not exploited. Subspace
algorithms are based on the idea that the channel (or part of the channel) vector is in a one-
dimensional subspace of either the observation statistics or a block of noiseless observations.

Some of the subspace techniques, such as the EVAM algorithm proposed by Gurelli and Nikias
[143], have been used in dereverberation problems. Gurelli and Nikias showed that the null
space of the correlation matrix of the received signals contains information on the transfer
function relating the source and the microphones. This was extended by Gannot and Moonen
[144], [145] to the speech dereverberation problem.
Even if these techniques are supported by theory, they have several drawbacks in real-life sce-
narios. The Generalized Eigenvalue Decomposition (GED) [146], which is used to construct
the null space of the correlation matrix, is not robust enough, and quite sensitive to small esti-
mation errors in the correlation matrix. Furthermore, the matrices involved become extremely
large causing severe memory and computational requirements. Another problem arises from
the wide dynamic range of the speech signal. This phenomenon may result in an erroneous
estimate of the frequency response of the IRs in the low energy bands of the input signal.
In a following paper [80], Moonen and Eneman showed that, at the current state, even the more
advanced subspace-based dereverberation techniques did not provide, in a real-life scenario,
any signal enhancement. Furthermore, even if most subspace methods can converge quickly,

79
Blind dereverberation techniques. Problem statement and existing technology

they are difficult to implement in an adaptive mode and have a high-computational load [65] 5 .

4.7.1.2 Adaptive blind channel identification techniques

To overcome these limitations, Huang and Benesty [86] proposed a set of adaptive algorithms
to solve the set of linear equation obtained from the CR approach. Their work started from the
formulation of the CR approach described in [147], where the channel impulse responses of an
identifiable system are blindly determined by calculating the null space of the cross-correlation
like matrix of channel outputs

Rx h = 0 (4.78)

with

P 
R −Rx2 x1 ··· −RxM x1
 i6=1 xi xi P 
 −Rx1 x2 i6=2 Rxi xi ··· −RxM x2 
 
Rx =  .. .. .. ..
 (4.79)
.
 
 . . . 
 P 
−Rx1 xM −Rx2 xM ··· R
i6=M xi xi

where M is the number of channels

Rxi xi = E xi (n)xTj (n) ;



i, j = 1, 2, ..., M (4.80)

and
T
h = hT1 , . . . , hTM

(4.81)

is the matrix of the impulse responses. For a blindly identifiable SIMO system, matrix Rx is
rank deficient by 1. In the absence of noise, the channel impulse responses can be uniquely
determined from Rx , which contains only the SOS of the system outputs.

By following the fact that

xi (n) ∗ hj (n) = s(n) ∗ hi (n) ∗ hj (n) = xj (n) ∗ hi (n); i, j = 1, 2, ..., N ; i 6= j; (4.82)


5
A review on other subspace methods can be found in the same paper.

80
Blind dereverberation techniques. Problem statement and existing technology

we have, in the absence of noise, the following cross relation at time k

xTi (k) ∗ hj = xTj (k) ∗ hi ; i, j = 1, 2, ..., N ; i 6= j. (4.83)

When noise is present and/or the estimate of channel impulse responses deviates from the true
value, an a priori error signal is produced

eij (k + 1) = xTi (k + 1)ĥj (k) − xTj (k + 1)ĥi (k); i, j = 1, 2, ..., N ; (4.84)

where ĥi (k) is the model filter for the i-th channel at time k. The estimated channel impulse
response vector is aligned to the true one, but up to a non-zero scale. This inherent scale
ambiguity is usually harmless in most of acoustic signal processing applications. But in the
development of an adaptive algorithm, attention needs to be paid to prevent it from converging
to a trivial all-zero estimate. Therefore, a constraint can be imposed on the model filter. Two
constraints can be found in the literature. The unit-norm constraint, i.e. ĥ = 1, and the
component normalization constraint [147], i.e. cT ĥ = 1, where c is a constant vector.

The unit-norm constraint can be explained by the Parseval’s theorem

Z ∞ Z ∞
|x(t)|2 dt = |X(f )|2 df (4.85)
−∞ −∞

where X(f ) = F{x(t)} represents the continuous Fourier transform (in normalized, unitary
form) of x(t) and f represents the frequency component of x. The interpretation of this form
of the theorem is that the total energy contained in a waveform x(t) summed across all of time
t is equal to the total energy of the waveform’s Fourier Transform X(f ) summed across all of
its frequency components f .

For a discrete time system, the Parseval’s theorem is

N −1 N −1
X 1 X
|x(n)|2 = |X[k]|2 (4.86)
N
n=0 k=0

where X[k] is the Discrete Fourier Transform (DFT) of x(n), both of length N . Therefore, the

81
Blind dereverberation techniques. Problem statement and existing technology

unitary-norm constraint
N
X −1
ĥ = |x(n)|2 = 1 (4.87)
n=0

imposes a unitary constraint on the filter power.

The component normalization constraint is useful when one coefficient of the model filter is
h
know to be equal to α, which is not zero. Then the vector c = 0, . . . , 1/α, . . . , 0 []T can
be properly specified, so that cT ĥ = 1. Even though the component normalization can be
more robust to noise than the unit-norm constraint [147], the knowledge of the location of the
component to be normalized and its value α may not be available in practice. So the unit-norm
constraint is more widely used.

With the unitary-norm constraint enforced on ĥ , the normalized error signal is

ij (k + 1) = eij (k + 1)/ ĥ(k) (4.88)

accordingly the cost function is formulated as

M
X −1 M
X
J(k + 1) = 2ij (k + 1) (4.89)
i=1 j=i+1

the update equation of the algorithm is then given by

h i
∂J(k + 1) 2 R̃x (n + 1) ĥ(n) − J(n + 1) ĥ(n)
∇J(k + 1) = = o (4.90)
∂ ĥ(k) ĥ(k)

where R̃ is a matrix with the same structure of the Rx matrix in equation 4.79, but built with
the instantaneous values of the received signals

R̃xi xj = xi (k + 1)xTj (k + 1); i, j = 1, 2, ..., N. (4.91)

This algorithm is known as the Multichannel LMS algorithm (MCLMS). Several algorithms
based on the MCLMS algorithm, and that outperform it, have been proposed by the same

82
Blind dereverberation techniques. Problem statement and existing technology

authors. All of them are documented and discussed in [86]. Among them:

• the Unconstrained Multichannel LMS algorithm with optimal step size (VSS-UMCLMS),
that has faster convergence and similar computational load

• the Constrained Multichannel Newton (CMN) algorithm, that has much faster conver-
gence but high computational load

• the normalized multichannel frequency domain LMS (NMCFLMS), that takes advantage
of the computational efficiency of the FFT, which by orthogonalizing the data offers also
faster convergence [45].

The NMCFLMS algorithm has been documented as being able to identify rather short (256
taps) impulse responses by using a 100 second long speech voice as the source signal [86].

One of the main challenges for NMCFLMS is that the algorithm suffers from a misconvergence
problem. It has been shown through simulations presented in [148], [149], that the estimated
filter coefficients converge first toward the impulse response of the acoustic system but then
misconverge. Under low signal-to-noise ratio (SNR) conditions, the effect of misconvergence
becomes more significant and occurs at an earlier stage of adaptation. Possible solutions to this
problem have been investigated by Gaubitch et al. in [150], [151] and by Ahmad et al. [148]
[152],[149]. In [153] a noise robust adaptive blind multichannel identification algorithm for
acoustic impulse responses, called ext-NMCFLMS, based on a modification of the NMCFLMS
has been proposed. No information about the performance with longer acoustic IR is available.

4.7.2 SIMO equalization

The previous approaches do not directly address the problem of system equalization. In fact, a
further step is required to calculate the equalizer once the FIR filters have been estimated.

4.7.2.1 A fundamental theorem for multiple-channel blind equalization

The problem of SIMO system equalization has been investigated by Slock and Papadias in
[154], where it is shown that a blind equalization can be achieved by linear prediction algorithm
on the channel bank outputs.

83
Blind dereverberation techniques. Problem statement and existing technology

A more general approach to blind multichannel equalization, that provides a unifying frame-
work for multichannel blind equalization, is reported in Liu and Dong [92]. In this paper, it is
proved that , if no common zeros are shared among channels transfer functions hi and if the
source signal s(n) is zero-mean and temporally uncorrelated, then an FIR equalizer bank wi
equalizes the FIR channel bank if, and only if, the composite output of the equalizer bank y(n)
is temporally uncorrelated. It is interesting to note that the condition of existence for the filter
bank is identical to the one requested by the MINT. While the MINT gives a closed formula to
calculate the equalizer in a non blind framework, this theorem suggests how to calculate them
in the blind case. The advantage of a multichannel structure in respect to a single channel one
when a non minimum phase system is considered, is present in the blind case also.

Important aspects that stem out from this theorem are:

• the second-order statistics of the composite output of the equalizer bank has sufficient
information for the blind equalization, though it does not have sufficient information for
blind identification

• the hypotheses are very mild: the source signal needs not be stationary, nor i.i.d.

• since the linear prediction error is temporally uncorrelated, the method proposed by Slock
and Papadias [154] that achieves direct blind equalization by multichannel linear predi-
cation follows immediately from the previous theorem

• no constraints on how to obtain the output decorrelation exist; therefore both linear and
non-linear approaches can be used.

All the following blind speech dereverberation algorithms are based, or can be explained, by the
Liu and Dong [92] theorem. In all of them in fact, the FIR equalizer bank is calculated to equal-
ize the FIR channel bank by decorrelating the composite output of the equalizer bank y(n). This
decorrelation is achieved by multichannel linear prediction [155],[156], or by decorrelating the
composite filter bank output both by SOS [157] and HOS [88] methods, or by shaping the cor-
relation of the received reverberant signal [158]. The main novelty of these contributions is on
the techniques that are used to decouple the speech production system and the room response
system to avoid ambiguous deconvolution.

In [155] Triki and Slock proposed a dereverberation technique based on the observation that a
single-input multi-output (SIMO) system is equalized blindly by applying multichannel linear

84
Blind dereverberation techniques. Problem statement and existing technology

Figure 4.11: Multichannel blind equalizer for a SIMO system.

prediction (LP). This approach follows [154] and the Liu and Dong theorem [92]. However
when the input is colored, the multichannel linear prediction will both equalize the reverbera-
tion filter and whiten the source. Therefore, as discussed in section 4.4, it is critical to define a
criterion to decouple the source and the reverberation. To address this problem, channel spatial
diversity and the speech signal non-stationarity were employed to estimate the source correla-
tion structure, which can hence be used to determine a source whitening filter 6 . Multichannel
linear prediction was then applied to the sensor signals filtered by the source whitening filter,
to obtain source dereverberation. The algorithm was tested with synthetic impulse responses
generated by a MISM algorithm, and in the simulation it is shown that the proposed equalizer
outperforms the delay and sum beamformer . In a following work [118] the dereverberation
in a noisy environment was considered and the robustness of the algorithm was improved by
considering an equalizer based on an MMSE criterion instead of a simple ZF equalizer.

A similar approach based on multi-channel LP was considered by Delcroix et al. in [156],[159],


where a two-channel dereverberation algorithm called LInear-predictive Multi-input Equaliza-
tion (LIME) is proposed. In their approach multi-channel LP was used, but allowing the source
to be whitened, while restoring the coloration in a final stage. Here, it is reported the LIME
algorithm formulation as summarized in [160].

The LIME algorithm assumes the following hypothesis:

1. One source and P microphones

2. The room impulse response between the source and the ith microphone is called hi (n).
6
More details are discussed in section 4.4.

85
Blind dereverberation techniques. Problem statement and existing technology

The signal received by the ith microphone, ui (n), is the source signal,s(n), convolved with
hi (n), i ∈ 1, ..., P . i = 1 is assumed for the microphone closer to the source.

The room transfer functions (RTFs) are modelled using time invariant polynomials and assumed
to share no common zeros. The RTFs are defined as

M
X −1
Hi (z) = hi (k)z −k i ∈ 1, .., P (4.92)
k=0

where M is the number of taps of the room impulse response. Using matrix formulation the
previous equation can be rewritten as

ui (n) = HTi s(n) (4.93)

where ui (n) = [ui (n), ..., ui (n−L+1)]T , Hi is a (M +L−1)xL convolution matrix expressed
as

 
hi (0) 0 ··· 0
..
 
..
.
 
 hi (1) hi (0) . 
 
 .. .. .. 
 . . . 
 
Hi = hi (M − 1) (4.94)
 

 
hi (M − 1)
 
 0 hi (0) 

.. ..

 .. 
 . . . 
 
0 ··· 0 hi (M − 1)

and s(n) = [s(n), ..., s(n − N + 1)]T . The length of the signal vector ui (n) is denoted by L,
and its minimum length is derived from the condition

M −1
L≥ (4.95)
P −1

as in the MINT theorem (section 3.2.3).

3. The input signal s(n) is assumed to be generated from a finite long-term AR process applied

86
Blind dereverberation techniques. Problem statement and existing technology

to a white noise e(n). The Z transform of the AR process is 1/a(z), where a(z) is the AR
polynomial

a(z) = 1 − a1 z −1 + ... + aN z −N ,

(4.96)

where N denotes the length of the AR polynomial. The length of N is given by N = M +L−1.

Using matrix formulation

s(n) = CT s(n − 1) + e(n) (4.97)

where C is the N xN matrix defined as:

 
a1 1 ··· 0
0
 
 a2 0 1 · · · 0
 
. .. . . . . .. 
 
C =  .. . . . . . (4.98)
 .. ..
 

.
 . 1
an 0 ··· ··· 0

The LIME algorithm consists of the following steps:

1. Both the prediction filter w and the AR polynomial a(z) are estimated from a matrix Q
which is defined as [159],[159]

+ 
Q = E u(n − 1)uT (n − 1) E u(n − 1)uT (n)

(4.99)

where u(n) = [uT1 (n), ..., uTP (n)]T , A+ is the Moore-Penrose generalized inverse of matrix
A, and E {.} denotes the time averaging operator. The covariance matrix is estimated using

Ns −1
1 X
E x(n)y(n)T (x(n) − mx )(y(n) − my )T

= (4.100)
Ns
n=0

where Ns denotes the length of the reverberant signal segment, x(n) = [x(n), ..., x(n − L +

87
Blind dereverberation techniques. Problem statement and existing technology

1)]T , y(n) = [y(n), ..., y(n − L + 1)]T , and mx , my are their mean vectors respectively.

The mean vectors are calculated using

Ns −1
1 X
mx = x(n) (4.101)
Ns
n=0

and
Ns −1
1 X
my = y(n). (4.102)
Ns
n=0

The first column of Q gives the prediction filter w, and an estimate of the AR polynomial a(z)
is obtained from the characteristic polynomial of Q.

2. The prediction residual is defined as [159],[159]

ê(n) = u1 (n) − uT (n − 1)w. (4.103)

The residual signal is free from the effect of room reverberation but is also excessively whitened.
Filtering the prediction residual with 1/a(z) produces the recovered input signal multiplied by
a factor of h1 (0). Simulation results showed that LIME could achieve almost perfect derever-
beration, but only for short duration impulse responses [161]. Additionally the computation
of large covariance matrices causes LIME to be a computationally exhaustive algorithm. In
[162] speech dereverberation in the presence of noise was discussed and the fact that the LIME
algorithm can achieve both dereverberation and noise reduction was shown.
In [160], where an improved method based on the LIME approach was proposed, it was shown
experimentally that the algorithm performance is signal dependent, i.e. applying the algorithm
to different segments of the same speech signal produces different performance levels. There-
fore LIME may not perform well, and produce signals whose audible quality is worse than
that of the reverberant. The authors proposed two non-intrusive methods to evaluate the per-
formance of LIME. These methods are used to detect errors in the estimation of the source
signal and isolate the input segments associated with poor dereverberation. For these segments
an alternative inverse filter calculation approach is used, by minimizing the least squares (LS)
error between the estimate of LIME and the filtered microphone signals. These LS filters are
then used to dereverberate the received microphone signals. These modifications led to stable
performances with different speech signals under the same room conditions.

88
Blind dereverberation techniques. Problem statement and existing technology

In [88] and in [157] by Yoshioka et al. proposed two similar methods, one HOS and the other
SOS based, applied to the same multichannel structure, to calculate the dereverberation filters.
Both the methods rely on the Liu and Dong [92] theorem since the dereverberation filter is
calculated by uncorrelating the composite output of the equalizer bank. The fundamental issue
addressed in both papers is how to estimate a channel’s inverse filter separately from the inverse
filter of the speech generating AR system, or in other words from the prediction error filter
(PEF). The authors claim that by jointly estimating the channel inverse filter and the PEF, the
channel inverse is identifiable due to the time varying nature of the PEF.

In [56] Gillespie et al. showed that penalizing long-term reverberation energy is more effective
than maximizing the signal-to-reverberation ratio (SRR) for improving audible quality and au-
tomatic speech recognition (ASR) accuracy. In a following paper [158], they noticed that the
energy in the tail of the autocorrelation sequence of the received reverberant signal is related
to the amount of long-term reverberation. Based on these observations, a technique, called
Correlation Shaping (CS), aimed to reshape the autocorrelation function of a signal by means
of linear filtering was proposed. This has the intended effect of reducing the length of the
equalized speaker-to-receiver impulse response to improve audible quality and ASR accuracy
blindly. Dereverberation can, therefore, be achieved by reshaping the autocorrelation of the lin-
ear prediction residual to a Dirac δ function, that is, whitening it. Therefore, this multichannel
algorithm also relies on the Liu and Dong [92] theorem.
Such a decorrelation process is achieved through adaptive filtering as shown in Fig.4.12. The
actual system is not limited solely to 2 channels, but can be generalized to any number of in-
puts, N . The proposed correlation shaping technique modifies the received reverberant signals,
xi , using a set of adaptive linear filters, gi ,

N
X
y(n) = giT (n)xi (n) (4.104)
i=1

to minimize the weighted mean square error between the actual output autocorrelation se-
quence, ryy , and the desired output autocorrelation sequence, rdd = δ(τ ). This error is given
by

e(τ ) = W (τ )(ryy (τ ) − rdd (τ ))2 (4.105)

89
Blind dereverberation techniques. Problem statement and existing technology

Figure 4.12: 2-channel Correlation shaping block diagram.

where W (τ ) is a real valued weight. A large positive value of W (τ ) gives more importance to
the error at a particular lag, τ . The adaptive filter employed uses a normalized gradient descent
approach to accomplish this minimization. The update equation for each filter coefficient l of
the ith channel is

∇i (l)
gi (l, n + 1) = gi (l, n) − µ pP P . (4.106)
2
i l ∇i (l)

It can be shown [158], [3] that the gradient for each filter coefficient,l, that minimizes e(τ ) with
respect to the filter coefficients is given by

X
∇(l) = W (τ )ryy (τ ))(rxi y (l − τ ) + rxi y (l + τ )) (4.107)
τ >0

where rxi y is the crosscorrelation between the ith channel input and the output. The gradi-
ent normalization improves the convergence properties (similarly to the Normalized LMS) and
normalizes the output signal energy.
Since long reverberation time is especially harmful [158], a “don0 t care” region in the desired
output autocorrelation function can be included to improve the whitening process for long lags,

90
Blind dereverberation techniques. Problem statement and existing technology

at the expense of allowing higher level short-term correlation. This is achieved by not including
the first autocorrelation lags in the gradient computation.
Experimental results, based on synthetic acoustic IR, showed that reverberant speech processed
with 4-channel correlation shaping provided better audio quality to either 4-channel DS pro-
cessing or an unprocessed single channel.

As a final comment on SIMO equalization methods, it is interesting to notice that also some
HOS multichannel methods can be explained by the Liu and Dong [92] theorem, except that
they don’t guarantee that the output is uncorrelated. For instance, the kurtosis maximization
algorithm used in the multichannel dereverberation algorithm discussed in section 4.6.1 and
described in [87] can be also explained as a way to uncorrelate the composite output of the
equalizer bank y(n).

4.7.2.2 A blind MINT approach to multiple-channel blind equalization

A different approach, that cannot be explicitly explained by Liu and Dong [92] theorem was
proposed in [163] by Furuya et al. This is a promising algorithm, based on the blind MINT
approach. The conventional MINT requires the room impulse responses to calculate the inverse
filters, so it cannot recover speech signals, when the room impulse responses are unknown.
However, as suggested by other SOS methods, the inverse filters can be blindly estimated from
the correlation matrix between input signals, that can be observed.

The N LxN L correlation matrix of received signals is defined by

R = E XT X

(4.108)

where

X = [X1 X1 ...XN ] (4.109)

and

Xi = [xi (k), xi (k − 1), ..., xi (k − L + 1))] (4.110)

91
Blind dereverberation techniques. Problem statement and existing technology

J−1
where N is the number of channels, L = N −1 is the inverse filter length, J is the impulse
response length, and E {} is the statistical expectation.

If the input signal is statistically white 7

E {s(k)s(k + n)} = δ(n) (4.111)

and the microphone closest to the source is known, or estimated in practice by using time
difference of arrival (TDOA), it can be shown8 [163] that the inverse filters w can be calculated

w = R−1 v (4.112)

where
w = [w1 (1), ..., w1 (M ), ..., wN (1), ..., wN (M )]T (4.113)

and
v = [1, 0, ..., 0]T (4.114)

is a N Lx1 vector.

In the situation requiring adaptation where impulse responses are frequently changed, the fol-
lowing recursive time-averaging is used to estimate the correlation matrix [91]

R̂k = β R̂k−1 + (1 − β)XT


k Xk (4.115)

where R̂k is the estimate of R at time k and β is the “forgetting factor”, that is the weight of
the older estimate R̂k−1 . R̂k is used in equation 4.112 so that

w = R̂−1
k v. (4.116)

The previous equation can be solved efficiently by using the FFT-Based Fast Conjugate Gradi-
ent Method described in [164], where a real-time dereverberation system was proposed. A real
7
A pre-witening step as described in section 4.4 is assumed.
8
The notation employed here is consistent with the definitions used in the MINT theorem reported in section
3.2.3, while in [91] a different notation is adopted.

92
Blind dereverberation techniques. Problem statement and existing technology

Figure 4.13: Signal flow of the method proposed by Furuya et al.

acoustic path was employed and the measured IR were used only to verify the dereverberation
performance. The experiment setup was the following:

• the source signal was two reproductions of male and female speech;

• the number of directive microphones was 4;

• the sampling frequency was 12 kHz;

• the room size was 6.6 m × 4.6 m × 3.1 m, with a reverberation time of about 0.55 s;

• the distance from the source to the microphone was 3.8 m;

• the inverse filter length was 2048 taps.

For performance comparison, the reverberation curves and the dereverberation improvement
were calculated using the measured impulse response (where the impulse response length is
6600 taps). Figure 4.13 shows the real-time dereverberation structure proposed by Furuya et al.
in [164].

It is meaningful to point out that this is the only reverberation cancellation algorithm, among
the ones described so far, that was tested in a realistic environment.

93
Blind dereverberation techniques. Problem statement and existing technology

The deconvolution based on inverse filtering does not improve the tail of reverberation be-
cause impulse responses are always fluctuating in the real world and the estimation error of
inverse filters is caused by deviation of the correlation matrix averaged for a finite duration.
As a possible improvement, in [91] an hybrid reverberation cancellation/suppression method is
proposed. A modified spectral subtraction algorithm is cascaded to the blind MINT deconvo-
lution algorithm. Spectral subtraction estimates the power spectrum of the reverberation and
then subtracts it from the power spectrum of reverberant speech. Inverse filtering reduces early
reflection, which has most of the power of the reverberation, and then, spectral subtraction sup-
presses the tail of the inverse-filtered reverberation. Inverse filtering reduces the power of the
reverberation, so the nonlinear processing distortion of spectral subtraction is reduced using a
small subtractive power. The authors claim superior dereverberation results for every reverbera-
tion time. On the other hand, even if the algorithm is effective and robust in situations requiring
adaptation, the adaptation speed is still slow for practical applications.

4.8 HOS and SOS approaches, a comparison

According to Brillinger [165], the sample size needed to estimate the nth -order statistics of
a random process to prescribed values of estimation bias and variance, increases almost ex-
ponentially with order n. This is why often HOS-based blind deconvolution methods exhibit
a slower rate of convergence in comparison to the SOS-based ones. This can be of concern
in highly non-stationary environments where an HOS-based algorithm might not have enough
time to track the statistical variations. On the other hand, when a method based on the kurtosis
maximization is used, we are not necessarily trying to achieve an accurate estimate of kurto-
sis. So performance in one case may not translate into performance in the other case. HOS
methods based on the implicit non-linear processing approach (i.e. the Bussgang equalizer) are
less demanding from the complexity perspective, even if they might be prone to local minima
[166]. An advantage of HOS-based methods is that they can be employed in SISO systems,
while SOS methods require a multichannel framework, however HOS methods require a non
Gaussian received signal.

94
Blind dereverberation techniques. Problem statement and existing technology

4.9 Other dereverberation cancellation techniques

4.9.1 HERB

Nakatani et al. proposed a single-microphone speech dereverberation technique called har-


monicity based dereverberation (HERB) [89],[167] . The proposed dereverberation method is
based on the harmonicity of speech signals. A harmonic sound, s(n), is a quasi-periodic (QP)
signal, that is an approximately periodic signal in each local time region. The period gradually
changes with time, and it has different periods in different time regions. Long reverberation
makes the observed signal non periodic. Source signals at different times, s(n) and s(n − m),
and therefore with different periods, are mixed. As a consequence, periodicity in the received
reverberant signal x(n) is degraded. The HERB method estimates the dereverberation filter
that makes the observed signal more likely to be a speech signal.

A speech signal s(n) can be locally described as

s(n) = sh (n) + sn (n) (4.117)

where sh (n) is the harmonic component and sn (n) the non-harmonic one (i.e. voiced and
unvoiced speech). The harmonic component, sh (n), can be modelled as a periodic signal within
every short-time frame. When reverberation is added

s(n) = hd ∗ sh (n) + hr ∗ sh (n) + h(n)sn (n) = xhd + xhr + xn (4.118)

where xhd and xhr are respectively the direct and the reverberant component of the harmonic
signal.

xhd can be estimated by applying an adaptive harmonic filter to x(n), that is a comb-like filter
capable to separate over time the dominant sinusoidal components (i.e.the fundamental fre-
quency f0 and its multiple harmonics). The output of this filter corresponds to a rough estima-
tion of the harmonic components of the direct signal in the observed signals.

An underlying assumption of this approach is that the separation of speech and reverberation is
possible since the speech periodicity and the reverb have different time scales, therefore, from
this point of view, there are similarities with the methods based on the LP residual processing.

95
Blind dereverberation techniques. Problem statement and existing technology

The dereverberation filter, W (k), is calculated as


!
X̂(l, k)
W (k) = A (4.119)
X(l, k)

where X(l, k) and X̂(l, k) are respectively the discrete Fourier transforms (DFT) of the ob-
served reverberant signal and the output of the adaptive harmonic filter at time frame l and
frequency bin k, respectively. A(·) is a function that calculates the weighted average of X̂/X
for each k over different time frames. This filter has been shown to approximate the inverse
filter of the acoustic transfer function between a speaker and a microphone.

In a block form, as shown in Fig.4.14 the HERB algorithm is;

1. the fundamental frequency F 0 is estimated for each short-time frame by an iterative


algorithm;

2. the initial estimate of the direct harmonic signal included in x(n) is estimated for each
short-time frame with an adaptive harmonic filter;

ˆ are calculated;
3. the DFT of x(n) and x(n)

4. the dereverberation filter W (k) is estimated over time;

5. the inverse DFT of W (k) is calculated as w(n);

6. the estimated filter w(n) is convolved with the observed signal to dereverberate it.

The conventional formulation of HERB does not provide an analytical framework within which
the dereverberation performance can be optimized. In [168], Nakatani et al. reformulated
HERB as a maximum a posteriori (MAP) estimation problem, in which the dereverberation
filter was determined as one that maximizes the a posteriori probability given the observed
signals. In real speech signals, there are certain non-harmonic components included, and the
adaptive harmonic filter cannot completely eliminate the reverberation. Therefore, repetitive
observations over a sufficiently long time duration are necessary to decrease estimation errors.
A very large training data set is required to calculate a correct dereverberation filter (i.e. more
than 60 minutes of speech data) under the assumption that the system is time-invariant [169].

This major drawback was investigated by Kinoshita et al. in [170], where a fast version of
the HERB algorithm, able to provide the estimation of the dereverberation filter with a greatly

96
Blind dereverberation techniques. Problem statement and existing technology

Figure 4.14: Block diagram of the HERB algorithm

reduce training data set (i.e. 1 minute long), was proposed. The faster convergence is obtained
ˆ
by using a noise reduction algorithm applied to the output of the adaptive harmonic filter, x(n)
, by recalculating the dereverberation filter in each time frame and by averaging the filter over
several frames to obtain the final filter. However, despite the good dereverberation quality, the
fast HERB is still too slow for practical applicability. Especially if it is considered that the
system must be time-invariant during the convergence and that an unrealistic length for the
inverse filter, in the order of the hundred thousands of taps [89], is used.

4.9.2 Homomorphic deconvolution

Homomorphic deconvolution has been investigated in [54],[171],[46] . By transforming signals


to the cepstral domain, convolutional noise sources can be turned into an additive disturbance
hrc [m]
y(n) = x(n) ∗ h(n) ⇔ yrc [m] = xrc [m] + hrc [m] (4.120)

where
zrc [m] = F −1 {log |F {z(n)}|} (4.121)

is the real cepstrum of signal z(n) and F is the Fourier transform. Speech can be considered as
a “low quefrent” signal as xrc [m] is typically concentrated around small values of m. The room
reverberation hrc [m], on the other hand, is expected to contain higher “quefrent” information.
The amount of reverberation can hence be reduced by appropriate lowpass “liftering” of yrc [m],
that is, suppressing high “quefrent” information, or through peak picking in the low “quefrent”
domain. Even if cesptrum based techniques are popular in speech recognition, they generally

97
Blind dereverberation techniques. Problem statement and existing technology

perform poorly since, speech and the acoustics cannot be clearly separated in the cepstral do-
main. Furthermore, even if this separation could be achieved, estimation of the clean speech
signal is still problematic since equation 4.121 removes all phase information. Therefore, the
proposed algorithms can only be successfully applied in simple reverberation scenarios (i.e.
speech degraded by simple echoes). As a final consideration, this is a nonlinear technique that
cannot be described by linear dereverberation filters.

4.10 Summary

In theory, blind deconvolution of the acoustic channel allows perfect dereverberation. However,
at the current state there are no techniques available to blindly estimate acoustic IRs in a realistic
environment. Therefore no blind dereverberation method based on acoustic IR identification
and inversion exist.

Mourjopoulos [48] observations about non blind approaches suggest that even if an almost per-
fect estimate of the acoustic IR is available, a subtle change in the conditions would determine
a very poor dereverberation. Therefore it might be argued that dereverberation based on identi-
fication and inversion might be a weak approach, unless the algorithm can track quickly enough
the system evolution.

However simulations based on the direct identification of the equalizer filter(s) seem more ef-
fective. Blind dereverberation techniques based on kurtosis maximization have been shown to
be practically usable for the reduction of the early reverberant part in a single channel dere-
verberation [90] algorithm. More consistent results have been claimed for the multichannel
extension proposed in [87]. Several methods implicitly based on the fundamental theorem
of multichannel equalization introduced by Liu and Dong in [92] have been proposed. Only
simulation based on synthetic data are available. Therefore it is difficult to evaluate their per-
formance in a realistic environment. An approach based on the blind MINT has been proposed
in [91] and a working real-time implementation, robust to IR variations, has been claimed.

As a general observation, it is difficult to draw conclusions on the comparative performance


of different dereverberation algorithms, since several dereverberation metrics are used. Fur-
thermore, the experiment results are often based on unrealistically short synthetic IRs. The
restriction to one source is also a serious issue.

98
Chapter 5
Novel HOS based blind
dereverberation algorithms

This chapter describes new blind dereverberation algorithms based on a HOS approach. This
work has been inspired by the work of Gillespie [87] and it is structured in three main contri-
butions:

1. the formulation of a single channel dereverberation algorithm [172] proposed as an en-


hancement of the kurtosis maximization method described in [87];

2. the formulation of a multichannel dereverberation method based on the previously pro-


posed approach [173];

3. the proposal of novel single and multichannel methods based on the natural gradient
and of a new dereverberation structure that improves the speech and reverberation model
decoupling [174].

This work has appeared separately in three published conference papers [172],[173],[174].

5.1 A single channel maximum likelihood approach to blind audio


dereverberation

The algorithm proposed by Gillespie et al. in [87], and described in section 3.7.2, was imple-
mented. In particular, it appeared reasonable to investigate the basic component of the algorithm
in its most simple form, the single channel time domain one.

Here it will be considered the single channel dereverberation framework described in section
3.2.1, where the acoustic path (the channel) is modelled as a linear-time invariant system char-
acterized by an IR, h(n). Therefore, the source signal, s(n), and the reverberant signal, x(n),
are linked by the equation

99
Novel HOS based blind dereverberation algorithms

Figure 5.1: Kurtosis maximization during adaptation

Figure 5.2: Example of misconvergence of the single channel time domain algorithm proposed
by Gillespie and Malvar. (a) system IR, (b) identified inverse IR, (c) equalized IR.

x(n) = h(n) ∗ s(n) (5.1)

where ∗ denotes the discrete linear convolution.

An estimate y(n) = ŝ(n) of the source signal s(n) is given by

ŝ(n) = y(n) = w(n) ∗ x(n) (5.2)

where w(n) is the inverse filter.

The main idea of the Gillespie method, is that by building a linear filter that maximizes the
kurtosis of the speech LP residual, it is theoretically possible to identify the inverse filter w(n)
and thus to equalize the system. This approach is blind since it requires only the evaluation
of the kurtosis of the LP residual of the system output. Conversely, as discussed in section
4.4, the use of the LP residual to decouple the harmonic structure of speech and reverberation
introduces ambiguity.

100
Novel HOS based blind dereverberation algorithms

During tests performed with the single channel time domain version of the algorithm described
in [87], stability problems and unexpected results were observed. In some apparently unpre-
dictable conditions, the algorithm captures a harmonic component and creates a resonant spike
in the residual. Therefore even if kurtosis is smoothly maximized (Figure 5.1), the inverse filter
might converge to a highly resonant state (Figure 5.2). By creating isolated strong peaks in
the residual (thus making it sparser) kurtosis increases, but the result is irrelevant for derever-
beration purposes. This issue can be associated to the extreme sensitivity of kurtosis, γ2 , to
“outlying” values. Examining the simplified expression of kurtosis for a zero mean, unitary
variance RV y
γ2 = E y 4 − 3.

(5.3)

It can be noticed how its value is greatly affected by the values of y greater than 1. Similar
criticisms of kurtosis have been raised in the context of Independent Component Analysis [175].

Let the random variable y be generated by filtering a random variable x:

y(n) = wT (n) · x(n) (5.4)

the derivative of kurtosis expression is

   
∂γ2 ∂y ∂wx
= E 4y 3 = 4E y 3 = 4E y 3 x

(5.5)
∂w ∂w ∂w

therefore it is not bounded and can theoretically diverge.

In order to minimize the sensitivity of the algorithm to “outlying” values a maximum likeli-
hood (ML) approach was proposed. Maximum likelihood, also called the maximum likelihood
method, is the procedure of finding the value of one or more parameters for a given statistic
which makes the known likelihood distribution a maximum.

Assuming an FIR filter so that

y(n) = wT (n) · x(n) (5.6)

the idea is to build w in order to achieve any desired probability density function of the output

101
Novel HOS based blind dereverberation algorithms

y:

T

|{z} E {log(P (y)} = max
max |{z} E log(P (w x) . (5.7)
w w

Defining the utility function as

J = E {log(P (y))} (5.8)

its gradient is

   
∂J ∂ log(P (y)) ∂wx
=E = E φ(y) = E {φ(y)x} (5.9)
∂w ∂w ∂w

where

1 ∂P (y)
φ(y) = (5.10)
P (y) ∂y

therefore the update equation to maximize J is

w(n + 1) = w(n) + µ · ∇J(w(n)) = w(n) + µ · E {φ(y)x} (5.11)

The probability density function of y is chosen in order to have high kurtosis and bounded
derivative. A probability density function with these properties is

1
P (y) = (5.12)
cosh(y)

giving

1 ∂P (y) ∂ 1
φ(y) = = cosh(y) = − tanh(y) (5.13)
P (y) ∂y ∂y cosh(y)

where cosh and tanh are respectively the hyperbolic cosine and hyperbolic tangent functions.

102
Novel HOS based blind dereverberation algorithms

Figure 5.3: A single channel time-domain adaptive algorithm for maximizing kurtosis of the
LP residual.

The update equation therefore becomes

w(n + 1) = w(n) − µ · E {tanh(y)x} . (5.14)

Thus the hyperbolic tangent function replaces the kurtosis term. While the latter is unbounded
and dominated by a cubic term, the former is bounded and insensitive to outliers. The stochas-
tic gradient version of equation 5.14 is of course, a Bussgang-type equalizer. Therefore, this
derivation also shows that a Bussgang algorithm obtains blind deconvolution by maximizing
the kurtosis.

5.1.1 Experimental results

A single-channel time-domain algorithm based on equation 5.14 was implemented and em-
ployed to equalize the IR of a real room. The proposed method was experimentally compared
to the performance of the single-channel time-domain kurtosis maximization algorithm pro-
posed in [87] and discussed in section 4.6.1. The two algorithms share the same structure,
shown in Fig.5.3, but have different cost functions.

5.1.1.1 Method used in the evaluation:

• all the data reported in the experiments are at a sampling rate of 22050 Hz;

• the clean speech signal, s(n), is a 7.5 seconds anechoic male speech sample. the source

103
Novel HOS based blind dereverberation algorithms

signal was two reproductions of male and female speech on different subjects;

• the reverberated input file, x(n), has been created by convolving the clean speech signal,
s(n), with a 6000 tap IR, h(n), measured from a real room (T60 of about 400 ms);

• the algorithms produce as an output the estimate of the clean speech signal, y(n), and
calculate the convolution between the original IR, h(n), and the identified inverse IR,
w(n), to evaluate the quality of the processing; in fact, this convolution provides the
equalized IR, and by measuring its DRR, defined in section 3.3.2.1, the dereverberation
performance can be measured.

The following parameters and procedure have been used for the algorithms:

• LPC analysis order = 26. This value allows to track up to 13 spectral peaks, thus it offers
a sufficient complexity for speech modeling;

• LPC analysis window length= 25 ms. This allows to consider speech as stationary in the
analysis window;

• FIR adaptive filter order length = 1000 samples. Theoretically, an inverse filter w(n)
with a number of taps greater than the length of the IR should be used. However, a
shorter filter is expected to reduce the intensity of the early reflections. No substantial
improvement was observed for longer filter, despite the increased computational cost;

• initial condition of the FIR adaptive filter w = [1, 0, ..., 0]. This allows the filter to start
the adaptation from a meaningful initial condition (the first tap set to one enables the
signal to flow unaffected at the beginning of the adaptation);

• the filter identified at the end of the adaptation has been used as the identified inverse
filter w(n);

• the adaptive filter, w(n), has been normalized at every adaptation step as suggested in
[87];

• to improve the speed and accuracy of convergence both algorithms were implemented in
batch mode1 and the LP residual, x̃(n), was normalized to unit variance. This frees the
adaptive algorithms from following the envelope of the residual;
1
Instead of using a “sample by sample” adaptation, the filter coefficients are updated only after the whole file
has been processed. This provides a better estimate of the gradient assuring a more stable convergence.

104
Novel HOS based blind dereverberation algorithms

ï20

(a)
ï40

0 500 1000 1500 2000 2500 3000 3500 4000

0
Amplitude dB

ï20
(b)

ï40

0 500 1000 1500 2000 2500 3000 3500 4000

ï20
(c)

ï40

0 500 1000 1500 2000 2500 3000 3500 4000


samples

Figure 5.4: (a)Echogram of the original IR (reference), DRR=-2.9 dB (b) Echogram of the
equalized IR obtained by the single-channel time-domain kurtosis maximization
algorithm as proposed by Gillespie and Malvar, DRR=-2.7 dB. (c) Echogram of the
equalized IR obtained by the single-channel time-domain kurtosis maximization
algorithm based on equation 5.14, DRR=-1.1 dB

• the adaptation parameter, µ, was set to 0.1 for the proposed method and to 7·10−4 for the
algorithm proposed in [87]; This parameter was heuristically chosen in order to achieve
convergence and assure stability 2 .

• the algorithm was left free to adapt also during unvoiced or silent periods as suggested in
[87].

The echograms of the original and equalized IRs are shown in Fig.5.4. As explained in sec-
tion 2.6.1, the echogram simplifies the IR analysis and highlights the attenuation of the most
energetic reflections. The proposed method shows, as reported in Fig. 5.5, a better derever-
beration performance and a faster convergence in respect to the original time domain kurtosis
maximization algorithm introduced in [87].

2
Since it is a batch algorithm, the choice of the adaptation parameter is less crucial than an on-line algorithm.

105
Novel HOS based blind dereverberation algorithms

Figure 5.5: Convergence of the single-channel time-domain kurtosis maximization algorithm


based on equation 5.14 (solid line) and of the single-channel time-domain kurtosis
maximization algorithm as proposed by Gillespie and Malvar (dashed line).

5.2 A multi-channel maximum likelihood approach to dereverber-


ation

As in [87], the previous algorithm was extended in [173] to a multichannel structure. The
advantages of a MINT-like structure were expected. This implies:

• a solution to the instability issue associated with the inversion of non-minimum phase
transfer functions, by ensuring that the equalizer will be a set of FIR filters if the channel
transfer functions are FIR;

• shorter filters for each channel in respect to the single-channel case and therefore better
statistical properties in their estimation, less computational demand and less memory
requirement.

It must be noticed that in [87] there’s no discussion about the motivations behind such extension.
The multichannel structure is simply justified as the direct extension from the single-channel
system.

For an N -channel system the update equation becomes

wi (n + 1) = wi (n) − µ · E {tanh(ỹm )x̃i } (5.15)

106
Novel HOS based blind dereverberation algorithms

Figure 5.6: Two-channel system composed by two single-channel structures

where x̃i = [x(n), x(n − 1), ..., x(n − p)] is the output vector of the i-th LP analysis filter and
ỹm is defined as

N
1 X
ỹm = ỹi (n) (5.16)
N
i=1

where ỹi (n) is the output of the i-th maximization filter at time n.
The filters are jointly optimized to maximize the likelihood of the output ỹm and the derever-
berator output y(n) is defined as

N
X
ŝ(n) = y(n) = wi (n) ∗ xi (n) (5.17)
i=1

where xi (n), wi (n) are respectively the i-th observation and equalizer of the corresponding
source-to-receiver channel.

Extending our results from the single-channel ML technique, we expect the multi-channel struc-
ture to benefit by improved stability and less noise in the convergence compared to the kurtosis
approach. The ambiguity introduced by the use of the LP residual to decouple the harmonic
structure of speech and reverb has been mitigated by spatial averaging the LP analysis coeffi-
cients as proposed in [117].

107
Novel HOS based blind dereverberation algorithms

5.2.1 A comparison between ML dereverberation and the MINT

To highlight the affinity of the above algorithm with the MINT, a two-channel system has been
used to equalize a speech signal sampled at 22050 Hz that has been processed with the following
two long echoes:

 h1 (n) = δ(0) − 0.8δ(n − 600)

(5.18)
 h2 (n) = δ(0) − 0.8δ(n − 1000)

The inverse filters have been calculated by the MINT inverse formula given in [9] with the
explicit use of the impulse responses of the system, and this result has been compared to the
filters that have been blindly identified by the multi-channel structure. This second approach
directly estimates the inverse filters,which is statistically better than estimating the impulse
responses of the system and then inverting. Note that the lengths of the inverse filters are
comparable to the length of the longest delay present. This is in contrast to the single-channel
case, which would require a much longer filter. The results are shown in Fig.5.7 where the three
leftmost plots relate to the MINT method while the rightmost three plots show the algorithm
performance. The inverse filters have similar placement of the taps but different gains; both
inverse filters provide an equalization for the system (Figures 5.7c, 5.7f ). Since we do not
enforce that the FIR order be the minimum required, this solution is non-unique and it only
needs to satisfy the Bezout identity.

5.2.2 Experimental results

An eight-channel system that uses a one-point sample mean version of the adaptation rule in
(5.15) was also evaluated. A linear array composed of eight omni-directional microphones
was employed in a room with a T60 of about 400 ms to measure the impulse responses of
the corresponding source-to-receiver acoustic paths. To acquire these impulse responses, the
technique reported in [44] was applied. A 4 cm spacing between microphones was chosen for
the first experiment and 45 cm for the second, and to simulate a generic setup, the array was
not placed orthogonal to the loudspeaker. The minimum “microphone to loudspeaker” distance
was of 2.5 m. The impulse responses measured for the 4 cm array configuration did not exhibit
significant delay misalignment among the channels, while the impulse responses for the 45 cm
array were not aligned. The algorithm was applied to male and female speech files sampled

108
Novel HOS based blind dereverberation algorithms

Figure 5.7: Comparison of the MINT and multi-channel equalizer. (a), (b) Inverse filters cal-
culated by MINT. (d), (e) Inverse filters calculated by the multi-channel algorithm.
(c), (f) Equation (4.25) evaluated in both cases.

at 22050 Hz and convolved with the resulting impulse responses (6000 tap long). The results
reported here are for the same speech file used in the experiment described in section 5.1.1.

The following parameters and initializations were used for the algorithm. µ was set to 10−5 , the
LP analysis order to 26 with an LP analysis frame length of 25 ms. The filters were T = 1000
taps long and initialized to wi = [1, 0, 0, 0, 0, ...]. This allows the filters to start the adaptation
from a meaningful initial condition (the first tap set to one enables the signal to flow unaffected
at the beginning of the adaptation). To obtain a more uniform convergence, the residual of each
channel was normalized to a zero mean, unit variance process. The algorithm was left free to
adapt also during unvoiced or silent periods as suggested in [87].

To solve the problem of gain uncertainty, a normalization of the filter coefficients was per-
formed at every update cycle [87]. It should be noted that normalizing all the channels makes
the problem over-constraint if we wish to take advantage of the Bezout inverse solution.

The algorithm was found to provide dereverberation in the case of the 4 cm spacing, but not in
the 45 cm configuration, due to the time-misalignment among the channels. After investigation,
it was understood that a meaningful convergence cannot be achieved for this algorithm when
the channels are time-misaligned. Therefore the algorithm cannot identify the inter-channel
delay. Note that this problem equally applies to the use of kurtosis within this framework [87].
Conversely, when the impulse responses were aligned manually, the algorithm converged for

109
Novel HOS based blind dereverberation algorithms

0
ï20
(a) ï40
0 500 1000 1500 2000 2500 3000 3500 4000
Amplitude dB

0
ï20
(b)

ï40
0 500 1000 1500 2000 2500 3000 3500 4000
0
ï20
(c)

ï40
0 500 1000 1500 2000 2500 3000 3500 4000
samples

Figure 5.8: (a) Reference echogram relating to the shortest source-to-receiver path, DRR=-2.9
dB (b) Echogram of the equalized IR obtained by the 8-channel delay-sum beam-
fomer, DRR =-0.1 dB. (c)Echogram of the equalized IR obtained by the 8-channel
ML dereverberator, DRR =2.3 dB. The ML multichannel structure provides im-
proved dereverberation in respect to the delay and sum beamformer.

the 45 centimeter setup, and provided dereverberation, although its operation was no longer
blind. Figure 5.8(a) shows the echogram of the original impulse relating to the shortest source-
to-receiver path. Figure 5.8(c) shows the equalized impulse response obtained with (5.15), in
the case of a one-point sample mean.

The process of alignment is the same that is required for a delay and sum beamformer. In this
sense it is worthy to observe that the algorithms proposed here and in [87] require a prepro-
cessing stage. A large amount of dereverberation is already achieved by the delay and sum
beamformer, which however does not produce a consistent attenuation of the isolated early re-
flections. Figure 5.8(b) shows the echogram of a delay and sum beamformer using the same
delays used to align the channel in the proposed method. The performance of the dereverbera-
tion algorithm reported in Fig.5.8 have been evaluated by the DRR defined in section 3.3.2.1.

The performance of the algorithm have been validated by repeating the previous experiment
with the same parameters but different speech signals of similar length (between 9 and 10
seconds), from different speakers and in different languages. The results obtained by using the
filters calculated after a single run of the algorithm (therefore after 9-10 seconds of adaptation)
are reported in table 5.1.

The average DRR improvement in dB in respect to a delay and sum beamformer (that provides
2.8 dB of improvement) is of 2.3 dB. Therefore the total average improvement in the DRR due
to the the dereverberation algorithm is of 5.1 dB.

110
Novel HOS based blind dereverberation algorithms

Signal DRR dB
Male 1 (Italian) 1.9
Male 2 (Portuguese) 2.0
Male 3 (English) 2.2
Male 4 (Italian) 2.5
Female 1 (Portuguese) 2.3
Female 2 (Russian) 2.1
Female 3 (Hebrew) 3.0
Female 4 (Dutch) 2.8

Table 5.1: DRR improvement in dB in respect to a DS beamformer.

5.3 Blind dereverberation algorithms based on the natural gradi-


ent

5.3.1 The Bussgang algorithm

The previously proposed algorithms rely implicitly on a Bussgang type algorithm. The Buss-
gang algorithm, in fact, can be treated as a constrained maximum likelihood problem for an
i.i.d. sequence y(n), where the usual normalization term is neglected. Hence a constrained, as
the previously proposed normalization scheme, must be provided to be fixed.

J(w) = E{− log P (y(n))} (5.19)

where y(n) = w ∗ x(n).

Differentiating and dropping the expectation 3 gives:

∂J
= −f (y(n))x(n − p) (5.20)
∂w(p)


where f (y) = ∂y log p(y) is the Bussgang non-linearity. This gives the Bussgang algorithm:

∂J
w(p) = w(p) − µ
∂w(p) (5.21)
= w(p) + µf (y(k))x(k − p).

This approach is appealing for its simplicity, but it may be slow to converge with a consequent

3
Therefore a one-point sample mean is used as in the LMS algorithm.

111
Novel HOS based blind dereverberation algorithms

Figure 5.9: Comparison of the adaptation behavior of the NGA and the time domain kurtosis
maximization proposed by Gillespie and Malvar when they are applied to a super-
gaussian input filtered by a single echo with a delay of 50 samples and a reflection
gain =-1. Equalizer order P=100. Perfect equalization is not achieved since a
truncated equalizer is used.

poor equalization [134].

5.3.2 Natural/Relative Gradient

The natural/relative gradient, as discussed in section 4.6, by providing isotropic convergence


independently of the model parametrization and of the dependencies within the signals being
processed by the algorithm overcomes these limitations and increases the usability of Bussgang
type methods to non stationary environments.

The usual natural/relative gradient form for the Bussgang algorithm update is [134]:

P
X
y(n) = w(p)e(n − p) (5.22)
p=0

P
X
u(n) = w(P − m)y(n − m) (5.23)
m=0
 
wnew (p) = w(p) + µ w(p) + f (y(n − P ))u(n − p) (5.24)

where P is the equalizer order and f (.) the Bussgang nonlinearity (the − tanh(.) or the − sgn(.)
functions will be assumed).

This algorithm will be addressed for simplicity as the Natural Gradient Algorithm (NGA). A

112
Novel HOS based blind dereverberation algorithms

(a) (b)

Figure 5.10: (a) Diagram of the time domain dereverberation algorithm proposed by Gillespie
and Malvar (forward structure). (b) Diagram of the proposed model (reversed
structure).

comparison of the performances of the time domain kurtosis maximization algorithm proposed
in [87] and of the NGA is provided in Fig.5.9, where a reverberated supergaussian white signal
is equalized. The two algorithms are evaluated by the Direct to Reverberation Ratio. The NGA
is a better candidate for time domain adaptive equalization schemes.

5.3.3 Dereverberating speech

The application of such equalization schemes to blind dereverberation using the structure in
Fig.5.10(a), called from now on “forward structure” is relatively easy and consists of LPC pre-
whitening followed by the equalization algorithm reported in 5.22, 5.23 and 5.24. However we
will show that better results can be achieved by using the structure shown in Figure 5.10(b),
called from now on the “reversed structure”.

5.4 A novel dereverberation structure that improves the speech


and reverberation model decoupling

5.4.1 On the correct structure for a single channel dereverberator. The reversed
structure.

The order of two linear filters can be swapped only if they are time invariant. Since the vocal
tract filter is not stationary, the forward and the reversed structure will lead to different results.
The residual e(n) calculated by the forward structure is, as shown in Fig.5.11(a), a function of
multiple quasi-stationary blocks, although it is being whitened by only a single LPC filter. This
is particularly problematic in the dereverberation setting since the duration over which speech is

113
Novel HOS based blind dereverberation algorithms

(a)

(b)

Figure 5.11: (a) In the forward structure, the residual is a function of multiple LPC filters. (b)
In the reversed structure, the residual is a function of one LPC filter.

quasi-stationarity is usually significantly less than the room reverberation time. By performing
the dereverberation before the LP analysis (reversed structure), the modelled residual e(n) is,
as shown in Fig.5.11(b), only a function of one quasi-stationary block. In more detail, if the LP
pre-whitening is performed before the dereverberation (forward structure), the residual e(n) is:

X
e(n) = w(p)y(n − p) (5.25)
p

and
X
y(n) = al (n)x(n − l) (5.26)
l

and therefore
X X
e(n) = w(p) al (n − p)x(n − l − p) (5.27)
p l

e(n) is a function of a number of different LPC coefficients al (n).

In contrast by using the reversed model, that performs the LP pre-whitening after the derever-
beration, the residual e(n) is a function of one single LPC filters:

X
e(n) = al (n)y(n − l) (5.28)
l

and
X
y(n) = w(p)x(n − p). (5.29)
p

114
Novel HOS based blind dereverberation algorithms

Therefore the correct relationship between the LP residual e(n) and x(n) is:

X
e(n) = al (n)y(n − l)
l
X X (5.30)
= al (n) w(p)x(n − l − p)
l p

where al (n) is the l − th time varying LPC filter coefficient at time n. Note that e(n) is only a
function of LPC coefficients associated with time n.

This observation led to a new algorithm based on the natural gradient that exploits the reversed
structure.

5.4.2 A causal normalized NGA dereverberation algorithm based on the reversed


structure

For the moment it is assumed that al (n) are known for all n and l.

The probability model for the reversed structure is:

J(w) = −E{log p(x|w)}


∂y
= − log − E{log p(y|w)}
∂x
∂y ∂e
= − log − log − E{log p(e|w)} (5.31)
∂x ∂y
Z π
1
=− log |W (ω)|dω − E{log p(e(n))}
2π −π
+ constant

where the expression − log ||∂e/∂y|| has been replaced with “constant” since it does not depend
on w and so will vanish when differentiated.

Differentiating this gives:

∂J ∂e(n)
= −h(−p) − f (e(n))
∂w(p) ∂w(p)
X (5.32)
= −h(−p) − f (e(n)) al (n)x(n − p − l)
l

where h(p) denotes the impulse response function for the inverse of w (i.e. h ∗ w = δ0 ) and

115
Novel HOS based blind dereverberation algorithms

f (.) a nonlinearity (the − tanh(.) or the − sgn(.) functions will be assumed).

Now the Bussgang version (standard gradient) algorithm is:


 
∂J
wnew (p) = w(p) − µ . (5.33)
∂w(p)

It can be shown, as reported in the derivation of equation A.33 in the appendix, that an equiv-
ariant, normalized form of the previous equation is:

wnew (p) = w(p) + µ w(p) + f (e(n))
X  (5.34)
· al (n)y(n − p + m − l) .
l

Finally, the causality problem, as in [136], can be addressed. First the dereverberation filter is
constrained to be a causal FIR:

P
X
y(n) = w(p)x(n − p) (5.35)
p=0

then the update is delayed by P samples:



wnew (p) = w(p) + µ w(p) + f (e(n − P ))
X X  (5.36)
· al (n − P ) w(m)y(n − p + m − l − P )
l m

and an auxiliary variable u(n) is introduced:

P
X
u(n) = w(P − m)y(n − m). (5.37)
m=0

The simplified update rule for the reversed structure (equation A.34 in the appendix), is:

wnew (p) = w(p) + µ w(p) + f (e(n − P ))
L
X  (5.38)
· al (n − P )u(n − p − l) .
l=0

It is of interest to compare (5.35), (5.37), (5.38), to the update equations for the forward struc-

116
Novel HOS based blind dereverberation algorithms

ture, (5.22), (5.23), (5.24). The update equations for the reversed structure are more complex,
as an additional step is required to calculate the last term in (5.38). However, in the forward
structure, this additional filtering must be calculated outside the adaptation loop to obtain the
speech residual. Furthermore, the reversed structure makes replicating the dereverberation filter
unnecessary since it acts directly on the reverberated signal.

5.4.3 Results. The reversed vs the forward structure.

In this section we will show that the reversed structure, by respecting the correct ordering of
the stationary (room response) and non-stationary filters (LPC filters), can achieve better dere-
verberation performance than the forward structure. In order to focus only on the change in
structure a toy non-blind example is used in the first experiment. In the second one the pro-
posed algorithm is applied to a speech signal that has been convolved with an impulse response
measured from a real room.

The algorithms, as before, were evaluated by the DRR criterion. The sampling frequency for
all the signals was 22.05 kHz. The f (.) nonlinear function in (5.38) was set to − sgn(.).

5.4.3.1 Toy non-blind example

A unitary variance, supergaussian 4 , white noise was filtered by alternating every 25 ms two
static FIR filters with impulse response g1 = [1, 0.5, 0.5, 0.5, 0.5] and g2 = [1, 0, 0, 0, 0]. The
resulting signal, s(n), was then reverberated by the single echo impulse response h(n) = δ(0)−
δ(275). To minimize amplitude fluctuations, g1 and g2 were chosen to have stable inverse and
similar output power.

The algorithms for the forward and reversed structure, with adaptation parameter µ = 10−5 and
given LPC coefficients, were used to dereverberate s(n). Figures 5.12(a1) and 5.12(a2) show
the dramatic difference in their dereverberation performance. While the reversed structure can
correctly dereverberate the input signal, this does not happen for the forward structure.

If the same experiment is repeated with g2 = g1 , therefore if the source filter is time invariant,
no difference in the results, as shown in Fig.5.12(b1) and Fig.5.12(b2), is appreciable.

What we see in the first case is solely due to the non-time invariance of the source filter.
4
A supergaussian signal is used to mimic the distribution of the speech residual.

117
Novel HOS based blind dereverberation algorithms

Figure 5.12: Results of the toy non-blind example described in section 5.4.3.1. Dereverbera-
tion performance with a time variant source filter: (a1) forward structure; (a2)
reversed structure. While the reversed structure can correctly dereverberate the
input signal, this does not happen for the forward structure. Dereverberation
performance with a time invariant source filter: (b1) forward structure; (b2) re-
versed structure. There is no appreciable difference in the performance of the two
structures.

118
Novel HOS based blind dereverberation algorithms

Amplitude dB
ï20

ï40

ï60
0 500 1000 1500 2000 2500 3000 3500 4000 4500

Amplitude dB 0

ï20

ï40

ï60
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Samples

Figure 5.13: Echogram of the original impulse response (above), DRR=-2.9 dB, and of the
equalized with the reversed structure (below), DDR=-0.7 dB

5.4.3.2 Single channel blind dereverberation of real speech

The experiment reported in section 5.1.1 was repeated by using the algorithm proposed in (5.38)
with the same settings (except µ = 0.5·10−5 ) and data. In order to have a better convergence, it
is important to prevent the algorithm from tracking the speech amplitude variations. This can be
obtained by dividing the LPC coefficients and the estimated residual by its standard deviation
and then by using this normalized data in (5.38).
As in section 5.1.1, a 1000-tap dereverberation filter was used. As previously observed, an
infinitely long filter would be theoretically required, and therefore, the room impulse response
is affected only in its first portion. The impulse response of the equalized system reported in
Fig.5.13 shows a good attenuation of the first 1000 taps. The global DRR improvement for the
whole impulse response is of 2.3 dB.
The forward structure has been tested with the same settings. The DRR improvement for the
whole impulse response is of 1.8 dB. This last result is in agreement with the one obtained by
the batch maximum likelihood algorithm proposed in section 5.1.1.

Compared to the previous toy example, this experiment based on speech signals highlights a
less evident difference in the performance of the forward and of the reversed structure. This
was partially expected since the physics of the vocal tract imposes the similarity of the filters
belonging to adjacent blocks. In other words, the LPC coefficients change smoothly from one
block to another. It is however confirmed that the reversed structure can better cope with the

119
Novel HOS based blind dereverberation algorithms

non-time invariance of the source filter. Furthermore, it makes replicating the dereverberation
filter unnecessary.

It is interesting to notice that the on-line algorithm based on reversed structure provides a better
dereverberation also in respect to the results of the single-channel batch algorithms based on
the standard kurtosis and maximum likelihood discussed in section 5.2.2 and shown in 5.4.

At the current state, we were unable to obtain results when longer dereverberation filters were
tested. In this case, both the forward and the reversed algorithms tend to be unstable. This can
be related to the misconvergence problem investigated in a paper from Douglas et al. [176],
where a more robust version of the NGA is reported. The modification of the proposed algo-
rithm by considering this improvement seems to be an interesting direction for future research.

5.5 A blind multichannel dereverberation algorithm based on the


natural gradient

The proposed reversed structure can be exploited in a multichannel framework to benefit of a


MINT-like form. As previously discussed, this implies:

• a solution to the instability issue associated with the inversion of non-minimum phase
transfer functions, by ensuring that the equalizer will be a set of FIR filters if the channel
transfer functions are FIR;

• shorter filters in respect to the single-channel case and therefore better statistical proper-
ties in their estimation, less computational demand and less memory requirement.

The advantages of the reverse algorithm becomes even more evident when a multichannel struc-
ture is employed. In fact, while the multichannel structure based on the forward algorithm
requires the calculation of the LP residual for each channel, as shown in Fig.5.6, the multi-
channel reversed structure, as shown in Fig.5.14, requires only a single LP residual calculation,
and no replication of the dereverberation filters.

For an N -channel system the update equation becomes

120
Novel HOS based blind dereverberation algorithms

Figure 5.14: proposed multi-channel dereverberation structure.

X
e(n) = al (n)y(n − l) (5.39)
l

where al (n) is the l − th time varying LPC filter coefficient at time n and y(n) is the derever-
berator output defined as

N
1 X
ŝ(n) = y(n) = yi (n) (5.40)
N
i=1

where yi (n) is the output vector of the i-th maximization filter

yi (n) = wi (n) ∗ xi (n) (5.41)

and xi (n), wi (n) are respectively the i-th observation and equalizer of the corresponding
source-to-receiver channel. The p − th coefficient of the i − th channel maximization filter, wi ,
is given by

winew (p) = wi (p) + µ wi (p) + f (e(n − P ))
L
X  (5.42)
· al (n − P )ui (n − p − l)
l=0

and
P
X
ui (n) = wi (P − m)yi (n − m). (5.43)
m=0

121
Novel HOS based blind dereverberation algorithms

(a) ï20

ï40

0 500 1000 1500 2000 2500 3000 3500 4000

0
Amplitude dB

ï20
(b)

ï40

0 500 1000 1500 2000 2500 3000 3500 4000

ï20
(c)

ï40

0 500 1000 1500 2000 2500 3000 3500 4000


samples

Figure 5.15: (a) Reference echogram relating to the shortest source-to-receiver path, DRR=-
2.9 dB (b)Echogram of the equalized IR obtained by the 8-channel delay-sum
beamfomer, DRR =-0.1 dB. (c) Echogram of the equalized IR obtained by the
proposed 8-channel dereverberator, DRR =3.1 dB. The proposed structure pro-
vides improved dereverberation in respect to the delay and sum beamformer.

Fig. 5.15(a) shows the echogram of the original impulse relating to the shortest source-to-
receiver path. Figure 5.15(c) shows the equalized impulse response. A large amount of derever-
beration is already achieved by the delay and sum beamformer, Figure 5.15(b), which however
does not produce a consistent attenuation of the isolated early reflections. As in section 5.2.2,
an improved estimation of the LP analysis coefficients can be obtained by averaging the LP
coefficients evaluated from all the channels. This obviously makes the algorithm slightly less
efficient.

5.5.1 Results

The experiment reported in section 5.2.2 was repeated, using the same settings (except µ =
0.5 · 10−5 ) and data. As it can be observed in Fig.5.15, the proposed algorithm outperforms by
0.8 dB the results reported in section 5.2.2 and shown in Fig.5.8.

The performance of the algorithm have been validated by repeating the previous experiment
with the same parameters but different speech signals of similar length (between 9 and 10

122
Novel HOS based blind dereverberation algorithms

Signal DRR dB
Male 1 (Italian) 2.2
Male 2 (Portuguese) 2.9
Male 3 (English) 2.6
Male 4 (Italian) 3.1
Female 1 (Portuguese) 2.9
Female 2 (Russian) 2.3
Female 3 (Hebrew) 3.2
Female 4 (Dutch) 2.9

Table 5.2: DRR improvement in dB in respect to a DS beamformer.

seconds), from different speakers and in different languages. The results obtained by using the
filters calculated after a single run of the algorithm (therefore after 9-10 seconds of adaptation)
are reported in table 5.2. The average DRR improvement in dB in respect to a delay and sum
beamformer (that already provides 2.8 dB of improvement) is of 2.8 dB. Therefore the total
average improvement in the DRR is of 5.6 dB. The proposed algorithm outperforms on average
by 0.5 dB the results reported in section 5.2.2

To validate the performance of the algorithm in a more reverberant environment, a similar


experiment with a 12 channel system was repeated by using two of the previous speech samples
(worst case “male1” and best case “female3”) . The experiment settings were the following:

• T60 of about 1 second;

• 20 cm spacing between microphones;

• IR length of 12000 samples;

• inverse filter length 1000 samples;

• adaptation parameter µ = 0.5 ∗ 10−5 ;

• also in this experiment the channel were time aligned manually;

• the filters calculated after a single run of the algorithm (therefore after 9-10 seconds of
adaptation) were used as the dereverberation filter.

Also in this case, as shown in Fig.5.16, the proposed algorithm can provide better dereverbera-
tion in respect to a delay and sum beamformer.

123
Novel HOS based blind dereverberation algorithms

Figure 5.16: (a) Reference echogram relating to the shortest source-to-receiver path, DRR=-
12.9 dB (b) Echogram of the equalized IR obtained by the 12-channel delay-sum
beamfomer, DRR =-4.9 dB. (c)Echogram of the equalized IR obtained by the 12-
channel dereverberator (female3 speaker), DRR =-0.9 dB. (d)Echogram of the
equalized IR obtained by the 12-channel dereverberator (male1 speaker), DRR
=-2.9 dB. The proposed algorithm provides better dereverberation in respect to
the delay and sum beamformer.

124
Chapter 6
Conclusions and Further Research

6.1 Conclusions

The aim of this work was to investigate the problem of blind speech dereverberation, and in
particular of the methods based on the explicit estimate of the inverse acoustic system (rever-
beration cancellation methods).

• In chapter 1, some fundamental prerequisites on room acoustics and modeling have been
provided. A relevant observation about the “image method for efficiently simulating
small-room acoustics” by J. B Allen and D. A. Berkley, was proposed in section 2.7.2.

• In chapter 2, the fundamental methods regarding the “input-output” impulse response


identification and equalization techniques have been discussed. The advantages of mul-
tichannel systems in respect to single channel ones have been highlighted.

• In chapter 3, a literature survey covering the reverberation cancellation methods for


speech dereverberation were presented, classifying the contributions in an organized
framework.

• In Chapter 4, novel single- and multi-channel dereverberation algorithms based on a


HOS approach, and in particular, a novel multi-channel dereverberation structure that
improves the speech and reverberation model decoupling were proposed. This novel
algorithm provides better dereverberation performance in respect to other known HOS
methods and delay and sum beamformers. The proposed method can be seen as a post
processing enhancing technique for the traditional delay and sum beamformer approach.

The dereverberation problem is still an open issue. This work proposes a partial solution in
an idealized framework where, in the absence of noise, the impulse responses of the source
to receiver path are time-invariant. In this idealized case, the achieved dereverberation is rel-
evant. The interest on the approaches based on the explicit channel inversion was led by the

125
Conclusions and Further Research

fact that, in theory, these methods can offer perfect dereverberation. In practice, even in the
described idealized framework, this does not happen and the improvement, even if perceptually
consistent, is limited.

Why this discrepancy between theory and practice exists?

• A limiting factor of the performance of any dereverberation algorithm is how to de-


couple the speech and room contributions. This has been previously described as “the
unambiguous deconvolution problem”. So far no perfect solution has been found in this
sense, especially in the single channel case. Better approaches exist for multichannel
systems. However, an high number of microphones spread in a large area of the room
are necessary to obtain a convincing estimation of the speech LP coefficients. This is a
severe constraint to practical applications. The unambiguous deconvolution problem is
present for physical reasons, independently from the blind deconvolution algorithm that
is used to perform the system inversion. This is a fundamental issue that is unlikely to be
overcome.

• Another limiting factor is the performance of the blind deconvolution algorithm em-
ployed, both in terms of accuracy and speed of convergence, especially when the input
signal fed to the algorithm is not perfectly white.

• Other issues come from all the systematic errors that might be present in the proposed
dereverberation system, as for instance the NGA misconvergence described in[176]. Of
course these can be addressed and hopefully solved.

Departing from the idealized framework of a noiseless time-invariant acoustic systems, two
problems are central and still need to be addressed:

• The speed of convergence of the algorithm must suffice to track the acoustic system
variations.

• The sensitivity of the algorithm to noise might prevent the algorithm from working in
real conditions.

Hopefully, both these issues are practically solvable.

126
Conclusions and Further Research

6.2 Suggestions for further research

The main aspects that still need to be considered are:

1. The problem of blind time alignment between channel: a robust algorithm that initialize
the inter-channel delay in the multichannel structure must be provided. This is necessary
since the proposed algorithm is unable to identify the delay present among the channels.
This would allow us to do tests in a fully blind scenario.

2. The behavior in a time varying environment: even if the proposed algorithm provides a
fast adaptation (convergence usually happens within seconds), all the simulations were
performed with steady impulse responses. Since it is well known [6] that even a small
perturbation in the source/receiver placement can determine dramatic loss in the derever-
beration performance, the algorithm should be evaluated in more realistic conditions. A
possible test, without incurring in the “unambiguous deconvolution problem”, would be
to use a supergaussian white noise as the source signal.

3. The problem of the sensitivity of the NGA algorithm [176] must be analyzed and the
algorithm should be robust to misconvergence. This should probably provide better dere-
verberation.

4. The evaluation of the performances in a noisy environment and multiple speaker envi-
ronment. This is particularly important since the noise might prevent the algorithm from
working. The problem of multiple speaker might be naively addressed by constraining
the orientation of the beamformer to a predefined direction.

5. The extension to a wider class of signal beyond speech (i.e. music and all the signal
containing sustained harmonic components). This would widen the possible applications.

127
128
Appendix A
Relative/Natural Gradient
De-reverberation Algorithm

A.1 The Bussgang algorithm

The Bussgang algorithm can be treated as a constrained maximum likelihood problem for an
i.i.d. sequence y(n), where the usual normalization term is neglected. Hence this must in some
way be constrained to be fixed.

J(w) = E{− log p(y(n))} (A.1)

where y(n) = w ∗ x(n). Note the expectations will be dropped.

Differentiating gives:
∂J
= −f (y(n))x(n − p) (A.2)
∂w(p)

where f (y) = ∂y log p(y) is the Bussgang nonlinearity. This gives the Bussgang algorithm:

∂J
w(p) = w(p) − µ
∂w(p) (A.3)
= w(p) + µf (y(k))x(k − p)

A.2 Natural/Relative Gradient

The update above is not equivariant. Equivariant means that if the coordinates are changed but
everything else is kept the same the updates would be equivalent.

For example, suppose a(n) = g ∗ x(n) is chosen as the observation (simply by filtering the
observed data with some filter g). If w is believed to be the deconvolution filter for x(n), then
w̃ = w ∗ g −1 should be naturally believed the deconvolution filter for a(n).

Furthermore it would be hopeful that an update from the Bussgang algorithm, w would be

129
Relative/Natural Gradient De-reverberation Algorithm

consistent with the update w̃ if it starts with w̃ and a(n). However this is not the case. Showing
this will lead to the relative gradient derivation.

Updating w̃ with respect to a(n) the derivative is:

∂J
= −f (y(n))a(n − p) (A.4)
∂ w̃(p)

However also the relationship holds:


w = w̃ ∗ g (A.5)

So what is the equivalent new w that is obtained by updating w̃ (using the relationship above)?

 
∂J
w(p) = w(p) − µ ∗g
∂ w̃(p)
 
X ∂J
= w(p) − µ g(m) (A.6)
m
∂ w̃(p − m)
X
= w(p) + µ g(m) (f (y(n))a(n − p + m))
m

Note that the y(n) is unchanged. ∂J/∂ w̃(p) is a function of p and it is this that is being
convolved with g.

Now writing a(n) in terms of x(n):


!
X X
w(p) = w(p) + µ g(m) f (y(n)) g(l)x(n − p + m − l) (A.7)
m l

or more simply:
w = w + µf (y(n)) (g(n) ∗ g(−n) ∗ x) (A.8)

where the index in g(n) has been included to indicate that one convolution is time reversed.

So from (A.8) it can be observed that the gradient update is dependent upon the coordinates for
the (same) data that have been chosen.

To get an equivariant estimate it is necessary to choose a coordinate system that is unrelated to


the observation. The simplest, and probably the best, is to calculate the gradient update in the
solution domain. That is, choose a(n) = y(n).

130
Relative/Natural Gradient De-reverberation Algorithm

(a) (b)

Figure A.1: (a) Diagram of the time domain de-reverberation algorithm proposed by Gillespie
and Malvar (forward structure). (b) Diagram of the proposed model(reversed
structure).

This gives the usual natural/relative gradient update:

w = w + µf (y(n)) (w(n) ∗ w(−n) ∗ x(n))


(A.9)
= w + µf (y(n)) (w(−n) ∗ y(n))

A.3 De-reverberating speech

This argument can be now applied to the update associated with speech dereverberation.

In the first instance it will be not cared whether the update is physically realizable (i.e. causal).

First, applying NGA to the “forward structure”, reported in figure A.1(a), and introduced by
Gillespie et al. in [87], is easy. The above update (preferably with a normalizing term included)
is applied to the LPC residual of the speech.

Now what if the correct model is used, that is the reversed structure shown in figure A.1(b)? As
before, the reverberated residual is modeled as an i.i.d. signal. Let us define:

P
e(n) - the clean residual: e(n) = l al (n)y(n − l), where al (n) are the time varying LPC
filter coefficients.
P
y(n) - the clean speech signal y(n) = p w(p)x(n − p), where w(p) is the de-reverb filter.

x(n) - the reverberant speech signal

The probability model is:


J(w) = E{− log p(e(n))} (A.10)

131
Relative/Natural Gradient De-reverberation Algorithm

where the correct filtering relationship between e(n) and x(n) is:

X
e(n) = al (n)y(n − l)
l
X X (A.11)
= al (n) w(p)x(n − l − p)
l p

It will be assumed that, somehow, al (n) are known for all n and l.

Now the Bussgang version (standard gradient) algorithm is:


∂J
w(p) = w(p) − µ
∂w(p)
(A.12)
∂e(n)
= w(p) + µf (e(n))
∂w(p)

From (A.11) above, it can be evaluated this to be:

X
w(p) = w(p) + µf (e(n)) al (n)x(n − p − l) (A.13)
l

Following the previous section let us ask what the equivalent update for w would be if it would
P
have started with r(n) = m g(m)x(n − m) and w̃.
P
Note,again, the relationship w(p) = m g(m)w̃(p − m) holds. This gives:
 
X ∂J
w(p) = w(p) − µ g(m)
m
∂ w̃(p − m)
X ∂e(n)
= w(p) + µ g(m)f (e(n))
m
∂w(p − m)
X X
! (A.14)
= w(p) + µ g(m)f (e(n)) al (n)r(n − p + m − l)
m l
!
X X X
= w(p) + µ g(m)f (e(n)) al (n) g(q)x(n − p + m − l − q)
m l q

To get the equivariant form it is necessary to choose g = w. This gives:


!
X X
w(p) = w(p) + µf (e(n)) al (n) w(m)y(n − p + m − l) (A.15)
l m

132
Relative/Natural Gradient De-reverberation Algorithm

A.3.1 A Causal NGA De-reverb Algorithm

Finally the causality problem, as in Amari et al [136], needs to be addressed.

First the de-reverb filter is assumed to be causal FIR:

P
X
y(n) = w(p)x(n − p) (A.16)
p=0

and similarly it is assumed that the LPC filters have an order L.

Like in [136], the update is delayed by P samples:


!
X X
w(p) = w(p) + µf (e(n − P )) al (n − P ) w(m)y(n − p + m − l − P ) (A.17)
l m

and introduce an auxiliary variable u(n):

P
X
u(n) = w(P − m)y(n − m) (A.18)
m=0

Then the simplified update rule is:

L
X
w(p) = w(p) + µf (e(n − P )) al (n − P )u(n − p − l) (A.19)
l=0

A.3.2 A Causal NGA Algorithm with Acausal w

Let:
P
X
y(n) = w(p)x(n − p) (A.20)
p=P

Note that this means at time n only y(n − P ) is accessible. It is necessary to introduce a delay
of 3P .

Let ũ(n) be defined as:

P
X
ũ(n) = w(P − m)y(n − m − 2P ) (A.21)
m=−P

Thus at time n, ũ(n) is immediately accessible.

133
Relative/Natural Gradient De-reverberation Algorithm

The new update is:


!
X X
w(p) = w(p) + µf (e(n − 3P )) al (n − 3P ) w(m)y(n − p + m − l − 3P )
l m
X
= w(p) + µf (e(n − 3P )) al (n − 3P )ũ(n − p − l − P )
l
(A.22)

To check that this is a causal update, consider the term ũ(n − p − l − P ). At time n, ũ(n) is
accessible. Furthermore l ≥ 0 and p ≥ −P . Thus the largest index in ũ is n.

A.4 The normalization constant

Recall that in the ML formulation (single channel) for the standard deconvolution problem,
there is a normalization constant. Let’s begin by looking at where this comes from. If a linear
transform model y = Ax is assumed then the probability density functions are related by:

∂y
p(x) = p(y)
∂x (A.23)
= p(y) det(A)

Taking logs of equation (A.23) we have:

log p(x) = log p(y) + log det(A) (A.24)

But the log det(A) can also be written in terms of the eigenvalues, λi of A:

Y
log det(A) = log |λi |
i
X (A.25)
= log |λi |
i

To apply this normalization factor to filters it is necessary to recall that the eigenfunctions of a
filter are complex exponentials and the eigenvalues are defined by the Fourier Transform (FT).
Thus, if y = w ∗ x, then:
Z π
1
log p(x) = log p(y) + log |W (ω)|dω (A.26)
2π −π

134
Relative/Natural Gradient De-reverberation Algorithm

where the sum has become a normalized integral because a continuous set of eigenvalues is
present. It is essentially the inverse FT of log |W (ω)| evaluated at n = 0.

A.4.1 Natural Gradient Blind Deconvolution

The full negative log Likelihood function for blind deconvolution can now be considered:

J(w) = −E{log p(x|w)}


∂y
= − log || || − E{log p(y|w)} (A.27)
Z ∂x
π
1
=− log |W (ω)|dω − E{log p(y(n))}
2π −π

This is equivalent to the log det W term in a ICA [175]. Since w(p) is an invertible linear
time-invariant system, it is an infinite dimensional Toeplitz operator.

Differentiating with respect to w(p) gives:


 Z π  Z π
∂ 1 1 ∂
− log |W (ω)|dω =− (log |W (ω)|) dω
∂w(p) 2π −π 2π −π ∂w(p)
Z π
1 1 ∂W (ω)
=− dω (A.28)
2π −π W (ω) ∂w(p)
Z π
1 1
=− e−jwp dω
2π −π W (ω)

The last line can be deduced by recalling the definition of the Fourier transform: W (ω) =
−jωp and then differentiating with respect to w(p). Note also that the last line is
P
p w(p)e
simply the time reversed inverse FT. Thus:

∂J
= −h(−p) − f (y(n))x(n − p) (A.29)
∂w(p)

where h(p) denotes the impulse response function for the inverse of w (i.e. h ∗ w = δ0 ). Note
that this formulation tries to be consistent with the definition of f (y) = ∂/∂y log p(y) as in the
Bussgang section, above.

135
Relative/Natural Gradient De-reverberation Algorithm

It is relatively straight forward to show that the natural gradient normalized update is:

X X ∂J
w(p) = w(p) − µ w(l) w(m)
m
∂w(p − l + m)
l
X X 
= w(p) + µ w(l) w(m) h(−p + l − m) + f (y(n))x(n − p + l − m)
l m
X  (A.30)
= w(p) + µ w(l) δ(l − p) + f (y(n))y(n − p + l)
l
X 
= w(p) + µ w(p) + f (y(n)) w(l)y(n − p + l)
l

The last line here is then equivalent to equation (33) in Amari et al. (with the sign change in
f (y)).

A.4.2 Natural Gradient Blind De-Reverb

So what happens in the de-reverb case?

J(w) = −E{log p(x|w)}


∂y
= − log − E{log p(y|w)}
∂x
∂y ∂e (A.31)
= − log − log − E{log p(e|w)}
∂x ∂y
Z π
1
=− log |W (ω)|dω − E{log p(e(n))} + constant
2π −π

where the expression − log ||∂e/∂y|| has been replaced with “constant” since it does not depend
on w and so will vanish when it is differentiated, which is done next.

∂J ∂e(n)
= −h(−p) − f (e(n))
∂w(p) ∂w(p)
X (A.32)
= −h(−p) − f (e(n)) al (n)x(n − p − l)
l

The ‘normalized’ equivalent of equation A.14 can now be rewritten where g(p) = w(p) it is

136
Relative/Natural Gradient De-reverberation Algorithm

directly used as:

X X  ∂J 
w(p) = w(p) − µ w(m) w(q)
m q
∂w(p − m + q)
X X  X 
= w(p) + µ w(m) w(q) h(m − p − q) + f (e(n)) al (n)x(n − p + m − q − l)
m q l
X  X 
= w(p) + µ w(m) δ(m − p) + f (e(n)) al (n)y(n − p + m − l)
m l
 X 
= w(p) + µ w(p) + f (e(n)) al (n)y(n − p + m − l)
l
(A.33)

Compare the last line with that in equation (A.15).

A.4.3 Causal normalized NGA for De-Reverb

The normalized versions of the causal NGA de-reverb algorithm can now be immediately writ-
ten down:

 L
X 
w(p) = w(p) + µ w(p) + f (e(n − P )) al (n − P )u(n − p − l) (A.34)
l=0

Compare with equation (A.19).

Also a normalized causal update with acausal w filter can be written:

 X 
w(p) = w(p) + µ w(p) + f (e(n − 3P )) al (n − 3P )ũ(n − p − l − P ) (A.35)
l

Compare with equation (A.22).

137
138
References

[1] H.Kutruff, Room Acoustics. New York, USA: Taylor & Francis; 4th edition, 2000.

[2] D. C. Halling and L. Humes, “Factors affecting the recognition of reverberant speech by
elderly listeners,” Journal of Speech, Language, and Hearing Research, 2000.

[3] M. Ferras, “Multi-microphone signal processing for automatic speech recognition in


meeting rooms,” Master’s thesis, ICSI, Berkeley, California, 2005.

[4] E. Habets, Single- and Multi-Microphone Speech Dereverberation using Spectral En-
hancement. PhD thesis, Technische Universiteit Eindhoven, Eindhoven, 2007.

[5] B. V. Veen and K. Buckley, “Speech de-reverberation via maximum-kurtosis subband


adaptive filtering,” IEE ASSP Magazine, pp. 4–24, 1988.

[6] J. Mourjopoulos, “On the variation and invertibility of room impulse response functions,”
Journal of Sound and Vibration, vol. 102, pp. 217–228, sept 1985.

[7] M. R. P. Thomas, N. D. Gaubitch, J. Gudnason, and P. A. Naylor, “A practical multichan-


nel dereverberation algorithm using multichannel dypsa and spatiotemporal averaging,”
In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
(WASPAA-07), Oct. 2007.

[8] P. Naylor and N. G. Eds., Speech Dereverberation. New Jersey: Springer, 2008.

[9] M. Miyoshi and Y. Kaneda, “Inverse filtering of room acoustics,” IEEE Trans. on Acous-
tics, Speech and Signal Processing, vol. 36, no. 2, pp. 145–152, 1988.

[10] J. Cowan, Handbook Of Environmental Acoustics. Wiley, 1993.

[11] M. D. Egan, Architectural acoustics. J. Ross Publishing, 2007.

[12] McGraw-Hill Dictionary of Scientific and Technical Terms, 7th edition. The McGraw-
Hill Companies, 2011.

[13] “ISO 3382:1997, acoustics. measurement of the reverberation time of rooms with refer-
ence to other acoustic parameters,” 1997.

[14] V. L. Jordan, “Acoustical criteria for auditoriums and their relation to model techniques,”
J. Acoust. Soc. Am., 1970.

[15] F. A. Everest, Master Handbook of acoustics. McGraw-Hill; 4th edition, 2001.

[16] A. D. Pierce, Acoustics. The Acoustical Society of America, 1994.

[17] M. K. (Editor) and K. B. (Editor), Applications of Digital Signal Processing to Audio


and Acoustics. Kluwer Academic Publishers, 1998.

139
References

[18] C. L. S. Gilford, “The acoustic design of talk studios and listening rooms,” Proc. Inst.
Elect. Engs., vol. 106, pp. 245–258, May 1959.

[19] O. J. Bonello, “A new criterion for the distribution of normal room modes,” J. Audio
Eng. Soc., pp. 597–606, 1981.

[20] M. C. Consumi, Proposta di un nuovo metodo d’indagine per il restauro acustico di


un teatro dell’800. Lettura in frequenza della risposta all’impulso in correlazione con i
parametri del dominio temporale. PhD thesis, Universita’ di Bologna, Bologna, 2005.

[21] Nimura and Tadamoto, “Effect of splayed walls of a room on the steady-state sound
transmission characteristics,” J. Acoust. Soc. Am., vol. 28, pp. 774–775, July 1956.

[22] D. Rocchesso and W. Putnam, “A numerical investigation of the representation of room


transfer functions for artificial reverberation,” Proc. XI Colloquium Mus. Inform. Mario
Baroni Editor, Bologna, pp. 149–152, November 1995.

[23] Y. Haneda, S. Makino, and Y. Kaneda, “Multiple-point equalization of room transfer


functions by using common acoustical poles,” IEEE transactions on speech and audio
processing, vol. 5, no. 4, pp. 325–333, 1997.

[24] S. Gudvangen and S. J. Flockton, “Comparison of pole-zero and all-zero modeling


of acoustic transfer functions,” Electronics letters ISSN 0013-5194, vol. 28, no. 21,
pp. 1976–1978, 1992.

[25] J. Pongsiri, P. Amin, and C. Thompson, “Modeling the acoustic transfer function of a
room,” Proceedings of the 12th International Conference on Scientific Computing and
Mathematical Modeling, p. 44, 1999.

[26] W. G. Gardner, “The virtual acoustic room,” Master’s thesis, MIT Media Lab, Cam-
bridge, 1992.

[27] R. Stewart and M. Sandler, “Statistical measures of early reflections of room impulse
responses,” Proceedings of the DAFx’07, 2007.

[28] J. M. Jot, “An analysis/synthesis approach to real-time artificial reverberation,” Proc. of


the International Conference on Acoustics, Speech, and Signal Processing (ICASSP),
vol. 2, no. 4, pp. 221–224, 1992.

[29] A. Kulowski, “Algorithmic representation of the ray tracing technique,” Applied Acous-
tics, vol. 18, no. 6, pp. 449–469, 1985.

[30] J. Allen and D. Berkley, “Image method for efficiently simulating small-room acoustics,”
J. Acoust. Soc. Am, vol. 65, no. 4, pp. 943–948, 1979.

[31] J. Martin and J. V. D. Van Maercke, “Binaural simulation of concert halls: a new
approach for the binaural reverberation process,” J. Acoust. Soc. Am., vol. 94, no. 6,
pp. 3255–3264, 1993.

[32] M. Vorlander, “Simulation of the transient and steady-state sound propagation in room
using a new combined ray-tracing/image-source algorithm,” J. Acoust. Soc. Am., vol. 86,
no. 1, pp. 172–178, 1989.

140
References

[33] E. Lehmann and A. Johansson, “Prediction of energy decay in room impulse responses
simulated with an image-source model,” Journal of the Acoustical Society of America,
vol. 124(1), pp. 269–277, July 2008.

[34] D. Yuanqing Lin; Lee, “Bayesian regularization and nonnegative deconvolution for
room impulse response estimation,” Signal Processing, IEEE Transactions on, vol. 54,
pp. 839–847, March 2006.

[35] R. Spagnolo, Manuale di acustica applicata. UTET, 2008.

[36] M. Petyt, “Finite element techniques for acoustics,” in R.G. White , J.G. Walker (editors),
Noise and Vibrations, Ellis Horwood Ltd., pp. 355–369, 1986.

[37] R. Ciscowski and C. Brebbia, “Boundary element method in acoustics,” Computational


Mechanics Pubblications, Southampton, 1991.

[38] D. Botteldoore, “Finite-difference time-domain simulation of low-frequency room


acoustic problems,” Journal of the Acoustical Society of America, vol. 98, no. 6,
pp. 3302–3308, 1995.

[39] J.Redondo, R.Pico, B. Roig, and M.R.Avis, “Time domain simulation of sound diffusers
using finite-difference schemes,” Acta Acustica, vol. 93, no. 4, pp. 611–622, 2007.

[40] L. Savioja, J. Backman, A. Jarvinen, and T. Takala, “Waveguide mesh method for low-
frequency simulation of room acoustics,” Proc. of the 15th Int. Congr. Acoust. (ICA95),
vol. 2, pp. 1–4, 1995.

[41] D. T. Murphy, M. Beeson, S. Shelley, A. Southern, and A. Moore, “Hybrid room im-
pulse response synthesis in digital waveguide mesh based room acoustics simulation,”
Proceedings of the 11th Int. Conference on Digital Audio Effects (DAFx08), pp. 129–
136, 2008.

[42] J. O. S. III, Mathematics of the Discrete Fourier Transform (DFT): with Audio Applica-
tions - Second Edition. W3K, 2008.

[43] D. D. Rife and J. Vanderkooy, “Transfer-function measurement with maximum length


sequences,” Journal of the Audio Engineering Society, vol. 37, no. 6, 1989.

[44] A. Farina, “Simultaneous measurements of impulse response and distortion with a swept-
sine technique,” 108th AES Convention, 2000.

[45] S. Haykin, Adaptive Filter Theory. New Jersey: Prentice-Hall, 2002.

[46] B. D. Radlovic and R. A. Kennedy, “Iterative cepstrum-based approach for speech de-
reverberation,” Proc. of ISSPAA, vol. 1, pp. 55–58, 1999.

[47] L. Fielder, “Analysis of traditional and reverberation-reducing methods of room equal-


ization,” J. Audio Eng. Society, vol. 51, pp. 3–26, January/February 2003.

[48] J. Mourjopoulos, “Comments on analysis of traditional and reverberation-reducing meth-


ods of room equalization,” Journal of the Audio Engineering Society, vol. 51, pp. 1186–
1188, Dec 2003.

141
References

[49] O. Kirkeby and P. A. Nelson, “Digital filter design for inversion problems in sound re-
production,” J. Audio Eng. Soc., vol. 47, no. 7/8, pp. 583–595, 1999.

[50] M. H. Hayes, Statistical Digital Signal Processing and Modeling. New York: John Wiley
& Sons, 1996.

[51] M. Gerzon, “Digital room equalization,” Studio Sound, http://www.audiosignal.co.uk,


1991.

[52] J. R. Treichler, J. C. R. Johnson, and M. G. Larimore, Theory and Design of Adaptive


Filters. New Jersey: Prentice-Hall, 2001.

[53] S. Neely and J. B. Allen, “Invertibility of a room impulse response,” J. Acoust. Soc.
Amer., vol. 66, pp. 165–169, 1979.

[54] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing. Englewood


Cliffs, New Jersey: Prentice-Hall, 1989.

[55] J. Mourjopoulos, P. M. Clarkson, and J. K. Hammond, “A comparative study of least-


squares and homomorphic techniques for the inversion of mixed-phase signals,” Proc.
of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP),
vol. 7, pp. 1858–1861, May 1982.

[56] B. W. Gillespie and L. E. Atlas, “Acoustic diversity for improved speech recognition
in reverberant environments,” in Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing, pp. 557–560, 2002.

[57] P. Hatziantoniou and J. Mourjopoulos, “Errors in real-time room acoustics dereverbera-


tion,” Journal of the Audio Engineering Society, vol. 52, pp. 883–899, September 2004.

[58] P. Hatziantoniou and J. Mourjopoulos, “Real-time room equalization based on complex


smoothing: Robustness results,” Audio Engineering Society 116th Convention, Berlin,
May 2004.

[59] B. D. Radlovic, R. Williamson, and R. A. Kennedy, “Equalization in an acoustic re-


verberant environment: Robustness results,” IEEE Trans. Speech and Audio Processing,
vol. 8, no. 3, pp. 311–319, 2000.

[60] M. Gerzon, “Why do equalisers sound different?,” Studio Sound, vol. 32, pp. 58–65, July
1990.

[61] R. P. Genereux, “Adaptive loudspeaker systems: Correcting for the acoustic environ-
ment,” Proc 8th International Audio Engineering Society Conference, Washington DC,
May 1990.

[62] P. Hatziantoniou and J. Mourjopoulos, “Results for room acoustics equalization based on
smoothed responses,” Audio Engineering Society 114th Convention, Amsterdam, March
2003.

[63] T. Hikichi and M. Miyoshi, “Blind algorithm for calculating common poles based on lin-
ear prediction.,” Proc. of the International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), vol. 4, no. 17–21, pp. 89–92, 2004.

142
References

[64] G. A. Jones and J. M. Jones, Elementary Number Theory. New York: Berlin: Springer-
Verlag, 1998.

[65] L. Tong and S. Perreau, “Multichannel blind identification: From subspace to maximum
likelihood methods,” Proc. IEEE, vol. 86, pp. 1951–1968, November 1998.

[66] T. Hikichi, M. Delcroix, and M. Miyoshi, “On robust inverse filter design for room
transfer function fluctuations,” in Proc. European Signal Processing Conf. (EUSIPCO),
2006.

[67] H. Yamada, H. Wang, and F. Itakura, “Inverse filtering of room acoustics,” International
Conference on Acoustics, Speech, and Signal Processing., vol. 2, pp. 969–972, 14-17
Apr 1991.

[68] P. Nelson, F. Orduna-Bustamante, and H. Hamada, “Inverse filter design and equaliza-
tion zones in multichannel soundreproduction,” IEEE Transactions on Speech and Audio
Processing, vol. 3, pp. 185–192, May 1995.

[69] J. Lochner and J. Burger, “The subjective masking of short time delayed echoes by their
primary sounds and their contribution to the intelligibility of speech,” Acustica, vol. 8,
pp. 1–10, 1958.

[70] A. Watkins and N. Holt, “Effects of a complex reflection on vowel indentification,” Jour-
nal of the Acoustical Society of America, vol. 86, pp. 532–542, 2000.

[71] M. Hodgson and E. Nosal, “Effect of noise and occupancy on optimal reverberation times
for speech intelligibility in classrooms,” Journal of the Acoustical Society of America,
vol. 111, pp. 931–939, 2002.

[72] P. Naylor and N. Gaubitch, “Speech dereverberation,” Proc. of the International Work-
shop on Acoustic Echo and Noise Control (IWAENC 2005), 2005.

[73] H. J. M. Steeneken and T. Houtgast, “A physical method for measuring speech-


transmission quality,” Journal of the Acoustical Society of America, vol. 67, pp. 318–326,
1980.

[74] E. Zwicker, “Subdivision of the audible frequency range into critical bands,” Journal of
the Acoustical Society of America, vol. 33, pp. 318–326, 1961.

[75] J.Y.C.Wen and P. Naylor, “An evaluation measure for reverberant speech using tail de-
cay modelling,” in Proc. of the European Signal Processing Conference (EUSIPCO06),
pp. 1–4, 2006.

[76] I. T. U. (ITU-T), “Perceptual evaluation of speech quality (pesq), an objective method


for end-to-end speech quality assessment of narrow-band telephone networks and speech
codecs,” Recommendation P.862, Feb. 2001.

[77] S. Wang, A. Skey, and A. Gersho, “An objective measure for predicting subjective quality
of speech coders,” IEEE Journal on Selected Areas in Communications, vol. 10, no. 5,
1992.

143
References

[78] P. A. Naylor, N. D. Gaubitch, and E. A. P. Habets, “Signal-based performance evaluation


of dereverberation algorithms,” Journal of Electrical and Computer Engineering, 2010.

[79] Huang, Yiteng, Benesty, Jacob, Chen, and Jingdong, Acoustic MIMO Signal Processing.
New Jersey: Springer Topics in Signal Processing, Vol. 1, 2006.

[80] K. Eneman and M. Moonen, “Multimicrophone speech dereverberation: Experimental


validation,” EURASIP Journal on Audio, Speech, and Music Processing, 2007.

[81] Benesty, Jacob, Makino, Shoji, Chen, and J. Eds., Speech Enhancement. New Jersey:
Springer, 2005.

[82] Benesty, Jacob, Chen, Jingdong, Huang, and Yiteng, Microphone Array Signal Process-
ing. New Jersey: Springer Topics in Signal Processing , Vol. 1, 2008.

[83] M. Unoki, M. Toi, and M. Akagi, “Refinement of an MTF-based speech dereverberation


method using an optimal inverse-MTF filter,” Proc. SPECOM, vol. 7, jun 2006.

[84] J. Allen, “Synthesis of pure speech from a reverberant signal,” U.S. Patent No. 3786188,
1974.

[85] B. Yegnanarayana and P. Murthy, “Enhancement of reverberant speech using lp residual


signal,” IEEE Trans. Speech Audio Processing, vol. 8, no. 3, pp. 267–281, 2000.

[86] Y. Huang, J. Benesty, and J. Chen., Acoustic MIMO Signal Processing (Signals and
Communication Technology). Siracuse, NJ, USA: Springer-Verlag New York, 2006.

[87] B. W. Gillespie, D. A. F. Florencio, and H. S. Malvar, “Speech de-reverberation via


maximum-kurtosis subband adaptive filtering,” Proc. of the International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), pp. 3701–3704, 2001.

[88] T. Yoshioka, T. Hikichi, M. Miyoshi, and H. Okuno, “Robust decomposition of inverse


filter of channel and prediction error filter of speech signal for dereverberation,” in Proc.
European Signal Processing Conf. (EUSIPCO), 2006.

[89] T. Nakatani, M. Miyoshi, and K. Kinoshita, “Implementation and effects of single chan-
nel dereverberation based on the harmonic structure of speech,” Proc. of the International
Workshop on Acoustic Echo and Noise Control (IWAENC03), pp. 91–94, 2003.

[90] M. Wu and D. Wang, “A two-stage algorithm for enhancement of reverberant speech,”


Proc. of the International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), vol. 1, pp. 1085–1088, March 2005.

[91] K. Furuya and A. Kataoka, “Robust speech dereverberation using multichannel blind
deconvolution with spectral subtraction,” IEEE Transactions on audio, speech and lan-
guange processing, vol. 15, no. 5, pp. 1579–1591, 2007.

[92] R. Liu and G. Dong, “A fundamental theorem for multiple-channel blind equalization,”
IEEE Trans. on circuits and systems, vol. 44, pp. 472–473, may 1997.

[93] S. Mitra, Digital Signal Processing. New Jersey: Mc Graw Hill, 2002.

144
References

[94] D. Ward, R. Kennedy, and R. Williamson, “Theory and design of broadband sensor
arrays with frequency invariant far-field beam patterns,” J. Acoust. Soc. Amer., vol. 97,
pp. 1023–1034, Feb. 1995.
[95] G. W. Elko, “Microphone array systems for hands-free telecommunication,” Speech
Commun., vol. 20, pp. 229–240, Dec. 1996.
[96] N. D. Gaubitch, Blind Identification of Acoustic Systems and Enhancement of Reverber-
ant Speech. PhD thesis, Imperial College, University of London, London, 2006.
[97] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE
Trans. on ASSP, vol. 27(2), no. 3, pp. 113–120, 1979.
[98] F. Pacheco and R. Seara, “Spectral subtraction for reverberation reduction applied to
automatic speech recognition,” Telecommunications Symposium, 2006 International,
vol. 7, pp. 795–800, Sept. 2006.
[99] N. Virag, “Single channel speech enhancement based on masking properties of the hu-
man auditory system,” IEEE Trans. Speech Audio Processing, vol. 7, pp. 126–137, Mar.
1999.
[100] E. A. P. Habets, “Single-channel speech dereverberation based on spectral subtrac-
tion,” Proc. 15th Annual Workshop Circuits, Syst., Signal Process. (ProRISC04), vol. 7,
pp. 250–254, Nov. 2004.
[101] N. W. D. Evans, J. S. Mason, W. M. Liu, and B. Fauve, “On the fundamental limita-
tions of spectral subtraction: an assessment by automatic speech recognition,” in Proc.
European Signal Processing Conf. (EUSIPCO), 2005.
[102] K. Lebart, J. Boucher, and P. Denbigh, “A new method based on spectral subtraction for
speech dereverberation,” Acta Acoustica, vol. 87, no. 3, pp. 359–366, 2001.
[103] J. Polack, La transmission de l’energie sonore dans les salles. PhD thesis, Universite’
du Maine, La mans, 1988.
[104] R. Ratnam, D. Jones, and J. W.D. O’Brien, “Fast algorithms for blind estimation of
reverberation time,” Signal Processing Letters, IEEE, vol. 11, pp. 537–540, June 2004.
[105] Y. Zhang, J. A. Chambers, F. Li, P. Kendrick, and T. Cox, “Blind estimation of reverber-
ation time in occupied rooms,” in Proc. European Signal Processing Conf. (EUSIPCO),
2006.
[106] J. Wen, E. Habets, and P. A. Naylor, “Blind estimation of reverberation time based on the
distribution of signal decay rates,” Proc. of the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), pp. 329–332, 2008.
[107] B. Yegnanarayana, S. R. M. Prasanna, and K. S. Rao, “Speech enhancement using exci-
tation source information,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing,
vol. 1, pp. 541–544, 2002.
[108] S. Griebel and M. Brandstein, “Wavelet tranform extrama clustering for multi-channel
speech deverberation,” in Proc. of the IEEE Workshop on Applications of Signal Pro-
cessing to Audio and Acoustics (WASPAA-99), vol. 1, 1999.

145
References

[109] N. D. Gaubitch, P. A. Naylor, and D. B. Ward, “Multimicrophone speech dereverbera-


tion using spatio-temporal averaging,” in Proc. European Signal Processing Conf. (EU-
SIPCO), pp. 809–812, Sept. 2004.

[110] D. Kundur and D. Hatzinakos, “Blind image deconvolution,” IEEE Trans. Signal Pro-
cessing, vol. 13, no. 2, pp. 43–64, 1996.

[111] O. Shalvi and E. Weinstein, “New criteria for blind deconvolution of nonminimum phase
systems (channels),” IEEE Trans. on information theory, vol. 36, pp. 312–321, march
1990.

[112] L.Tong, G. Xu, and T. Kailath, “Blind identification and equalization based on second-
order statistics. a time domain approach,” IEEE Trans. on information theory, vol. 40,
pp. 340–349, march 1994.

[113] S. Bellini and F. Rocca, “Blind deconvolution: polyspectra or bussgang techniques,”


Digital Communications, pp. 251–263, 1986.

[114] J. Hopgood, Nonstationary Signal Processing with Application to Reverberation Can-


cellation in Acoustic Environments. PhD thesis, University of Cambridge, Cambridge,
2000.

[115] R. Lopez-Valcarce and S. Dasgupta, “Second order statistics based blind channel equal-
ization with correlated sources,” IEEE, vol. 4, pp. 366–369, march 2001.

[116] R. Lopez-Valcarce and S. Dasgupta, “Blind channel equalization with colored sources
based on second-order statistics: A linear prediction approach,” IEEE Transactions on
Signal Processing, vol. 49, pp. 2050–2059, sept 2001.

[117] N. D. Gaubitch, P. A. Naylor, and D. B. Ward, “On the use of linear prediction for
dereverberation of speech,” IWAENC, pp. 99–102, 2003.

[118] M. Triki and T. Slock, “AR source modeling based on spatiotemporally diverse multi-
channel outputs and application to multimicrophone dereverberation,” DSP 2007, 15th
International Conference on Digital Signal Processing, July 2007.

[119] S. Haykin, Blind Deconvolution. New Jersey: Prentice-Hall, 1994.

[120] S. Haykin, Unsupervised adaptive filtering. New Jersey: John Wiley & Sons, 2000.

[121] M. Joho, A Systematic Approach to Adaptive Algorithms for Multichannel System Iden-
tification, InverseModeling, and Blind Identification. PhD thesis, Swiss Federal Institute
of Technology, Zrich, 2000.

[122] C. L. Nikias and M. R. Raghuveer, “Bispectrum estimation: A digital signal processing


framework,” Proc. IEEE, vol. 75, pp. 869–891, July 1987.

[123] J. Mendel, “Tutorial on higher-order statistics (spectra) in signal processing and system
theory: Theoretical results and some applications,” Proc. IEEE, vol. 79, pp. 278–305,
March 1991.

146
References

[124] G. Giannakis and J. Mendel, “Identification of nonminimum phase systems using higher
order statistics,” IEEE Transaction on Acoustics, Speech and Signal Processing, vol. 37,
pp. 360–377, March 1989.
[125] C. Nikias and J. Mendel, “Signal processing with higher order spectra,” IEEE Signal
Processing Magazine, vol. 10, pp. 10–37, July 1993.
[126] D. Hatzinakos and C. Nikias, “Blind equalisation based on higher order statistics
(H.O.S.),” [119], pp. 181–258, 1994.
[127] R. A. Wiggins, “Minimum entropy deconvolution,” Geoexploration, vol. 16, pp. 21–35,
1978.
[128] J. A. Cadzow, “Blind deconvolution via cumulant extrema,” IEEE Signal Process.Mag.,
vol. 13, pp. 24–42, May 1996.
[129] D. L. Donoho, “On minimum entropy deconvolution,” Applied Time Series Analysis, D.
F. Findley, Ed. New York: Academic Press, 1981.
[130] A. Papoulis, Probability, Random Variables and Stochastic Processes, 2nd ed. Singa-
pore: McGraw-Hill, 1984.
[131] P. Paajarvi and J. P. LeBlanc, “Computational efficient norm-constrained adaptive blind
deconvolution using third-order moments,” Proc. of the International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), pp. 752–755, 2006.
[132] S. Bellini and F. Rocca, “Near optimal blind deconvolution,” IEEE, pp. 2236–2239,
1988.
[133] S. Bellini, Bussgang techniques for blind deconvolution and equalizzation. Englewood
Cliffs, NJ: ed. S. Haykin, Prentice-Hall, 1994.
[134] S. Amari, A. Cichocki, and H. Yang, “A new learning algorithm for blind signal separa-
tion,” Advances in Neural Information Processing Systems 8. MIT Press., pp. 752–763,
1996.
[135] J. Cardoso and B. Laheld, “Equivariant adaptive source separation,” IEEE Trans. Signal
Processing. MIT Press., vol. 43, pp. 3017–3030, 1996.
[136] S. Amari, S. Douglas, A. Cichocki, and H. Yang, “Novel on-line adaptive learning algo-
rithms for blind deconvolution using the natural gradient approach,” 1997.
[137] A. J. Bell and T. J. Sejnowski, “An information-maximization approach to blind sep-
aration and blind deconvolution,” Neural Computation, vol. 7, no. 6, pp. 1129–1159,
1995.
[138] D. Fee, C. Cowan, S. Bilbao, and I. Ozcelik, “Predictive deconvolution and kurtosis
maximization for speech dereverberation,” in Proc. European Signal Processing Conf.
(EUSIPCO), 2006.
[139] G. Xu, H. Liu, L. Tong, and T. Kailath, “A least-squares approach to blind channel iden-
tification,” IEEE Trans. Signal Processing, vol. SP43(12), pp. 2982–2993, December
1995.

147
References

[140] L.Tong, G. Xu, and T. Kailath, “A new approach to blind identification and equalization
of multipath channels,” conference record of the 25th Asilomar Conference on signals
systems and computers, vol. 2, pp. 856–860, november 1991.

[141] Y. Hua, “Fast maximum likelihood for blind identification of multiple FIR channels,”
IEEE Trans. Signal Processing, vol. 44, pp. 661–672, Mar. 1996.

[142] H. Liu, G. Xu, and L. Tong, “A deterministic approach to blind equalization,” Proc. 27th
Asilomar Conf., Pacific Grove, CA, pp. 751–755, 1993.

[143] M. I. Gurelli and C. Nikias, “Evam: an eigenvector-based algorithm for multichannel


blind deconvolution of input colored signals,” in Proceedings of the 7th IEEE/EURASIP
International Workshop on Acoustic Echo and Noise Control (IWAENC 2001), vol. 43,
pp. 134–149, Jannuary 2001.

[144] S. Gannot and M. Moonen, “Subspace methods for multimicrophone speech dereverber-
ation,” in Proceedings of the 7th IEEE/EURASIP International Workshop on Acoustic
Echo and Noise Control (IWAENC 2001), vol. 1, pp. 47–50, September 2001.

[145] S. Gannot and M. Moonen, “Subspace methods for multimicrophone speech dereverber-
ation,” EURASIP Journal on Applied Signal Processing, vol. 2003, vol. 11, pp. 1074–
1090, 2003.

[146] Matrix Computations (3rd ed.). Baltimore: Johns Hopkins University Press., 1996.

[147] C. Avendano, J. Benesty, and D. R. Morgan, “A least squares component normaliza-


tion approach to blind channel identification,” Proc. of the International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), vol. 4, pp. 1797–1800, 1999.

[148] M. K. Hasan, J. Benesty, P. A. Naylor, and D. B. Ward, “Improving robustness of blind


adaptive multichannel identification algorithms using constraints,” Proc. 13th European
Signal Processing Conf., 2005.

[149] R. Ahmad, A. W. H. Khong, M. K. Hasan, and P. A. Naylor, “The extended normalized


multichannel flms algorithm for blind channel identification,” in Proc. European Signal
Processing Conf. (EUSIPCO), 2006.

[150] N. Gaubitch, H. M.K., and P. Naylor, “Generalized optimal step-size for blind multichan-
nel lms system identification,” Signal Processing Letters, IEEE, vol. 13, pp. 624–627,
October 2006.

[151] N. Gaubitch, J. Benesty, and P. Naylor, “Adaptive common root estimation and the com-
mon zeros problem in blind channel estimation,” in Proc. European Signal Processing
Conf. (EUSIPCO), September 2005.

[152] R. Ahmad, N. Gaubitch, and P. Naylor, “A noise-robust dual filter approach to multichan-
nel blind system identification,” in Proc. European Signal Processing Conf. (EUSIPCO),
2007.

[153] R. Ahmad, A. W. H. Khong, and P. A. Naylor, “A practical adaptive blind multichannel


estimation algorithm with application to acoustic impulse responses,” Proc. IEEE Int.
Conf. Digital Signal Processing, 2007.

148
References

[154] D. Slock, A. Meraim, P. Duhamel, D. Lesbert, P. Loubaton, S. Mayrargue, and


E. Moulines, “Prediction error methods for time-domain blind identification of multi-
channel FIR filters,” Proc. of the International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), pp. 1968–1971, May 1995.

[155] M. Triki and T. Slock, “Iterated delay and predict equalization for blind speech derever-
beration,” IWAENC 2006, Paris, September 2006.

[156] M. Delcroix, T. Hikichi, and M. Miyoshi, “Dereverberation of speech signals based on


linear prediction,” in Proc. of the 8th International Conference on Spoken Language
Processing ICSLP04, Jeju Island, Korea, vol. 2, pp. 877–881, October 2004.

[157] T. Yoshioka, T. Hikichi, and M. Miyoshi, “Second-order statistics based dereverberation


by using nonstationarity of speech,” IWAENC 2006, Paris, September 2006.

[158] B. Gillespie and L. Atlas, “Strategies for improving audible quality and speech recogni-
tion accuracy of reverberant speech,” Proc. of the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), 2003.

[159] M. Delcroix, T. Hikichi, and M. Miyoshi, “Blind dereverberation algorithm for speech
signals based on multi-channel linear prediction,” Acoustical Science and Technology,
vol. 26, pp. 432–439, October 2005.

[160] Y. A. I. Ram, E.A.P. Habets and I. Cohen, “Multi-microphone speech dereverberation


using lime and least squares filtering,” in Proc. European Signal Processing Conf. (EU-
SIPCO), Aug. 2008.

[161] M. Delcroix, T. Hikichi, and M. Miyoshi, “Precise dereverberation using multichannel


linear prediction,” IEEE Trans. Audio, Speech, Language Processing, vol. 15, no. 2,
pp. 430–440, 2006.

[162] M. Delcroix, T. Hikichi, and M. Miyoshi, “On the use of lime dereverberation algorithm
in an acoustic environment with a noise source,” Proc. of the International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 825–828, 2006.

[163] K. Furuya, “Noise reduction and dereverberation using correlation matrix based on the
multiple-input/output inverse-filtering theorem (MINT),” Proc. of International Work-
shop on Handsfree Speech Communication, vol. 15, no. 5, pp. 59–62, 2001.

[164] K. Furuya and A. Kataoka, “FFT-based fast conjugate gradient method for real-time
dereverberation system,” Electronics and Communications in Japan, vol. 90, no. 7, 2007.

[165] D. R. Brillinger, Time series: Data analysis and theory. New York: Holt, Rinehart and
Winston, 1975.

[166] J. C. R. Johnson, “Admissibility in blind adaptive channel equalization,” IEEE Contr.


Syst. Mag., pp. 3–15, January 1991.

[167] T. Nakatani and M. Miyoshi, “Blind dereverberation of single channel speech sig-
nal based on harmonic structure,” Proc. of the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), vol. 1, pp. 92–95, 2003.

149
References

[168] T. Nakatani, B. Juang, K. Kinoshita, and M. Miyoshi, “Harmonicity based dereverbera-


tion with maximum a posteriori estimation,” Proc. of the IEEE Workshop on Applications
of Signal Processing to Audio and Acoustics (WASPAA05), vol. 1, pp. 94–97, 2005.

[169] K. Kinoshita, T. Nakatani, and M. Miyoshi, “Harmonicity based dereverberation


for improving automatic speech recognition performance and speech intelligibility,”
ICE Trans. on Fundamentals of Electronics Communications and Computer Sciences,
vol. E88A, no. 7, pp. 1724–1731, 2005.

[170] K. Kinoshita, T. Nakatani, and M. Miyoshi, “Fast estimation of a precise dereverberation


filter based on speech harmonicity,” Proc. of the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), vol. 1, pp. 1073–1076, March 2005.

[171] D. Bees, M. Blostein, and P. Kabal, “Reverberant speech enhancement using cepstral
processing,” Acoustics, Speech, and Signal Processing, 1991. International Conference
on, vol. 2, pp. 977–980, 1991.

[172] M. Tonelli, N. Mitianoudis, and M. E. Davies, “A maximum likelihood approach to blind


audio de-reverberation,” Proc. Digital Audio Effects Conference (DAFx’04), pp. 256–
261, 2004.

[173] M. Tonelli, M. Jafari, and M. E. Davies, “A multi-channel maximum likelihood approach


to de-reverberation,” in Proc. European Signal Processing Conf. (EUSIPCO), 2006.

[174] M. Tonelli and M. E. Davies, “A blind multichannel dereverberation algorithm based on


the natural gradient,” IWAENC, 2010.

[175] A. Hyvarinen, J. Karhunen, and E. Oja, Independent Component Analysis. John Wiley
& Sons, 2001.

[176] S. Douglas, H. Sawada, and S. Makino, “Natural gradient multichannel blind deconvo-
lution and speech separation using causal FIR filters,” IEEE Transactions on Speech and
Audio Processing, vol. 13, pp. 92–104, January 2005.

150

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy