Tonelli 2012
Tonelli 2012
Massimiliano Tonelli
ii
Abstract
The process of recovering the source signal by removing the unwanted reverberation is called
dereverberation. Usually only a reverberated instance of the signal is available. As a conse-
quence only a blind approach, that is a more difficult task, is possible. In more precise terms,
unsupervised or blind audio de-reverberation is the problem of removing reverberation from an
audio signal without having explicit data regarding the system and the input signal. Different
approaches have been proposed for blind dereverberation. A possible discrimination into two
classes can be accomplished by considering whether or not the inverse acoustic system needs
to be estimated.
The aim of this work is to investigate the problem of blind speech dereverberation, and in
particular of the methods based on the explicit estimate of the inverse acoustic system, known
as “reverberation cancellation techniques”. The following novel contributions are proposed:
the formulation of single and multichannel dereverberation algorithms based on a maximum
likelihood (ML) approach and on the natural gradient (NG); a new dereverberation structure that
improves the speech and reverberation model decoupling. Experimental results are provided to
confirm the capability of these algorithms to successfully dereverberate speech signals.
Declaration of originality
I hereby declare that the research recorded in this thesis and the thesis itself was composed and
originated entirely by myself in the Department of Electronics and Electrical Engineering at
The University of Edinburgh.
Massimiliano Tonelli
iv
Acknowledgements
First and foremost, my supervisor, Prof. Michael E. Davies,for his constant support and close
supervision of my project. Without him, this thesis would have never seen the light.
The Institute for Digital Communications, School of Engineering of The University of Edin-
burgh for financial support.
Everyone in my family.
v
Contents
Declaration of originality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Acronyms and abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction 1
1.1 Why dereverberation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Applications of dereverberation . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Blind dereverberation approaches . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
vi
Contents
vii
Contents
References 138
viii
List of figures
ix
List of figures
4.10 Illustration of the relationships between the input s(n) and the observations
x(n) in an M-channel SIMO system. . . . . . . . . . . . . . . . . . . . . . . . 78
4.11 Multichannel blind equalizer for a SIMO system. . . . . . . . . . . . . . . . . 85
4.12 2-channel Correlation shaping block diagram. . . . . . . . . . . . . . . . . . . 90
4.13 Signal flow of the method proposed by Furuya et al. . . . . . . . . . . . . . . . 93
4.14 Block diagram of the HERB algorithm . . . . . . . . . . . . . . . . . . . . . . 97
x
List of figures
5.12 Results of the toy non-blind example described in section 5.4.3.1. Dereverbera-
tion performance with a time variant source filter: (a1) forward structure; (a2)
reversed structure. While the reversed structure can correctly dereverberate the
input signal, this does not happen for the forward structure. Dereverberation
performance with a time invariant source filter: (b1) forward structure; (b2)
reversed structure. There is no appreciable difference in the performance of the
two structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.13 Echogram of the original impulse response (above), DRR=-2.9 dB, and of the
equalized with the reversed structure (below), DDR=-0.7 dB . . . . . . . . . . 119
5.14 proposed multi-channel dereverberation structure. . . . . . . . . . . . . . . . . 121
5.15 (a) Reference echogram relating to the shortest source-to-receiver path, DRR=-
2.9 dB (b)Echogram of the equalized IR obtained by the 8-channel delay-sum
beamfomer, DRR =-0.1 dB. (c) Echogram of the equalized IR obtained by the
proposed 8-channel dereverberator, DRR =3.1 dB. The proposed structure pro-
vides improved dereverberation in respect to the delay and sum beamformer. . . 122
5.16 (a) Reference echogram relating to the shortest source-to-receiver path, DRR=-
12.9 dB (b) Echogram of the equalized IR obtained by the 12-channel delay-
sum beamfomer, DRR =-4.9 dB. (c)Echogram of the equalized IR obtained by
the 12-channel dereverberator (female3 speaker), DRR =-0.9 dB. (d)Echogram
of the equalized IR obtained by the 12-channel dereverberator (male1 speaker),
DRR =-2.9 dB. The proposed algorithm provides better dereverberation in re-
spect to the delay and sum beamformer. . . . . . . . . . . . . . . . . . . . . . 124
A.1 (a) Diagram of the time domain de-reverberation algorithm proposed by Gille-
spie and Malvar (forward structure). (b) Diagram of the proposed model(reversed
structure). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
xi
List of tables
xii
Acronyms and abbreviations
xiii
Acronyms and abbreviations
xiv
Nomenclature
Notations
x scalar quantity
x vector quantity
X matrix quantity
x(n) discrete time signal
x(t) continuous time signal
Operators
xv
xvi
Chapter 1
Introduction
The way sound waves reflect off various surfaces before reaching our ears is a fascinating phe-
nomenon, usually taken for granted in our every day life. Without reverberation our perception
of the world would be greatly affected. Just to give a few examples, we could hardly speak
to someone that is not facing us directly, we would be less aware of the potential danger of a
car that is approaching, we would not appreciate the performance of any orchestra or musical
instrument. Sounds would be lifeless. All the acoustic information conveyed by the surround-
ing space and objects would be missing. So why we should remove it? The reason is simple:
for specific applications, reverberation can introduce detrimental modifications to the source
signal.
In all the following cases, and surely in many others, dereverberation would be beneficial.
It is well known that reverberation can decrease speech intelligibility [1]. This is a particularly
severe problem for hearing impaired people [2]. Therefore, speech enhancement techniques
capable of improving the comprehensibility of speech are of interest for applications such as
hearing aids, forensic analysis, surveillance and recording restoration.
1
Introduction
equipments), where it is desirable to broadcast a signal without the reverberation of the sur-
rounding space.
Usually the above mentioned applications require to estimate the original source signal by
removing the reverberation components from the received signal(s), without knowledge of the
surrounding acoustic environment. Therefore this estimate must be performed “blindly”.
Even though several approaches have been proposed, blind dereverberation algorithms can be
divided in two classes [4]: “reverberation suppression” and “reverberation cancellation” meth-
ods.
Reverberation cancellation methods, on the other hand, are based on the explicit estimate of
the inverse acoustic system. Therefore, dereverberation is obtained by convolving the received
signal with an estimate of the acoustic inverse filter.
Due to the spatial diversity and temporal instability that characterize the acoustic system im-
pulse responses [6], the first class of algorithms can offer, so far, more effective results in prac-
tical conditions [7] [8]. However, the algorithms belonging to the second class can potentially
lead to ideal performances [9].
At the current state, practical dereverberation is still largely an unsolved problem. Furthermore,
research has been focusing on the problem of dereverberation of speech signals, and no organic
extension to a larger class of acoustic signals exists.
The aim of this work is to investigate blind speech dereverberation, and in particular the “rever-
beration cancellation” techniques.
2
Introduction
Chapter 2 is an overview of the physics and the modelling of room acoustics. Here some of
the properties of room acoustics, which are important to understand why particular models and
methods are used in speech dereverberation, are introduced.
Chapter 3 examines the principal non-blind techniques for room identification and equaliza-
tion. From the discussion it emerges the complexity of the dereverberation problem even in the
non-blind context.
Chapter 4 analyses the blind dereverberation problem and provide a wide literature survey of
the existing technology, with a focus on “reverberation cancellation” techniques;
Chapter 5 proposes novel blind dereverberation algorithms based on single and multichannel
“reverberation cancellation” methods. More specifically, single and multichannel de-reverberation
algorithms based on a maximum likelihood approach and on the natural gradient are formu-
lated. A new de-reverberation structure that improves the speech and reverberation model de-
coupling is also discussed. Experimental results are provided to confirm the capability of these
algorithms to successfully de-reverberate speech signals.
Finally, in Chapter 6, the results and contributions are summarised, and several directions for
future research are proposed.
1.5 Publications
3
4
Chapter 2
Room acoustics prerequisites
2.1 Reverberation
The process of reverberation starts with the production of sound at a location within a room.
The acoustic pressure wave expands radially, reaching walls and other surfaces where energy
is both absorbed and reflected. All reflected energy is reverberation.
2.1.1 Reflections
Let us consider an acoustic plane wave which strikes a perfectly flat and rigid surface of infinite
extension.
The direction of the impinging wave is characterized by an angle θ with respect to the wall
normal. It is called the angle of incidence. According to the law of optics the reflected wave
leaves the boundary under the same angle. Furthermore the wave normals of both waves and
the wall normal are lying in the same plane [1]. When the wave impinges on the surface, part
of the wave energy is reflected back and part is absorbed by the medium, either because it is
dissipated by losses occurring within it, or because it is transmitted through it. The absorption
coefficient α is a real value between 0, perfectly absorbent surface, and 1, perfectly reflective
surface, that can be calculated by
5
Room acoustics prerequisites
4ξcos(θ)
α= 2 (2.1)
|(ζ)| cos2 (θ)
+ 2ξcos(θ) + 1
where ζ is the specific acoustic impedance of the surface (the surface impedance normalized
by the characteristic impedance of air) and ξ is its real part [1].
Incidentally, in literature values greater than 1 for the absorption coefficient can be found. This
is caused by the testing methods and can be misleading. Any absorption coefficients listed as
greater than 1 should be taken as 1 in any calculation or consideration [10].
α + r = 1. (2.2)
Since only the energetic aspects are considered, the values of the absorption and reflection
coefficients are real numbers. Therefore they do not take into account the frequency depen-
dent phase shift that occurs upon reflection. If those aspects have to be evaluated, a complex
reflection factor for the surface must be used [1]
ζcos(θ) − 1
R(θ) = . (2.3)
ζcos(θ) + 1
The reflection coefficient r and the complex reflection factor R are related by
r = |R|2 . (2.4)
While the reflection coefficient r is related to energetic aspects (it describes the amount of
power that is reflected), the R coefficient describes the attenuation of the acoustic pressure
wave.
Reflection off non-uniform finite surfaces is a more complicated process and it can be approxi-
mated by a specular reflection only when surface dimensions are large relative to the wavelength
λ of sound being evaluated (i.e. > 4λ) [11]. The specular reflection hypothesis, that is the foun-
6
Room acoustics prerequisites
dation of geometric acoustics, fails both when the wavelength is comparable to the unevenness
of the surface, and when the wavelength is comparable to the room dimensions or to the di-
mensions of the objects placed inside the room. All models based on geometric acoustics, that
ignore the undulatory nature of sound propagation, and therefore the diffuse reflection at the
high frequencies and the diffraction at low frequencies, are potentially inaccurate.
In nature no perfectly reflective surfaces exist, thus the energy of the impinging wave will
decrease at every reflection. This will cause an intensity decay in the reflected energy. The
time in seconds required for the average sound-energy density to reduce to one-millionth (a
decrease of 60 decibels) of its initial steady-state value after the sound source has been stopped
is defined as the reverberation time (RT) or T60 [12]. Reverberation time can be measured by
exciting a room with a wide band (i.e. 20 Hz to 20 kHz) or narrow band signals (i.e. one octave,
1/3 octave, 1/6 octave, etc.)[13]. Very roughly, the T60 can be considered as the time required
for a very loud sound to decay to inaudibility. This concept, introduced by W.C. Sabine at the
turn of 19th century, is still the most important characteristic to evaluate the acoustical quality
of a room. Sabine determined that reverberation is proportional to the volume of the room and
inversely proportional to the amount of absorption
V
T60 ∝ (2.5)
A
where V is the volume of the room and A is a measure of the total absorption of materials in
the room.
Sabine relation was derived empirically. The subsequent formal derivation showed that Sabine
relation holds only under the following hypothesis:
1. the energy decay is the same in all the positions within the room;
2. perfect energy diffusion and and no preferential direction for the reflections exists;
4. the discrete phenomenon of the energy impinging on the room walls can be modelled as
a continuous one.
7
Room acoustics prerequisites
These can be approximated only for not too absorptive and not too large regular rooms with
uniformly-distributed absorption and with the excitation source positioned in proximity of the
room barycentre. Under these conditions the space is “Sabinian” and the the acoustic energy
decay curve (EDC) has a linear dB decay over time. Usually the T60 is calculated by a line fit
to the portion of the EDC between -5 and -35 dB.
In all other cases (i.e. coupled spaces, large auditoria) a unique T60 is not definable. The T60
becomes therefore a locally defined value depending on the source and the receiver position.
The more the room features depart from the Sabinian hypothesis, the more local variations of
the T60 are present.
Both the T60 and the the EDT are linked only to the energetic aspects of reverberation and hide
all the directional information connected to the room geometry.
For frequencies above 1 kHz and for very large rooms, the absorption of sound by the air in
the space is not negligible [15]. The amount of sound that air absorbs increases with audio
8
Room acoustics prerequisites
frequency and decreases with air density, and also depends on temperature and humidity. This,
Frequency Sabins/m3
2000 Hz 0.010
4000 Hz 0.024
8000 Hz 0.086
in conjunction with the fact that porous absorptive materials have higher absorption at higher
frequencies, is why the treble reverberation time falls off faster.
In a perfectly rectangular room with rigid walls the acoustical wave equation
1 ∂2p
∇2 p − =0 (2.6)
c2 ∂t2
where ∇2 is the Laplace operator, p is the acoustic pressure and c the speed of sound, can be
solved in closed form. This approach yields a solution based on the natural resonant frequencies
of the room, called normal modes. The resonant frequencies are given by [16]:
s
c nx 2 ny 2 nz 2
fn = + + (2.7)
2 Lx Ly Lz
where
When a sound source is turned on in an enclosure, it excites one or more of the normal modes
of the room. When the source is turned off, the modes continue to resonate their stored energy,
each decaying at a separate rate determined by the mode’s damping constant, which depends on
9
Room acoustics prerequisites
Figure 2.3: 3-D representation of the sound pressure distribution in a rectangular room for the
tangential mode 3,2,0
the absorption of the room. This is entirely analogous to an electrical circuit containing many
parallel resonances [17]. Room response is made up of combined modes summed vectorially
with both magnitude and phase.
• The mode is defined as axial, if only one value is different from zero (i.e. nx = 1, ny =
0, nz = 0).
• The mode is defined as tangential, if two values are different from zero (i.e. nx =
1, ny = 1, nz = 0),
• The mode is defined as oblique, if all three values are different from zero (i.e. nx =
1, ny = 1, nz = 1).
Axial modes are linked to the concept of standing wave. Assume two flat, solid parallel walls
separated a given distance. A sound source between them radiates sound of a specific frequency.
The wavefront striking the right wall is reflected back toward the source, striking the left wall
where it is reflected back toward the right wall, and so on. One wave travels to the right, the
other toward the left. Only the standing wave, the interaction of the two, is stationary [15].
Tangential and oblique modes are due to reflections upon more surfaces, and therefore suffer
more losses. As a consequence, they tend to be less intense than axial modes.
Every mode determines within a cavity, with perfectly rigid walls, a pattern of minima and
maxima of pressure according to the equation
10
Room acoustics prerequisites
nx πx ny πy nz πz
p(x, y, z) = cos( )cos( )cos( ). (2.8)
Lx Ly Lz
Every mode exhibits a resonance curve as shown in Fig.2.4. The bandwidth of each mode can
be determined by the equation
2.2
∆f = (2.9)
T60
where T60 is the reverberation time measured by using a pure tone excitation. Therefore, the
more absorption, the shorter the reverberation time, and the wider the mode resonance. This
means that adjacent modes tend to overlap for rooms with short reverberation time.
For a room with a good acoustic, a primary goal is to avoid coincidences of modes and in
particular of the axial ones. In fact, coincident modal frequencies tend to overemphasize signal
components at that frequency. Modes are directly linked to room dimension, therefore mode
overlapping or excessive closeness, that can cause beating, is minimized by properly choosing
the room dimensions. A uniform distribution of axial modes is also important. Gilford [18]
stated that to avoid coloration axial modes should not be separated more than 20 Hz. Another
criterion was suggested by Bonello [19], who considered all three types of modes, not axial
modes alone. He states that, to provide a good modal distribution it is desirable to have all
modal frequencies in a critical band at least 5% of their frequency apart (i.e. 40 Hz and 41 Hz
would not be acceptable).
In the following table, several optimal ratios for the room dimensions are reported. These ratios
11
Room acoustics prerequisites
The behavior of an irregularly shaped room can also be described in terms of its normal modes,
even though a closed form solution may be impossible to achieve [1]. Splaying one or two
walls of a room does not eliminate modal problems, although it might shift them slightly and
provide somewhat better diffusion [21]. However, while the proportion of a rectangular room
can be selected to eliminate, or at least greatly reduce degeneracies, making the sound field
asymmetrical by splaying walls only introduces unpredictability.
When a sufficient density of modes exists, an equal energy density at all points in the room
can be assumed. As a consequence, there will be an equal probability that sound will arrive
from any direction. This condition, as mentioned before, is known as a diffuse field, and it can
be described by a statistical model [1]. This statistical model for reverberation is justified for
frequencies higher than the, so called, Schroeder frequency [1]
r
T60
fg = 2000 (2.10)
V
where V is the volume of the room in m3 .
12
Room acoustics prerequisites
4πV 3
Nf = f . (2.11)
3c3
dNf 4πV 2
= f (2.12)
df 3c3
therefore the number of modes per unit bandwidth grows as the square of the frequency.
dNt 4πc3 2
= t (2.13)
dt 3V
In considering the acoustics of small rooms, it is useful to consider the audio spectrum divided
in four regions:
1. The region with upper bound given by c/2L, where c is the speed of sound and L the
longest dimension of the room. In this frequency range there is no resonant support for
sound in the room.
2. The region with upper bound given by the Schroeder frequency fg . In this frequency
range the wavelength of the sound being considered is comparable to the room dimen-
sions, therefore the room behavior can be described by its modes, therefore wave acous-
tics must be used.
3. The region with upper bound given by 4fs . This is a transition region between region 2,
and region 4.
4. The region in which the wave lengths are short enough for geometric acoustic to be valid.
13
Room acoustics prerequisites
“Very small room, with too few modal resonances spaced too far apart, are characterized by
domination of a great stretch of the audible spectrum by modal resonances. This is the small
listening room problem in a nutshell” [15].
The Z transform of the IR is the Room Transfer Function (RTF). Since reverberation is due
to energy that is decaying below the noise floor in a finite amount of time, RTF are usually
described as a finite impulse response (FIR) filter, that is an all zero model (MA model). Re-
cursive models are sometimes assumed (i.e. AR or ARMA models). In general, any RTF can
be split in an FIR part and an IIR part [22], and this latter part is independent on the source or
receiver position, being given by the common resonances of the room. In fact, modes do not
change when source and receiver location is modified. For low frequencies modes, the distance
between minima and maxima of pressure is wide. As a consequence, the room frequency re-
sponse, taken in different positions, usually exhibit common peaks in the low frequencies. This
is an intrinsic physical aspect of room acoustics and does not depend on the adopted model.
These peaks can be modelled by common acoustic poles shared among the room transfer func-
tions. This property has been employed to achieve multiple point low frequency equalization
[23].
Even if it might be expected that a pole-zero representation would be much more economical
than an all-zero one, in [24] it is shown that in general this is not the case. Furthermore, many
system identification strategies provide an FIR model, while the process to calculate a recursive
14
Room acoustics prerequisites
one is less straightforward [25]. Thus, the FIR model is often preferred.
Assuming that the IR of a specific source/listener path has been estimated, several parameters
(i.e.T60, EDT, modal distribution, steady state frequency response, intelligibility, clarity etc.)
can be estimated from it. This allows one to verify and to diagnose possible problems in the
acoustic quality of a room.
15
Room acoustics prerequisites
The energy decay curve, that describes how the energy of the IR decays, is defined by the
Schroeder integral
Z ∞
EDC(t) = h2 (τ )dτ . (2.15)
t
The T60 and the EDT can be directly estimated from the EDC by linearly interpolating it
respectively between -5 and -35 dB and 0 and -10 dB. The T60 and the EDT are often calculated
also for 1/3 octave ISO bands.
A time-frequency extension of Schroeder’s Energy Decay Curve can be defined on the basis of
a power density time-frequency representation ρh (t, f ) obtained from h(t)
Z ∞
EDR(t, f ) = ρh (τ, f )dτ . (2.16)
t
This generalization of the EDC to multiple frequency bands has been formalized by Jot [28],
as the energy decay relief, EDR(t, f ), which is a time frequency representation of the energy
16
Room acoustics prerequisites
decay. Therefore, EDR(0, f ) gives the power gain as a function of frequency and EDR(t, f0 )
gives the energy decay curve for some frequency f0 . This representation allows one to diagnose
undesired unevenness in the time and frequency response of a room.
The D50 parameter (also known as “Definition”) is the early to total sound energy ratio. It is
defined as:
R 50 ms 2
h (τ )dτ
D50 = 0R ∞ 2 . (2.17)
0 h (τ )dτ
Expressed by values between 0 and 1, it is associated with the degree to which rapidly occurring
individual sounds are distinguishable. Usually rooms designed for speech require D50 > 0.5,
while rooms designed for music requires D50 < 0.5
The ts parameter (also known as “Centre Time”) is the time of the centre of gravity of the
squared impulse response. A high value is an indicator of poor clarity. It is defined as:
R∞
t · h2 (τ )dτ
ts = 0R ∞ 2 (2.18)
0 h (τ )dτ
and expressed in ms. Usually rooms designed for speech require ts < 50 ms, while rooms
designed for music requires 50 ms < ts < 250 ms
17
Room acoustics prerequisites
Beyond the fact that many other descriptive acoustic parameters can be defined (i.e. Loudness,
Intimacy, Warmth, IACC, ST1 etc.), there are four general factors that are of main importance
for a good acoustic:
• adequate reverberation time according to the room functionality (i.e. classroom, confer-
ences hall, theater, church etc.)
• uniformity in the frequency response (avoid large peaks or dips), that is linked to an
adequate modal density
• uniformity in the time response (avoid excessive comb filtering or echo), that is linked to
an adequate time density
One of the main needs of the acoustician, is to be able to predict the acoustic of a room prior
to construction. This allows one to correct the design of a room and evaluate many options and
surface treatments. Computer simulations are now possible and several different approaches
have been proposed. The main strategies available are based on geometric acoustics or on the
direct numerical solution of the acoustic wave equation.
Being computationally less intensive, these techniques are, at the current state, much more
diffuse. However they can be inaccurate at low frequencies where the modal behavior prevales.
The most popular methods belonging to this family are ray tracing [29], and mirror image
source method (MISM) [30]. Other derived methods, such as conical and pyramidal tracing,
are less usual [31].
• the propagation of the sound energy takes places along linear trajectories
18
Room acoustics prerequisites
• starting from the source, the propagation takes place toward all directions and follow the
law of geometric acoustics
• geometric divergence of rays takes into account the geometric divergence of energy
• the rays are summed together, neglecting the phase, when in proximity of the receiver
Ray tracing is computationally efficient, can be used for arbitrary shape rooms, can easily model
diffusion, refraction and, but not easily, diffraction. On the down side, the choice of the number
of ray to employ and of the size of the receiver is somewhat arbitrary and critical.
The MISM method is based on the idea that a specular reflection from a flat surface is equivalent
to the direct field of an identical virtual source placed symmetrically on the opposite side of the
surface. This, as shown in Fig.2.11, can be simply extended to higher order reflections by
considering as a real source, the virtual source considered for the previous reflection.
19
Room acoustics prerequisites
• the ray energy decreases for geometric divergence, surface and air absorption
MISM is therefore theoretically more solid than ray tracing. However it becomes extremely
inefficient for arbitrary shape rooms and it cannot model refraction and diffraction.
To address these drawbacks, hybrid methods based on MISM and ray tracing have been pro-
posed [32].
20
Room acoustics prerequisites
Figure 2.12: a) IR from a Allen and Berkley type algorithm b) enhanced IR by using a negative
reflection coefficient c)measured IR
The most well known paper about the MISM method is [30] by Allen and Berkley. The pro-
posed algorithm is often used to create synthetic IRs for the evaluation of blind separation,
deconvolution and dereverberation methods. One of its assumptions is to consider a positive
value for the pressure reflection coefficient. This choice determines a non-negative IR that
neglects all phase modification caused at every reflection. To remove the DC component the
authors suggest to high pass the obtained impulse response. After this heuristic post-processing,
as shown in Fig.2.12 the IR is still heavily biased, and very dissimilar to a natural acoustic IR.
Possible improvements to the Allen and Berkley algorithm have been discussed in a recent
paper [33] (2008). In it, the use of a negative value for the pressure reflection coefficient has
been suggested, but without strong physical evidence of the choice.
The physical reason of this improvement can be easily explained: the power absorption and
reflection coefficients α and r and the pressure reflection coefficient R (in [30] called β) are
linked by the following equation:
21
Room acoustics prerequisites
α + r = α + R2 = 1 (2.19)
therefore two possible values of R can be obtained from the power reflection coefficient r
√
R = ± r. (2.20)
This ambiguity is not discussed in [30], and positive pressure reflection coefficients are chosen
without commenting. However, if a a normally incident plane wave is considered, the incident
and the reflected pressure wave, respectively pi and pr , are related by the equation [1],
pr z2 − z1
= (2.21)
pi z2 + z1
where z2 is the acoustic impedance of the material interested by the incident and reflected
pressure waves and z1 the impedance of the reflective material 1 , then
the incident and the reflected waves have opposite phase. Therefore a negative coefficient can
better model the reflection caused by the jump of impedance from air to a rigid wall.
A better solution, but computationally intensive and difficult to realize due to the lack of data
about material properties, would be to consider the complex reflection coefficient of the mate-
rial, and model the surface as a filter with precise phase and modulus response.
Sometimes the MISM problem is formulated in terms of energetic aspects, by using the inten-
sity instead of the pressure level. This can generate confusion about the MISM formulations
and about which reflection coefficients should be used (r calculated from the absorption coeffi-
cient α or the complex valued coefficient R obtained from the material impedance). However,
this confusion should be avoided by considering the fact that a microphone is a pressure trans-
ducer, not an intensity one. Another assumption made in the Allen and Berkley paper is the
1
The acoustic impedance of the medium is a real value for a plane wave[1].
22
Room acoustics prerequisites
independence of the reflection coefficient from the angle of incidence. Even if this is not stated,
this implies that the surface is locally reacting, thus that the motion at one point of the surface
is not related to other points of the surface. This approximation however applies only to some
materials (i.e. a porous surface) [16].
These observations suggest that it might be necessary to evaluate with care the simulations
performed with synthetic IRs, before generalizing the result to a real room. As an example,
room IR estimation techniques based on the non-negative impulse response hypothesis have
even been proposed [34]. This is not meaningful from a physical point of view. Real IRs are
both positive and negative. The wrong assumption of considering the non-negativeness has
been probably influenced by a non critical use of the Allen and Berkley MISM method.
Room simulation can be realized by the numerical solution of the acoustic wave equation:
1 ∂2p
∇2 p − =0 (2.23)
c2 ∂t2
where ∇2 is the Laplace operator, p is the acoustic pressure and c the speed of sound. The ad-
vantage of this approach is that it gives a correct solution also when it is not possible to neglect
the undulatory nature of sound (i.e. the wavelength is comparable to the room dimension). The
main drawback, however, is efficiency. To assure a minimum of accuracy, the discretization step
of the acoustic space that is under evaluation must be at least 1/8 of the shortest wavelength
of interest [35]. Therefore it is common to restrict the analysis to the low frequency range or
to small spaces. In fact, the extension to higher frequencies implies the evaluation of billions
of variables, with a huge computational power demand. Furthermore, the boundary conditions
must be specified in term of complex acoustic impedance of the material, and these data are
still largely unavailable.
The most famous method is the finite element method (FEM) [36], that can be employed for
irregular geometries and with variable boundary conditions. The FEM method neither depends
on the system geometry nor on the medium properties, therefore complex systems composed
of different media can be considered. It can be shown that the method converges to the exact
solution when the number of elements is increased. In FEM calculations, the acoustic field
23
Room acoustics prerequisites
A different approach is based on the finite difference time domain (FDTD) method [38]. In
FDTD calculations, the solution procedure is quite different in respect to FEM. Instead of
approximating the fields directly, the derivatives of the fields are approximated as the difference
of the field values at adjacent locations divided by the distance between these locations. The
acoustic wave equation is thus transformed into a set of difference equations, valid at a number
of points. These points are usually located on a rectangular mesh, or grid. The main advantages
of the FDTD method are that a single calculation is sufficient to study a wide frequency band,
and that the time domain behavior of the reflected sound can be directly inspected [39].
A promising technology for room acoustic simulation is the digital waveguide mesh (DWM)
[40]. DWMs algorithms are based on a subset of the wider family of finite difference time
domain (FDTD) numerical approximation. These methods have been successfully used in room
acoustic prediction and can provide, in respect to the geometric acoustic based methods, better
accuracy at low frequencies [41].
24
Chapter 3
Impulse response identification and
equalization. Input-output techniques
3.1 IR identification
Theoretically, the IR of a room, modeling it as a linear time invariant (LTI) system, can be
obtained by generating a Dirac δ in the position of the speaker and sampling the signal in the
position of the listener. However, while in the digital domain it is easy to obtain an impulse,
a δ is a pure mathematical abstraction in the physical domain. Practically, the measurement
can be obtained using pseudo-impulsive signals (e.g. the pop of an exploding balloon), but
only a rough approximation of the real IR is provided. In fact, the lack of uniformity in the
energetic distribution of the pseudo-impulsive source, determines a distortion of the measured
IR and, potentially, a low signal to noise ratio at certain frequencies. On the other hand, pseudo-
impulsive methods are very simple and only a minimum amount of equipment is necessary to
implement them.
25
Impulse response identification and equalization. Input-output techniques
A better estimate can be obtained using the cross-correlation between the source and the mea-
sured signal.
Considering :
• an LTI system;
where E {.} and ∗ are respectively the statistical expectation and the convolution operators.
and thus
rxy (m) = h(m). (3.5)
Therefore, the impulse response can be calculated directly by the cross-correlation between the
white noise emitted by the source and the signal measured at the listener position.
In principle, the expected value must be computed by averaging x(n) · x(n + l) and x(n) ·
y(n + l) over all realizations of the stochastic process x and y. In practice, an estimate can
be obtained by averaging a finite number of realizations. This is called an “ensemble average”
across realizations of a stochastic process. If the signals are stationary (which primarily means
their statistics are time-invariant), then it is possible to average across time to estimate the
expected value. In other words, for stationary noise-like signals, time averages equal ensemble
26
Impulse response identification and equalization. Input-output techniques
averages [42].
Therefore the autocorrelation and the cross-correlation functions can be estimated by [42]
N −1
1 X
r̂xx (m) = x(n) · x(n + m) (3.6)
N
n=0
N −1
1 X
r̂xy (m) = x(n) · y(n + m). (3.7)
N
n=0
The above definitions are only valid for stationary stochastic processes [42].
The IR identification approach known as the Maximum Length Sequence(MLS) method [43] is
based on the previous observation. The MLS method can offer better linearity and better signal
to noise ratio in respect to pseudo-impulsive approaches. The principal disadvantage of this
technique is the strong dependence on the linearity of the measurement system. Non-existent
echoes and phase problems can appear even with small non-linearities in the measuring chain.
An alternative approach that assures greater noise immunity, improved robustness against mild
time variations of the system and against the non-linearities present in the measurement chain,
is based on the log swept-sine technique [44]. The swept-sine method uses a known sequence
x(n) and a proper inverse filter f (n), so that
thus the unknown IR h(n) can be calculated by convolving the measured output signal y(n)
with the inverse filter f (n):
Using as an input a log swept-sine (a sweep with frequency increasing at a logarithmic rate),
27
Impulse response identification and equalization. Input-output techniques
" #
ω1 · T t ω
ln( ω2 )
x(t) = sin ω2 · (e − 1) (3.10)
T 1
ln( ω1 )
where ω1 and ω2 are the starting and the ending frequency, and T is the sweep duration, the
inverse filter can be calculated in closed form, and it is created by time reversing the excitation
signal x(t), and then applying to it an amplitude envelope to reduce the level by 6 dB/octave,
starting from 0 dB and ending to −6 log2 ( ωω21 ) dB[44].
If the linear system is not time invariant (i.e. a slowly moving source within a room), and the
system inputs and outputs are known, an adaptive filter can be used to identify and track its
IR. The adaptive filter identifies the unknown system by minimizing the error, according to a
chosen criterion, between the system output and the filter output.
One of the most popular adaptive filter algorithm is the least mean square (LMS). “The LMS
algorithm is a linear adaptive filtering algorithm, which consists of two basic processes:
1. A filtering process, which involves computing the output of a linear filter in response
to an input signal and generating an estimation error by comparing this output with a
desired response.
2. An adaptive process, which involves the automatic adjustment of the parameters of the
filter in accordance with the estimation error.” [45].
28
Impulse response identification and equalization. Input-output techniques
where x is the system/filter input, y is the filter output, d the system output, µ the adaptation
step, x(n) = [x(n), x(n − 1), ..., x(n − p)]T and hn = [hn (0), hn (1), ..., hn (p)]T is the vector
of the FIR filter coefficients at time n.
The main ideas behind the LMS algorithm applied to system identification are based on:
a) a cost function that measures the error in terms of the power of the error signal, defined as:
b) the minimum value of J can be estimated going in the steepest descent, using a proper step
size µ
where ∇J(n) is the gradient of the cost function (that is a function of the estimated IR at time
n)
c) since usually the gradient cannot be calculated, it must be estimated from the available data.
29
Impulse response identification and equalization. Input-output techniques
therefore
It is important to point out, that the LMS does not converge to the real minimum, but only in
expectation. Hence the estimate of a time varying system will be noisy. Several other adaptive
algorithms with faster convergence properties exist [45]. However the LMS is, for its simplicity,
the most used.
Reverberation may degrade speech intelligibility and, more generally, the quality of sound and
music [1]. This is common for rooms that have not been designed for such functionalities. In
this case, the recommended approach is to improve the acoustics of the space by modifying
its physical properties. However, this might not be possible for functional and economic con-
straints.
If the speaker-to-receiver impulse response(s) is/are known, an estimate of the source signal
can be obtained from its reverberated instance(s) by the following techniques.
30
Impulse response identification and equalization. Input-output techniques
If the acoustic path (the channel) is modelled as a linear-time invariant system characterized by
an Impulse Response (IR), h(n), the source signal, s(n), and the reverberant signal, x(n), are
linked by the equation
x(n) = h(n) ∗ s(n) (3.21)
where ∗ denotes the discrete linear convolution. Dereverberation is achieved by finding a filter
with impulse response w(n) so that
where w(n) is defined as the inverse filter, or the equalizer, of h(n), δ(k) is the unit sample
sequence and τ a delay [46].
The Zero-Forcing (ZF) equalizer applies the inverse of the channel to the received signal, to
restore the signal before the channel. The inverse of the channel, W (ω), is
1
W (ω) = (3.24)
H(ω)
where W (ω) and H(ω) are the Fourier transforms of w and h at frequency ω.
A condition for the existence of the ZF equalizer is that H(ω) must have no spectral nulls (i.e.
the system function H(z) has no zeros on the unit circle). This affects the robustness of the
ZF approach in noisy environment, even for transfer function zeros close to the unit circle [47].
The inversion in fact would induce an extremely high gain that would amplify the noise (i.e.
ringing). With noise present, perfect source recovery becomes impossible.
A possible solution, among several [48], is to use a regularization scheme [49] or Wiener de-
convolution [50]. In general, it is desirable to be able to flatten the frequency response, but not
at the expense of boosting dips or notches to the point where the boost causes amplifier and
31
Impulse response identification and equalization. Input-output techniques
speaker overload or massive amounts of the boosted frequency at other listening positions. To
overcome this problem, one needs to measure the original room frequency response, and not to
equalize for the actual measured response, but for a “regularised” version of the room response
which, in some manner, has filled in deep troughs of the measured response [51].
In the Wiener deconvolution the “closeness” between s(n) and y(n) in terms of the Mean-
Squared Error (MSE) is used
E [s(n) − y(n)]2 .
(3.25)
Choosing the equalizer transfer function to minimize 3.25 results in the Minimum Mean-
Squared Error (MMSE) equalizer. The MMSE equalizer is most easily described in the fre-
quency domain
1 Ps (ω)
W (ω) = (3.26)
H(ω) Ps (ω) + Pv (ω)
where Ps (ω) is the power spectrum of the source signal s(n) and Pv (ω) the power spectrum of
the noise v(n).
Equation 3.26 reports the most general, usually non-causal, MMSE equalizer. A practical linear
equalizer, on the other hand, must be stable and causal. This can be obtained by designing an
FIR filter, therefore stable, that minimize the following MSE [52]
32
Impulse response identification and equalization. Input-output techniques
E [d(n) − y(n)]2
(3.27)
where the desired output d(n) = s(n − τ ), with τ (τ ≥ 0), is a delayed version of the source
signal s(n). This leads to the Wiener-Hopf equation [50]
w = Rx −1 rdx (3.28)
where w = [w(0), ..., w(L − 1)]T is the Lth order inverse filter, Rx = E xn xTn
is the
autocorrelation matrix of the received signal and rdx = E {d(n)xn } is the cross-correlation
vector between the target and the received signals.
When the speaker-to-receiver impulse response h(n) is non-minimum phase (i.e. it has zeros
outside the unit circle), the calculation of its inverse filter is problematic. In fact, the inverse
of a non-minimum phase FIR system is an unstable IIR filter. The acoustic signal-transmission
channel is, except for almost anechoic rooms or very close to the sound source in normal
rooms, a non minimum phase function [53]. A possible solution is to consider a truncated FIR
approximation of the inverse IIR filter, that is by definition always stable. However, truncation
can lead to inaccuracies. Therefore, other techniques to invert a non-minimum phase single
channel systems have been investigated.
Homomorphic techniques [54] are based on the decomposition of the IR into minimum and
maximum-phase (all zeros outside the unit circle) components prior to inversion. However,
performing this decomposition is not trivial and therefore, in general, the calculated inverse
filter generates poor results.
The least-square inversion method, proposed in [55] by Mourjopoulos, relies on the minimiza-
33
Impulse response identification and equalization. Input-output techniques
L−1
X
J(τ ) = (δ(j − τ ) − f (j))2 (3.29)
j=1
PL−1
with f = i=1 w(i)h(j − i), where h is the time invariant impulse response and w =
[w(0), ..., w(L − 1)]T its least-square inverse filter of length L, and τ a delay.
MMSE tends to spread the error out in time, causing long non zero tails (i.e. longer rever-
beration) in the equalized impulse response. To address this problem a weighted least-square
inversion method that forces the system to equalize the long tails, at the expense of the energy
near the main tap has been proposed in [56] by Gillespie et al.
Both the homomorphic and least-square approaches highlight that an acausal inverse filter is
necessary to compensate for the system non-minimum phase component. Such acausal filters
can generate perceptually disturbing “pre-echoes” due to any inversion error (i.e. the imperfect
match between response and inverse filter) [57].
Even if the least-square approach can give in theory very good results, in real-time situations
it offers marginal improvement [58]. In fact, the performance deteriorates dramatically when
a filter designed for one position within the room is employed for equalization at a different
position. The performance also deteriorates for longer filters. This is due to the fact that Room
Transfer Functions (RTF) vary dramatically from point to point within the same enclosure.
Just a few tenths of the acoustic wavelength can cause large degradations in the equalized
room response [59]. Thermal variations can also induce substantial non stationarity. Therefore,
even a small mismatch between measured response and inverse filter can introduce measurable
errors larger than those removed by the dereverberation technique [59]. At points in the room
other than that of the measurement microphone, the equalizer and the room response do not
completely cancel each other out, so that a “pre-echo” or pre-response can be heard before the
main sound arrives. Such pre-responses sound highly unnatural and are very audible [60],[61].
These observations led Hatziantoniou and Mourjopoulos to consider it unrealistic that the ideal
equalization filters could be designed for all source/receiver positions. They also observed that
the direct response inversion of the mixed-phase room response can suffer from perceptually
disturbing time/frequency domain artifacts related to pre-echoes and ringing poles compensat-
34
Impulse response identification and equalization. Input-output techniques
ing for the original spectral dips. They therefore proposed a room equalization method, called
“Complex Smoothing”, based on a frequency domain regularization scheme of the measured
room responses, that generates an inverse filter of reduced spatial, spectral and time complex-
ity, but still capable of decreasing phase, transient and spectral distortions introduced by room
acoustics [62], [57]. The Complex Smoothed room discrete-frequency response Hcs (ω) is
transformed into a corresponding smoothed room impulse response hcs (n). Then, an inverse
filter wcs (n) is evaluated, which inverts the Complex Smoothed response, i.e.
This approach does not achieve a theoretically-perfect room acoustics deconvolution, but re-
alizes a good compromise between reduction of perceived reverberation effects and the intro-
duction of inaudible processing artifacts. In the time domain the equalized response has more
power shaped in the direct and early reflection path sounds and less power allocated in some of
the reverberant components. In the frequency domain, the equalization procedure corrects gross
spectral effects, without attempting to compensate for many of the original narrow-bandwidth
spectral dips. Such improvements are achieved to the benefit of reproduction in other positions
within the same enclosure [58]. In fact, since the inverse filter is designed to have progressively
reduced frequency resolution from low to high frequencies, it can compensate for the full range
audio spectrum while modeling the low frequency modes of the room, that are independent of
position.
Another approach, to achieve multiple point equalization, that cannot recover the frequency re-
sponse dips of the multiple room transfer functions, but that can suppress their common peaks
due to resonance was proposed by Haneda et al. [23]. The proposed equalization scheme is
based on common acoustic poles equalization and employs an IIR model for the RTF. The
inverse filter, that is a causal FIR, achieves equalization without the pre-echo problem. This
method is only useful for low frequency equalization. However, since it is quite easy to achieve
reverberation reduction at high frequencies by using foam acoustic absorbers, it can have prac-
tical applications when the modal response of a room needs to be improved. A blind method to
calculate common acoustic poles has been proposed by Hiikichi at al. [63]. This can be used in
theory to build a self-adapting equalizer that can track the room changing conditions.
35
Impulse response identification and equalization. Input-output techniques
The dereverberation problem can be generalized for an arbitrary N -channel system, where
reverberated instances xi (n) of the source signal s(n) are acquired at N different positions
within a room. This leads to the following set of relations
dereverberation is achieved by finding a set of filters with impulse responses wi (n) so that
N
X
δ(n − τ ) = hi (n) ∗ wi (n) (3.32)
i=1
where xi (n), hi (n), wi (n) are respectively the i-th observation, transfer function and the in-
verse filter of the corresponding source-to-receiver channel and τ a delay. The previous equa-
tion can be written in the Z domain as
N
X
z −τ = Hi (z)Wi (z). (3.33)
i=1
N
X
ŝ(n) = y(n) = xi (n) ∗ wi (n). (3.34)
i=1
FIR inverse filters wi (n) exist only if the channel transfer functions Hi (z), i = 1, ..., N , have
no common zeros, or in other words, only if they are coprime. This is what is meant by “channel
diversity”.
Comprimeness is strictly related to the Bezout identity [64]. Bzout’s theorem for polynomials
states that if P and Q are two polynomials with no roots in common, then there exist two other
polynomials A and B such that AP + BQ = 1.
Consider the multichannel model shown in Fig.3.2. If the subchannels are not coprime, then
36
Impulse response identification and equalization. Input-output techniques
The set of filters that satisfy equation 3.32 is not unique. A possible solution is provided by
the MUltiple-input/output INverse Theorem (MINT), proposed in [9]. The MINT allows one
to calculate a set of FIR filters, wi (n), when hi (n) are FIR with no common zeros.
The filters wi (n) can be calculated from the following equations [9]:
Hw = b (3.36)
where
37
Impulse response identification and equalization. Input-output techniques
h (0) 0 ··· 0
i
..
..
.
hi (1) h( 0) .
.. .. ..
. . . 0
Hi = hi (J) (3.38)
hi (0)
0 hi (J) hi (1)
.. ..
..
. . .
0 0 hi (J)
is an [J + M, M ] matrix, J is the length of the impulse response, M the length of the inverse
filter
w = H+ b (3.41)
where H+ is the pseudo inverse of the matrix H. Any generalized inverse can also be used.
For an N -channel system, considering a J-tap long impulse response, the minimum number of
taps, M , required in each filter is calculated by setting M so that the matrix H is square, i.e. ,
J + M = N M holds, which leads to
The filter length can be set at M > dJ/(N − 1)e as well [66], where dxe is the smallest integer
not less than x.
For a given impulse response of length, J, by increasing the number of microphones, N , the
inverse filter length, M , can be reduced. As a consequence, as shown in Fig. 3.3, the total
38
Impulse response identification and equalization. Input-output techniques
Figure 3.3: The total number of filter taps in the equalizer can be reduced by increasing the
number of microphones N . In the figure is reported equation 3.43 for a 1000-tap
impulse response when the number of channels is increased from 2 to 251. The non
monotonic decrease of the total tap number is due to the fact that the channel filter
length, M , can only assume integer values (as stated by equation 3.42).
T aps = M · N (3.43)
is reduced. In other words, the use of more microphones yields less computational demand and
less memory requirements. Thus multi-channel based structures can potentially exploit these
properties to provide more efficient dereverberation.
On the other hand, in [66] Hikichi et al. show that it is desirable to reduce the inverse filter
norm to reduce the sensitivity to RTF variations caused by source position changes. This can
be achieved by lengthening the filter and by choosing the arbitrary delay τ in equation 3.40 to
positive values, so that the causality constraint is relaxed. Therefore, it seems advisable to use
a longer filter length than the one suggested by 3.42.
The MINT theorem offers a solution to the instability issue associated with the inversion of
non-minimum phase transfer functions [9], by ensuring that the filters for the equalizer will be
FIR if the channel transfer functions are FIR. Since reverberation is essentially due to energy
that is decaying in a finite amount of time, every room transfer function can be represented by
39
Impulse response identification and equalization. Input-output techniques
1 1
Amplitude
0 0.5
!1 0
0 500 1000 0 500 1000
Samples Samples
Figure 3.4: Illustration of the inversion: (a) single echo impulse response with α = −0.9 and
k = 50; (b) truncated inverse filter. A strong reflection requires a very long inverse
filter.
an FIR filter. Therefore, the inverse impulse responses are characterized by shorter lengths than
in the single-channel case. In fact, also in the minimum-phase case, the inverse of a finite IR is
an IIR filter, and therefore of infinite length.
As an example, let us consider the problem of equalizing a single echo IR of the kind:
Its Z transform is
H(z) = 1 + α · z −k . (3.45)
This filter is also known as an FIR comb. Its inverse transfer function is
1
W (z) = . (3.46)
1 + α · z −k
It can be observed that, when the reflection gain increases, the decay rate of the inverse filter
impulse response slows down. This implies, as shown in Fig.3.4, that, if the approximation of
a truncated inverse filter is used, a very long inverse filter might be necessary to obtain a good
dereverberation.
On the other hand, the MINT assures, under mild conditions, the existence of a finite length
40
Impulse response identification and equalization. Input-output techniques
Figure 3.5: Illustration of the inversion of a two channel system: h1 (n) = δ(0) − 0.9δ(n −
600), h2 (n) = δ(0) − 0.9δ(n − 1000). (a),(b) Inverse filters calculated by MINT.
(c) Equalized IR. In the multichannel case, a strong reflection can be perfectly
equalized by FIR filters.
equalizer. This yields better statistical properties in the equalizer estimation, less computational
demand and less memory requirement. Thus multi-channel based structures can potentially
exploit these properties to provide better and more efficient dereverberation. An example of a
multichannel system inversion by the MINT is shown in Fig. 3.5.
Beyond the already mentioned improvement of the MINT algorithm by Hikichi et al [66] , other
modifications have been published. In [67] a sub-band implementation, that greatly reduces the
computational requirements of the original algorithm has been proposed by Yamada et al. In
[68] the relation between the MINT and an adaptive algorithm called Multiple Error-LMS (ME-
LMS) is discussed. The theory presented reconciles the two approaches and derives explicit
conditions which must be fulfilled if an exact inverse is to exist.
At the current state there is no agreement on the best metric to adopt in the evaluation of
the performance of dereverberation algorithms. Often dereverberation quality is associated to
the concept of intelligibility. In phonetics, intelligibility is a measure of how comprehensible
speech is, or the degree to which speech can be understood. Late reverberation and noise are
the main causes of the degradation in speech intelligibility. This is of interest in particular in
automatic speech recognition. However also other factors should be taken into account.
Early reflections are considered to provide a positive contribution when “strengthening” the di-
rect signal and therefore when improving intelligibility. From a “communication” perspective
this is desirable [69], [70], [71],[56]. However early reflections impress their own strong char-
acter to a signal by creating, often undesired, colorations. There are applications, as for instance
hands free communication or speech recording restoration, where it might be of interest to re-
41
Impulse response identification and equalization. Input-output techniques
According to this scenario, two kinds of assessment can be found in the literature:
2. the capability of removing reverberation according to a reference signal (i.e. the undis-
torted speech or the system IR)
3.3.1 Intrusive measures based on the comparison with the undistorted speech
The Segmental Signal to Reverberation Ratio (SRRseg ) [72] of the l-th frame is defined simi-
larly to the segmental SNR, i.e.
K−1 PkN +N −1 !
10 X s(n)2
SRRseg = log10 PkN +Nn=kN
−1
(3.48)
K n=kN (s(n) − ŝ(n))2
k=0
where K is the number of frames, N is the frame length in samples (the window length is
usually of 32 ms), s(n) is the undistorted signal, and ŝ(n) is the enhanced signal. This implies
that the higher the SRRseg , the more closely the dereverberated signal approaches the original
signal
The log-spectral distortion (LSD), also referred to as log-spectral distance is a distance measure,
expressed in dB, between two spectra. The log-spectral distance between spectra P (ω) and
P̂ (ω) is defined as:
42
Impulse response identification and equalization. Input-output techniques
k p1
2
−1
2 X p
LSD(l) = L(P (ω)) − L(P̂ (ω)) (3.49)
k
k=0
where L = 10 log10 (P (ω))2 . In most cases short-time spectra are used, which are obtained
using the Short Time Fourier Transform (STFT) of the signals. The mean Log Spectral Dis-
tortion is obtained by averaging the previous equation for all frames containing speech. The
most common choices for p are 1, 2, and ∞, yielding mean absolute, root mean square, and
maximum deviation, respectively [4].
Voice can be considered as an amplitude modulated signal [73]. If the original characteristic
of the modulation envelope are retained at the receiver, good intelligibility is present. The
amount of modulation, denoted by the Modulation Index (MI), is used to evaluate the Speech
Transmission Index (STI), a widely accepted measure to predict the effect of room acoustics on
speech intelligibility. The STI is a machine measure of intelligibility whose value varies from
0 (completely unintelligible) to 1 (perfect intelligibility). The modulation index as function of
modulation frequency can be calculated as follow [4]:
1. The speech signal is analysed using an octave filter bank. The filter bank can be bypassed
resulting in a broad-band analysis.
2. For each octave band the envelope is estimated by taking the magnitude of a standard
Hilbert-transform, which results in the Hilbert envelope. The Hilbert envelope is low-
pass filtered with a 50 Hz low-pass filter and then downsampled to a frequency of 200
Hz. The resulting signal will be referred to as the envelope signal.
3. For each octave band the Power Spectral Density (PSD) of the envelope signal is esti-
mated using a standard Welch procedure. The parameters used for the Welch procedure
are a window length of 8 seconds, a Hanning window with 40% overlap between succes-
sive windows.
4. The intensity values of the PSD are summed over modulation frequencies for each octave
band and are normalized to 1 using the DC-component of the PSD.
43
Impulse response identification and equalization. Input-output techniques
Other intrusive measures based on the comparison with the undistorted signal take into account
the human auditory system, among them the Bark Spectral Distortion (BSD), that transforms
the speech signal into a perceptually relevant domain [74], the Reverberation Decay Tail (RDT)
proposed in [75] by Wen and Naylor, and the PESQ described in ITU-T Recommendation P.862
(February 2001) [76].
BSD was developed by Wang et al. [77]. It was the first objective measure to incorporate
psychoacoustic responses. Its performance was quite good for speech coding distortions as
compared to traditional objective measures (in time domain and in spectral domain). The BSD
measure is based on the assumption that speech quality is directly related to speech loudness,
which is a psychoacoustical term defined as the magnitude of auditory sensation. In order to
calculate loudness, the speech signal is processed using the results of psychoacoustics measure-
ments. BSD estimates the overall distortion by using the average Euclidean distance between
loudness vectors of the reference and of the distorted speech. BSD works well in cases where
the distortion in voiced regions represents the overall distortion, because it processes voiced
regions only; for this reason, voiced regions must be detected.
If the the acoustic system impulse response is known, dereverberation performance can be
easily evaluated by comparing the original system IR and the equalized one. In a multichannel
system the impulse response between the source and the closest microphone can be used as the
reference IR.
h2 (δ)
DRR = 10 log10 PM −1 dB (3.50)
2 (k)
k=0(k6=δ) h
where h(n) is the speaker-to-receiver impulse response, M, its length in samples, and δ the
time-index of the direct path in samples. The DRR depends on the distance between the source
44
Impulse response identification and equalization. Input-output techniques
and the microphone and on the reverberation time of the room. Since DRR is a ratio between
the energy of the direct path versus the rest of the energy in the speaker-to-receiver impulse
response and, it does not involve any information about the duration of the impulse response.
It can be shown that the DRR can be estimated by the SRR [78]. Therefore without the explicit
knowledge of the channel impulse response.
The Definition index is the early to total sound energy ratio. It is defined [1] as:
Rt 2
0 h (τ )dτ
D = R∞ 2
100% (3.51)
0 h (τ )dτ
expressed in unity fraction, is associated to the degree to which rapidly occurring individual
sounds are distinguishable. The Definition index attempts to define an objective criterion of
what may be called the distinctness of sound. t is usually set to 50 or 80 ms (i.e., in case
t = 50ms the Definition index is denoted by D50).
The Clarity index is the Early to Late reverberation energy Ratio (ELR). It is defined [1] as:
Rt 2
0 h (τ )dτ
C = 10 log10 R∞
2
(3.52)
t h (τ )dτ
where t is usually set to 50 or 80 ms (i.e., in case t = 50ms the Clarity index is denoted by
C50).
The Normalized Projection Misalignment (NPM) [79] criterion is particularly useful in com-
paring the identified IR with a target response. NPM projects estimation onto true impulse
response, ignoring scaling factors. This is mandatory in many situations, because the estimated
impulse response is usually a scaled version of the true impulse response.
45
Impulse response identification and equalization. Input-output techniques
kς(k)k
N P M (k) = (3.53)
khk
where
hT ĥ
ς(k) = h − ĥ. (3.54)
ĥT ĥ
This implies that with a perfectly identified set of channels, the NPM will be zero.
46
Chapter 4
Blind dereverberation techniques.
Problem statement and existing
technology
4.1 Introduction
Even though several approaches have been proposed, a possible discrimination into two classes
can be accomplished by considering whether or not the inverse IR needs to be estimated. In fact,
all dereverberation algorithms attempt to obtain dereverberation by attenuating the IR effects
or by undoing it. In a simplistic view, one approach tries to alleviate the “symptoms” of the
signal degradation, while the other attempts to address its “cause”. Due to the spatial diversity
and temporal instability that characterize the IRs [6], the first class of algorithms can offer, at
the current state, more effective results in practical conditions [7] [8]. However, the algorithms
belonging to the second class can potentially lead to ideal performances [9]. It must be noted
that practical dereverberation is still largely an unsolved problem.
To be consistent with the useful definitions reported in [4], the first class of algorithms will
47
Blind dereverberation techniques. Problem statement and existing technology
“Reverberation suppression methods” are based on diverse set of techniques such as: beam-
forming [5][7], spectral subtraction [8], temporal envelope [83], LPC enhancement [84] [85].
“Blind reverberation cancellation methods” can be distinguished into two sub-classes: the tech-
niques that are based on the IR blind estimation followed by its inversion [86] and the ones that
attempt to directly estimate the inverse system [87] [88] [89]. While the first methods have the
benefit of providing the access to the IR estimation, and this is of interest for the extraction of
many acoustic parameters (i.e. T60 , EDT, C80 etc. [1]), the calculation of the inverse system
is not trivial even in the non blind case [55] [49] [66] and it might lead to inaccuracies [6],
[59], [47]. Therefore, it is probably more consistent for the dereverberation purpose to achieve
a direct estimation of the inverse system.
Blind reverberation cancellation and suppression methods can be combined to offer hybrid
strategies [90][91].
This chapter will be focused on reverberation cancellation methods. For completeness, the
more relevant reverberation suppression methods will be briefly described.
4.2.1 Beamforming
Beamforming is a spatial filtering technique that can discriminate between signals coming from
different directions. Beamforming applications are several and diverse [5] (i.e. radar, sonar,
telecommunication, geophysical exploration, biomedicine, image processing, acoustics). In
acoustics, beamforming is employed to create sensors with an electronically configurable di-
rectivity pattern. This can be used to separate a source in a noisy environment or to minimize
48
Blind dereverberation techniques. Problem statement and existing technology
the interference caused by reverberation. Enhanced speech obtained from a far field acquisi-
tion is a typical application. In a similar way, this technique can be used to obtain a higher
intelligibility of the diffused signal by designing a loudspeaker array that can focus the acoustic
energy in a confined spatial region, minimizing the reflections due to the surrounding walls and
objects.
N
X
y(k) = hi xi (k) (4.1)
i=1
where hi is the weight and xi the signal at the i − th sensor and y(k) the beamformer output.
This beamformer will be called for simplicity “linear beamformer”. This equation has the same
structure of an FIR filter.
The weight hi can be a simple real number or a more complex filter. In this last form beam-
forming, as it will be shown in chapter 4.7.2, has a close similarity to the structures employed
in multichannel blind reverberation cancellation methods.
When the linear beamformer operates at a single frequency ω (narrow band hypothesis), the
analogy with an FIR filter is intuitive.
49
Blind dereverberation techniques. Problem statement and existing technology
• In an FIR filter, the time interval between two consecutive samples in the filtered signal
is determined by the sampling period T .
• In a linear beamformer, the time interval present between two sensor is determined once
it is specified the Direction Of Arrival (DOA) θ of the impinging wave, its velocity, c and
the distance between the sensors, d.
d
τ (θ) = sin(θ). (4.2)
c
While an FIR filter is based on the combination of uniformly spaced temporal samples, the
linear beamformer is based on the combination of uniformly spaced spatial samples. As in
an FIR filter, the beamformer weights h determine completely its response, and the FIR filter
design techniques can be used, replacing the concept of frequency with the one of directivity.
4.2.1.2 FIR filter frequency response and linear beamformer directivity pattern
The frequency response of an FIR filter of length N , with impulse response h at a sampling
period T is given by
N
X
H(ω) = hi e−jωT (i−1) (4.3)
i=1
where
hT = [h(1), h(2), ..., h(N )] (4.5)
1
A far field hypothesis is assumed: the sources are far enough to consider the wave as plane. This relation is not
valid in the close field.
50
Blind dereverberation techniques. Problem statement and existing technology
and
d(ω) = [1, ejωT , ..., ejω(N −1)T ]. (4.6)
The squared absolute value of the frequency response is the filter power response.
To obtain the spatial response of a linear beamformer, it is sufficient to replace the sampling
d
frequency T with τ (θ) = c sin(θ)
where
d(θ, ω) = [1, ejωτ (θ) , ..., ejω(N −1)τ (θ) ]. (4.8)
The square of the absolute value of H(ω) is the beamformer “directivity pattern” or “beam-
pattern”.
In the temporal sampling, the highest frequency that can be represented without ambiguity
is fN yquist = fs /2, where fs is the sampling frequency. If a signal is sampled with an in-
sufficiently high fs , the sampled signal, due to the ambiguity in the representation (aliasing),
will contain frequency components that are not really present in the original signal. In a similar
way, when a plane wave is spatially sampled with sensors placed at a uniform distance d, spatial
aliasing can happen. It is possible to show that the highest frequency that can be reconstructed
without ambiguity is
c
fmax = . (4.9)
2d
A simple FIR low pass filter is given by the average of N neighboring samples
51
Blind dereverberation techniques. Problem statement and existing technology
Figure 4.3: Beampatterns of linear beamformers: (a)16 sensors, (b)32 sensors, (c)Dolph-
Cebyshev 16 sensors.
The filter frequency cut is linked to the filter length N . The frequency response of this system
can be calculated by 4.3 and its power spectrum is reported in Fig.4.2.
For a linear beamformer, the same h(n) can be interpreted as a peak of sensitivity at 0◦ , as
shown in Fig.4.3(a). The spatial resolution of the beamformer can be increased by augmenting
the number of sensors, as shown in Fig. 4.3(b).
In the FIR filter design it is usual to weight the coefficients with a windowing function to obtain
a smoother frequency response. This at the expense of frequency resolution. An optimal choice
in this sense, that offers equiriple behavior in the stopband, is the Dolph-Chebyshev window
[93]. An example of a beamformer obtained by using the weights reported in 4.10 and smoothed
by a Dolph-Chebyshev window is reported in Fig. 4.3(c).
52
Blind dereverberation techniques. Problem statement and existing technology
h = d(θ0 , ω0 ). (4.11)
A linear beamformer can discriminate the direction of arrival from a 2-D space. A 3-D dis-
crimination can be obtained by a 2-D matrix of equi-spaced sensors. The smoothing window
coefficients are given by the factorization of the 1-D window. Therefore, the 2-D linear beam-
former can be viewed as the composition of 1-D linear beamformers.
If a narrowband beamformer designed for a specific frequency is hit by a wideband plane wave,
it will appear large for high frequencies, therefore the spatial resolution will be high, and small
for low frequencies, and as a consequence the spatial resolution will decrease. At high fre-
quencies spatial aliasing might happen, and unwanted maxima of sensitivity might appear in
the beampattern. The behavior of a beamformer at different frequencies is shown in Fig.4.4.
Since the beampattern exhibits a non constant lobe width at different frequencies, the interfering
signal will not be completely suppressed, but only low-pass filtered. Design techniques for
beamformers with a constant beampattern have been proposed, among them the approach of
Ward et al. [94].
53
Blind dereverberation techniques. Problem statement and existing technology
A beamformer is however not capable of reducing the components of reverberation that fall
inside the beampattern. In other words, since reverberation comes from all possible directions
in a room, it will always enter the path of the beam.
Gaubitch [96] has shown that the expected improvement in direct-to-reverberant ratio that can
be achieved with a DSB is
D0 2
PM PM 1
m=1 l=1 Dm Dl
E DRR = 10 log10 PM PM sin(k||l −l ||)
(4.12)
m
m=1
l
l=1 k||lm −lm+1 || cos(k(Dm − Dl ))
where Dm is the distance between the source and the m-th microphone, lm is the m-th mi-
crophone three dimensional coordinate vector and D0 = minm (Dm ) is the distance from the
source to the closest microphone. k = 2πf /c is the wave number with f denoting frequency
and c being the speed of sound in air.
The expected improvement that can be achieved with the DSB depends only on the distance
between the source and the array and the separation of the microphones and is consequently
independent of the reverberation time. The performance increases by augmenting the number
of microphones and the distance from the source.
In summary, beamforming and in particular the delay-and-sum beamformer are simple ap-
proaches that can provide moderate improvement in dereverberation.
54
Blind dereverberation techniques. Problem statement and existing technology
Spectral subtraction is not a recent approach to noise compensation and was first proposed in
1979 [97]. There is however a vast amount of more recent work in the literature relating to
different implementations and configurations of spectral subtraction.
Spectral subtraction is usually applied to additive noise reduction. Its main advantage is the
simplicity of implementation and the low computational requirements. Speech degraded by
additive noise can be represented by
where x(n) is a speech signal corrupted by the additive noise v(n). Assuming that speech and
noise are uncorrelated, the enhanced speech, x̂(n), is obtained from
|Y (k)| − |V̂ (k)|, if |Y (k)| > |V̂ (k)|
X̂(k) = (4.14)
0, otherwise
where X(k), Y (k) and V (k), denote the short-time magnitude spectra of x(n), y(n) and v(n)
of the kth frame. These short-time magnitude spectra are usually obtained from the discrete
Fourier transform (DFT) of sliding frames, typically in the order of 20-40 ms. The noise spec-
trum is estimated during non-speech intervals. Noise reduction is thus achieved by suppressing
the effect of noise from the magnitude spectra only. The subtraction process can be in true
magnitude terms or in power terms. Phase terms are ignored.
In the particular case of magnitude spectral subtraction, the enhanced speech is reconstructed
by using the phase information of the corrupted signal y(n)
h i
6
x̂(n) = IDF T |X̂(k)|ej Y (k)
(4.15)
55
Blind dereverberation techniques. Problem statement and existing technology
The main inconvenience with this approach is the generation in the processed signal of an
annoying interference, termed musical noise. This noise is composed of tones at random fre-
quencies [99] and is mainly due to the rectification effect caused by equation 4.14. The spectral
subtraction process can also be described as a filtering operation in the frequency-domain as
[99]
Equations, that allow one to design the filter G(k) to perform magnitude subtraction, power
spectral subtraction and Wiener filtering, have been similarly proposed in [98], [100], [99].
Here, the formulation proposed in [98] is reported:
n h iγ oγ2 h iγ
(k) 1 V (k) 1
1 − α VY (k)
, if Y (k) < 1
α+β
G(k) = (4.17)
β otherwise .
ˆ to zero, a threshold
• β (spectral flooring). Instead of assigning negative values of |X(k)|
β can be set in such a way that musical noise caused by the rectification effect can be
reduced.
• Exponents γ1 and γ2 . Determines the path between G(k) = 1 and G(k) = 0. Three
classical methods are defined: magnitude subtraction, with γ1 = γ2 = 1, power spectral
subtraction, defined by γ1 = 2 and γ2 = 0.5, and Wiener filtering, with γ1 = 2 and
γ2 = 1.
Speech distortion and residual noise cannot be minimized simultaneously. Parameter adjust-
ment is dependent on the application. As a general rule, human listeners can tolerate some
distortion, but they are sensitive to fatigue caused by noise. Automatic speech recognizers
usually are more susceptible to speech distortion [98].
56
Blind dereverberation techniques. Problem statement and existing technology
In [101], the limitations of spectral subtraction have been analyzed. In the same paper, the
exact relation between the clean speech spectrum X(k) and the noise, V (k), and distorted
signal, Y (k), spectra is given:
1/2 θ (k)
X(k) = |Y (k)|2 − |V (k)|2 − X(k) · V ∗ (k) − X ∗ (k) · V (k)
eX . (4.18)
This expression suggests that three sources of error in a practical implementation of spectral
subtraction thus exist:
• phase errors, arising from the differences between the phase of the corrupted signal
θY (ejw ) and the phase of the true signal θX (k)
• magnitude errors, which refer to the differences between the true noise spectrum |V (k)|
and its estimate |V̂ (k)|
Except for the worst levels of SNR, errors in the magnitude make the greatest contribution.
However, as noise levels in the order of 0 dB are approached phase and cross-term errors are
not negligible and lead to degradations that are comparable to those caused by magnitude errors.
The use of spectral subtraction for speech dereverberation of noise-free speech was proposed
by Lebart et al. in [102]. Spectral subtraction dereverberation methods are based on the obser-
vation that reverberation creates correlation between the signal measured at time t0 and at time
t0 + ∆t. Therefore, reverberation can be reduced by considering as noise the contribution of
the signal at time t0 to the signal at time t0 + ∆t. The problem of reverberation suppression
differs from classical de-noising in that the “reverberation noise” is non stationary.
57
Blind dereverberation techniques. Problem statement and existing technology
where v(n) represents a white zero-mean Gaussian noise, u(n), the unit step function, and τ is
a damping constant related to the reverberation time T60
τ = 3 ln(10)/T60 . (4.20)
The non-stationary reverberation-noise power spectrum, due to late reverberation, can be mod-
elled as [98]
where Td is the number of samples that identify the threshold that separates the direct compo-
nent from the late reverberant one (usually between 40 and 80 ms). Therefore, it is the number
of samples where reverberation is not suppressed. V (n) is an exponentially attenuated power
spectrum of the acquired signal x(n).
58
Blind dereverberation techniques. Problem statement and existing technology
The LP residual of reverberant voiced speech segments contains the glottal pulses followed by
other peaks due to multi-path reflections. dereverberation, as reported in Fig.4.5, can thus be
achieved by attenuating these undesired peaks and synthesizing the enhanced speech waveform
using the modified LP residual and the time-varying all-pole filter with coefficients calculated
from the reverberant speech. This approach relies on the assumption that the LP coefficients
are unaffected by reverberation. The validity of this hypothesis will be discussed in more in
detail in section 4.4. The main interest here is that these methods show similarities with many
blind reverberation cancellation algorithms, where the LP residual calculation is often used as
a preprocessing step. However, it might be misleading to classify these last methods as “LP
residual enhancement algorithms”, since it would be advisable to consider in this class, only
the algorithms that do not require system identification.
The first approach based on LP residual enhancement was most likely proposed by J.B. Allen
and F. Haven, from Bell Telephone Laboratories Inc., in a patent that was filed in 1972 [84].
A detector to distinguish between voiced and unvoiced speech frames, a pitch estimator, and a
gain estimator were used. All signals were then employed to synthesize a clean LP residual.
LP residual enhancement was also proposed by Yegnanarayana and Murthy [85]. Their method
involves identifying and manipulating the linear prediction residual signal in three different
regions of the speech signal: high SRR, low SRR, and only reverberation component regions.
A weight function is derived to modify the linear prediction residual signal. The weighted
residual signal samples are used to excite a time-varying all-pole filter to obtain perceptually
59
Blind dereverberation techniques. Problem statement and existing technology
enhanced speech. A following approach [107] is based on the use of time-aligned Hilbert
envelopes to represent the strength of the peaks in the LP residuals. The Hilbert envelopes are
then summed and used as a weight vector which is applied to the LP residual.
Another approach has been proposed by Griebel and Brandstein [108]. Their technique is based
upon the observation that residual impulses due to the original speech tend to be predictable
while those due to reverberation effects are relatively uncorrelated in both amplitude and time.
All these methods use the scheme reported in Fig.4.5: LPC analysis followed by residual en-
hancement and speech re-synthesis. The advantage is that no system identification is required.
However, the main limitation is that they do not consider the original structure of the excitation
signal and therefore the enhanced residual can differ from the original clean residual and can
result in unnatural sounding speech.
An advanced practical implementation based on the enhancement of the LP residual from the
output of a delay and sum beamfomer was proposed in [109] by Gaubitch et al. The fact
that the waveform of the LP residual between adjacent larynx-cycles varies slowly was used,
so that each such cycle can be replaced by an average of itself and its nearest neighboring
cycles. The averaging results in a suppression of spurious peaks in the LP residual caused by
room reverberation. This algorithm was practically implemented in a computationally efficient
system that can achieve a 5 dB SNR improvement in a real environment without the knowledge
or the estimate of the room transfer functions [7].
The reverberation cancellation techniques described in 3.2.1 for the SISO system and in 3.2.3
for the SIMO system calculates the inverse filter(s) from the known system impulse response(s).
However, when no information about the system impulse responses are available, the estimation
of the inverse filters can be obtained only by observing the system output(s). These techniques
are called blind.
Single channel blind reverberation cancellation is connected to the blind estimation of the equal-
izer w(n) so that its convolution with the received signal x(n)
60
Blind dereverberation techniques. Problem statement and existing technology
The relation that exists between the system impulse response and its inverse,
1
W (ω) = (4.24)
H(ω)
reveals that, in theory, the blind equalization and the blind system identification problem are
essentially the same. Therefore any blind identification scheme can be used to perform blind
deconvolution by identifying and then inverting the system impulse response. Due to the es-
timation error of h(n), all the problems regarding the system inversion analyzed in 3.2.1 are
magnified in the blind case. Therefore it is probably even more appropriate to achieve a direct
evaluation of the equalizer instead of calculating it by an inversion from the identified system
impulse response.
In a similar way, the multichannel blind reverberation cancellation is connected to the blind
estimation of a set of filters wi (n) that satisfy the equation:
N
X
δ(n − Nd ) = hi (n) ∗ wi (n) (4.25)
i=1
and
N
X
ŝ(n) = y(n) = xi (n) ∗ wi (n) (4.26)
i=1
Blind dereverberation can be therefore viewed as the problem of blindly estimating one or more
filters to equalize the received signal. This is much more problematic than in the non-blind case,
in particular when colored source signals are considered.
In fact, suppose the channel, h(n), is Linear Time-Invariant (LTI), then the observed signal
x(n) can be expressed as
x(n) = h(n) ∗ s(n) (4.27)
61
Blind dereverberation techniques. Problem statement and existing technology
where * denotes convolution. If either h(n) or s(n) are the convolution of two signals,
then
x(n) = h1 (n) ∗ h2 (n) ∗ s1 (n) ∗ s2 (n) (4.30)
thus, it is impossible to understand which component belongs to the source signal or to the
distortion operator. Unambiguous blind deconvolution can be obtained only if s(n) and h(n)
are irreducible. An irreducible signal is one which cannot be exactly expressed as the convolu-
tion of two or more components signals (except with a delta function) [110]. The problem of
unambiguous deconvolution is present whenever we want to decouple the effect of coloration
of the source (e.g. the resonances due to the vocal tract in speech) and the one connected to the
reverberation. Unless additional knowledge is available, blind deconvolution of colored signals
(i.e. speech, music) cannot be achieved.
On the other hand, if the source signal s(n) has a “white” spectral structure several techniques
can be employed to blindly estimate the filter(s) to equalize the received signal x(n) [111],
[112], [113], [92].
Although the speech signal is not statistically white, it can be modelled as a convolution of the
white signal v(n) and the vocal tract filter a(n)
62
Blind dereverberation techniques. Problem statement and existing technology
where a(n) has the characteristic of the speech spectrum A(ω). Ideally the whitening filter
g(n) must be able to remove only the inherent correlation of the speech signal without affecting
reverberation, so that
The most popular techniques for pre-withening the reverberated speech are based on Linear
Predictive (LP) analysis [87] or on block whitening by Fast Fourier Transform (FFT) magni-
tude equalization [91]. The foundation of these methods is based on the fact that reverberation
and speech have different temporal structure. While speech can be considered quasi-stationary
within blocks of 20-30 ms of duration, reverberation, unless considering a source that is mov-
ing, can be considered approximately stationary. Other possible approaches are based on the
non-stationarity of signals [114] or on the knowledge of the source signal statistics [115] [116].
LP pre-whitening is based on the observation that signals with a harmonic structure can be
modelled using a cascade of second order all-pole sections driven by a white noise signal.
1 1
H(z) = −1 −2
= (4.33)
2
1 − 2Re(p) · z + |p| z 1 − a1 · z − a2 · z −2
−1
p = r · ejω (4.34)
Having a segment of speech s(n), the optimal coefficients ai can be calculated [50] by mini-
mizing the squared error
∞
X
εp = |e(n)|2 (4.35)
n=0
where
e(n) = s(n) − ŝ(n) (4.36)
63
Blind dereverberation techniques. Problem statement and existing technology
and p
X
ŝ(n) = − ap (k) · s(n − k). (4.37)
k=1
This method is called Linear Prediction (LP), since ŝ(n) is an estimate, or prediction, of the
signal s(n) in terms of a linear combination of the previous p values. p, defined as the LP order,
is related to the number of poles, therefore to the number of resonant peaks, used to model the
signal spectrum.
The residual error e(n) represents the unpredictable components of y(n). Thus e(n) is small
when the signal has an harmonic structure and large for transients and noise-like behaviors.
The coefficients ap (k) are called the linear prediction coefficients. The filter
p
X
Ap (z) = 1 + ap (k)z −k (4.38)
k=1
A speech signal, or more generally any audio signal, can be analyzed by splitting it into quasi-
stationary blocks (20-30 ms for speech) and by calculating the filter coefficients for every block.
64
Blind dereverberation techniques. Problem statement and existing technology
can be obtained in the multi-channel case, as proposed in [117] by Gaubitch et al. and in [118]
by Triki and Slock, by exploiting both the temporal and the spatial diversity. Gaubitch et al.
show that a more accurate decoupling can be achieved, in a multichannel framework, by av-
eraging the LP coefficients calculated from instances of the same speech recorded in different
positions within a room. Triki and Slock suggest exploiting the spatiotemporal diversity to
estimate the source correlation structure, which hence is used to determine a source whiten-
ing filter. Their observation relies on the fact that, according to statistical acoustics, above the
Schroeder frequency
r
T60
fg = 2000 (4.39)
V
the spatially averaged reverberation spectrum is flat,
1−β
h|H(ω)|2 i = (4.40)
πAβ
where h.i is the spatial expectation (estimated by averaging over all possible source and micro-
phone positions), β is the average wall absorption coefficients and A is the total wall surface
area. Therefore the speech spectrum can be estimated by averaging the frequency response of
the speech signals received by multiple microphones. It can be criticized that this might be
problematic for rooms where the modal behaviour prevails (i.e. small reverberant rooms). In
this case, the flatness hypothesis might happens only for relatively high frequencies, and room
resonances are overlapped with speech resonances.
An iterative approach method, applicable in the single channel case also, was proposed in [88]
by Yoshioka et al. The authors claim that by jointly estimating the channel’s inverse filter and
the PEF, the channel’s inverse is identifiable due to the time varying nature of the PEF.
In [91] Furuya et al. suggest whitening the reverberant speech by block whitening by FFT
magnitude equalization. A whitening filter g(k) with a short time span (16-512 taps at a sam-
pling frequency of 12 kHz) was used to remove the correlation due to speech while leaving the
correlation of reverberation with a longer time span. A possible limitation is that if reflective
surfaces are placed in the proximity of the source, also the correlation due to this early reflec-
tions will be removed. This will happen more evidently if the FFT block length is increased. In
their method, the source speech spectrum A(ω) is estimated by using
65
Blind dereverberation techniques. Problem statement and existing technology
where h.i is the spatial expectation, S(ω, m) is the STFT of the original speech signal and
Xj (ω, m) is the STFT of the received speech signal xj (n).
1
G(ω) = (4.42)
A(ω)
and performing the inverse Fourier transform of G(ω). The decorrelated signal is given by
convolution of the received signal at microphone j, xj (k), and the whitening filter g(k) and is
used to compute the inverse filters instead of using the received signal xj (n).
All the methods that have been discussed offer only an improvement to the unambiguous de-
convolution problem for speech, and cannot be extended to the signals that contains sustained
harmonics (i.e. music or singing voice). The choice of the data window length to be whitened,
while being critical, is usually chosen on heuristic criteria. Furthermore, it can be criticized
that dereverberation should happen before whitening, while usually the opposite approach is
adopted [87], [118],[91]. In fact, the order of two linear filters can be swapped only if they
are time invariant. Since the vocal tract filter is not stationary, performing whitening before or
after dereverberation will lead to different results. From a physical perspective, reverberation
is added after the convolution with the vocal tract. Therefore reverberation spreads the infor-
mation contained in a segment of speech signal into a wider temporal frame, and whitening on
short block base is unable to remove this. This aspect will be analyzed in more detail in section
5.4.
Blind deconvolution techniques are an unsupervised learning approach that identifies the in-
verse of an unknown linear time-invariant, possibly non minimum phase, system, without
66
Blind dereverberation techniques. Problem statement and existing technology
having access to a training sequence (i.e a desired response). An overview of existing blind
deconvolution techniques can be found in [119], [120], [45], [121].
• The source signal s(n) is a discrete-time, real, stationary stochastic process with zero
mean and discrete-time power spectrum Ps (ω).
• The distorting system is linear and time invariant (LTI) with discrete-time transfer func-
tion H(ω).
67
Blind dereverberation techniques. Problem statement and existing technology
∞
X
jω
Px (e ) = rx (k)e−jkω (4.43)
k=−∞
that is linked to the power spectrum of the system input s(n) by the relation
If a different LTI system H 0 (ω) = H(ω) ∗ A(ω), where A(ω) is an allpass filter (i.e. unit
magnitude and arbitrary phase response), the power spectrum Px (ω) is unchanged. SOS cannot
distinguish between H 0 (ω) and H(ω). This is why SOS is often described as phase-blind.
A unique relationship between the magnitude |H(ω)| and the phase 6 H(ω) of an LTI system
exists only when H is either minimum phase or maximum phase (i.e. the transfer function of
the system is stable and has all its zeros confined either to the interior or exterior of the unit
circle in the z-plane) [54]. Therefore, a whitening filter, that equalizes only the magnitude
response |H(ω)| of a system, is insufficient for blind equalization of mixed-phase systems.
A possible approach to overcome the limitation of SOS and to recover the phase information is
by using Higher Order Statistics (HOS).
Let u(n), u(n + τ1 ), .., u(n + τk−1 ) denote the random variables obtained by observing the
process at times n, n + τ1 , .., n + τk−1 .
The second, third and fourth-order cumulants for a stationary random process are given by [45]
68
Blind dereverberation techniques. Problem statement and existing technology
As an example, M 1 (A) = E {A} is the mean, and M 2 (A)−(M 1 (A))2 = E A2 −(E {A})2
is the variance of A.
The generalization to a stationary random process u(n) is the k th − order moment function
Ru , defined as
• the second-order cumulant c2 (τ ) is the same as the autocorrelation function r(τ ) (the
second order moment function E {u(n) · u(n + τ ]});
• the third order cumulant c3 (τ ) is the same as the third order moment function
E {u(n) · u(n + τ1 ) · u(n + τ2 )};
• the fourth order cumulant differs from the fourth order moment function.
69
Blind dereverberation techniques. Problem statement and existing technology
∞
X ∞
X
Ck (ω1 , ..., ωk−1 ) = ... ck [τ1 , ..., τk−1 ] · e−j(ω1 τ1 +...+ωk−1 τk−1 ) (4.50)
τ1 =−∞ τk−1 =−∞
∞
X
P (ω) = c2 (τ ) · e−jωk (4.51)
k=−∞
∞
X ∞
X
C3 (ω1 , ω2 ) = c3 [τ1 , τ2 ) · e−j(ω1 τ1 +ω2 τ2 ) (4.52)
τ1 =−∞ τ2 =−∞
∞
X ∞
X ∞
X
C4 (ω1 , ω2 , ω2 ) = c4 [τ1 , τ2 , τ3 ] · e−j(ω1 τ1 +ω2 τ2 +ω3 τ3 ) . (4.53)
τ1 =−∞ τ2 =−∞ τ3 =−∞
When a real-valued stationary random process is considered, its power spectrum is real, there-
fore no phase information can be extracted from it,on the other hand, it can be shown that
polyspectra preserve phase information. In particular, for a bispectrum the following relation-
ship holds [122]
where, the higher-order spectrum of the received signal x(n) is linked to the the higher-order
spectrum of the source signal s(n) by a complex valued relation from which both magnitude
and phase of the transfer function H(ω) can be identified. In [123] a closed form for an FIR
model of the identified system impulse response is given
70
Blind dereverberation techniques. Problem statement and existing technology
Ry [P, n]
h(n) = n = 0, ..., P (4.55)
Ry [−P, P ]
Unfortunately, equation 4.55 has limited practical use as the model order P must be known, and
the estimate of the moment function Ry [p, n] comes with large variances, even without noise
present. Nevertheless, it demonstrates the usability of HOS for blind system identification.
A review of blind identification methods based on the explicit use of higher order spectra can be
find in Giannakis [124], Mendel [123] and in Nikias and Mendel [125]. See [126] Hatzinakos
and Nikias for the application to deconvolution.
A different approach started with the work of Wiggins [127], where it was shown that blind
estimation of the equalizer W (ω) could be achieved by maximizing the non-Gaussianity of
the received signal. A possible measure of the non-Gaussianity is Kurtosis, a measure of the
“peakedness” of the probability distribution of a real-valued random variable, was proposed as
a metric for the non-Gaussianity. Kurtosis is defined2 as the fourth moment around the mean
divided by the square of the variance (that is the second moment) of the probability distribution
minus 3
µ4
γ2 = −3 (4.56)
σ4
therefore it implies the computation of the fourth and second moments only. The previous
expression is also know as kurtosis “excess” and is commonly used because γ2 of a normal
distribution is equal to zero, while it is positive for distributions with heavy tails and a peak
at zero, and negative for flatter densities with lighter tails. Distributions of positive [negative]
kurtosis are thus called super-Gaussian [sub-Gaussian]. A signal with sparse peaks and wide
low level areas is characterized by a high positive kurtosis values.
Donoho [129] provided a statistical foundation to Wiggins’ method, by pointing out that, by the
central limit theorem [130], a filtered version of a non-Gaussian i.i.d. process appears “more
Gaussian” than the source itself. He also concluded that general HOS can be used to reflect the
amount of Gaussianity of a random variable.
Later Shalvi and Weinstein [111] provided a theoretical foundation to the non Gaussianity maxi-
µ4
2
Other definitions of kurtosis exist (i.e. µ2
or µ4 − 3µ22 [128])
2.
71
Blind dereverberation techniques. Problem statement and existing technology
mization approach extending it to any non-Gaussian source signal and a necessary and sufficient
condition for blind deconvolution of nonminimum phase linear time-invariant systems.
µ3
γ1 = (4.57)
σ3
and the advantages in respect to kurtosis in some special cases of asymmetric source signals
was proposed.
A different class of HOS-based blind deconvolution techniques indirectly exploits higher order
statistics of the received signal, by employing a static non linear function.
This approach is due to Bellini [113], [132], [133]. In these papers, a technique for blind
deconvolution, which is efficient when the input sequence is i.i.d. and the channel distortion is
small, is proposed. The algorithms are know as Bussgang algorithms because the deconvolved
signal exhibits Bussgang statistics when the algorithm converges in the mean value.
72
Blind dereverberation techniques. Problem statement and existing technology
P
X
y(n) = w(p)e(n − p) (4.58)
p=0
where e(n) = f (y(n)) − y(n) is the estimation error, P is the equalizer order, µ the adaptation
parameter and f (.) the Bussgang nonlinearity 3 .
The standard Bussgang algorithm has a very slow convergence. To address this problem, Amari
et al. proposed in [134] an on-line adaptive algorithm for blind deconvolution, called Natural
Gradient (NG). The same algorithm was also discovered by Cardoso et al. and described in
[135] as the “relative gradient”.
The SISO Natural Gradient Algorithm (NGA) for a LTI system is [136]
P
X
y(n) = w(p)e(n − p) (4.60)
p=0
P
X
u(n) = w(P − m)y(n − m) (4.61)
m=0
wnew (p) = w(p) + µ w(p) + f (y(n − P ))u(n − p) . (4.62)
The standard gradient descent (as in the standard Bussgang algorithm) is most useful for cost
functions that have a single minimum and whose gradients are isotropic in magnitude with
respect to any direction away from this minimum. In practice, however, the cost function
being optimized is multi-modal, and the gradient magnitudes are non-isotropic about any min-
imum. In such a case, the parameter estimates are only guaranteed to locally-minimize the cost
function, and convergence to any local minimum can be slow. The natural gradient adapta-
tion modifies the standard gradient search direction according to Riemannian structure of the
parameter space. While not removing local cost function minima, natural gradient adaptation
3
The − tanh(.) or the − sgn(.) functions is usually assumed for super-Gaussian source signals.
73
Blind dereverberation techniques. Problem statement and existing technology
provides isotropic convergence properties about any local minimum independently of the model
parametrization and of the dependencies within the signals being processed by the algorithm.
Moreover, natural gradient adaptation overcomes many of the limitations of Newton’s method,
which assumes that the cost function being minimized is approximately locally “quadratic”. By
providing a faster rate of convergence, the natural gradient increases the usability of Bussgang
type methods to non stationary environments.
In [137] Bell and Sejnowski derive a self-organising learning algorithm which maximises the
information transferred in a network of non-linear units to perform blind deconvolution cancel-
lation of unknown echoes and reverberation in a speech signal. However the examples reported
in their paper are restricted to the deconvolution of unrealistically short IRs.
It must be highlighted that a non Gaussian source signal s(n) is required for all the HOS blind
identification and deconvolution methods. In fact for a Gaussian distribution with expected
value µ and variance σ 2 , the cumulants are k1 = µ, k2 = σ 2 , and k3 = k4 = ... = 0. Therefore
no information is present in higher order cumulants/moments.
Single channel reverberation cancellation that can be referred to blind deconvolution HOS ap-
proaches are reported in [87],[90], [138]. All these methods are based on an LPC pre-whitening
step, as described in section 4.4, followed by a blind deconvolution algorithm that performs the
non-Gaussianity maximization of the received signal in a similar fashion to what suggested by
Wiggins [127], and discussed in section 4.6. While their performance in the SISO case is quite
limited, a considerable improvement is offered by their extension to a multichannel framework.
This allows one the use of both spatial and temporal diversity to achieve better deconvolution.
However, the way adopted to create this extension often does not have a clear theoretic foun-
dation and their better behavior can be mainly justified by the consideration that a MINT-like
structure is employed.
In [87] the kurtosis of the reverberant residual was proposed by Gillespie et al. as a rever-
beration metric. It was observed that for clean voiced speech, LP residuals have strong peaks
corresponding to glottal pulses, whereas for reverberated speech such peaks are spread in time.
A measure of amplitude spread of LP residuals can serve as a reverberation metric. By build-
ing a filter that maximize the kurtosis of the reverberant residual it is theoretically possible to
74
Blind dereverberation techniques. Problem statement and existing technology
identify the inverse function of the RTF and thus to equalize the system. This approach is blind
since it requires only the evaluation of the kurtosis of the reverberant residual of the system
output. Gillespie’s observation can be explained by considering that by its nature, reverberation
is the process of summing a large number of attenuated and delayed copies of the same signal.
Thus, by the central limit theorem [129][130], the reverberated signal has a more Gaussian
distribution in respect to the original one. The LPC residual of speech is mainly constituted
by the glottal pulses, so it is sparse and characterized by high, positive kurtosis. Therefore
reverberation causes this signal to assume a more Gaussian distribution.
In more detail, the key idea behind the Gillespie approach is to build an FIR filter, applied to
the residual x̃(n) of the received noisy reverberated speech signal x(n),
so that the kurtosis of ỹ(n) is maximized. Since ỹ(n) is a zero mean signal, the kurtosis
expression can be simplified as
E ỹ 4
γ2 = 2 2 . (4.64)
E {ỹ }
Using the kurtosis as the cost function an LMS like algorithm can be formulated 4
4 E ỹ 2 E ỹ 3 x̃ − E ỹ 4 E {ỹx̃}
∂J
∇J(w(n)) = = . (4.66)
∂w E 3 {ỹ 2 }
Compared to the LMS algorithm, the kurtosis approach does not require the evaluation of an
error signal in respect to a target, although the update equation is based on very similar structure.
In [87] the gradient is simplified as
4
’+’ instead of ’-’ because maximization of the cost function is desired
75
Blind dereverberation techniques. Problem statement and existing technology
Figure 4.9: A single channel time-domain adaptive algorithm for maximizing kurtosis of the
LP residual.
4 E ỹ 2 y 2 − E ỹ 4 ỹ
∂J
∇J(w(n)) = = · x̃(n) = f (n) · x̃(n) (4.67)
∂w E 3 {ỹ 2 }
where µ controls the speed of the adaptation. For an efficient real-time implementation, the
expected value are estimated using
A similar approach is also used as the first stage of a single microphone dereverberation algo-
rithm proposed by Wu and Wang [90]. The algorithm shows satisfactory results for reducing
reverberation effects when T60 is between 0.2s and 0.4s.
76
Blind dereverberation techniques. Problem statement and existing technology
If a multichannel system is considered, SOS can be used both for system identification and
deconvolution.
In [139], based on the results in [140], Xu,Tong et al. proposed a method to blindly identify a
set of FIR filters. The method relies on the fact that all the outputs from a multiple channel FIR
system are correlated if driven by the same input. Any pair of different noise free instantiation
of the same source signal s(n) is linked by the following relations
then
From this relation, an overdetermined set of linear equations, with hi , hj as unknowns, can be
written [139]. For n = L, . . . , N , where N is the last sample index of the received data xi (n)
and xj (n) and L is the maximum length for the channel impulse response, we have N − L + 1
linear equations
h
.
i hj
Xi (L) .. −Xj (L) = (4.73)
hi
xm (L) xm (L + 1) ... xm (2L)
xm (L + 1) xm (L + 2) . . . xm (2L + 1)
Xm (L) =
.. .. .. ..
. (4.74)
.
. . .
xm (N − L) xm (N − L + 1) . . . xm (N )
77
Blind dereverberation techniques. Problem statement and existing technology
Figure 4.10: Illustration of the relationships between the input s(n) and the observations x(n)
in an M-channel SIMO system.
Equation 4.73 can be written for each pair of channels (i, j). The equations of all channels
can be combined and a larger set of linear equations in terms of h1 , ..., hL or simply h =
T T
h1 , . . . , hTL , so that all the channel impulse responses can be calculated simultaneously
X(L)h = 0 (4.75)
where h is the matrix of the impulse responses, and X(L) the matrix containing the received
signals.
X1 (L)
..
X(L) = (4.76)
.
X M −1 (L)
with
0 . . . 0 Xi+1 (L) −Xi (L) 0 0
.. ..
Xi (L) = ..
0 . (4.77)
.
. . 0
0 . . . 0 XM (L) 0 . . . −Xi (L)
The necessary and sufficient conditions to ensure a unique solution to the above equation, or in
other words to assure identifiability, are [139]:
78
Blind dereverberation techniques. Problem statement and existing technology
2. the autocorrelation matrix of the source signal Rss = E s(k)sT (k) is of full rank (such
The first condition is the usual coprimeness, or channel diversity, identifiability condition for a
SIMO system discussed in section 3.2.3. The second condition is relatively mild and does not
imply the knowledge of the exact statistic of the input signal, nor even constrain it to be an i.i.d
process. Therefore, theoretically, any input signal that can fully excite the SIMO system can be
employed. These conditions are sufficient for the blind identification of any SIMO system.
This approach is known in the literature as the Cross Relation (CR) approach. The CR approach,
a termed coined by Hua [141], was discovered independently and in different forms by several
authors, among them, by Liu et al. [142], and Gurreli and Nikias [143]. These algorithms,
originally aimed at solving communication problems, are often referred to as “deterministic
subspace methods”, since the statistical properties of the source are not exploited. Subspace
algorithms are based on the idea that the channel (or part of the channel) vector is in a one-
dimensional subspace of either the observation statistics or a block of noiseless observations.
Some of the subspace techniques, such as the EVAM algorithm proposed by Gurelli and Nikias
[143], have been used in dereverberation problems. Gurelli and Nikias showed that the null
space of the correlation matrix of the received signals contains information on the transfer
function relating the source and the microphones. This was extended by Gannot and Moonen
[144], [145] to the speech dereverberation problem.
Even if these techniques are supported by theory, they have several drawbacks in real-life sce-
narios. The Generalized Eigenvalue Decomposition (GED) [146], which is used to construct
the null space of the correlation matrix, is not robust enough, and quite sensitive to small esti-
mation errors in the correlation matrix. Furthermore, the matrices involved become extremely
large causing severe memory and computational requirements. Another problem arises from
the wide dynamic range of the speech signal. This phenomenon may result in an erroneous
estimate of the frequency response of the IRs in the low energy bands of the input signal.
In a following paper [80], Moonen and Eneman showed that, at the current state, even the more
advanced subspace-based dereverberation techniques did not provide, in a real-life scenario,
any signal enhancement. Furthermore, even if most subspace methods can converge quickly,
79
Blind dereverberation techniques. Problem statement and existing technology
they are difficult to implement in an adaptive mode and have a high-computational load [65] 5 .
To overcome these limitations, Huang and Benesty [86] proposed a set of adaptive algorithms
to solve the set of linear equation obtained from the CR approach. Their work started from the
formulation of the CR approach described in [147], where the channel impulse responses of an
identifiable system are blindly determined by calculating the null space of the cross-correlation
like matrix of channel outputs
Rx h = 0 (4.78)
with
P
R −Rx2 x1 ··· −RxM x1
i6=1 xi xi P
−Rx1 x2 i6=2 Rxi xi ··· −RxM x2
Rx = .. .. .. ..
(4.79)
.
. . .
P
−Rx1 xM −Rx2 xM ··· R
i6=M xi xi
and
T
h = hT1 , . . . , hTM
(4.81)
is the matrix of the impulse responses. For a blindly identifiable SIMO system, matrix Rx is
rank deficient by 1. In the absence of noise, the channel impulse responses can be uniquely
determined from Rx , which contains only the SOS of the system outputs.
80
Blind dereverberation techniques. Problem statement and existing technology
When noise is present and/or the estimate of channel impulse responses deviates from the true
value, an a priori error signal is produced
where ĥi (k) is the model filter for the i-th channel at time k. The estimated channel impulse
response vector is aligned to the true one, but up to a non-zero scale. This inherent scale
ambiguity is usually harmless in most of acoustic signal processing applications. But in the
development of an adaptive algorithm, attention needs to be paid to prevent it from converging
to a trivial all-zero estimate. Therefore, a constraint can be imposed on the model filter. Two
constraints can be found in the literature. The unit-norm constraint, i.e. ĥ = 1, and the
component normalization constraint [147], i.e. cT ĥ = 1, where c is a constant vector.
Z ∞ Z ∞
|x(t)|2 dt = |X(f )|2 df (4.85)
−∞ −∞
where X(f ) = F{x(t)} represents the continuous Fourier transform (in normalized, unitary
form) of x(t) and f represents the frequency component of x. The interpretation of this form
of the theorem is that the total energy contained in a waveform x(t) summed across all of time
t is equal to the total energy of the waveform’s Fourier Transform X(f ) summed across all of
its frequency components f .
N −1 N −1
X 1 X
|x(n)|2 = |X[k]|2 (4.86)
N
n=0 k=0
where X[k] is the Discrete Fourier Transform (DFT) of x(n), both of length N . Therefore, the
81
Blind dereverberation techniques. Problem statement and existing technology
unitary-norm constraint
N
X −1
ĥ = |x(n)|2 = 1 (4.87)
n=0
The component normalization constraint is useful when one coefficient of the model filter is
h
know to be equal to α, which is not zero. Then the vector c = 0, . . . , 1/α, . . . , 0 []T can
be properly specified, so that cT ĥ = 1. Even though the component normalization can be
more robust to noise than the unit-norm constraint [147], the knowledge of the location of the
component to be normalized and its value α may not be available in practice. So the unit-norm
constraint is more widely used.
M
X −1 M
X
J(k + 1) = 2ij (k + 1) (4.89)
i=1 j=i+1
h i
∂J(k + 1) 2 R̃x (n + 1) ĥ(n) − J(n + 1) ĥ(n)
∇J(k + 1) = = o (4.90)
∂ ĥ(k) ĥ(k)
where R̃ is a matrix with the same structure of the Rx matrix in equation 4.79, but built with
the instantaneous values of the received signals
This algorithm is known as the Multichannel LMS algorithm (MCLMS). Several algorithms
based on the MCLMS algorithm, and that outperform it, have been proposed by the same
82
Blind dereverberation techniques. Problem statement and existing technology
authors. All of them are documented and discussed in [86]. Among them:
• the Unconstrained Multichannel LMS algorithm with optimal step size (VSS-UMCLMS),
that has faster convergence and similar computational load
• the Constrained Multichannel Newton (CMN) algorithm, that has much faster conver-
gence but high computational load
• the normalized multichannel frequency domain LMS (NMCFLMS), that takes advantage
of the computational efficiency of the FFT, which by orthogonalizing the data offers also
faster convergence [45].
The NMCFLMS algorithm has been documented as being able to identify rather short (256
taps) impulse responses by using a 100 second long speech voice as the source signal [86].
One of the main challenges for NMCFLMS is that the algorithm suffers from a misconvergence
problem. It has been shown through simulations presented in [148], [149], that the estimated
filter coefficients converge first toward the impulse response of the acoustic system but then
misconverge. Under low signal-to-noise ratio (SNR) conditions, the effect of misconvergence
becomes more significant and occurs at an earlier stage of adaptation. Possible solutions to this
problem have been investigated by Gaubitch et al. in [150], [151] and by Ahmad et al. [148]
[152],[149]. In [153] a noise robust adaptive blind multichannel identification algorithm for
acoustic impulse responses, called ext-NMCFLMS, based on a modification of the NMCFLMS
has been proposed. No information about the performance with longer acoustic IR is available.
The previous approaches do not directly address the problem of system equalization. In fact, a
further step is required to calculate the equalizer once the FIR filters have been estimated.
The problem of SIMO system equalization has been investigated by Slock and Papadias in
[154], where it is shown that a blind equalization can be achieved by linear prediction algorithm
on the channel bank outputs.
83
Blind dereverberation techniques. Problem statement and existing technology
A more general approach to blind multichannel equalization, that provides a unifying frame-
work for multichannel blind equalization, is reported in Liu and Dong [92]. In this paper, it is
proved that , if no common zeros are shared among channels transfer functions hi and if the
source signal s(n) is zero-mean and temporally uncorrelated, then an FIR equalizer bank wi
equalizes the FIR channel bank if, and only if, the composite output of the equalizer bank y(n)
is temporally uncorrelated. It is interesting to note that the condition of existence for the filter
bank is identical to the one requested by the MINT. While the MINT gives a closed formula to
calculate the equalizer in a non blind framework, this theorem suggests how to calculate them
in the blind case. The advantage of a multichannel structure in respect to a single channel one
when a non minimum phase system is considered, is present in the blind case also.
• the second-order statistics of the composite output of the equalizer bank has sufficient
information for the blind equalization, though it does not have sufficient information for
blind identification
• the hypotheses are very mild: the source signal needs not be stationary, nor i.i.d.
• since the linear prediction error is temporally uncorrelated, the method proposed by Slock
and Papadias [154] that achieves direct blind equalization by multichannel linear predi-
cation follows immediately from the previous theorem
• no constraints on how to obtain the output decorrelation exist; therefore both linear and
non-linear approaches can be used.
All the following blind speech dereverberation algorithms are based, or can be explained, by the
Liu and Dong [92] theorem. In all of them in fact, the FIR equalizer bank is calculated to equal-
ize the FIR channel bank by decorrelating the composite output of the equalizer bank y(n). This
decorrelation is achieved by multichannel linear prediction [155],[156], or by decorrelating the
composite filter bank output both by SOS [157] and HOS [88] methods, or by shaping the cor-
relation of the received reverberant signal [158]. The main novelty of these contributions is on
the techniques that are used to decouple the speech production system and the room response
system to avoid ambiguous deconvolution.
In [155] Triki and Slock proposed a dereverberation technique based on the observation that a
single-input multi-output (SIMO) system is equalized blindly by applying multichannel linear
84
Blind dereverberation techniques. Problem statement and existing technology
prediction (LP). This approach follows [154] and the Liu and Dong theorem [92]. However
when the input is colored, the multichannel linear prediction will both equalize the reverbera-
tion filter and whiten the source. Therefore, as discussed in section 4.4, it is critical to define a
criterion to decouple the source and the reverberation. To address this problem, channel spatial
diversity and the speech signal non-stationarity were employed to estimate the source correla-
tion structure, which can hence be used to determine a source whitening filter 6 . Multichannel
linear prediction was then applied to the sensor signals filtered by the source whitening filter,
to obtain source dereverberation. The algorithm was tested with synthetic impulse responses
generated by a MISM algorithm, and in the simulation it is shown that the proposed equalizer
outperforms the delay and sum beamformer . In a following work [118] the dereverberation
in a noisy environment was considered and the robustness of the algorithm was improved by
considering an equalizer based on an MMSE criterion instead of a simple ZF equalizer.
2. The room impulse response between the source and the ith microphone is called hi (n).
6
More details are discussed in section 4.4.
85
Blind dereverberation techniques. Problem statement and existing technology
The signal received by the ith microphone, ui (n), is the source signal,s(n), convolved with
hi (n), i ∈ 1, ..., P . i = 1 is assumed for the microphone closer to the source.
The room transfer functions (RTFs) are modelled using time invariant polynomials and assumed
to share no common zeros. The RTFs are defined as
M
X −1
Hi (z) = hi (k)z −k i ∈ 1, .., P (4.92)
k=0
where M is the number of taps of the room impulse response. Using matrix formulation the
previous equation can be rewritten as
where ui (n) = [ui (n), ..., ui (n−L+1)]T , Hi is a (M +L−1)xL convolution matrix expressed
as
hi (0) 0 ··· 0
..
..
.
hi (1) hi (0) .
.. .. ..
. . .
Hi = hi (M − 1) (4.94)
hi (M − 1)
0 hi (0)
.. ..
..
. . .
0 ··· 0 hi (M − 1)
and s(n) = [s(n), ..., s(n − N + 1)]T . The length of the signal vector ui (n) is denoted by L,
and its minimum length is derived from the condition
M −1
L≥ (4.95)
P −1
3. The input signal s(n) is assumed to be generated from a finite long-term AR process applied
86
Blind dereverberation techniques. Problem statement and existing technology
to a white noise e(n). The Z transform of the AR process is 1/a(z), where a(z) is the AR
polynomial
a(z) = 1 − a1 z −1 + ... + aN z −N ,
(4.96)
where N denotes the length of the AR polynomial. The length of N is given by N = M +L−1.
a1 1 ··· 0
0
a2 0 1 · · · 0
. .. . . . . ..
C = .. . . . . . (4.98)
.. ..
.
. 1
an 0 ··· ··· 0
1. Both the prediction filter w and the AR polynomial a(z) are estimated from a matrix Q
which is defined as [159],[159]
+
Q = E u(n − 1)uT (n − 1) E u(n − 1)uT (n)
(4.99)
where u(n) = [uT1 (n), ..., uTP (n)]T , A+ is the Moore-Penrose generalized inverse of matrix
A, and E {.} denotes the time averaging operator. The covariance matrix is estimated using
Ns −1
1 X
E x(n)y(n)T (x(n) − mx )(y(n) − my )T
= (4.100)
Ns
n=0
where Ns denotes the length of the reverberant signal segment, x(n) = [x(n), ..., x(n − L +
87
Blind dereverberation techniques. Problem statement and existing technology
1)]T , y(n) = [y(n), ..., y(n − L + 1)]T , and mx , my are their mean vectors respectively.
Ns −1
1 X
mx = x(n) (4.101)
Ns
n=0
and
Ns −1
1 X
my = y(n). (4.102)
Ns
n=0
The first column of Q gives the prediction filter w, and an estimate of the AR polynomial a(z)
is obtained from the characteristic polynomial of Q.
The residual signal is free from the effect of room reverberation but is also excessively whitened.
Filtering the prediction residual with 1/a(z) produces the recovered input signal multiplied by
a factor of h1 (0). Simulation results showed that LIME could achieve almost perfect derever-
beration, but only for short duration impulse responses [161]. Additionally the computation
of large covariance matrices causes LIME to be a computationally exhaustive algorithm. In
[162] speech dereverberation in the presence of noise was discussed and the fact that the LIME
algorithm can achieve both dereverberation and noise reduction was shown.
In [160], where an improved method based on the LIME approach was proposed, it was shown
experimentally that the algorithm performance is signal dependent, i.e. applying the algorithm
to different segments of the same speech signal produces different performance levels. There-
fore LIME may not perform well, and produce signals whose audible quality is worse than
that of the reverberant. The authors proposed two non-intrusive methods to evaluate the per-
formance of LIME. These methods are used to detect errors in the estimation of the source
signal and isolate the input segments associated with poor dereverberation. For these segments
an alternative inverse filter calculation approach is used, by minimizing the least squares (LS)
error between the estimate of LIME and the filtered microphone signals. These LS filters are
then used to dereverberate the received microphone signals. These modifications led to stable
performances with different speech signals under the same room conditions.
88
Blind dereverberation techniques. Problem statement and existing technology
In [88] and in [157] by Yoshioka et al. proposed two similar methods, one HOS and the other
SOS based, applied to the same multichannel structure, to calculate the dereverberation filters.
Both the methods rely on the Liu and Dong [92] theorem since the dereverberation filter is
calculated by uncorrelating the composite output of the equalizer bank. The fundamental issue
addressed in both papers is how to estimate a channel’s inverse filter separately from the inverse
filter of the speech generating AR system, or in other words from the prediction error filter
(PEF). The authors claim that by jointly estimating the channel inverse filter and the PEF, the
channel inverse is identifiable due to the time varying nature of the PEF.
In [56] Gillespie et al. showed that penalizing long-term reverberation energy is more effective
than maximizing the signal-to-reverberation ratio (SRR) for improving audible quality and au-
tomatic speech recognition (ASR) accuracy. In a following paper [158], they noticed that the
energy in the tail of the autocorrelation sequence of the received reverberant signal is related
to the amount of long-term reverberation. Based on these observations, a technique, called
Correlation Shaping (CS), aimed to reshape the autocorrelation function of a signal by means
of linear filtering was proposed. This has the intended effect of reducing the length of the
equalized speaker-to-receiver impulse response to improve audible quality and ASR accuracy
blindly. Dereverberation can, therefore, be achieved by reshaping the autocorrelation of the lin-
ear prediction residual to a Dirac δ function, that is, whitening it. Therefore, this multichannel
algorithm also relies on the Liu and Dong [92] theorem.
Such a decorrelation process is achieved through adaptive filtering as shown in Fig.4.12. The
actual system is not limited solely to 2 channels, but can be generalized to any number of in-
puts, N . The proposed correlation shaping technique modifies the received reverberant signals,
xi , using a set of adaptive linear filters, gi ,
N
X
y(n) = giT (n)xi (n) (4.104)
i=1
to minimize the weighted mean square error between the actual output autocorrelation se-
quence, ryy , and the desired output autocorrelation sequence, rdd = δ(τ ). This error is given
by
89
Blind dereverberation techniques. Problem statement and existing technology
where W (τ ) is a real valued weight. A large positive value of W (τ ) gives more importance to
the error at a particular lag, τ . The adaptive filter employed uses a normalized gradient descent
approach to accomplish this minimization. The update equation for each filter coefficient l of
the ith channel is
∇i (l)
gi (l, n + 1) = gi (l, n) − µ pP P . (4.106)
2
i l ∇i (l)
It can be shown [158], [3] that the gradient for each filter coefficient,l, that minimizes e(τ ) with
respect to the filter coefficients is given by
X
∇(l) = W (τ )ryy (τ ))(rxi y (l − τ ) + rxi y (l + τ )) (4.107)
τ >0
where rxi y is the crosscorrelation between the ith channel input and the output. The gradi-
ent normalization improves the convergence properties (similarly to the Normalized LMS) and
normalizes the output signal energy.
Since long reverberation time is especially harmful [158], a “don0 t care” region in the desired
output autocorrelation function can be included to improve the whitening process for long lags,
90
Blind dereverberation techniques. Problem statement and existing technology
at the expense of allowing higher level short-term correlation. This is achieved by not including
the first autocorrelation lags in the gradient computation.
Experimental results, based on synthetic acoustic IR, showed that reverberant speech processed
with 4-channel correlation shaping provided better audio quality to either 4-channel DS pro-
cessing or an unprocessed single channel.
As a final comment on SIMO equalization methods, it is interesting to notice that also some
HOS multichannel methods can be explained by the Liu and Dong [92] theorem, except that
they don’t guarantee that the output is uncorrelated. For instance, the kurtosis maximization
algorithm used in the multichannel dereverberation algorithm discussed in section 4.6.1 and
described in [87] can be also explained as a way to uncorrelate the composite output of the
equalizer bank y(n).
A different approach, that cannot be explicitly explained by Liu and Dong [92] theorem was
proposed in [163] by Furuya et al. This is a promising algorithm, based on the blind MINT
approach. The conventional MINT requires the room impulse responses to calculate the inverse
filters, so it cannot recover speech signals, when the room impulse responses are unknown.
However, as suggested by other SOS methods, the inverse filters can be blindly estimated from
the correlation matrix between input signals, that can be observed.
R = E XT X
(4.108)
where
and
91
Blind dereverberation techniques. Problem statement and existing technology
J−1
where N is the number of channels, L = N −1 is the inverse filter length, J is the impulse
response length, and E {} is the statistical expectation.
and the microphone closest to the source is known, or estimated in practice by using time
difference of arrival (TDOA), it can be shown8 [163] that the inverse filters w can be calculated
w = R−1 v (4.112)
where
w = [w1 (1), ..., w1 (M ), ..., wN (1), ..., wN (M )]T (4.113)
and
v = [1, 0, ..., 0]T (4.114)
is a N Lx1 vector.
In the situation requiring adaptation where impulse responses are frequently changed, the fol-
lowing recursive time-averaging is used to estimate the correlation matrix [91]
where R̂k is the estimate of R at time k and β is the “forgetting factor”, that is the weight of
the older estimate R̂k−1 . R̂k is used in equation 4.112 so that
w = R̂−1
k v. (4.116)
The previous equation can be solved efficiently by using the FFT-Based Fast Conjugate Gradi-
ent Method described in [164], where a real-time dereverberation system was proposed. A real
7
A pre-witening step as described in section 4.4 is assumed.
8
The notation employed here is consistent with the definitions used in the MINT theorem reported in section
3.2.3, while in [91] a different notation is adopted.
92
Blind dereverberation techniques. Problem statement and existing technology
acoustic path was employed and the measured IR were used only to verify the dereverberation
performance. The experiment setup was the following:
• the source signal was two reproductions of male and female speech;
• the room size was 6.6 m × 4.6 m × 3.1 m, with a reverberation time of about 0.55 s;
For performance comparison, the reverberation curves and the dereverberation improvement
were calculated using the measured impulse response (where the impulse response length is
6600 taps). Figure 4.13 shows the real-time dereverberation structure proposed by Furuya et al.
in [164].
It is meaningful to point out that this is the only reverberation cancellation algorithm, among
the ones described so far, that was tested in a realistic environment.
93
Blind dereverberation techniques. Problem statement and existing technology
The deconvolution based on inverse filtering does not improve the tail of reverberation be-
cause impulse responses are always fluctuating in the real world and the estimation error of
inverse filters is caused by deviation of the correlation matrix averaged for a finite duration.
As a possible improvement, in [91] an hybrid reverberation cancellation/suppression method is
proposed. A modified spectral subtraction algorithm is cascaded to the blind MINT deconvo-
lution algorithm. Spectral subtraction estimates the power spectrum of the reverberation and
then subtracts it from the power spectrum of reverberant speech. Inverse filtering reduces early
reflection, which has most of the power of the reverberation, and then, spectral subtraction sup-
presses the tail of the inverse-filtered reverberation. Inverse filtering reduces the power of the
reverberation, so the nonlinear processing distortion of spectral subtraction is reduced using a
small subtractive power. The authors claim superior dereverberation results for every reverbera-
tion time. On the other hand, even if the algorithm is effective and robust in situations requiring
adaptation, the adaptation speed is still slow for practical applications.
According to Brillinger [165], the sample size needed to estimate the nth -order statistics of
a random process to prescribed values of estimation bias and variance, increases almost ex-
ponentially with order n. This is why often HOS-based blind deconvolution methods exhibit
a slower rate of convergence in comparison to the SOS-based ones. This can be of concern
in highly non-stationary environments where an HOS-based algorithm might not have enough
time to track the statistical variations. On the other hand, when a method based on the kurtosis
maximization is used, we are not necessarily trying to achieve an accurate estimate of kurto-
sis. So performance in one case may not translate into performance in the other case. HOS
methods based on the implicit non-linear processing approach (i.e. the Bussgang equalizer) are
less demanding from the complexity perspective, even if they might be prone to local minima
[166]. An advantage of HOS-based methods is that they can be employed in SISO systems,
while SOS methods require a multichannel framework, however HOS methods require a non
Gaussian received signal.
94
Blind dereverberation techniques. Problem statement and existing technology
4.9.1 HERB
where sh (n) is the harmonic component and sn (n) the non-harmonic one (i.e. voiced and
unvoiced speech). The harmonic component, sh (n), can be modelled as a periodic signal within
every short-time frame. When reverberation is added
where xhd and xhr are respectively the direct and the reverberant component of the harmonic
signal.
xhd can be estimated by applying an adaptive harmonic filter to x(n), that is a comb-like filter
capable to separate over time the dominant sinusoidal components (i.e.the fundamental fre-
quency f0 and its multiple harmonics). The output of this filter corresponds to a rough estima-
tion of the harmonic components of the direct signal in the observed signals.
An underlying assumption of this approach is that the separation of speech and reverberation is
possible since the speech periodicity and the reverb have different time scales, therefore, from
this point of view, there are similarities with the methods based on the LP residual processing.
95
Blind dereverberation techniques. Problem statement and existing technology
where X(l, k) and X̂(l, k) are respectively the discrete Fourier transforms (DFT) of the ob-
served reverberant signal and the output of the adaptive harmonic filter at time frame l and
frequency bin k, respectively. A(·) is a function that calculates the weighted average of X̂/X
for each k over different time frames. This filter has been shown to approximate the inverse
filter of the acoustic transfer function between a speaker and a microphone.
2. the initial estimate of the direct harmonic signal included in x(n) is estimated for each
short-time frame with an adaptive harmonic filter;
ˆ are calculated;
3. the DFT of x(n) and x(n)
6. the estimated filter w(n) is convolved with the observed signal to dereverberate it.
The conventional formulation of HERB does not provide an analytical framework within which
the dereverberation performance can be optimized. In [168], Nakatani et al. reformulated
HERB as a maximum a posteriori (MAP) estimation problem, in which the dereverberation
filter was determined as one that maximizes the a posteriori probability given the observed
signals. In real speech signals, there are certain non-harmonic components included, and the
adaptive harmonic filter cannot completely eliminate the reverberation. Therefore, repetitive
observations over a sufficiently long time duration are necessary to decrease estimation errors.
A very large training data set is required to calculate a correct dereverberation filter (i.e. more
than 60 minutes of speech data) under the assumption that the system is time-invariant [169].
This major drawback was investigated by Kinoshita et al. in [170], where a fast version of
the HERB algorithm, able to provide the estimation of the dereverberation filter with a greatly
96
Blind dereverberation techniques. Problem statement and existing technology
reduce training data set (i.e. 1 minute long), was proposed. The faster convergence is obtained
ˆ
by using a noise reduction algorithm applied to the output of the adaptive harmonic filter, x(n)
, by recalculating the dereverberation filter in each time frame and by averaging the filter over
several frames to obtain the final filter. However, despite the good dereverberation quality, the
fast HERB is still too slow for practical applicability. Especially if it is considered that the
system must be time-invariant during the convergence and that an unrealistic length for the
inverse filter, in the order of the hundred thousands of taps [89], is used.
where
zrc [m] = F −1 {log |F {z(n)}|} (4.121)
is the real cepstrum of signal z(n) and F is the Fourier transform. Speech can be considered as
a “low quefrent” signal as xrc [m] is typically concentrated around small values of m. The room
reverberation hrc [m], on the other hand, is expected to contain higher “quefrent” information.
The amount of reverberation can hence be reduced by appropriate lowpass “liftering” of yrc [m],
that is, suppressing high “quefrent” information, or through peak picking in the low “quefrent”
domain. Even if cesptrum based techniques are popular in speech recognition, they generally
97
Blind dereverberation techniques. Problem statement and existing technology
perform poorly since, speech and the acoustics cannot be clearly separated in the cepstral do-
main. Furthermore, even if this separation could be achieved, estimation of the clean speech
signal is still problematic since equation 4.121 removes all phase information. Therefore, the
proposed algorithms can only be successfully applied in simple reverberation scenarios (i.e.
speech degraded by simple echoes). As a final consideration, this is a nonlinear technique that
cannot be described by linear dereverberation filters.
4.10 Summary
In theory, blind deconvolution of the acoustic channel allows perfect dereverberation. However,
at the current state there are no techniques available to blindly estimate acoustic IRs in a realistic
environment. Therefore no blind dereverberation method based on acoustic IR identification
and inversion exist.
Mourjopoulos [48] observations about non blind approaches suggest that even if an almost per-
fect estimate of the acoustic IR is available, a subtle change in the conditions would determine
a very poor dereverberation. Therefore it might be argued that dereverberation based on identi-
fication and inversion might be a weak approach, unless the algorithm can track quickly enough
the system evolution.
However simulations based on the direct identification of the equalizer filter(s) seem more ef-
fective. Blind dereverberation techniques based on kurtosis maximization have been shown to
be practically usable for the reduction of the early reverberant part in a single channel dere-
verberation [90] algorithm. More consistent results have been claimed for the multichannel
extension proposed in [87]. Several methods implicitly based on the fundamental theorem
of multichannel equalization introduced by Liu and Dong in [92] have been proposed. Only
simulation based on synthetic data are available. Therefore it is difficult to evaluate their per-
formance in a realistic environment. An approach based on the blind MINT has been proposed
in [91] and a working real-time implementation, robust to IR variations, has been claimed.
98
Chapter 5
Novel HOS based blind
dereverberation algorithms
This chapter describes new blind dereverberation algorithms based on a HOS approach. This
work has been inspired by the work of Gillespie [87] and it is structured in three main contri-
butions:
3. the proposal of novel single and multichannel methods based on the natural gradient
and of a new dereverberation structure that improves the speech and reverberation model
decoupling [174].
This work has appeared separately in three published conference papers [172],[173],[174].
The algorithm proposed by Gillespie et al. in [87], and described in section 3.7.2, was imple-
mented. In particular, it appeared reasonable to investigate the basic component of the algorithm
in its most simple form, the single channel time domain one.
Here it will be considered the single channel dereverberation framework described in section
3.2.1, where the acoustic path (the channel) is modelled as a linear-time invariant system char-
acterized by an IR, h(n). Therefore, the source signal, s(n), and the reverberant signal, x(n),
are linked by the equation
99
Novel HOS based blind dereverberation algorithms
Figure 5.2: Example of misconvergence of the single channel time domain algorithm proposed
by Gillespie and Malvar. (a) system IR, (b) identified inverse IR, (c) equalized IR.
The main idea of the Gillespie method, is that by building a linear filter that maximizes the
kurtosis of the speech LP residual, it is theoretically possible to identify the inverse filter w(n)
and thus to equalize the system. This approach is blind since it requires only the evaluation
of the kurtosis of the LP residual of the system output. Conversely, as discussed in section
4.4, the use of the LP residual to decouple the harmonic structure of speech and reverberation
introduces ambiguity.
100
Novel HOS based blind dereverberation algorithms
During tests performed with the single channel time domain version of the algorithm described
in [87], stability problems and unexpected results were observed. In some apparently unpre-
dictable conditions, the algorithm captures a harmonic component and creates a resonant spike
in the residual. Therefore even if kurtosis is smoothly maximized (Figure 5.1), the inverse filter
might converge to a highly resonant state (Figure 5.2). By creating isolated strong peaks in
the residual (thus making it sparser) kurtosis increases, but the result is irrelevant for derever-
beration purposes. This issue can be associated to the extreme sensitivity of kurtosis, γ2 , to
“outlying” values. Examining the simplified expression of kurtosis for a zero mean, unitary
variance RV y
γ2 = E y 4 − 3.
(5.3)
It can be noticed how its value is greatly affected by the values of y greater than 1. Similar
criticisms of kurtosis have been raised in the context of Independent Component Analysis [175].
∂γ2 ∂y ∂wx
= E 4y 3 = 4E y 3 = 4E y 3 x
(5.5)
∂w ∂w ∂w
In order to minimize the sensitivity of the algorithm to “outlying” values a maximum likeli-
hood (ML) approach was proposed. Maximum likelihood, also called the maximum likelihood
method, is the procedure of finding the value of one or more parameters for a given statistic
which makes the known likelihood distribution a maximum.
the idea is to build w in order to achieve any desired probability density function of the output
101
Novel HOS based blind dereverberation algorithms
y:
T
|{z} E {log(P (y)} = max
max |{z} E log(P (w x) . (5.7)
w w
its gradient is
∂J ∂ log(P (y)) ∂wx
=E = E φ(y) = E {φ(y)x} (5.9)
∂w ∂w ∂w
where
1 ∂P (y)
φ(y) = (5.10)
P (y) ∂y
The probability density function of y is chosen in order to have high kurtosis and bounded
derivative. A probability density function with these properties is
1
P (y) = (5.12)
cosh(y)
giving
1 ∂P (y) ∂ 1
φ(y) = = cosh(y) = − tanh(y) (5.13)
P (y) ∂y ∂y cosh(y)
where cosh and tanh are respectively the hyperbolic cosine and hyperbolic tangent functions.
102
Novel HOS based blind dereverberation algorithms
Figure 5.3: A single channel time-domain adaptive algorithm for maximizing kurtosis of the
LP residual.
Thus the hyperbolic tangent function replaces the kurtosis term. While the latter is unbounded
and dominated by a cubic term, the former is bounded and insensitive to outliers. The stochas-
tic gradient version of equation 5.14 is of course, a Bussgang-type equalizer. Therefore, this
derivation also shows that a Bussgang algorithm obtains blind deconvolution by maximizing
the kurtosis.
A single-channel time-domain algorithm based on equation 5.14 was implemented and em-
ployed to equalize the IR of a real room. The proposed method was experimentally compared
to the performance of the single-channel time-domain kurtosis maximization algorithm pro-
posed in [87] and discussed in section 4.6.1. The two algorithms share the same structure,
shown in Fig.5.3, but have different cost functions.
• all the data reported in the experiments are at a sampling rate of 22050 Hz;
• the clean speech signal, s(n), is a 7.5 seconds anechoic male speech sample. the source
103
Novel HOS based blind dereverberation algorithms
signal was two reproductions of male and female speech on different subjects;
• the reverberated input file, x(n), has been created by convolving the clean speech signal,
s(n), with a 6000 tap IR, h(n), measured from a real room (T60 of about 400 ms);
• the algorithms produce as an output the estimate of the clean speech signal, y(n), and
calculate the convolution between the original IR, h(n), and the identified inverse IR,
w(n), to evaluate the quality of the processing; in fact, this convolution provides the
equalized IR, and by measuring its DRR, defined in section 3.3.2.1, the dereverberation
performance can be measured.
The following parameters and procedure have been used for the algorithms:
• LPC analysis order = 26. This value allows to track up to 13 spectral peaks, thus it offers
a sufficient complexity for speech modeling;
• LPC analysis window length= 25 ms. This allows to consider speech as stationary in the
analysis window;
• FIR adaptive filter order length = 1000 samples. Theoretically, an inverse filter w(n)
with a number of taps greater than the length of the IR should be used. However, a
shorter filter is expected to reduce the intensity of the early reflections. No substantial
improvement was observed for longer filter, despite the increased computational cost;
• initial condition of the FIR adaptive filter w = [1, 0, ..., 0]. This allows the filter to start
the adaptation from a meaningful initial condition (the first tap set to one enables the
signal to flow unaffected at the beginning of the adaptation);
• the filter identified at the end of the adaptation has been used as the identified inverse
filter w(n);
• the adaptive filter, w(n), has been normalized at every adaptation step as suggested in
[87];
• to improve the speed and accuracy of convergence both algorithms were implemented in
batch mode1 and the LP residual, x̃(n), was normalized to unit variance. This frees the
adaptive algorithms from following the envelope of the residual;
1
Instead of using a “sample by sample” adaptation, the filter coefficients are updated only after the whole file
has been processed. This provides a better estimate of the gradient assuring a more stable convergence.
104
Novel HOS based blind dereverberation algorithms
ï20
(a)
ï40
0
Amplitude dB
ï20
(b)
ï40
ï20
(c)
ï40
Figure 5.4: (a)Echogram of the original IR (reference), DRR=-2.9 dB (b) Echogram of the
equalized IR obtained by the single-channel time-domain kurtosis maximization
algorithm as proposed by Gillespie and Malvar, DRR=-2.7 dB. (c) Echogram of the
equalized IR obtained by the single-channel time-domain kurtosis maximization
algorithm based on equation 5.14, DRR=-1.1 dB
• the adaptation parameter, µ, was set to 0.1 for the proposed method and to 7·10−4 for the
algorithm proposed in [87]; This parameter was heuristically chosen in order to achieve
convergence and assure stability 2 .
• the algorithm was left free to adapt also during unvoiced or silent periods as suggested in
[87].
The echograms of the original and equalized IRs are shown in Fig.5.4. As explained in sec-
tion 2.6.1, the echogram simplifies the IR analysis and highlights the attenuation of the most
energetic reflections. The proposed method shows, as reported in Fig. 5.5, a better derever-
beration performance and a faster convergence in respect to the original time domain kurtosis
maximization algorithm introduced in [87].
2
Since it is a batch algorithm, the choice of the adaptation parameter is less crucial than an on-line algorithm.
105
Novel HOS based blind dereverberation algorithms
As in [87], the previous algorithm was extended in [173] to a multichannel structure. The
advantages of a MINT-like structure were expected. This implies:
• a solution to the instability issue associated with the inversion of non-minimum phase
transfer functions, by ensuring that the equalizer will be a set of FIR filters if the channel
transfer functions are FIR;
• shorter filters for each channel in respect to the single-channel case and therefore better
statistical properties in their estimation, less computational demand and less memory
requirement.
It must be noticed that in [87] there’s no discussion about the motivations behind such extension.
The multichannel structure is simply justified as the direct extension from the single-channel
system.
106
Novel HOS based blind dereverberation algorithms
where x̃i = [x(n), x(n − 1), ..., x(n − p)] is the output vector of the i-th LP analysis filter and
ỹm is defined as
N
1 X
ỹm = ỹi (n) (5.16)
N
i=1
where ỹi (n) is the output of the i-th maximization filter at time n.
The filters are jointly optimized to maximize the likelihood of the output ỹm and the derever-
berator output y(n) is defined as
N
X
ŝ(n) = y(n) = wi (n) ∗ xi (n) (5.17)
i=1
where xi (n), wi (n) are respectively the i-th observation and equalizer of the corresponding
source-to-receiver channel.
Extending our results from the single-channel ML technique, we expect the multi-channel struc-
ture to benefit by improved stability and less noise in the convergence compared to the kurtosis
approach. The ambiguity introduced by the use of the LP residual to decouple the harmonic
structure of speech and reverb has been mitigated by spatial averaging the LP analysis coeffi-
cients as proposed in [117].
107
Novel HOS based blind dereverberation algorithms
To highlight the affinity of the above algorithm with the MINT, a two-channel system has been
used to equalize a speech signal sampled at 22050 Hz that has been processed with the following
two long echoes:
h1 (n) = δ(0) − 0.8δ(n − 600)
(5.18)
h2 (n) = δ(0) − 0.8δ(n − 1000)
The inverse filters have been calculated by the MINT inverse formula given in [9] with the
explicit use of the impulse responses of the system, and this result has been compared to the
filters that have been blindly identified by the multi-channel structure. This second approach
directly estimates the inverse filters,which is statistically better than estimating the impulse
responses of the system and then inverting. Note that the lengths of the inverse filters are
comparable to the length of the longest delay present. This is in contrast to the single-channel
case, which would require a much longer filter. The results are shown in Fig.5.7 where the three
leftmost plots relate to the MINT method while the rightmost three plots show the algorithm
performance. The inverse filters have similar placement of the taps but different gains; both
inverse filters provide an equalization for the system (Figures 5.7c, 5.7f ). Since we do not
enforce that the FIR order be the minimum required, this solution is non-unique and it only
needs to satisfy the Bezout identity.
An eight-channel system that uses a one-point sample mean version of the adaptation rule in
(5.15) was also evaluated. A linear array composed of eight omni-directional microphones
was employed in a room with a T60 of about 400 ms to measure the impulse responses of
the corresponding source-to-receiver acoustic paths. To acquire these impulse responses, the
technique reported in [44] was applied. A 4 cm spacing between microphones was chosen for
the first experiment and 45 cm for the second, and to simulate a generic setup, the array was
not placed orthogonal to the loudspeaker. The minimum “microphone to loudspeaker” distance
was of 2.5 m. The impulse responses measured for the 4 cm array configuration did not exhibit
significant delay misalignment among the channels, while the impulse responses for the 45 cm
array were not aligned. The algorithm was applied to male and female speech files sampled
108
Novel HOS based blind dereverberation algorithms
Figure 5.7: Comparison of the MINT and multi-channel equalizer. (a), (b) Inverse filters cal-
culated by MINT. (d), (e) Inverse filters calculated by the multi-channel algorithm.
(c), (f) Equation (4.25) evaluated in both cases.
at 22050 Hz and convolved with the resulting impulse responses (6000 tap long). The results
reported here are for the same speech file used in the experiment described in section 5.1.1.
The following parameters and initializations were used for the algorithm. µ was set to 10−5 , the
LP analysis order to 26 with an LP analysis frame length of 25 ms. The filters were T = 1000
taps long and initialized to wi = [1, 0, 0, 0, 0, ...]. This allows the filters to start the adaptation
from a meaningful initial condition (the first tap set to one enables the signal to flow unaffected
at the beginning of the adaptation). To obtain a more uniform convergence, the residual of each
channel was normalized to a zero mean, unit variance process. The algorithm was left free to
adapt also during unvoiced or silent periods as suggested in [87].
To solve the problem of gain uncertainty, a normalization of the filter coefficients was per-
formed at every update cycle [87]. It should be noted that normalizing all the channels makes
the problem over-constraint if we wish to take advantage of the Bezout inverse solution.
The algorithm was found to provide dereverberation in the case of the 4 cm spacing, but not in
the 45 cm configuration, due to the time-misalignment among the channels. After investigation,
it was understood that a meaningful convergence cannot be achieved for this algorithm when
the channels are time-misaligned. Therefore the algorithm cannot identify the inter-channel
delay. Note that this problem equally applies to the use of kurtosis within this framework [87].
Conversely, when the impulse responses were aligned manually, the algorithm converged for
109
Novel HOS based blind dereverberation algorithms
0
ï20
(a) ï40
0 500 1000 1500 2000 2500 3000 3500 4000
Amplitude dB
0
ï20
(b)
ï40
0 500 1000 1500 2000 2500 3000 3500 4000
0
ï20
(c)
ï40
0 500 1000 1500 2000 2500 3000 3500 4000
samples
Figure 5.8: (a) Reference echogram relating to the shortest source-to-receiver path, DRR=-2.9
dB (b) Echogram of the equalized IR obtained by the 8-channel delay-sum beam-
fomer, DRR =-0.1 dB. (c)Echogram of the equalized IR obtained by the 8-channel
ML dereverberator, DRR =2.3 dB. The ML multichannel structure provides im-
proved dereverberation in respect to the delay and sum beamformer.
the 45 centimeter setup, and provided dereverberation, although its operation was no longer
blind. Figure 5.8(a) shows the echogram of the original impulse relating to the shortest source-
to-receiver path. Figure 5.8(c) shows the equalized impulse response obtained with (5.15), in
the case of a one-point sample mean.
The process of alignment is the same that is required for a delay and sum beamformer. In this
sense it is worthy to observe that the algorithms proposed here and in [87] require a prepro-
cessing stage. A large amount of dereverberation is already achieved by the delay and sum
beamformer, which however does not produce a consistent attenuation of the isolated early re-
flections. Figure 5.8(b) shows the echogram of a delay and sum beamformer using the same
delays used to align the channel in the proposed method. The performance of the dereverbera-
tion algorithm reported in Fig.5.8 have been evaluated by the DRR defined in section 3.3.2.1.
The performance of the algorithm have been validated by repeating the previous experiment
with the same parameters but different speech signals of similar length (between 9 and 10
seconds), from different speakers and in different languages. The results obtained by using the
filters calculated after a single run of the algorithm (therefore after 9-10 seconds of adaptation)
are reported in table 5.1.
The average DRR improvement in dB in respect to a delay and sum beamformer (that provides
2.8 dB of improvement) is of 2.3 dB. Therefore the total average improvement in the DRR due
to the the dereverberation algorithm is of 5.1 dB.
110
Novel HOS based blind dereverberation algorithms
Signal DRR dB
Male 1 (Italian) 1.9
Male 2 (Portuguese) 2.0
Male 3 (English) 2.2
Male 4 (Italian) 2.5
Female 1 (Portuguese) 2.3
Female 2 (Russian) 2.1
Female 3 (Hebrew) 3.0
Female 4 (Dutch) 2.8
The previously proposed algorithms rely implicitly on a Bussgang type algorithm. The Buss-
gang algorithm, in fact, can be treated as a constrained maximum likelihood problem for an
i.i.d. sequence y(n), where the usual normalization term is neglected. Hence a constrained, as
the previously proposed normalization scheme, must be provided to be fixed.
∂J
= −f (y(n))x(n − p) (5.20)
∂w(p)
∂
where f (y) = ∂y log p(y) is the Bussgang non-linearity. This gives the Bussgang algorithm:
∂J
w(p) = w(p) − µ
∂w(p) (5.21)
= w(p) + µf (y(k))x(k − p).
This approach is appealing for its simplicity, but it may be slow to converge with a consequent
3
Therefore a one-point sample mean is used as in the LMS algorithm.
111
Novel HOS based blind dereverberation algorithms
Figure 5.9: Comparison of the adaptation behavior of the NGA and the time domain kurtosis
maximization proposed by Gillespie and Malvar when they are applied to a super-
gaussian input filtered by a single echo with a delay of 50 samples and a reflection
gain =-1. Equalizer order P=100. Perfect equalization is not achieved since a
truncated equalizer is used.
The usual natural/relative gradient form for the Bussgang algorithm update is [134]:
P
X
y(n) = w(p)e(n − p) (5.22)
p=0
P
X
u(n) = w(P − m)y(n − m) (5.23)
m=0
wnew (p) = w(p) + µ w(p) + f (y(n − P ))u(n − p) (5.24)
where P is the equalizer order and f (.) the Bussgang nonlinearity (the − tanh(.) or the − sgn(.)
functions will be assumed).
This algorithm will be addressed for simplicity as the Natural Gradient Algorithm (NGA). A
112
Novel HOS based blind dereverberation algorithms
(a) (b)
Figure 5.10: (a) Diagram of the time domain dereverberation algorithm proposed by Gillespie
and Malvar (forward structure). (b) Diagram of the proposed model (reversed
structure).
comparison of the performances of the time domain kurtosis maximization algorithm proposed
in [87] and of the NGA is provided in Fig.5.9, where a reverberated supergaussian white signal
is equalized. The two algorithms are evaluated by the Direct to Reverberation Ratio. The NGA
is a better candidate for time domain adaptive equalization schemes.
The application of such equalization schemes to blind dereverberation using the structure in
Fig.5.10(a), called from now on “forward structure” is relatively easy and consists of LPC pre-
whitening followed by the equalization algorithm reported in 5.22, 5.23 and 5.24. However we
will show that better results can be achieved by using the structure shown in Figure 5.10(b),
called from now on the “reversed structure”.
5.4.1 On the correct structure for a single channel dereverberator. The reversed
structure.
The order of two linear filters can be swapped only if they are time invariant. Since the vocal
tract filter is not stationary, the forward and the reversed structure will lead to different results.
The residual e(n) calculated by the forward structure is, as shown in Fig.5.11(a), a function of
multiple quasi-stationary blocks, although it is being whitened by only a single LPC filter. This
is particularly problematic in the dereverberation setting since the duration over which speech is
113
Novel HOS based blind dereverberation algorithms
(a)
(b)
Figure 5.11: (a) In the forward structure, the residual is a function of multiple LPC filters. (b)
In the reversed structure, the residual is a function of one LPC filter.
quasi-stationarity is usually significantly less than the room reverberation time. By performing
the dereverberation before the LP analysis (reversed structure), the modelled residual e(n) is,
as shown in Fig.5.11(b), only a function of one quasi-stationary block. In more detail, if the LP
pre-whitening is performed before the dereverberation (forward structure), the residual e(n) is:
X
e(n) = w(p)y(n − p) (5.25)
p
and
X
y(n) = al (n)x(n − l) (5.26)
l
and therefore
X X
e(n) = w(p) al (n − p)x(n − l − p) (5.27)
p l
In contrast by using the reversed model, that performs the LP pre-whitening after the derever-
beration, the residual e(n) is a function of one single LPC filters:
X
e(n) = al (n)y(n − l) (5.28)
l
and
X
y(n) = w(p)x(n − p). (5.29)
p
114
Novel HOS based blind dereverberation algorithms
Therefore the correct relationship between the LP residual e(n) and x(n) is:
X
e(n) = al (n)y(n − l)
l
X X (5.30)
= al (n) w(p)x(n − l − p)
l p
where al (n) is the l − th time varying LPC filter coefficient at time n. Note that e(n) is only a
function of LPC coefficients associated with time n.
This observation led to a new algorithm based on the natural gradient that exploits the reversed
structure.
For the moment it is assumed that al (n) are known for all n and l.
where the expression − log ||∂e/∂y|| has been replaced with “constant” since it does not depend
on w and so will vanish when differentiated.
∂J ∂e(n)
= −h(−p) − f (e(n))
∂w(p) ∂w(p)
X (5.32)
= −h(−p) − f (e(n)) al (n)x(n − p − l)
l
where h(p) denotes the impulse response function for the inverse of w (i.e. h ∗ w = δ0 ) and
115
Novel HOS based blind dereverberation algorithms
It can be shown, as reported in the derivation of equation A.33 in the appendix, that an equiv-
ariant, normalized form of the previous equation is:
wnew (p) = w(p) + µ w(p) + f (e(n))
X (5.34)
· al (n)y(n − p + m − l) .
l
Finally, the causality problem, as in [136], can be addressed. First the dereverberation filter is
constrained to be a causal FIR:
P
X
y(n) = w(p)x(n − p) (5.35)
p=0
P
X
u(n) = w(P − m)y(n − m). (5.37)
m=0
The simplified update rule for the reversed structure (equation A.34 in the appendix), is:
wnew (p) = w(p) + µ w(p) + f (e(n − P ))
L
X (5.38)
· al (n − P )u(n − p − l) .
l=0
It is of interest to compare (5.35), (5.37), (5.38), to the update equations for the forward struc-
116
Novel HOS based blind dereverberation algorithms
ture, (5.22), (5.23), (5.24). The update equations for the reversed structure are more complex,
as an additional step is required to calculate the last term in (5.38). However, in the forward
structure, this additional filtering must be calculated outside the adaptation loop to obtain the
speech residual. Furthermore, the reversed structure makes replicating the dereverberation filter
unnecessary since it acts directly on the reverberated signal.
In this section we will show that the reversed structure, by respecting the correct ordering of
the stationary (room response) and non-stationary filters (LPC filters), can achieve better dere-
verberation performance than the forward structure. In order to focus only on the change in
structure a toy non-blind example is used in the first experiment. In the second one the pro-
posed algorithm is applied to a speech signal that has been convolved with an impulse response
measured from a real room.
The algorithms, as before, were evaluated by the DRR criterion. The sampling frequency for
all the signals was 22.05 kHz. The f (.) nonlinear function in (5.38) was set to − sgn(.).
A unitary variance, supergaussian 4 , white noise was filtered by alternating every 25 ms two
static FIR filters with impulse response g1 = [1, 0.5, 0.5, 0.5, 0.5] and g2 = [1, 0, 0, 0, 0]. The
resulting signal, s(n), was then reverberated by the single echo impulse response h(n) = δ(0)−
δ(275). To minimize amplitude fluctuations, g1 and g2 were chosen to have stable inverse and
similar output power.
The algorithms for the forward and reversed structure, with adaptation parameter µ = 10−5 and
given LPC coefficients, were used to dereverberate s(n). Figures 5.12(a1) and 5.12(a2) show
the dramatic difference in their dereverberation performance. While the reversed structure can
correctly dereverberate the input signal, this does not happen for the forward structure.
If the same experiment is repeated with g2 = g1 , therefore if the source filter is time invariant,
no difference in the results, as shown in Fig.5.12(b1) and Fig.5.12(b2), is appreciable.
What we see in the first case is solely due to the non-time invariance of the source filter.
4
A supergaussian signal is used to mimic the distribution of the speech residual.
117
Novel HOS based blind dereverberation algorithms
Figure 5.12: Results of the toy non-blind example described in section 5.4.3.1. Dereverbera-
tion performance with a time variant source filter: (a1) forward structure; (a2)
reversed structure. While the reversed structure can correctly dereverberate the
input signal, this does not happen for the forward structure. Dereverberation
performance with a time invariant source filter: (b1) forward structure; (b2) re-
versed structure. There is no appreciable difference in the performance of the two
structures.
118
Novel HOS based blind dereverberation algorithms
Amplitude dB
ï20
ï40
ï60
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Amplitude dB 0
ï20
ï40
ï60
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Samples
Figure 5.13: Echogram of the original impulse response (above), DRR=-2.9 dB, and of the
equalized with the reversed structure (below), DDR=-0.7 dB
The experiment reported in section 5.1.1 was repeated by using the algorithm proposed in (5.38)
with the same settings (except µ = 0.5·10−5 ) and data. In order to have a better convergence, it
is important to prevent the algorithm from tracking the speech amplitude variations. This can be
obtained by dividing the LPC coefficients and the estimated residual by its standard deviation
and then by using this normalized data in (5.38).
As in section 5.1.1, a 1000-tap dereverberation filter was used. As previously observed, an
infinitely long filter would be theoretically required, and therefore, the room impulse response
is affected only in its first portion. The impulse response of the equalized system reported in
Fig.5.13 shows a good attenuation of the first 1000 taps. The global DRR improvement for the
whole impulse response is of 2.3 dB.
The forward structure has been tested with the same settings. The DRR improvement for the
whole impulse response is of 1.8 dB. This last result is in agreement with the one obtained by
the batch maximum likelihood algorithm proposed in section 5.1.1.
Compared to the previous toy example, this experiment based on speech signals highlights a
less evident difference in the performance of the forward and of the reversed structure. This
was partially expected since the physics of the vocal tract imposes the similarity of the filters
belonging to adjacent blocks. In other words, the LPC coefficients change smoothly from one
block to another. It is however confirmed that the reversed structure can better cope with the
119
Novel HOS based blind dereverberation algorithms
non-time invariance of the source filter. Furthermore, it makes replicating the dereverberation
filter unnecessary.
It is interesting to notice that the on-line algorithm based on reversed structure provides a better
dereverberation also in respect to the results of the single-channel batch algorithms based on
the standard kurtosis and maximum likelihood discussed in section 5.2.2 and shown in 5.4.
At the current state, we were unable to obtain results when longer dereverberation filters were
tested. In this case, both the forward and the reversed algorithms tend to be unstable. This can
be related to the misconvergence problem investigated in a paper from Douglas et al. [176],
where a more robust version of the NGA is reported. The modification of the proposed algo-
rithm by considering this improvement seems to be an interesting direction for future research.
• a solution to the instability issue associated with the inversion of non-minimum phase
transfer functions, by ensuring that the equalizer will be a set of FIR filters if the channel
transfer functions are FIR;
• shorter filters in respect to the single-channel case and therefore better statistical proper-
ties in their estimation, less computational demand and less memory requirement.
The advantages of the reverse algorithm becomes even more evident when a multichannel struc-
ture is employed. In fact, while the multichannel structure based on the forward algorithm
requires the calculation of the LP residual for each channel, as shown in Fig.5.6, the multi-
channel reversed structure, as shown in Fig.5.14, requires only a single LP residual calculation,
and no replication of the dereverberation filters.
120
Novel HOS based blind dereverberation algorithms
X
e(n) = al (n)y(n − l) (5.39)
l
where al (n) is the l − th time varying LPC filter coefficient at time n and y(n) is the derever-
berator output defined as
N
1 X
ŝ(n) = y(n) = yi (n) (5.40)
N
i=1
and xi (n), wi (n) are respectively the i-th observation and equalizer of the corresponding
source-to-receiver channel. The p − th coefficient of the i − th channel maximization filter, wi ,
is given by
winew (p) = wi (p) + µ wi (p) + f (e(n − P ))
L
X (5.42)
· al (n − P )ui (n − p − l)
l=0
and
P
X
ui (n) = wi (P − m)yi (n − m). (5.43)
m=0
121
Novel HOS based blind dereverberation algorithms
(a) ï20
ï40
0
Amplitude dB
ï20
(b)
ï40
ï20
(c)
ï40
Figure 5.15: (a) Reference echogram relating to the shortest source-to-receiver path, DRR=-
2.9 dB (b)Echogram of the equalized IR obtained by the 8-channel delay-sum
beamfomer, DRR =-0.1 dB. (c) Echogram of the equalized IR obtained by the
proposed 8-channel dereverberator, DRR =3.1 dB. The proposed structure pro-
vides improved dereverberation in respect to the delay and sum beamformer.
Fig. 5.15(a) shows the echogram of the original impulse relating to the shortest source-to-
receiver path. Figure 5.15(c) shows the equalized impulse response. A large amount of derever-
beration is already achieved by the delay and sum beamformer, Figure 5.15(b), which however
does not produce a consistent attenuation of the isolated early reflections. As in section 5.2.2,
an improved estimation of the LP analysis coefficients can be obtained by averaging the LP
coefficients evaluated from all the channels. This obviously makes the algorithm slightly less
efficient.
5.5.1 Results
The experiment reported in section 5.2.2 was repeated, using the same settings (except µ =
0.5 · 10−5 ) and data. As it can be observed in Fig.5.15, the proposed algorithm outperforms by
0.8 dB the results reported in section 5.2.2 and shown in Fig.5.8.
The performance of the algorithm have been validated by repeating the previous experiment
with the same parameters but different speech signals of similar length (between 9 and 10
122
Novel HOS based blind dereverberation algorithms
Signal DRR dB
Male 1 (Italian) 2.2
Male 2 (Portuguese) 2.9
Male 3 (English) 2.6
Male 4 (Italian) 3.1
Female 1 (Portuguese) 2.9
Female 2 (Russian) 2.3
Female 3 (Hebrew) 3.2
Female 4 (Dutch) 2.9
seconds), from different speakers and in different languages. The results obtained by using the
filters calculated after a single run of the algorithm (therefore after 9-10 seconds of adaptation)
are reported in table 5.2. The average DRR improvement in dB in respect to a delay and sum
beamformer (that already provides 2.8 dB of improvement) is of 2.8 dB. Therefore the total
average improvement in the DRR is of 5.6 dB. The proposed algorithm outperforms on average
by 0.5 dB the results reported in section 5.2.2
• the filters calculated after a single run of the algorithm (therefore after 9-10 seconds of
adaptation) were used as the dereverberation filter.
Also in this case, as shown in Fig.5.16, the proposed algorithm can provide better dereverbera-
tion in respect to a delay and sum beamformer.
123
Novel HOS based blind dereverberation algorithms
Figure 5.16: (a) Reference echogram relating to the shortest source-to-receiver path, DRR=-
12.9 dB (b) Echogram of the equalized IR obtained by the 12-channel delay-sum
beamfomer, DRR =-4.9 dB. (c)Echogram of the equalized IR obtained by the 12-
channel dereverberator (female3 speaker), DRR =-0.9 dB. (d)Echogram of the
equalized IR obtained by the 12-channel dereverberator (male1 speaker), DRR
=-2.9 dB. The proposed algorithm provides better dereverberation in respect to
the delay and sum beamformer.
124
Chapter 6
Conclusions and Further Research
6.1 Conclusions
The aim of this work was to investigate the problem of blind speech dereverberation, and in
particular of the methods based on the explicit estimate of the inverse acoustic system (rever-
beration cancellation methods).
• In chapter 1, some fundamental prerequisites on room acoustics and modeling have been
provided. A relevant observation about the “image method for efficiently simulating
small-room acoustics” by J. B Allen and D. A. Berkley, was proposed in section 2.7.2.
The dereverberation problem is still an open issue. This work proposes a partial solution in
an idealized framework where, in the absence of noise, the impulse responses of the source
to receiver path are time-invariant. In this idealized case, the achieved dereverberation is rel-
evant. The interest on the approaches based on the explicit channel inversion was led by the
125
Conclusions and Further Research
fact that, in theory, these methods can offer perfect dereverberation. In practice, even in the
described idealized framework, this does not happen and the improvement, even if perceptually
consistent, is limited.
• Another limiting factor is the performance of the blind deconvolution algorithm em-
ployed, both in terms of accuracy and speed of convergence, especially when the input
signal fed to the algorithm is not perfectly white.
• Other issues come from all the systematic errors that might be present in the proposed
dereverberation system, as for instance the NGA misconvergence described in[176]. Of
course these can be addressed and hopefully solved.
Departing from the idealized framework of a noiseless time-invariant acoustic systems, two
problems are central and still need to be addressed:
• The speed of convergence of the algorithm must suffice to track the acoustic system
variations.
• The sensitivity of the algorithm to noise might prevent the algorithm from working in
real conditions.
126
Conclusions and Further Research
1. The problem of blind time alignment between channel: a robust algorithm that initialize
the inter-channel delay in the multichannel structure must be provided. This is necessary
since the proposed algorithm is unable to identify the delay present among the channels.
This would allow us to do tests in a fully blind scenario.
2. The behavior in a time varying environment: even if the proposed algorithm provides a
fast adaptation (convergence usually happens within seconds), all the simulations were
performed with steady impulse responses. Since it is well known [6] that even a small
perturbation in the source/receiver placement can determine dramatic loss in the derever-
beration performance, the algorithm should be evaluated in more realistic conditions. A
possible test, without incurring in the “unambiguous deconvolution problem”, would be
to use a supergaussian white noise as the source signal.
3. The problem of the sensitivity of the NGA algorithm [176] must be analyzed and the
algorithm should be robust to misconvergence. This should probably provide better dere-
verberation.
4. The evaluation of the performances in a noisy environment and multiple speaker envi-
ronment. This is particularly important since the noise might prevent the algorithm from
working. The problem of multiple speaker might be naively addressed by constraining
the orientation of the beamformer to a predefined direction.
5. The extension to a wider class of signal beyond speech (i.e. music and all the signal
containing sustained harmonic components). This would widen the possible applications.
127
128
Appendix A
Relative/Natural Gradient
De-reverberation Algorithm
The Bussgang algorithm can be treated as a constrained maximum likelihood problem for an
i.i.d. sequence y(n), where the usual normalization term is neglected. Hence this must in some
way be constrained to be fixed.
Differentiating gives:
∂J
= −f (y(n))x(n − p) (A.2)
∂w(p)
∂
where f (y) = ∂y log p(y) is the Bussgang nonlinearity. This gives the Bussgang algorithm:
∂J
w(p) = w(p) − µ
∂w(p) (A.3)
= w(p) + µf (y(k))x(k − p)
The update above is not equivariant. Equivariant means that if the coordinates are changed but
everything else is kept the same the updates would be equivalent.
For example, suppose a(n) = g ∗ x(n) is chosen as the observation (simply by filtering the
observed data with some filter g). If w is believed to be the deconvolution filter for x(n), then
w̃ = w ∗ g −1 should be naturally believed the deconvolution filter for a(n).
Furthermore it would be hopeful that an update from the Bussgang algorithm, w would be
129
Relative/Natural Gradient De-reverberation Algorithm
consistent with the update w̃ if it starts with w̃ and a(n). However this is not the case. Showing
this will lead to the relative gradient derivation.
∂J
= −f (y(n))a(n − p) (A.4)
∂ w̃(p)
So what is the equivalent new w that is obtained by updating w̃ (using the relationship above)?
∂J
w(p) = w(p) − µ ∗g
∂ w̃(p)
X ∂J
= w(p) − µ g(m) (A.6)
m
∂ w̃(p − m)
X
= w(p) + µ g(m) (f (y(n))a(n − p + m))
m
Note that the y(n) is unchanged. ∂J/∂ w̃(p) is a function of p and it is this that is being
convolved with g.
or more simply:
w = w + µf (y(n)) (g(n) ∗ g(−n) ∗ x) (A.8)
where the index in g(n) has been included to indicate that one convolution is time reversed.
So from (A.8) it can be observed that the gradient update is dependent upon the coordinates for
the (same) data that have been chosen.
130
Relative/Natural Gradient De-reverberation Algorithm
(a) (b)
Figure A.1: (a) Diagram of the time domain de-reverberation algorithm proposed by Gillespie
and Malvar (forward structure). (b) Diagram of the proposed model(reversed
structure).
This argument can be now applied to the update associated with speech dereverberation.
In the first instance it will be not cared whether the update is physically realizable (i.e. causal).
First, applying NGA to the “forward structure”, reported in figure A.1(a), and introduced by
Gillespie et al. in [87], is easy. The above update (preferably with a normalizing term included)
is applied to the LPC residual of the speech.
Now what if the correct model is used, that is the reversed structure shown in figure A.1(b)? As
before, the reverberated residual is modeled as an i.i.d. signal. Let us define:
P
e(n) - the clean residual: e(n) = l al (n)y(n − l), where al (n) are the time varying LPC
filter coefficients.
P
y(n) - the clean speech signal y(n) = p w(p)x(n − p), where w(p) is the de-reverb filter.
131
Relative/Natural Gradient De-reverberation Algorithm
where the correct filtering relationship between e(n) and x(n) is:
X
e(n) = al (n)y(n − l)
l
X X (A.11)
= al (n) w(p)x(n − l − p)
l p
It will be assumed that, somehow, al (n) are known for all n and l.
∂J
w(p) = w(p) − µ
∂w(p)
(A.12)
∂e(n)
= w(p) + µf (e(n))
∂w(p)
X
w(p) = w(p) + µf (e(n)) al (n)x(n − p − l) (A.13)
l
Following the previous section let us ask what the equivalent update for w would be if it would
P
have started with r(n) = m g(m)x(n − m) and w̃.
P
Note,again, the relationship w(p) = m g(m)w̃(p − m) holds. This gives:
X ∂J
w(p) = w(p) − µ g(m)
m
∂ w̃(p − m)
X ∂e(n)
= w(p) + µ g(m)f (e(n))
m
∂w(p − m)
X X
! (A.14)
= w(p) + µ g(m)f (e(n)) al (n)r(n − p + m − l)
m l
!
X X X
= w(p) + µ g(m)f (e(n)) al (n) g(q)x(n − p + m − l − q)
m l q
132
Relative/Natural Gradient De-reverberation Algorithm
P
X
y(n) = w(p)x(n − p) (A.16)
p=0
P
X
u(n) = w(P − m)y(n − m) (A.18)
m=0
L
X
w(p) = w(p) + µf (e(n − P )) al (n − P )u(n − p − l) (A.19)
l=0
Let:
P
X
y(n) = w(p)x(n − p) (A.20)
p=P
Note that this means at time n only y(n − P ) is accessible. It is necessary to introduce a delay
of 3P .
P
X
ũ(n) = w(P − m)y(n − m − 2P ) (A.21)
m=−P
133
Relative/Natural Gradient De-reverberation Algorithm
To check that this is a causal update, consider the term ũ(n − p − l − P ). At time n, ũ(n) is
accessible. Furthermore l ≥ 0 and p ≥ −P . Thus the largest index in ũ is n.
Recall that in the ML formulation (single channel) for the standard deconvolution problem,
there is a normalization constant. Let’s begin by looking at where this comes from. If a linear
transform model y = Ax is assumed then the probability density functions are related by:
∂y
p(x) = p(y)
∂x (A.23)
= p(y) det(A)
But the log det(A) can also be written in terms of the eigenvalues, λi of A:
Y
log det(A) = log |λi |
i
X (A.25)
= log |λi |
i
To apply this normalization factor to filters it is necessary to recall that the eigenfunctions of a
filter are complex exponentials and the eigenvalues are defined by the Fourier Transform (FT).
Thus, if y = w ∗ x, then:
Z π
1
log p(x) = log p(y) + log |W (ω)|dω (A.26)
2π −π
134
Relative/Natural Gradient De-reverberation Algorithm
where the sum has become a normalized integral because a continuous set of eigenvalues is
present. It is essentially the inverse FT of log |W (ω)| evaluated at n = 0.
The full negative log Likelihood function for blind deconvolution can now be considered:
This is equivalent to the log det W term in a ICA [175]. Since w(p) is an invertible linear
time-invariant system, it is an infinite dimensional Toeplitz operator.
The last line can be deduced by recalling the definition of the Fourier transform: W (ω) =
−jωp and then differentiating with respect to w(p). Note also that the last line is
P
p w(p)e
simply the time reversed inverse FT. Thus:
∂J
= −h(−p) − f (y(n))x(n − p) (A.29)
∂w(p)
where h(p) denotes the impulse response function for the inverse of w (i.e. h ∗ w = δ0 ). Note
that this formulation tries to be consistent with the definition of f (y) = ∂/∂y log p(y) as in the
Bussgang section, above.
135
Relative/Natural Gradient De-reverberation Algorithm
It is relatively straight forward to show that the natural gradient normalized update is:
X X ∂J
w(p) = w(p) − µ w(l) w(m)
m
∂w(p − l + m)
l
X X
= w(p) + µ w(l) w(m) h(−p + l − m) + f (y(n))x(n − p + l − m)
l m
X (A.30)
= w(p) + µ w(l) δ(l − p) + f (y(n))y(n − p + l)
l
X
= w(p) + µ w(p) + f (y(n)) w(l)y(n − p + l)
l
The last line here is then equivalent to equation (33) in Amari et al. (with the sign change in
f (y)).
where the expression − log ||∂e/∂y|| has been replaced with “constant” since it does not depend
on w and so will vanish when it is differentiated, which is done next.
∂J ∂e(n)
= −h(−p) − f (e(n))
∂w(p) ∂w(p)
X (A.32)
= −h(−p) − f (e(n)) al (n)x(n − p − l)
l
The ‘normalized’ equivalent of equation A.14 can now be rewritten where g(p) = w(p) it is
136
Relative/Natural Gradient De-reverberation Algorithm
X X ∂J
w(p) = w(p) − µ w(m) w(q)
m q
∂w(p − m + q)
X X X
= w(p) + µ w(m) w(q) h(m − p − q) + f (e(n)) al (n)x(n − p + m − q − l)
m q l
X X
= w(p) + µ w(m) δ(m − p) + f (e(n)) al (n)y(n − p + m − l)
m l
X
= w(p) + µ w(p) + f (e(n)) al (n)y(n − p + m − l)
l
(A.33)
The normalized versions of the causal NGA de-reverb algorithm can now be immediately writ-
ten down:
L
X
w(p) = w(p) + µ w(p) + f (e(n − P )) al (n − P )u(n − p − l) (A.34)
l=0
X
w(p) = w(p) + µ w(p) + f (e(n − 3P )) al (n − 3P )ũ(n − p − l − P ) (A.35)
l
137
138
References
[1] H.Kutruff, Room Acoustics. New York, USA: Taylor & Francis; 4th edition, 2000.
[2] D. C. Halling and L. Humes, “Factors affecting the recognition of reverberant speech by
elderly listeners,” Journal of Speech, Language, and Hearing Research, 2000.
[4] E. Habets, Single- and Multi-Microphone Speech Dereverberation using Spectral En-
hancement. PhD thesis, Technische Universiteit Eindhoven, Eindhoven, 2007.
[6] J. Mourjopoulos, “On the variation and invertibility of room impulse response functions,”
Journal of Sound and Vibration, vol. 102, pp. 217–228, sept 1985.
[8] P. Naylor and N. G. Eds., Speech Dereverberation. New Jersey: Springer, 2008.
[9] M. Miyoshi and Y. Kaneda, “Inverse filtering of room acoustics,” IEEE Trans. on Acous-
tics, Speech and Signal Processing, vol. 36, no. 2, pp. 145–152, 1988.
[12] McGraw-Hill Dictionary of Scientific and Technical Terms, 7th edition. The McGraw-
Hill Companies, 2011.
[13] “ISO 3382:1997, acoustics. measurement of the reverberation time of rooms with refer-
ence to other acoustic parameters,” 1997.
[14] V. L. Jordan, “Acoustical criteria for auditoriums and their relation to model techniques,”
J. Acoust. Soc. Am., 1970.
139
References
[18] C. L. S. Gilford, “The acoustic design of talk studios and listening rooms,” Proc. Inst.
Elect. Engs., vol. 106, pp. 245–258, May 1959.
[19] O. J. Bonello, “A new criterion for the distribution of normal room modes,” J. Audio
Eng. Soc., pp. 597–606, 1981.
[21] Nimura and Tadamoto, “Effect of splayed walls of a room on the steady-state sound
transmission characteristics,” J. Acoust. Soc. Am., vol. 28, pp. 774–775, July 1956.
[25] J. Pongsiri, P. Amin, and C. Thompson, “Modeling the acoustic transfer function of a
room,” Proceedings of the 12th International Conference on Scientific Computing and
Mathematical Modeling, p. 44, 1999.
[26] W. G. Gardner, “The virtual acoustic room,” Master’s thesis, MIT Media Lab, Cam-
bridge, 1992.
[27] R. Stewart and M. Sandler, “Statistical measures of early reflections of room impulse
responses,” Proceedings of the DAFx’07, 2007.
[29] A. Kulowski, “Algorithmic representation of the ray tracing technique,” Applied Acous-
tics, vol. 18, no. 6, pp. 449–469, 1985.
[30] J. Allen and D. Berkley, “Image method for efficiently simulating small-room acoustics,”
J. Acoust. Soc. Am, vol. 65, no. 4, pp. 943–948, 1979.
[31] J. Martin and J. V. D. Van Maercke, “Binaural simulation of concert halls: a new
approach for the binaural reverberation process,” J. Acoust. Soc. Am., vol. 94, no. 6,
pp. 3255–3264, 1993.
[32] M. Vorlander, “Simulation of the transient and steady-state sound propagation in room
using a new combined ray-tracing/image-source algorithm,” J. Acoust. Soc. Am., vol. 86,
no. 1, pp. 172–178, 1989.
140
References
[33] E. Lehmann and A. Johansson, “Prediction of energy decay in room impulse responses
simulated with an image-source model,” Journal of the Acoustical Society of America,
vol. 124(1), pp. 269–277, July 2008.
[34] D. Yuanqing Lin; Lee, “Bayesian regularization and nonnegative deconvolution for
room impulse response estimation,” Signal Processing, IEEE Transactions on, vol. 54,
pp. 839–847, March 2006.
[36] M. Petyt, “Finite element techniques for acoustics,” in R.G. White , J.G. Walker (editors),
Noise and Vibrations, Ellis Horwood Ltd., pp. 355–369, 1986.
[39] J.Redondo, R.Pico, B. Roig, and M.R.Avis, “Time domain simulation of sound diffusers
using finite-difference schemes,” Acta Acustica, vol. 93, no. 4, pp. 611–622, 2007.
[40] L. Savioja, J. Backman, A. Jarvinen, and T. Takala, “Waveguide mesh method for low-
frequency simulation of room acoustics,” Proc. of the 15th Int. Congr. Acoust. (ICA95),
vol. 2, pp. 1–4, 1995.
[41] D. T. Murphy, M. Beeson, S. Shelley, A. Southern, and A. Moore, “Hybrid room im-
pulse response synthesis in digital waveguide mesh based room acoustics simulation,”
Proceedings of the 11th Int. Conference on Digital Audio Effects (DAFx08), pp. 129–
136, 2008.
[42] J. O. S. III, Mathematics of the Discrete Fourier Transform (DFT): with Audio Applica-
tions - Second Edition. W3K, 2008.
[44] A. Farina, “Simultaneous measurements of impulse response and distortion with a swept-
sine technique,” 108th AES Convention, 2000.
[46] B. D. Radlovic and R. A. Kennedy, “Iterative cepstrum-based approach for speech de-
reverberation,” Proc. of ISSPAA, vol. 1, pp. 55–58, 1999.
141
References
[49] O. Kirkeby and P. A. Nelson, “Digital filter design for inversion problems in sound re-
production,” J. Audio Eng. Soc., vol. 47, no. 7/8, pp. 583–595, 1999.
[50] M. H. Hayes, Statistical Digital Signal Processing and Modeling. New York: John Wiley
& Sons, 1996.
[53] S. Neely and J. B. Allen, “Invertibility of a room impulse response,” J. Acoust. Soc.
Amer., vol. 66, pp. 165–169, 1979.
[56] B. W. Gillespie and L. E. Atlas, “Acoustic diversity for improved speech recognition
in reverberant environments,” in Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing, pp. 557–560, 2002.
[60] M. Gerzon, “Why do equalisers sound different?,” Studio Sound, vol. 32, pp. 58–65, July
1990.
[61] R. P. Genereux, “Adaptive loudspeaker systems: Correcting for the acoustic environ-
ment,” Proc 8th International Audio Engineering Society Conference, Washington DC,
May 1990.
[62] P. Hatziantoniou and J. Mourjopoulos, “Results for room acoustics equalization based on
smoothed responses,” Audio Engineering Society 114th Convention, Amsterdam, March
2003.
[63] T. Hikichi and M. Miyoshi, “Blind algorithm for calculating common poles based on lin-
ear prediction.,” Proc. of the International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), vol. 4, no. 17–21, pp. 89–92, 2004.
142
References
[64] G. A. Jones and J. M. Jones, Elementary Number Theory. New York: Berlin: Springer-
Verlag, 1998.
[65] L. Tong and S. Perreau, “Multichannel blind identification: From subspace to maximum
likelihood methods,” Proc. IEEE, vol. 86, pp. 1951–1968, November 1998.
[66] T. Hikichi, M. Delcroix, and M. Miyoshi, “On robust inverse filter design for room
transfer function fluctuations,” in Proc. European Signal Processing Conf. (EUSIPCO),
2006.
[67] H. Yamada, H. Wang, and F. Itakura, “Inverse filtering of room acoustics,” International
Conference on Acoustics, Speech, and Signal Processing., vol. 2, pp. 969–972, 14-17
Apr 1991.
[68] P. Nelson, F. Orduna-Bustamante, and H. Hamada, “Inverse filter design and equaliza-
tion zones in multichannel soundreproduction,” IEEE Transactions on Speech and Audio
Processing, vol. 3, pp. 185–192, May 1995.
[69] J. Lochner and J. Burger, “The subjective masking of short time delayed echoes by their
primary sounds and their contribution to the intelligibility of speech,” Acustica, vol. 8,
pp. 1–10, 1958.
[70] A. Watkins and N. Holt, “Effects of a complex reflection on vowel indentification,” Jour-
nal of the Acoustical Society of America, vol. 86, pp. 532–542, 2000.
[71] M. Hodgson and E. Nosal, “Effect of noise and occupancy on optimal reverberation times
for speech intelligibility in classrooms,” Journal of the Acoustical Society of America,
vol. 111, pp. 931–939, 2002.
[72] P. Naylor and N. Gaubitch, “Speech dereverberation,” Proc. of the International Work-
shop on Acoustic Echo and Noise Control (IWAENC 2005), 2005.
[74] E. Zwicker, “Subdivision of the audible frequency range into critical bands,” Journal of
the Acoustical Society of America, vol. 33, pp. 318–326, 1961.
[75] J.Y.C.Wen and P. Naylor, “An evaluation measure for reverberant speech using tail de-
cay modelling,” in Proc. of the European Signal Processing Conference (EUSIPCO06),
pp. 1–4, 2006.
[77] S. Wang, A. Skey, and A. Gersho, “An objective measure for predicting subjective quality
of speech coders,” IEEE Journal on Selected Areas in Communications, vol. 10, no. 5,
1992.
143
References
[79] Huang, Yiteng, Benesty, Jacob, Chen, and Jingdong, Acoustic MIMO Signal Processing.
New Jersey: Springer Topics in Signal Processing, Vol. 1, 2006.
[81] Benesty, Jacob, Makino, Shoji, Chen, and J. Eds., Speech Enhancement. New Jersey:
Springer, 2005.
[82] Benesty, Jacob, Chen, Jingdong, Huang, and Yiteng, Microphone Array Signal Process-
ing. New Jersey: Springer Topics in Signal Processing , Vol. 1, 2008.
[84] J. Allen, “Synthesis of pure speech from a reverberant signal,” U.S. Patent No. 3786188,
1974.
[86] Y. Huang, J. Benesty, and J. Chen., Acoustic MIMO Signal Processing (Signals and
Communication Technology). Siracuse, NJ, USA: Springer-Verlag New York, 2006.
[89] T. Nakatani, M. Miyoshi, and K. Kinoshita, “Implementation and effects of single chan-
nel dereverberation based on the harmonic structure of speech,” Proc. of the International
Workshop on Acoustic Echo and Noise Control (IWAENC03), pp. 91–94, 2003.
[91] K. Furuya and A. Kataoka, “Robust speech dereverberation using multichannel blind
deconvolution with spectral subtraction,” IEEE Transactions on audio, speech and lan-
guange processing, vol. 15, no. 5, pp. 1579–1591, 2007.
[92] R. Liu and G. Dong, “A fundamental theorem for multiple-channel blind equalization,”
IEEE Trans. on circuits and systems, vol. 44, pp. 472–473, may 1997.
[93] S. Mitra, Digital Signal Processing. New Jersey: Mc Graw Hill, 2002.
144
References
[94] D. Ward, R. Kennedy, and R. Williamson, “Theory and design of broadband sensor
arrays with frequency invariant far-field beam patterns,” J. Acoust. Soc. Amer., vol. 97,
pp. 1023–1034, Feb. 1995.
[95] G. W. Elko, “Microphone array systems for hands-free telecommunication,” Speech
Commun., vol. 20, pp. 229–240, Dec. 1996.
[96] N. D. Gaubitch, Blind Identification of Acoustic Systems and Enhancement of Reverber-
ant Speech. PhD thesis, Imperial College, University of London, London, 2006.
[97] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE
Trans. on ASSP, vol. 27(2), no. 3, pp. 113–120, 1979.
[98] F. Pacheco and R. Seara, “Spectral subtraction for reverberation reduction applied to
automatic speech recognition,” Telecommunications Symposium, 2006 International,
vol. 7, pp. 795–800, Sept. 2006.
[99] N. Virag, “Single channel speech enhancement based on masking properties of the hu-
man auditory system,” IEEE Trans. Speech Audio Processing, vol. 7, pp. 126–137, Mar.
1999.
[100] E. A. P. Habets, “Single-channel speech dereverberation based on spectral subtrac-
tion,” Proc. 15th Annual Workshop Circuits, Syst., Signal Process. (ProRISC04), vol. 7,
pp. 250–254, Nov. 2004.
[101] N. W. D. Evans, J. S. Mason, W. M. Liu, and B. Fauve, “On the fundamental limita-
tions of spectral subtraction: an assessment by automatic speech recognition,” in Proc.
European Signal Processing Conf. (EUSIPCO), 2005.
[102] K. Lebart, J. Boucher, and P. Denbigh, “A new method based on spectral subtraction for
speech dereverberation,” Acta Acoustica, vol. 87, no. 3, pp. 359–366, 2001.
[103] J. Polack, La transmission de l’energie sonore dans les salles. PhD thesis, Universite’
du Maine, La mans, 1988.
[104] R. Ratnam, D. Jones, and J. W.D. O’Brien, “Fast algorithms for blind estimation of
reverberation time,” Signal Processing Letters, IEEE, vol. 11, pp. 537–540, June 2004.
[105] Y. Zhang, J. A. Chambers, F. Li, P. Kendrick, and T. Cox, “Blind estimation of reverber-
ation time in occupied rooms,” in Proc. European Signal Processing Conf. (EUSIPCO),
2006.
[106] J. Wen, E. Habets, and P. A. Naylor, “Blind estimation of reverberation time based on the
distribution of signal decay rates,” Proc. of the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), pp. 329–332, 2008.
[107] B. Yegnanarayana, S. R. M. Prasanna, and K. S. Rao, “Speech enhancement using exci-
tation source information,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing,
vol. 1, pp. 541–544, 2002.
[108] S. Griebel and M. Brandstein, “Wavelet tranform extrama clustering for multi-channel
speech deverberation,” in Proc. of the IEEE Workshop on Applications of Signal Pro-
cessing to Audio and Acoustics (WASPAA-99), vol. 1, 1999.
145
References
[110] D. Kundur and D. Hatzinakos, “Blind image deconvolution,” IEEE Trans. Signal Pro-
cessing, vol. 13, no. 2, pp. 43–64, 1996.
[111] O. Shalvi and E. Weinstein, “New criteria for blind deconvolution of nonminimum phase
systems (channels),” IEEE Trans. on information theory, vol. 36, pp. 312–321, march
1990.
[112] L.Tong, G. Xu, and T. Kailath, “Blind identification and equalization based on second-
order statistics. a time domain approach,” IEEE Trans. on information theory, vol. 40,
pp. 340–349, march 1994.
[115] R. Lopez-Valcarce and S. Dasgupta, “Second order statistics based blind channel equal-
ization with correlated sources,” IEEE, vol. 4, pp. 366–369, march 2001.
[116] R. Lopez-Valcarce and S. Dasgupta, “Blind channel equalization with colored sources
based on second-order statistics: A linear prediction approach,” IEEE Transactions on
Signal Processing, vol. 49, pp. 2050–2059, sept 2001.
[117] N. D. Gaubitch, P. A. Naylor, and D. B. Ward, “On the use of linear prediction for
dereverberation of speech,” IWAENC, pp. 99–102, 2003.
[118] M. Triki and T. Slock, “AR source modeling based on spatiotemporally diverse multi-
channel outputs and application to multimicrophone dereverberation,” DSP 2007, 15th
International Conference on Digital Signal Processing, July 2007.
[120] S. Haykin, Unsupervised adaptive filtering. New Jersey: John Wiley & Sons, 2000.
[121] M. Joho, A Systematic Approach to Adaptive Algorithms for Multichannel System Iden-
tification, InverseModeling, and Blind Identification. PhD thesis, Swiss Federal Institute
of Technology, Zrich, 2000.
[123] J. Mendel, “Tutorial on higher-order statistics (spectra) in signal processing and system
theory: Theoretical results and some applications,” Proc. IEEE, vol. 79, pp. 278–305,
March 1991.
146
References
[124] G. Giannakis and J. Mendel, “Identification of nonminimum phase systems using higher
order statistics,” IEEE Transaction on Acoustics, Speech and Signal Processing, vol. 37,
pp. 360–377, March 1989.
[125] C. Nikias and J. Mendel, “Signal processing with higher order spectra,” IEEE Signal
Processing Magazine, vol. 10, pp. 10–37, July 1993.
[126] D. Hatzinakos and C. Nikias, “Blind equalisation based on higher order statistics
(H.O.S.),” [119], pp. 181–258, 1994.
[127] R. A. Wiggins, “Minimum entropy deconvolution,” Geoexploration, vol. 16, pp. 21–35,
1978.
[128] J. A. Cadzow, “Blind deconvolution via cumulant extrema,” IEEE Signal Process.Mag.,
vol. 13, pp. 24–42, May 1996.
[129] D. L. Donoho, “On minimum entropy deconvolution,” Applied Time Series Analysis, D.
F. Findley, Ed. New York: Academic Press, 1981.
[130] A. Papoulis, Probability, Random Variables and Stochastic Processes, 2nd ed. Singa-
pore: McGraw-Hill, 1984.
[131] P. Paajarvi and J. P. LeBlanc, “Computational efficient norm-constrained adaptive blind
deconvolution using third-order moments,” Proc. of the International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), pp. 752–755, 2006.
[132] S. Bellini and F. Rocca, “Near optimal blind deconvolution,” IEEE, pp. 2236–2239,
1988.
[133] S. Bellini, Bussgang techniques for blind deconvolution and equalizzation. Englewood
Cliffs, NJ: ed. S. Haykin, Prentice-Hall, 1994.
[134] S. Amari, A. Cichocki, and H. Yang, “A new learning algorithm for blind signal separa-
tion,” Advances in Neural Information Processing Systems 8. MIT Press., pp. 752–763,
1996.
[135] J. Cardoso and B. Laheld, “Equivariant adaptive source separation,” IEEE Trans. Signal
Processing. MIT Press., vol. 43, pp. 3017–3030, 1996.
[136] S. Amari, S. Douglas, A. Cichocki, and H. Yang, “Novel on-line adaptive learning algo-
rithms for blind deconvolution using the natural gradient approach,” 1997.
[137] A. J. Bell and T. J. Sejnowski, “An information-maximization approach to blind sep-
aration and blind deconvolution,” Neural Computation, vol. 7, no. 6, pp. 1129–1159,
1995.
[138] D. Fee, C. Cowan, S. Bilbao, and I. Ozcelik, “Predictive deconvolution and kurtosis
maximization for speech dereverberation,” in Proc. European Signal Processing Conf.
(EUSIPCO), 2006.
[139] G. Xu, H. Liu, L. Tong, and T. Kailath, “A least-squares approach to blind channel iden-
tification,” IEEE Trans. Signal Processing, vol. SP43(12), pp. 2982–2993, December
1995.
147
References
[140] L.Tong, G. Xu, and T. Kailath, “A new approach to blind identification and equalization
of multipath channels,” conference record of the 25th Asilomar Conference on signals
systems and computers, vol. 2, pp. 856–860, november 1991.
[141] Y. Hua, “Fast maximum likelihood for blind identification of multiple FIR channels,”
IEEE Trans. Signal Processing, vol. 44, pp. 661–672, Mar. 1996.
[142] H. Liu, G. Xu, and L. Tong, “A deterministic approach to blind equalization,” Proc. 27th
Asilomar Conf., Pacific Grove, CA, pp. 751–755, 1993.
[144] S. Gannot and M. Moonen, “Subspace methods for multimicrophone speech dereverber-
ation,” in Proceedings of the 7th IEEE/EURASIP International Workshop on Acoustic
Echo and Noise Control (IWAENC 2001), vol. 1, pp. 47–50, September 2001.
[145] S. Gannot and M. Moonen, “Subspace methods for multimicrophone speech dereverber-
ation,” EURASIP Journal on Applied Signal Processing, vol. 2003, vol. 11, pp. 1074–
1090, 2003.
[146] Matrix Computations (3rd ed.). Baltimore: Johns Hopkins University Press., 1996.
[150] N. Gaubitch, H. M.K., and P. Naylor, “Generalized optimal step-size for blind multichan-
nel lms system identification,” Signal Processing Letters, IEEE, vol. 13, pp. 624–627,
October 2006.
[151] N. Gaubitch, J. Benesty, and P. Naylor, “Adaptive common root estimation and the com-
mon zeros problem in blind channel estimation,” in Proc. European Signal Processing
Conf. (EUSIPCO), September 2005.
[152] R. Ahmad, N. Gaubitch, and P. Naylor, “A noise-robust dual filter approach to multichan-
nel blind system identification,” in Proc. European Signal Processing Conf. (EUSIPCO),
2007.
148
References
[155] M. Triki and T. Slock, “Iterated delay and predict equalization for blind speech derever-
beration,” IWAENC 2006, Paris, September 2006.
[158] B. Gillespie and L. Atlas, “Strategies for improving audible quality and speech recogni-
tion accuracy of reverberant speech,” Proc. of the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), 2003.
[159] M. Delcroix, T. Hikichi, and M. Miyoshi, “Blind dereverberation algorithm for speech
signals based on multi-channel linear prediction,” Acoustical Science and Technology,
vol. 26, pp. 432–439, October 2005.
[162] M. Delcroix, T. Hikichi, and M. Miyoshi, “On the use of lime dereverberation algorithm
in an acoustic environment with a noise source,” Proc. of the International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 825–828, 2006.
[163] K. Furuya, “Noise reduction and dereverberation using correlation matrix based on the
multiple-input/output inverse-filtering theorem (MINT),” Proc. of International Work-
shop on Handsfree Speech Communication, vol. 15, no. 5, pp. 59–62, 2001.
[164] K. Furuya and A. Kataoka, “FFT-based fast conjugate gradient method for real-time
dereverberation system,” Electronics and Communications in Japan, vol. 90, no. 7, 2007.
[165] D. R. Brillinger, Time series: Data analysis and theory. New York: Holt, Rinehart and
Winston, 1975.
[167] T. Nakatani and M. Miyoshi, “Blind dereverberation of single channel speech sig-
nal based on harmonic structure,” Proc. of the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), vol. 1, pp. 92–95, 2003.
149
References
[171] D. Bees, M. Blostein, and P. Kabal, “Reverberant speech enhancement using cepstral
processing,” Acoustics, Speech, and Signal Processing, 1991. International Conference
on, vol. 2, pp. 977–980, 1991.
[175] A. Hyvarinen, J. Karhunen, and E. Oja, Independent Component Analysis. John Wiley
& Sons, 2001.
[176] S. Douglas, H. Sawada, and S. Makino, “Natural gradient multichannel blind deconvo-
lution and speech separation using causal FIR filters,” IEEE Transactions on Speech and
Audio Processing, vol. 13, pp. 92–104, January 2005.
150