0% found this document useful (0 votes)
8 views

A Pitch Detection Method Based On Continuous Wavelet Transform For Harmonic Signal

This paper proposes a new pitch detection method based on continuous wavelet transform that can work properly under noisy environments even with a short frame duration. The method uses a harmonic analyzing wavelet in the continuous wavelet transform to obtain information about pitch from a scalogram. It is expected to provide both robustness to noise and ability to detect pitch within a short time frame.

Uploaded by

denniskwgu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

A Pitch Detection Method Based On Continuous Wavelet Transform For Harmonic Signal

This paper proposes a new pitch detection method based on continuous wavelet transform that can work properly under noisy environments even with a short frame duration. The method uses a harmonic analyzing wavelet in the continuous wavelet transform to obtain information about pitch from a scalogram. It is expected to provide both robustness to noise and ability to detect pitch within a short time frame.

Uploaded by

denniskwgu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Acoust. Sci. & Tech.

24, 1 (2003)

PAPER

A pitch detection method based on continuous wavelet transform


for harmonic signal

Yoshifumi Chisaki1 , Hidetoshi Nakashima2 , Shuuichi Shiroshita2 ,


Tsuyoshi Usagawa2 and Masanao Ebata3
1
Faculty of Engineering, Kumamoto University,
2–39–1 Kurokami, Kumamoto, 860–8555 Japan
2
Graduate School of Science and Technology, Kumamoto University,
2–39–1 Kurokami, Kumamoto, 860–8555 Japan
3
Kumamoto National College of Technology,
2659–2 Suya, Nishigoshi, Kikuchigun, Kumamoto, 861–1102 Japan
( Received 3 October 2001, Accepted for publication 2 July 2002 )

Abstract: In order to track a rapid transient of pitch, a required frame length of some conventional
pitch detection methods is too long. Although there are wavelet based pitch detection methods which
require only a few periods of pitch for a frame, they are not robust enough against noise. This paper
proposes a new pitch detection method which can work properly under noisy environments even if a
frame duration is short. The proposed method consists of a power level detector, a signal analyzer, an
autocorrelator, a voiced-unvoiced detector and a lag time interpolator. The signal analyzer is based on
the continuous wavelet transform using a harmonic analyzing wavelet. Usage of the harmonic
analyzing wavelet gives us more information about a pitch in a scalogram. Simulations of pitch
detection for a harmonic chirp signal and speech signals are performed. Performances are compared
with two conventional pitch detection methods, cepstrum and modified correlation methods. As a
result, a performance of a pitch detection by the proposed method under a noisy environment is better
than that of the other two conventional methods. In particular, the largest improvement of performance
is obtained for male voices.

Keywords: Pitch detection, Continuous wavelet transform, Harmonic signal


PACS number: 43.72.Ar, 43.66.Hg

stable pitch detection for male speech; hence at least 70 ms


1. INTRODUCTION as a duration for a frame is required when a pitch is 100 Hz.
Pitch information plays an important role for pitch- Moreover, other references describe a pitch detection
based speech processing systems, such as a speech rate method based on a short duration not only for clean speech
converter, speech morphing, speech enhancer for a but also under noisy environments [1,3,8,9]. In particular, a
telephone line and so on [1–6]. A pitch detection based pitch detection method based on instantaneous frequency
on a short duration is required to improve performance of proposed by Atake [9] improves a robustness against noise
the systems because a pitch frequency has a rapid transient with a short duration over other methods based on
in frequency. Some references mention that the duration for instantaneous frequency. However, a duration for pitch
steady-state of speech is 10–20 ms and a dynamic range of detection requires over 40 ms when a pitch is 100 Hz, since
pitch frequency in a short time is wider than an octave in a frame length for these methods corresponds to an analysis
frequency [1,2]. Simultaneously, a robustness against noise frequency. In view of a frame length for high performance
is also one of the essential issues in order to improve pitch-based systems, a shorter duration is preferred to
performance of pitch-based applications. reduce a group delay. Thus, a development of a pitch
There are many papers which are related to pitch detection method which can work with a short frame length
detection of speech since the 1960s [2,7]. In the literature, and also has a robustness against noise is an important
results of simulation show that over seven or eight pitch issue.
periods are required as a fixed frame length to perform In studies based on autocorrelation [2,7], performance

7
Acoust. Sci. & Tech. 24, 1 (2003)

of pitch detection is degraded under noisy environments the series of local peaks along the time axis contributes to
because a peak of autocorrelation corresponding to a pitch the detection of a pitch when the power level at a
is not enhanced against other peaks corresponding to noise. fundamental frequency is insufficient; namely, a missing
There are various techniques to decrease effects of noise in fundamental signal, such as a speech signal over tele-
autocorrelation, e.g. low pass filtering in the cepstrum communication lines.
method and the modified correlation method. However, a In this paper, a pitch detection method based on the
robustness against noise for those techniques is insufficient continuous wavelet transform is proposed. The proposed
when a frame length becomes shorter. In order to obtain method consists of a power level detector, a signal
further robustness against noise, we consider to use both analyzer, an autocorrelator, a voiced-unvoiced detector
harmonic components of speech and temporal information and a lag time interpolator. In the signal analyzer block, the
corresponding to a pitch in scalogram obtained by a continuous wavelet transform is used to obtain both a pitch
continuous wavelet transform. The continuous wavelet detection based on a short duration and a robustness against
transform is a suitable transform method because we can noise. According to both a relative enhancement of power
obtain much information about a pitch in a scalogram. It is level and some series of local peaks, a pitch detection
expected that information about a pitch along each time based on a short duration and a robustness against noise is
and frequency axis can be obtained by using the following expected simultaneously.
two major flexibilities for speech analysis. The first First, a procedure for the proposed method is described.
advantage is flexibility with respect to arrangements for In particular, detailed advantages for the continuous
resolution of both time and frequency. A second one is a wavelet transform using a harmonic analyzing wavelet
flexible selection of a transform kernel which is called an are mentioned. Secondly, parameters for the proposed
analyzing wavelet or a mother wavelet. The analyzing scheme are discussed by a computer simulation. The
wavelet plays an important role to decide characteristics of proposed method is compared with the conventional
wavelet transform. The only restriction for an analyzing modified correlation method and the cepstrum method in
wavelet is admissible condition. Thus, there is a large all simulations. In order to confirm a performance of pitch
variety in selection of mathematical function as an detection, a harmonic chirp signal is used as an input
analyzing wavelet when inverse wavelet transform is signal. Moreover, results of simulations with respect to a
neglected. In order to extract pitch information, a function robustness against noise in case of speech with the addition
based on a harmonic structure is selected as an analyzing of white or pink noise are shown. Finally, a performance of
wavelet because our target signal, a speech signal, has a the proposed method is concluded by those results
harmonic structure. In a pitch detection method based on compared with the modified correlation method and the
an autocorrelation, pitch candidates are selected from local cepstrum method.
peaks in autocorrelation. As studies for a pitch detection of
a music signal [10,11] show, a manipulation of relative 2. PROCEDURE OF PITCH DETECTION
power enhancement for the local peak corresponding to a In this section, a procedure of pitch detection is
pitch is important even though each harmonic component mentioned. Figure 1 shows a block diagram of the
of speech has a perturbation of frequency. Power for each proposed method. The pitch detection method consists of
high order harmonic component is summed up at a pitch 5 blocks; a power level detector, a signal analyzer, an
frequency in the scalogram because the characteristics of autocorrelator, a voiced-unvoiced detector and a lag time
frequency for an analyzing wavelet which has a harmonic interpolator. The pitch detection is performed frame by
analyzing wavelet are the same as that for a comb filter. frame with Tshift .
Thus, it is expected that a relative power level at a
fundamental frequency obtained by wavelet transform STEP 1:
using the harmonic analyzing wavelet is higher than that In the first stage, the power level detector works as a
obtained by wavelet transform using a Gabor function switch for turning the pitch detection process on and off.
[3,9]. The pitch detection is performed when the following
In addition, another characteristic of continuous wave- equations are satisfied,
let transform based on the harmonic analyzing wavelet 10 logðPÞ > PTH ;
gives one more benefit to get a robustness against noise. ð1Þ
P > 0;
The continuous wavelet transform brings some series of
local peaks with an interval of a pitch period along the time where P is average power level with respect to time in a
axis in a scalogram. Therefore, gathering pitch period frame, and PTH is a threshold derived from average power
information from pitch interval series contributes to level of background noise.
enhanced power of a local peak in a short time. Moreover,

8
Y. CHISAKI et al.: PITCH DETECTION BASED ON CONTINUOUS WAVELET TRANSFORM

0
-10
-20
Tshift a frame
-30

(dB)
-40
f(x)
-50
-60
-70
-80
-90
0 0.1 0.2 0.3 0.4 0.5
Relative frequency

10log( P ) PTH
No (b)
Yes
0
skip to
Wavelet Transform using -10
next frame
Harmonic Analyzing Wavelet -20
-30
W( t , f )

(dB)
-40
-50
Autocorrelation
-60
in scalogram
-70
R( ,f) -80
-90
Summation of 0 0.1 0.2 0.3 0.4 0.5
autocorrelation Relative frequency

R( )
(b)
Maximum value search
in autocorrelation domain
Fig. 2 Example of frequency response. (a): Gabor
function. (b): Harmonic Analyzing Wavelet (Pro-
posed).
20log( R( ) / R(0) ) LTH
No

Yes
invalid frame Ph0
Interpolation of a lag time
with spline function
Power

Pg0
estimated pitch f0
background noise
Fig. 1 Block diagram of pitch detection.
f0 2f0 3f0 (n-1)f0 nf0

STEP 2: Fig. 3 Illustration of characteristics of harmonic wavelet transform.


Wavelet transform is performed in this stage. Since the
admissible condition is the only restriction for the analyzing wavelet, as shown in Fig. 2(b), is the same as that
analyzing wavelet, there is a large variety in selection of of a comb filter, the power at harmonic components is
mathematical functions when the inverse wavelet transform summed up at the pitch frequency by the proposed
is neglected. The Gabor function is often used as the analyzing wavelet, as illustrated in Fig. 3. Pg0 and Ph0 at
analyzing wavelet for the continuous wavelet transform f0 shown in Fig. 3 denote power levels by the Gabor
[3,9]. Figures 2(a) and 2(b) show an example of a function and the proposed analyzing wavelet, respectively.
frequency response for the Gabor function and the Therefore, it is expected that the relative enhancement of
proposed analyzing wavelet, respectively. As shown in power by the summation of power for harmonic compo-
Fig. 2(a), the frequency response of the Gabor function is nents brings a robustness against noise.
regarded as that of a band pass filter. Since the lag time for Moreover, the proposed method gives other useful
a pitch candidate in an autocorrelation is chosen by information about a pitch to a scalogram. The information
searching for a peak of maximum value, a relative is that a series of local peaks at an interval of pitch period
enhancement of a local peak gives us a robustness against appear along a time axis (i.e. shift axis) in the scalogram.
noise. In order to bring information about a pitch in a The ripples which correspond to a pitch period along a time
scalogram, we propose to make use of an analyzing axis appear in a scalogram not only at a pitch frequency but
wavelet which has information of high order harmonic also at other frequencies. Mismatching of phase at each
components. Since a frequency response of the proposed harmonic component gives only a modification of a pattern

9
Acoust. Sci. & Tech. 24, 1 (2003)

for the ripple, and an error ratio for pitch estimation caused W(t,f)
by the mismatching of phase is less than 2% when a
duration for an analyzing wavelet is longer than a duration
f
N N
x= x=0 x=
corresponding to 3 fundamental periods [12]. 2f 2f

Hereafter, the proposed analyzing wavelet and the


proposed method is called Harmonic Analyzing Wavelet
and Harmonic Wavelet Transform (HWT), respectively.
A general form of the wavelet transform is expressed as
follows,
g(t,f,x)
Z1  
1  xb
ðW f Þðb; aÞ ¼ pffiffiffiffiffiffi f ðxÞdx; ð2Þ
1 jaj a
t
where
ðxÞ : analyzing wavelet, Fig. 4 Examples of analyzing wavelets.
f ðxÞ : input signal,
a : scale parameter, window function. The window is designed to give a
b : shift parameter, localization in a duration of N periods for a fundamental
frequency. In order to obtain the highest performance, k
x : conjugate of x.
and k should be the same as those of a target signal.
And the analyzing wavelet ðxÞ must satisfy the following However, it is difficult to obtain suitable values of k and
admissible condition, k frame by frame because the target signal is speech.
Z1 Therefore, each fixed value is set to k and k in this paper.
ðxÞdx ¼ 0: ð3Þ It is confirmed that the performance is sufficient for pitch
1
detection, as shown in this paper and other literature [12].
In order to apply a harmonic function to the analyzing In addition, a combination of values of k and k gives a
wavelet, let us assume a part of Eq. (2) as follows, variety of patterns of ripple waveform, a period of ripples
 
1 xb corresponding to a pitch period always appears with the
gðb; a; xÞ ¼ pffiffiffi  : ð4Þ
a a highest power in any values of k and k [12].
Finally, an analyzing wavelet gðt; f ; xÞ is expressed
Since the inverse of the scale parameter 1=a and shift
with a factor for normalization of the function hðt; f ; xÞ and
parameter b correspond to parameters for frequency f and pffiffiffiffi pffiffiffi
a factor f instead of 1= a shown in Eq. (4) as follows.
time t in scalogram, respectively, Eq. (2) can be expressed pffiffiffiffi
by using Eq. (4) with t and f as follows, f
Z1 gðt; f ; xÞ ¼ hðt; f ; xÞ ð8Þ
khk2
Wðt; f Þ ¼ gðt; f ; xÞf ðxÞdx: ð5Þ
1 where k k2 is a L2 norm.
A harmonic structure is adopted to the function gðt; f ; xÞ. In Examples of the proposed analyzing wavelet are shown
pffiffiffi in Fig. 4. Each waveform of the analyzing wavelet, from
addition, a factor 1= a, shown in Eq. (4), should also be
taken into account. The function gðt; f ; xÞ is derived as the left side in Fig. 4, has a relation of doubled analysis
follows: we define a basic function hðt; f ; xÞ which has the frequency. As the figure shows, Eq. (8) satisfies the
characteristics of a harmonic structure as follows. admissible condition shown in Eq. (3).
!
Xn
STEP 3:
hðt; f ; xÞ ¼ k ejð2fkðxtÞþk Þ wðxÞ; ð6Þ
k¼1
To determine a pitch period candidate, a scalogram is
converted by two autocorrelations. First, the autocorrela-
N N
 x ; ð7Þ tion is calculated along a time axis in a scalogram to
2f 2f
synchronize a lag time at each frequency. Second, the
where k and k denote amplitude and phase at k-th order autocorrelation is executed to determine a final candidate
harmonic component, respectively. f is a fundamental for pitch period.
frequency for analysis and n is a number of harmonics. As This scheme is similar to that of human pitch
higher harmonic components do not keep a strict harmonic perception models, such as the Meddis-Hewitt model
structure [13], a value of n is set up to 20 or a number which is based on psychophysical studies [14,15]. As a
where a frequency of the highest harmonic components first process, a conversion from a scalogram into an
does not exceed the Nyquist frequency. wðxÞ is Gaussian autocorrelation domain is executed by Eq. (9).

10
Y. CHISAKI et al.: PITCH DETECTION BASED ON CONTINUOUS WAVELET TRANSFORM

Z T=2 3.1. Confirmation of Each Process


1
Rð ; f Þ ¼ lim fWðt; f ÞWðt þ ; f Þgdt; ð9Þ In this subsection, each result is shown as scalogram,
T!1 T T=2
correlogram, and integrated autocorrelation. Pseudo vowel
where Wðt; f Þ is a set of transformed values by wavelet signals are used as an input signal whose fundamental
transform. For calculations on a computer, a null is padded frequency and order of harmonic components are set to
to Wðt; f Þ in a duration except for a frame of signal. 200 Hz and 10, respectively. Lower and higher frequencies
As a second process, a conversion from Rð ; f Þ to Rð Þ for searching range, FL and FH , are set to 50 Hz and 500 Hz
is performed by integration to enhance local peaks for pitch respectively, because a pitch for test signals is 200 Hz. A
candidates relatively. Rð Þ is defined as integrated auto- duration for frame length is set to 20.0 ms. N, as a number
correlation, and can be expressed as follows. of fundamental periods to satisfy the admissible condition,
Z FH is set to 3 and n, as a maximum number of harmonics for
Rð Þ ¼ Rð ; f Þdf : ð10Þ the analyzing wavelet, is set to 20. k and k are set to 1
FL
and 0, respectively.
where FL and FH are the lowest and the highest frequencies A threshold parameter LTH for judgment of voiced or
for pitch analysis interval, respectively. Parameter is a lag unvoiced is obtained as follows. An unvoiced frame signal
time. is simulated by adding noise to a frame of a voiced speech.
A value of the lag time where a maximum value of White and pink noise are added to the pseudo vowel.
Rð Þ locates in a range of 1=FH 1=FL corresponds to Values of the left-hand side in Eq. (11) are calculated in
a pitch period. combinations of vowel and noise. A value of the left-hand
side corresponds to a degree of distortion of harmonic
STEP 4:
structure. The value becomes smaller as the harmonic
In the process of searching for a local peak for a pitch
structure is distorted. As a result, a range of the values is
candidate, a peak of maximum value in an integrated
from 26:3 dB, for incorrect estimation, to 3:6 dB, for
autocorrelation domain is detected in both cases of a voiced
clean speech, in case of white noise. In case of pink noise,
or an unvoiced frame. Therefore, a process for judgment of
the left-hand side values are 1 dB greater than that in the
voiced-unvoiced is required. As values of Rð Þ at local
case of white noise. According to the preliminary
peaks correspond to the power level of harmonic
experiment, parameter LTH is set to a value 24 dB for
components of the observed signal, the following condition
both white and pink noise.
is defined to distinguish whether a frame is voiced or
Figure 5 shows a scalogram for the pseudo vowel /a/ by
unvoiced.
 the proposed method without noise. Vertical and horizontal
maxðRð Þ; 1=FH 1=FL Þ axes represent frequency and time (i.e. scale and shift in the
20 log  LTH ; ð11Þ
Rð0Þ wavelet transform), respectively. Peaks are located with
where maxð Þ is a function which gives a maximum value, around 5 ms intervals along the time axis. This interval
and LTH is a threshold. LTH is defined by a preliminary corresponds to the fundamental frequency.
experiment. For picking up the information about a pitch, the
autocorrelation in the scalogram is calculated. Figure 6
STEP 5: shows an autocorrelation defined by Eq. (9). Vertical and
In a final step, an interpolation of Rð Þ with respect to a horizontal axes represent frequency and lag time for
lag time is performed by cubic spline function to improve autocorrelation. In this figure, peaks are aligned along the
the resolution of lag time. The resolution of lag time is 10 frequency axis around 5 and 10 ms which corresponds to a
times higher than that of the original. A final estimation of pitch and a half pitch.
pitch is obtained by inverting the interpolated lag time. To obtain a pitch period candidate, the integrated
autocorrelation is calculated. Figure 7 shows a result of
3. SIMULATIONS integrated autocorrelation defined in Eq. (10). Vertical and
In this section, three simulations are performed. horizontal axes represent the magnitude of integrated
Scalogram, correlogram, and integrated autocorrelation autocorrelation and lag time, respectively. As shown in
are shown by using pseudo vowel in the first simulation. A Fig. 7, a maximum local peak in an interval [1=FH , 1=FL ]
harmonic chirp signal is used to show a basic performance is located at 5.00 ms; namely, the estimated fundamental
in the second simulation. Finally, performance of pitch frequency is 200 Hz. This is the pitch period of the test
detection for speech is examined by using words and signal. Figure 8 shows a result of integrated autocorrelation
sentences uttered by females and males. Sampling when white noise is added at SNR ¼ 0 dB. As the figure
frequency and resolution of quantization are 10 kHz and shows, magnitudes of local peaks are degraded. In this
16 bits, respectively. Tshift for a frame shift is set to 10 ms. case, a maximum magnitude is located at 2.50 ms; namely

11
Acoust. Sci. & Tech. 24, 1 (2003)

an estimated pitch is 400 Hz.

3.2. Pitch Detection for Chirp Signal


In this subsection, basic characteristics of pitch
detection are examined by using a harmonic chirp signal.
The proposed method is compared with two conventional
methods, the cepstrum method [7] and the modified
correlation method [2]. The cepstrum method determines
a pitch based on cepstral peaks. The modified correlation
method is based on autocorrelation of LPC residual signal.
Fig. 5 Scalogram for pseudo vowel /a/. These methods show better performances for clean speech;
however, a long frame length, such as over 40 ms, is
required to perform accurate pitch detection.
A condition of simulation using a harmonic chirp signal
is as follows. Fundamental frequencies at onset and offset
time are 50 Hz and 600 Hz, respectively. The fundamental
frequency sweeps exponentially to simulate a rapid change
in higher frequency. A number of harmonic components
for the signal is set to four.
In the proposed method, parameters FL , FH in Eq. (10)
are set to 30 Hz, 700 Hz, respectively, because a range of
frequency for the harmonic chirp signal is set from 50 Hz to
600 Hz. LTH is set to 24 dB from the result shown in the
previous simulation. N and n are 20 and 3, respectively.
Fig. 6 Autocorrelation along time (shift) axis in scalogram.
Frame length is set to 20.0 ms.
In the conventional methods, a frequency range for
pitch detection and length for the frame signal obtained
from the observed signal is on the same conditions as that
1/FH 1/FL
search range for local peaks for the proposed method, respectively. Hanning window is
pitch candidate used for pre-processing. In the cepstrum method, null
(max magnitude) padding is performed in order to improve frequency
)
R(

resolution. The padded frame length is 4 times longer than


that of the original. Coefficients of cepstrum correspond to
over 1.5 kHz is flattened by using the mean value of
coefficients under 1.5 kHz, as shown in other papers
0 5 10 15 20 [1,9,15].
Lag time (ms) Figure 9 illustrates the results of simulation by each
method. Figures 9(a), 9(b) and 9(c) are obtained by the
Fig. 7 Integrated autocorrelation for pseudo vowel without noise.
cepstrum method, the modified correlation method and the
proposed method, respectively. Circle marks denote an
estimated pitch, and dotted lines show a fundamental
1/FH 1/FL frequency of chirp signal. Horizontal axis represents
search range for local peaks
frequency in logarithmic scale.
pitch candidate In case of the cepstrum method shown in Fig. 9(a),
(max magnitude)
)

pitches are not detected correctly when a frequency is


R(

under 400 Hz. It is also shown that the resolution of the


frequency is insufficient in higher frequency caused by a
sampling rate.
As for the modified correlation method in Fig. 9(b),
0 5 10 15 20 quadplex pitch frequency is estimated when the true pitch
Lag time (ms)
is under 147.8 Hz.
Fig. 8 Integrated autocorrelation for pseudo vowel In case of the proposed method shown in Fig. 9(c), an
when a white noise is added at SNR ¼ 0 dB. error of a pitch detection occurs when a pitch is under

12
Y. CHISAKI et al.: PITCH DETECTION BASED ON CONTINUOUS WAVELET TRANSFORM

600
/fukuoka made no kippu ga hoshii no desuga/. Words are
Frequency (Hz) /ohayou/, /kumamoto/, /kagoshima/, /meiru/, /keikeitii/
200

100 (KKT), /tiikeiyuu/ (TKU), and /keieibii/ (KAB). A


CEP
50
FRM_LEN=20.0(ms) sampling rate of speech signal is converted from 48 kHz
0 1 2 3
Time (s)
to 10 kHz for simulation.
Performance is evaluated by both gross pitch error and
(a) fine pitch error. In evaluation of gross pitch error, an
600
evaluation of absolute frequency difference between
estimated pitch and reference pitch is used in some
Frequency (Hz)

200 references [16,17]. However, the range of pitch frequency


100
MOC for pitch detection is wide, such as from 50 Hz to 500 Hz.
FRM LEN=20.0(ms)
50
0 1 2 3
Therefore, gross pitch error based on a ratio is used as
Time (s) shown in some studies [1,18].
Let us assume that reference and estimated pitch
(b)
frequencies are fk and f^k for k-th frame. Then, a ratio of
600 relative error is defined as follows,
Frequency (Hz)

jfk  f^k j
200
e^ðkÞ ¼ ; ð12Þ
100
HWT
fk
FRM LEN=20.0(ms)
50
0 1 2 3 where k denotes a frame index. Error frames are defined
Time (s) that the ratio of relative error e^ðkÞ is greater than 0.05,
hence 5%. Gross pitch error (GPE) is calculated as follows,
(c)
Nerror
GPE ¼  100 ð%Þ; ð13Þ
Fig. 9 Result of pitch detection when a duration for a Nall
frame is 20 ms. (a) Cepstrum method (CEP). (b)
Modified correlation method (MOC). (c) the proposed
where Nerror and Nall are the number of error frames and
method (HWT). whole frames, respectively.
In evaluation of fine pitch error, standard deviation and
mean error is calculated when the ratio of relative error e^ðkÞ
88.9 Hz.
is less than 0.05 as shown in some studies [19,20].
By comparing these results, the proposed method
3.3.2 Length for analysis frame
performs better than that of the other two conventional
This subsection examines how a frame length affects to
methods. In particular, an improvement is apparent at
errata of pitch analysis is examined. A frame length is
lower frequencies. This is one of the benefits for the
varied as 15.0, 20.0, 25.6, 30.0, and 51.2 ms. In simula-
proposed method.
tions, white and pink noise are added at SNR ¼ 5 dB. In
preliminary experiments, as for several values of SNR,
3.3. Simulation for Speech Signals
performance of pitch detection is degraded extremely
In this subsection, performance of a pitch detection is
under SNR ¼ 5 dB. Therefore, this value of SNR is
evaluated by both gross pitch error and fine pitch error.
selected as the lowest SNR in this paper. SNR is calculated
After the simulation, performance of a pitch detection with
in overall duration. The proposed method is compared with
various frame lengths is discussed. Moreover, a robustness
the cepstrum and the modified correlation methods. Figures
against noise is also discussed. Finally, processing time is
10(a) and 10(b) show results in case of female and male
examined.
speech, respectively. Filled and open marks denote the
3.3.1 Conditions for simulation and evaluation method
results for white noise and pink noise, respectively. Circles,
Speech signals, as input signals, are recorded in an
triangles and squares show results of the proposed (HWT),
anechoic room. A glottal vibration is recorded as a
modified correlation (MOC) and cepstrum (CEP) methods,
reference pitch simultaneously. Sampling rate is 48 kHz
respectively. Vertical and Horizontal axes are gross pitch
because a resolution of frequency for a standard pitch
error and frame length, respectively. According to the
estimation depends on the sampling rate. A reference pitch
results shown in both (a) and (b), performance of the
is obtained by autocorrelation of a glottal waveform. As
proposed method is better than both those of the modified
doubled and half pitches are sometimes estimated, final
correlation method and the cepstrum method. In all cases,
reference pitches are determined by human inspection. A
the worst performance is shown when the frame length is
Japanese female and male utter 2 sentences and 7 words.
15.0 ms. The reason for degradation of the performance is
Sentences are /ashitano tenki wa nani desu ka?/ and

13
Acoust. Sci. & Tech. 24, 1 (2003)

100 100
Gross Pitch Error[%] HWT(white) HWT(pink)
80 MOC(white) MOC(pink)
80 HWT(female)
CEP(white) CEP(pink)
MOC(female)

Gross Pitch Error[%]


CEP(female)
60 60
HWT(male)
MOC(male)
CEP(male)
40
40

20
20

0
15.0 20.0 25.6 30.0 51.2 0
5 10 15 20 30 40
Frame length[ms] SNR[dB]

(a) GPE vs. frame length in case of female speech. (a) GPE vs. SNR in case of white noise.

100 HWT(white) HWT(pink) 100


MOC(white) MOC(pink)
Gross Pitch Error[%]

80 CEP(white) CEP(pink)
80 HWT(female)
MOC(female)

Gross Pitch Error[%]


60 CEP(female)
HWT(male)
60 MOC(male)
CEP(male)
40
40
20
20
0
15.0 20.0 25.6 30.0 51.2 0
5 10 15 20 30 40
Frame length[ms]
SNR[dB]
(b) GPE vs. frame length in case of male speech. (b) GPE vs. SNR in case of pink noise.

Fig. 10 Results of pitch detection evaluated by Gross Fig. 11 Performance of pitch detection evaluated by
pitch error (GPE). (a) and (b) are in case of female and Gross pitch error (GPE). (a) and (b) are in case of white
male, respectively. Circle, triangle and square repre- noise and pink noise, respectively. Circle, triangle and
sent harmonic wavelet transform (HWT), modified square represent harmonic wavelet transform (HWT)
correlation (MOC) and cepstrum (CEP) methods, method, modified correlation (MOC) method and
respectively. Filled and open marks denote results for cepstrum (CEP) method, respectively. Filled and open
white noise and pink noise, respectively. marks denote results for female and male, respectively.

that the power level for harmonic components in the lower


frequency range cannot be obtained sufficiently as a frame females and males, respectively. Vertical and horizontal
length becomes shorter. Moreover, gross pitch error axes are gross pitch error and SNR, respectively. In both
increases slightly until frame length is 20.0 ms even though Figs. 11(a) and 11(b), performance of each method is
SNR is 5 dB. degraded as SNR becomes smaller, except for the result of
From a point of view at the type of noise, performance the cepstrum method for male. In each SNR, performance
of the proposed method is a quite a bit better than that of of the proposed method is better than other two conven-
the conventional methods, except for a case of pink noise tional methods.
for females. The reason why the performance for pink Furthermore, performance of the proposed method is
noise is the worst is that a power level of pink noise in a examined by gross pitch error with some frequency bands
range of a pitch frequency is higher than that in a range of each when a frame length obtained from observed signal is
other frequencies. 20.0 ms. Figure 12 shows gross pitch error at each
According to these results, the proposed method shows frequency when SNR is 15 dB and 30 dB, respectively.
the best performance among three methods. It is also Gross pitch error is calculated at each center frequency
shown that performance of the proposed method is not from 100 Hz to 300 Hz with 50 Hz steps. The bandwidth at
degraded when a frame length is over 20.0 ms. each frequency is 50 Hz. In Fig. 12, filled and open marks
3.3.3 Performance against noise are for female and male speech, respectively. Circle and
Figures 11(a) and 11(b) show performance of pitch triangle marks represent the result for the proposed (HWT)
detection in case of white and pink noise when a frame and modified correlation (MOC) methods, respectively.
length obtained from observed signal is 20 ms, respec- Solid and dotted lines denote in case of SNR ¼ 15 dB and
tively. Three methods; the proposed (HWT), the modified 30 dB, respectively.
correlation (MOC) and the cepstrum (CEP) are compared. Performance cannot be compared between each
Each result is denoted by circle, triangle and square mark, frequency band because SNR at each frequency band is
respectively. Filled and open marks denote the result for not the same. Performance at 100 Hz is not better than that

14
Y. CHISAKI et al.: PITCH DETECTION BASED ON CONTINUOUS WAVELET TRANSFORM

100 Table 1 Standard deviation of fine pitch error when a


15 dB 30 dB
Gross Pitch Error(%) HWT(female) HWT(female) frame length obtained from observed signal is 20.0 ms
80 MOC(female) MOC(female) (Hz).
HWT(male) HWT(male)
MOC(male) MOC(male)
60 (a) female

40
SNR 1 30 20 15 10 5
HWT 3.03 3.15 3.22 3.30 4.10 4.71
20 MOC 2.85 3.18 3.35 3.36 3.95 4.91
0 (b) male
100 150 200 250 300 SNR 1 30 20 15 10 5
Pitch frequency (Hz)
HWT 2.48 2.39 2.45 2.43 2.55 2.68
MOC 1.94 1.96 2.05 2.22 2.07 1.69
Fig. 12 Gross pitch error (GPE) at each center
frequency from 100 Hz to 300 Hz with 50 Hz step for
female and male speech in case of SNR ¼ 15 dB and
30 dB. Filled and open marks are in case of female and Table 2 Mean value of fine pitch error when a frame
male, respectively. Circle and triangle marks represent length obtained from observed signal is 20.0 ms (Hz).
the proposed method and modified correlation method,
respectively. (a) female
SNR 1 30 20 15 10 5
at other frequencies at each SNR in Fig. 12 because HWT 0:57 0:56 0:50 0:64 0:40 0:30
segmental SNR at 100 Hz is the lowest among SNR at each MOC 0:84 0:88 0:72 0.31 1.45 3.54
frequency band. In each case except for 300 Hz at (b) male
SNR ¼ 30 dB, performance of the proposed method is
SNR 1 30 20 15 10 5
better than that of the modified correlation method at both
HWT 0.58 0.60 0.52 0.50 0.49 0.63
SNR ¼ 15 dB and 30 dB. Moreover, it is confirmed that the
MOC 0:29 0:14 0:10 0.18 0.74 0.46
difference of performance between the proposed method
and the modified correlation method at SNR ¼ 15 dB is
greater than that at SNR ¼ 30 dB.
are Pentium III (550 MHz) and 128 MB, respectively. In
In addition, a reason for difference of performance
softwares, Operating System is linux (kernel 2.2.18) and
between female and male in Figs. 11(a) and 11(b) can be
gcc (version egcs-2.91.66) is used for the compiler.
considered as follows. The performance in Figs. 11(a) and
Performance is measured by off-line processing. Chirp
11(b) can be considered as sum of performance at each
signal, whose duration is 2.0 s, is used as an input signal.
frequency band in Fig. 12. Therefore, performance of
As a result, processing time is 120 s for the modified
female is better than that of male because pitch frequency
correlation method. In case of the proposed method,
for female is higher than 187.6 Hz in this simulation.
processing time is 8.6 times longer than the modified
Next, fine pitch error is examined. The condition of
correlation method.
simulation for fine pitch error is the same as that for gross
pitch error. Standard deviation and mean error for fine pitch 4. CONCLUSION
error is calculated when the ratio of relative error e^ðkÞ is
A pitch detection method based on the continuous
less than 0.05, as shown in some studies [19,20]. Table 1
wavelet transform for a harmonic signal is proposed. Basic
shows the result of standard deviation of fine pitch error.
characteristics of pitch detection are confirmed by using a
Mean of fine pitch error is shown in Table 2. Performance
harmonic chirp signal. In addition, a simulation of a pitch
of the proposed method with respect to standard deviation
detection with added noise is performed. According to the
is not always better than that of the modified correlation
results of all simulations, it is confirmed that the proposed
method at each SNR, except for male. In a point of view
method has advantages with respect to a frame length and a
from the mean value, it is also shown that performance of
robustness against noise. In addition, it is mentioned that a
the proposed method is not always better than that of the
long processing time is required when comparing to the
conventional method.
conventional method. We plan to use temporal information
According to the results, performance of the proposed
and auditory phenomena of human in order to improve
method is better than conventional pitch detection methods
performance. Furthermore, we will discuss the adoption of
with respect to gross pitch error.
this algorithm to parallel processing in order to improve
3.3.4 Performance of processing time
performance of processing time.
Performance of processing time is examined. Condi-
tions for hardware are as follows. CPU and memory size

15
Acoust. Sci. & Tech. 24, 1 (2003)

ACKNOWLEDGMENTS instantaneous frequencies of harmonic components,’’ IEICE


D-II, J83-D-II, 2077–2086 (2000).
We would like to thank Prof. Adrianus J. M. Houtsma, [10] R. Abe and N. Kambayashi, ‘‘Frequency estimation based on
IPO at the Eindhoven University of Technology, for harmonic product spectrum method,’’ J. Acoust. Soc. Jpn. (J),
53, 691–697 (1997).
comments, suggestions, and review.
[11] R. Abe and N. Kambayashi, ‘‘An algorithm for fundamental
Part of this research was carried out by the Ono frequency estimation using the harmonic spectral model,’’ J.
Acoustics Research Fund (2001). Acoust. Soc. Jpn. (J), 54, 296–304 (1998).
[12] Y. Chisaki, T. Usagawa and M. Ebata, ‘‘Spectral analysis for
REFERENCES inharmonic signal by signal specific analyzing wavelet,’’ Tech.
[1] T. Takagi, N. Seiyama and E. Miyasaka, ‘‘A method for pitch Rep. IEICE, EA98-95, pp. 29–36 (1998).
extraction of speech signals using autocorrelation functions [13] S. Kato and J. Miwa, ‘‘Pitch detecion using moving average
through multiple window-lengths,’’ IEICE A, J80-A, 1341– and band-limitaion in cepstrum method and it’s application,’’
1350 (1997). Tech. Rep. IEICE, SP94-95, pp. 29–36 (1995).
[2] F. Itakura and S. Saito, ‘‘Speech information compression [14] R. Meddis and M. J. Hewitt, ‘‘Virtual pitch and phase
based on the maximum likelihood spectral estimation,’’ J. sensitivity of a computer model of the auditory periphery. I:
Acoust. Soc. Jpn. (J), 9, 463–472 (1971). Pitch identification,’’ J. Acoust. Soc. Am., 89, 2866–2882
[3] H. Kawahara, ‘‘Speech representation and transformation using (1991).
adaptive interpolation of weighted spectrum: VOCODER [15] R. Meddis and M. J. Hewitt, ‘‘Modeling the identification of
revisited,’’ Proc. IEEE Int. Conf. Acoustics, Speech and Signal concurrent vowels with different fundamental frequencies,’’ J.
Processing, Vol. 2, pp. 1303–1306 (1997). Acoust. Soc. Am., 91, 233–245 (1992).
[4] Y. Chisaki, T. Usagawa and M. Ebata, ‘‘Speech enhancements [16] L. R. Rabiner, M. J. Cheng, A. E. Rosenberg and C. A.
in multiple speaker condition using adaptive comb filter,’’ Mcgonegal, ‘‘A comparative performance study of several
Proc. Inter-Noise 95, Vol. II, pp. 1125–1128 (1995). pitch detection algorithms,’’ IEEE Trans. Acoust. Speech
[5] M. Nakai, H. Shimodaira and S. Sagayama, ‘‘Prosodic phrase Signal Process., ASSP-24, 399–418 (1976).
segmentation based on pitch-pattern clustering,’’ Trans. IEICE, [17] H. Ohmura and K. Tanaka, ‘‘Fine pitch contour extraction by
J77-A, 206–214 (1993). voice fundamental wave filtering method,’’ J. Acoust. Soc. Jpn.
[6] T. Yoshimura, S. Hayamizu, H. Omura and K. Tanaka, ‘‘Pitch (J), 51, 509–518 (1995).
pattern clustering of user utterances in human-machine [18] L. R. Rabiner, M. J. Cheng, A. E. Rosenberg and C. A.
dialogue,’’ ICSLP ’96, Vol. 2, pp. 837–840 (1996). McGonegal, ‘‘A comparative performance study of several
[7] A. M. Noll, ‘‘Cepstrum pitch determination,’’ J. Acoust. Soc. pitch detection algorithms,’’ IEEE Trans. Acoust. Speech
Am., 22, 293–309 (1967). Signal Process., ASSP-24, 399–418 (1976).
[8] T. Abe, T. Kobayashi and S. Imai, ‘‘Robust pitch estimation [19] N. Kunieda, T. Shimamura and J. Suzuki, ‘‘Pitch extraction by
with harmonics enhancement in noisy environments based on using autocorrelation function on the log spectrum,’’ IEICE A,
instantaneous frequency,’’ Proc. ICSLP ’96, Vol. 2, pp. 1277– J80-A, 435–443 (1997).
1280 (1996). [20] A. Sasou and S. Nakamura, ‘‘A pitch extraction method using
[9] Y. Atake, T. Irino, H. Kawahara, J. Lu, S. Nakamura and K. wavelet transform,’’ IEICE A, J80-A, 1848–1856 (1997).
Shikano, ‘‘Robust estimation of fundamental frequency using

16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy