A Pitch Detection Method Based On Continuous Wavelet Transform For Harmonic Signal
A Pitch Detection Method Based On Continuous Wavelet Transform For Harmonic Signal
24, 1 (2003)
PAPER
Abstract: In order to track a rapid transient of pitch, a required frame length of some conventional
pitch detection methods is too long. Although there are wavelet based pitch detection methods which
require only a few periods of pitch for a frame, they are not robust enough against noise. This paper
proposes a new pitch detection method which can work properly under noisy environments even if a
frame duration is short. The proposed method consists of a power level detector, a signal analyzer, an
autocorrelator, a voiced-unvoiced detector and a lag time interpolator. The signal analyzer is based on
the continuous wavelet transform using a harmonic analyzing wavelet. Usage of the harmonic
analyzing wavelet gives us more information about a pitch in a scalogram. Simulations of pitch
detection for a harmonic chirp signal and speech signals are performed. Performances are compared
with two conventional pitch detection methods, cepstrum and modified correlation methods. As a
result, a performance of a pitch detection by the proposed method under a noisy environment is better
than that of the other two conventional methods. In particular, the largest improvement of performance
is obtained for male voices.
7
Acoust. Sci. & Tech. 24, 1 (2003)
of pitch detection is degraded under noisy environments the series of local peaks along the time axis contributes to
because a peak of autocorrelation corresponding to a pitch the detection of a pitch when the power level at a
is not enhanced against other peaks corresponding to noise. fundamental frequency is insufficient; namely, a missing
There are various techniques to decrease effects of noise in fundamental signal, such as a speech signal over tele-
autocorrelation, e.g. low pass filtering in the cepstrum communication lines.
method and the modified correlation method. However, a In this paper, a pitch detection method based on the
robustness against noise for those techniques is insufficient continuous wavelet transform is proposed. The proposed
when a frame length becomes shorter. In order to obtain method consists of a power level detector, a signal
further robustness against noise, we consider to use both analyzer, an autocorrelator, a voiced-unvoiced detector
harmonic components of speech and temporal information and a lag time interpolator. In the signal analyzer block, the
corresponding to a pitch in scalogram obtained by a continuous wavelet transform is used to obtain both a pitch
continuous wavelet transform. The continuous wavelet detection based on a short duration and a robustness against
transform is a suitable transform method because we can noise. According to both a relative enhancement of power
obtain much information about a pitch in a scalogram. It is level and some series of local peaks, a pitch detection
expected that information about a pitch along each time based on a short duration and a robustness against noise is
and frequency axis can be obtained by using the following expected simultaneously.
two major flexibilities for speech analysis. The first First, a procedure for the proposed method is described.
advantage is flexibility with respect to arrangements for In particular, detailed advantages for the continuous
resolution of both time and frequency. A second one is a wavelet transform using a harmonic analyzing wavelet
flexible selection of a transform kernel which is called an are mentioned. Secondly, parameters for the proposed
analyzing wavelet or a mother wavelet. The analyzing scheme are discussed by a computer simulation. The
wavelet plays an important role to decide characteristics of proposed method is compared with the conventional
wavelet transform. The only restriction for an analyzing modified correlation method and the cepstrum method in
wavelet is admissible condition. Thus, there is a large all simulations. In order to confirm a performance of pitch
variety in selection of mathematical function as an detection, a harmonic chirp signal is used as an input
analyzing wavelet when inverse wavelet transform is signal. Moreover, results of simulations with respect to a
neglected. In order to extract pitch information, a function robustness against noise in case of speech with the addition
based on a harmonic structure is selected as an analyzing of white or pink noise are shown. Finally, a performance of
wavelet because our target signal, a speech signal, has a the proposed method is concluded by those results
harmonic structure. In a pitch detection method based on compared with the modified correlation method and the
an autocorrelation, pitch candidates are selected from local cepstrum method.
peaks in autocorrelation. As studies for a pitch detection of
a music signal [10,11] show, a manipulation of relative 2. PROCEDURE OF PITCH DETECTION
power enhancement for the local peak corresponding to a In this section, a procedure of pitch detection is
pitch is important even though each harmonic component mentioned. Figure 1 shows a block diagram of the
of speech has a perturbation of frequency. Power for each proposed method. The pitch detection method consists of
high order harmonic component is summed up at a pitch 5 blocks; a power level detector, a signal analyzer, an
frequency in the scalogram because the characteristics of autocorrelator, a voiced-unvoiced detector and a lag time
frequency for an analyzing wavelet which has a harmonic interpolator. The pitch detection is performed frame by
analyzing wavelet are the same as that for a comb filter. frame with Tshift .
Thus, it is expected that a relative power level at a
fundamental frequency obtained by wavelet transform STEP 1:
using the harmonic analyzing wavelet is higher than that In the first stage, the power level detector works as a
obtained by wavelet transform using a Gabor function switch for turning the pitch detection process on and off.
[3,9]. The pitch detection is performed when the following
In addition, another characteristic of continuous wave- equations are satisfied,
let transform based on the harmonic analyzing wavelet 10 logðPÞ > PTH ;
gives one more benefit to get a robustness against noise. ð1Þ
P > 0;
The continuous wavelet transform brings some series of
local peaks with an interval of a pitch period along the time where P is average power level with respect to time in a
axis in a scalogram. Therefore, gathering pitch period frame, and PTH is a threshold derived from average power
information from pitch interval series contributes to level of background noise.
enhanced power of a local peak in a short time. Moreover,
8
Y. CHISAKI et al.: PITCH DETECTION BASED ON CONTINUOUS WAVELET TRANSFORM
0
-10
-20
Tshift a frame
-30
(dB)
-40
f(x)
-50
-60
-70
-80
-90
0 0.1 0.2 0.3 0.4 0.5
Relative frequency
10log( P ) PTH
No (b)
Yes
0
skip to
Wavelet Transform using -10
next frame
Harmonic Analyzing Wavelet -20
-30
W( t , f )
(dB)
-40
-50
Autocorrelation
-60
in scalogram
-70
R( ,f) -80
-90
Summation of 0 0.1 0.2 0.3 0.4 0.5
autocorrelation Relative frequency
R( )
(b)
Maximum value search
in autocorrelation domain
Fig. 2 Example of frequency response. (a): Gabor
function. (b): Harmonic Analyzing Wavelet (Pro-
posed).
20log( R( ) / R(0) ) LTH
No
Yes
invalid frame Ph0
Interpolation of a lag time
with spline function
Power
Pg0
estimated pitch f0
background noise
Fig. 1 Block diagram of pitch detection.
f0 2f0 3f0 (n-1)f0 nf0
9
Acoust. Sci. & Tech. 24, 1 (2003)
for the ripple, and an error ratio for pitch estimation caused W(t,f)
by the mismatching of phase is less than 2% when a
duration for an analyzing wavelet is longer than a duration
f
N N
x= x=0 x=
corresponding to 3 fundamental periods [12]. 2f 2f
10
Y. CHISAKI et al.: PITCH DETECTION BASED ON CONTINUOUS WAVELET TRANSFORM
11
Acoust. Sci. & Tech. 24, 1 (2003)
12
Y. CHISAKI et al.: PITCH DETECTION BASED ON CONTINUOUS WAVELET TRANSFORM
600
/fukuoka made no kippu ga hoshii no desuga/. Words are
Frequency (Hz) /ohayou/, /kumamoto/, /kagoshima/, /meiru/, /keikeitii/
200
jfk f^k j
200
e^ðkÞ ¼ ; ð12Þ
100
HWT
fk
FRM LEN=20.0(ms)
50
0 1 2 3 where k denotes a frame index. Error frames are defined
Time (s) that the ratio of relative error e^ðkÞ is greater than 0.05,
hence 5%. Gross pitch error (GPE) is calculated as follows,
(c)
Nerror
GPE ¼ 100 ð%Þ; ð13Þ
Fig. 9 Result of pitch detection when a duration for a Nall
frame is 20 ms. (a) Cepstrum method (CEP). (b)
Modified correlation method (MOC). (c) the proposed
where Nerror and Nall are the number of error frames and
method (HWT). whole frames, respectively.
In evaluation of fine pitch error, standard deviation and
mean error is calculated when the ratio of relative error e^ðkÞ
88.9 Hz.
is less than 0.05 as shown in some studies [19,20].
By comparing these results, the proposed method
3.3.2 Length for analysis frame
performs better than that of the other two conventional
This subsection examines how a frame length affects to
methods. In particular, an improvement is apparent at
errata of pitch analysis is examined. A frame length is
lower frequencies. This is one of the benefits for the
varied as 15.0, 20.0, 25.6, 30.0, and 51.2 ms. In simula-
proposed method.
tions, white and pink noise are added at SNR ¼ 5 dB. In
preliminary experiments, as for several values of SNR,
3.3. Simulation for Speech Signals
performance of pitch detection is degraded extremely
In this subsection, performance of a pitch detection is
under SNR ¼ 5 dB. Therefore, this value of SNR is
evaluated by both gross pitch error and fine pitch error.
selected as the lowest SNR in this paper. SNR is calculated
After the simulation, performance of a pitch detection with
in overall duration. The proposed method is compared with
various frame lengths is discussed. Moreover, a robustness
the cepstrum and the modified correlation methods. Figures
against noise is also discussed. Finally, processing time is
10(a) and 10(b) show results in case of female and male
examined.
speech, respectively. Filled and open marks denote the
3.3.1 Conditions for simulation and evaluation method
results for white noise and pink noise, respectively. Circles,
Speech signals, as input signals, are recorded in an
triangles and squares show results of the proposed (HWT),
anechoic room. A glottal vibration is recorded as a
modified correlation (MOC) and cepstrum (CEP) methods,
reference pitch simultaneously. Sampling rate is 48 kHz
respectively. Vertical and Horizontal axes are gross pitch
because a resolution of frequency for a standard pitch
error and frame length, respectively. According to the
estimation depends on the sampling rate. A reference pitch
results shown in both (a) and (b), performance of the
is obtained by autocorrelation of a glottal waveform. As
proposed method is better than both those of the modified
doubled and half pitches are sometimes estimated, final
correlation method and the cepstrum method. In all cases,
reference pitches are determined by human inspection. A
the worst performance is shown when the frame length is
Japanese female and male utter 2 sentences and 7 words.
15.0 ms. The reason for degradation of the performance is
Sentences are /ashitano tenki wa nani desu ka?/ and
13
Acoust. Sci. & Tech. 24, 1 (2003)
100 100
Gross Pitch Error[%] HWT(white) HWT(pink)
80 MOC(white) MOC(pink)
80 HWT(female)
CEP(white) CEP(pink)
MOC(female)
20
20
0
15.0 20.0 25.6 30.0 51.2 0
5 10 15 20 30 40
Frame length[ms] SNR[dB]
(a) GPE vs. frame length in case of female speech. (a) GPE vs. SNR in case of white noise.
80 CEP(white) CEP(pink)
80 HWT(female)
MOC(female)
Fig. 10 Results of pitch detection evaluated by Gross Fig. 11 Performance of pitch detection evaluated by
pitch error (GPE). (a) and (b) are in case of female and Gross pitch error (GPE). (a) and (b) are in case of white
male, respectively. Circle, triangle and square repre- noise and pink noise, respectively. Circle, triangle and
sent harmonic wavelet transform (HWT), modified square represent harmonic wavelet transform (HWT)
correlation (MOC) and cepstrum (CEP) methods, method, modified correlation (MOC) method and
respectively. Filled and open marks denote results for cepstrum (CEP) method, respectively. Filled and open
white noise and pink noise, respectively. marks denote results for female and male, respectively.
14
Y. CHISAKI et al.: PITCH DETECTION BASED ON CONTINUOUS WAVELET TRANSFORM
40
SNR 1 30 20 15 10 5
HWT 3.03 3.15 3.22 3.30 4.10 4.71
20 MOC 2.85 3.18 3.35 3.36 3.95 4.91
0 (b) male
100 150 200 250 300 SNR 1 30 20 15 10 5
Pitch frequency (Hz)
HWT 2.48 2.39 2.45 2.43 2.55 2.68
MOC 1.94 1.96 2.05 2.22 2.07 1.69
Fig. 12 Gross pitch error (GPE) at each center
frequency from 100 Hz to 300 Hz with 50 Hz step for
female and male speech in case of SNR ¼ 15 dB and
30 dB. Filled and open marks are in case of female and Table 2 Mean value of fine pitch error when a frame
male, respectively. Circle and triangle marks represent length obtained from observed signal is 20.0 ms (Hz).
the proposed method and modified correlation method,
respectively. (a) female
SNR 1 30 20 15 10 5
at other frequencies at each SNR in Fig. 12 because HWT 0:57 0:56 0:50 0:64 0:40 0:30
segmental SNR at 100 Hz is the lowest among SNR at each MOC 0:84 0:88 0:72 0.31 1.45 3.54
frequency band. In each case except for 300 Hz at (b) male
SNR ¼ 30 dB, performance of the proposed method is
SNR 1 30 20 15 10 5
better than that of the modified correlation method at both
HWT 0.58 0.60 0.52 0.50 0.49 0.63
SNR ¼ 15 dB and 30 dB. Moreover, it is confirmed that the
MOC 0:29 0:14 0:10 0.18 0.74 0.46
difference of performance between the proposed method
and the modified correlation method at SNR ¼ 15 dB is
greater than that at SNR ¼ 30 dB.
are Pentium III (550 MHz) and 128 MB, respectively. In
In addition, a reason for difference of performance
softwares, Operating System is linux (kernel 2.2.18) and
between female and male in Figs. 11(a) and 11(b) can be
gcc (version egcs-2.91.66) is used for the compiler.
considered as follows. The performance in Figs. 11(a) and
Performance is measured by off-line processing. Chirp
11(b) can be considered as sum of performance at each
signal, whose duration is 2.0 s, is used as an input signal.
frequency band in Fig. 12. Therefore, performance of
As a result, processing time is 120 s for the modified
female is better than that of male because pitch frequency
correlation method. In case of the proposed method,
for female is higher than 187.6 Hz in this simulation.
processing time is 8.6 times longer than the modified
Next, fine pitch error is examined. The condition of
correlation method.
simulation for fine pitch error is the same as that for gross
pitch error. Standard deviation and mean error for fine pitch 4. CONCLUSION
error is calculated when the ratio of relative error e^ðkÞ is
A pitch detection method based on the continuous
less than 0.05, as shown in some studies [19,20]. Table 1
wavelet transform for a harmonic signal is proposed. Basic
shows the result of standard deviation of fine pitch error.
characteristics of pitch detection are confirmed by using a
Mean of fine pitch error is shown in Table 2. Performance
harmonic chirp signal. In addition, a simulation of a pitch
of the proposed method with respect to standard deviation
detection with added noise is performed. According to the
is not always better than that of the modified correlation
results of all simulations, it is confirmed that the proposed
method at each SNR, except for male. In a point of view
method has advantages with respect to a frame length and a
from the mean value, it is also shown that performance of
robustness against noise. In addition, it is mentioned that a
the proposed method is not always better than that of the
long processing time is required when comparing to the
conventional method.
conventional method. We plan to use temporal information
According to the results, performance of the proposed
and auditory phenomena of human in order to improve
method is better than conventional pitch detection methods
performance. Furthermore, we will discuss the adoption of
with respect to gross pitch error.
this algorithm to parallel processing in order to improve
3.3.4 Performance of processing time
performance of processing time.
Performance of processing time is examined. Condi-
tions for hardware are as follows. CPU and memory size
15
Acoust. Sci. & Tech. 24, 1 (2003)
16