0% found this document useful (0 votes)
87 views5 pages

Pitch Estimation Using A Full/Multi-Band Approaches: Mikhail Tadjikov, Arya Ahmadi

This document discusses pitch estimation techniques including peak-to-peak detection, cepstral analysis, and multi-band detection. It compares these methods on a dataset of speech samples from male and female speakers that include clean speech and speech mixed with noise. The techniques are implemented and their pitch estimation accuracy is evaluated against ground truths and an industry standard tool. Voice detection is also briefly examined to determine voiced and unvoiced segments before pitch estimation. Experimental results are analyzed and conclusions are drawn regarding the performance of the different pitch estimation methods.

Uploaded by

Vinayaka Swamy
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views5 pages

Pitch Estimation Using A Full/Multi-Band Approaches: Mikhail Tadjikov, Arya Ahmadi

This document discusses pitch estimation techniques including peak-to-peak detection, cepstral analysis, and multi-band detection. It compares these methods on a dataset of speech samples from male and female speakers that include clean speech and speech mixed with noise. The techniques are implemented and their pitch estimation accuracy is evaluated against ground truths and an industry standard tool. Voice detection is also briefly examined to determine voiced and unvoiced segments before pitch estimation. Experimental results are analyzed and conclusions are drawn regarding the performance of the different pitch estimation methods.

Uploaded by

Vinayaka Swamy
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1

Pitch Estimation Using a Full/Multi-Band


Approaches
Mikhail Tadjikov, Arya Ahmadi

Abstract—Pitch estimation is an important and complex prob- with 10 dB of white noise, and a sample with 10 dB babble
lem that has been evolving for decades. Noise and varying noise. The ground truth files for each of the male and female
individual characteristics are common pitfalls even in today’s sets were also provided. Each of the above speech signals were
pitch estimation algorithms. We will discuss the methodology
and results for Peak-to-Peak, Cepstral Analysis and Multi-band sampled at 16kHz with analysis performed at a frame rate of
detections. Embedded in the pitch detection is another well 10 msec. Empirically determined varying window sized were
known problem - voicing detection. This problem will be quickly used for GPE and VDE calculations. We will compare SIFT,
examined to maintain the focus of the report on pitch estimation. Cepstral analysis, an elementary frequency domain algorithm,
We will conclude by comparing experimental results with base and finally an industry standard, Wavesurfer.
truths and industry standard tools.
Index Terms—Pitch, detection, Cepstrum, voice, frequency,
multi-band.
II. M ETHODS
A. Voice Detection
I. I NTRODUCTION In the interest of time we decided to go with a simple voice
N SPEECH, the pitch conveys information about voicing, detection algorithm - energy detection and thresholding. We
I prosody, and speaker identity. [1] Pitch estimation is an
ever evolving topic in speech and signal processing. Issues
use the energy detection to determine the voicing of individual
frames via a simple equation:
including but not limited to, physiology, language differences,

k∗N
performance degradation due to noise and discontinuities be- E(k) = ∥x[n]∥2 (1)
tween voiced and unvoiced segments all add varying levels of n=(k−1)∗N +1
complexity to accurate pitch estimation. Low frequency noise
is a particular pitfall of many algorithms, as much of the rele- Where x[n] is the input waveform, N denotes the frame size
vant information in pitch estimation is below a 1kHz. Various and k is the current frame. Our frame size is 10 milliseconds
methods for pitch estimation have been developed in both the or 160 points. After we compute energy for all frames, we
time and frequency domains. In the time domain a relatively establish the threshold as follows:
elementary method for pitch detection includes estimating the arg max E
E threshold = (2)
period of the time domain signal by calculating the number 100
of zero crossings; however this method is extremely sensitive With the threshold established the signal is compared on the
to noise and the changing period inherent in speech. Another per frame basis whether or not current frame’s energy is above
example of time domain analysis is the simplified inverse filter the threshold. If the frame’s energy is below the threshold, the
tracking (SIFT); SIFT utilizes a pre-emphasized signal as a frame is deemed unvoiced and F0 estimate is set to 0.
part of Linear Predictive Coding (LPC) to improve accuracy
of pitch estimation. Average Magnitude Difference Function
(AMDF) and Average Squared Mean Difference Function B. Peak to Peak Detection
(ASMDF) are examples of more elaborate pitch estimation Pitch is also know and the fundamental frequency of the
algorithms [6], [2], [7]. On the other hand, the frequency speaker produces integer harmonics of itself throughout the
domain can also be utilized to estimate pitch. Cepstral analysis, spectrum. Peak to peak detection, or peak spacing detection,
a non-linear technique that employs the Fourier Transform is a frequency domain technique that is used to determine
(FT) and was first described by Bogert, et al. [3] in 1963, is the spacing (in terms of frequency) between the harmonics.
often used in speech recognition. It is important to note that Ideally all the harmonic frequencies will be exactly F0 apart;
the Cepstral analysis is performed in the frequency domain however, this is not always the case. There are tow main
as it is not completely analogous. At its foundation Cepstral factors that would introduce error into this detection: noise and
analysis transforms a convolution of two signals into the quantization errors. This technique is more-or-less immune
superposition of the Cepstra [2],[7]. Physically, this represents to effects of noise at high SNR like in the our case we are
the segregation of the low frequency excitation and source considering SNR of 10dB.
information produced by the vocal cords. In our work we Quantization errors occur from the lack of resolution of the
compare various methods for pitch estimation given a training FFT. Since ∆F = FS /N , where FS is the sampling frequency
set of 12 signals. The development sets included two male and of that data and N the size of our FFT. Even though we are
two female sets each of which included a clean sample, sample performing 1024 point FFT operations in our algorithms the
2

8
Processed data from 800−1600Hz Band
frequency. The pitch period can be found as the number of the
X: 10

6
Y: 6.338
X: 40
coefficient where the peak occurs. When the signal is periodic
Y: 5.348
X: 24
Y: 4.599 there are spikes with equal distances in the spectrum. However,
4
sometimes it is difficult to determine the pitch frequency
since the different harmonics are not necessarily present in
Amplitude (dB)

X: 54
the spectrum. The cepstrum consists of two elements. One
0 Y: −0.6489
element is from the excitation sequence (a pulse train for
−2 voiced speech) in the higher frequencies. The other element
−4
originates from the vocal tract impulse response and is present
in the lower frequencies. The peak in the higher quefrencies
−6
800 956.25 1112.5 1268.75
Frequency (Hz)
1425 1581.25 1767 indicates that there are some periodicities in the signal. The
peak is located at the period of the fundamental frequency.
To cepstrum is produced in the following way.
Fig. 1. Peaks in the 800-1600Hz band after noise removal and FFT
X(z) = F F T (x[n])
window only contains 640 points thus giving us a frequency X̂(z) = 1 − log(X(z))
resolution of ∆F = 25Hz. x̂[n] = IF F T (X̂(z))
From figure 1, we can observe that the peaks are spaced
almost evenly by 15 FFT bins. Therefore, by averaging we can
reduce quantization errors and achieve the best results. In this
Cepstrum
particular example the average spacing is (16 + 14 + 16)/3 = 0.4
15.33points, which gives us F0 = 239.53Hz.
0.35
Peak-to-peak detection algorithm can be used on a full-band X: 71
Y: 0.3504
or in a multi-band approach. Depending on the level and type 0.3

of noise present in the signal estimating the harmonic spacing 0.25


in different bands would ensure a better overall estimate.
0.2
Amplitude

0.15
C. Cepstral Analysis
0.1

0.05

−0.05

−0.1
0 50 100 150 200 250 300 350
Samples

Fig. 3. Sample Cepstrum of a single window with a clearly defined peak at


71st sample. This translates into F0 estimate of 225.35Hz, while the truth
for this frame is 227.27Hz

Although this method has been used quite a lot before in


different forms, we tailored it as per our requirements. In
this case we first analyzed each frame (of length 40 ms) to
categorize the segment as voiced or unvoiced based on an
energy threshold on the autocorrelated signal preprocessed to
remove the higher frequencies (retaining 1 kHz) to reduce the
effect of white noise. This is different from the conventional
method in which the V/UV decision is made in the cepstral
domain, whereas we dont go to cepstral domain until we are
sure that the segment is voiced. After we are sure that the given
segment is voiced we obtain the next three frames (each of
length 10 ms) along with the one under analysis to determine
Fig. 2. Flowchart of Full-band Cepstral Analysis. the pitch period. This has been done under quite a practical
assumption that speech is a smooth phenomenon and hence
Detecting pitch frequency by looking at the spectrum of there cannot be an abrupt change in the pitch. Moreover, this
a spectrum is called Cepstral analysis, ”Cepstrum” being a feature makes our algorithm robust for the pitch detection
play on the word ”Spectrum”. A peak in the cepstrum denotes as it can capture the pitch period as low as 50 Hz. The
that the signal is a linear combination of multiples of the pitch signal hence obtained is then analyzed in quefrency domain
3

after high pass liftering (retaining N=50:320 Quefrencies) to


eliminate the undesirable noise and higher pitch components
(order of 400 Hz and greater) as we have only male and female
speakers whose pitch ranges from 50 Hz to 400 Hz. Step
by step implementation of the process is given in figure 2.
Here are the plots for a sample run of Matlab code for pitch
estimation for a female speaker. Following that we have the
liftered cepstrum which to retain pitch periods in the desirable
range and eliminate noisy and vocal tract quefrencies as seen
in figure 3.

D. Multi-band detection
For the multi-band approach, after the implementation of
the energy detection which was used for detecting the voiced
and unvoiced part of the waveform which is parallel to the
implementation of the male and female detection which is
used for getting the cutoff frequency value (FC ), the waveform
is divided spectrally into multiple bands using a filter bank
consisting of 10 to 19 channels. The channels are made using
Butterworth filters of fourth order. After this stage, each band
is passed through the Cepstral analysis algorithm and the peak
detection algorithm. Each channel then provides us with a
pitch frequency and for each frame we get 10 to 19 different
pitch frequencies depending on the size of our filter bank.
Different filter bank arrangements were tested for female and Fig. 4. Flow-chart of Multi-band Analysis
male voices.
For female voices the best results were achieved when 12
channels were used with smaller channel lengths at around 50 at the Center for Speech Technology at KTH in Stockholm,
to 100 Hz and longer channel length at 300 Hz up to 400 Hz. Sweden. The version that is used for this project is Wavesurfer
The frequencies cuts were made at 50, 60, 70, 80, 90, 100, 1.8.5, which was released in Nov. 1, 2005 [8]. The main reason
120, 150, 170, 200, 300, 400 Hz. of using Wavesurfer over other speech processing tool is its
Despite the fact that the spectral content for a male voice simple and customizable user interface. Other advantages in-
is more concentrated at the lower frequencies, for male voices clude multi-platform, flexible interface, handling long signals,
the best results were achieved when filter bank had a range easy to extend through plug-in and so on [9].
from 10 Hz all the way up to 5000Hz. The frequency cuts were
made at 10, 400, 1000, 2000, 3000, 4000, 5000 Hz. Finally
we do some sort of post processing (explained later) on the
values obtained per frame bases and ultimately declare the
pitch value. And overall diagram of this algorithm is shown
in figure 4.
For the post processing of the female voices, we realized
that eliminating pitch frequencies above 300Hz and below
50 Hz increased our accuracy of detecting the right pitch
frequencies however decreased the detection of zeros. Thus for
detecting zeros, we did not eliminate any frequency band at
the post detection phase but for detecting pitch frequencies we
did eliminated frequencies between above 300Hz and below
50 Hz. We then simply computed the average over all the
remaining frequencies.
In the post processing of the male voices, we simply
eliminate pitch frequencies below 50 Hz and above 200 Hz
and then computed the average over the remaining pitch
frequencies to come to a final decision of the pitch.

E. Wavesurfer Fig. 5. Wavesurfer screen shot


Wavesurfer is an open source speech processing software
for viewing, editing, and labeling audio data. It is developed Figure 5 is a screen shot of the Wavesurfer speech analysis
4

Female Male
display. It shows the waveform, spectrogram, time-axis, pitch GPE VDE GPE VDE
contour, and the complete waveform of the signal respectively. Clean 2 4.63 1.56 7.13
The properties window is used to change the parameter of Noise 0.86 10.41 0.98 13.17
Babble 2.39 9.66 1.82 12.47
the setting. It can be obtained by right-clicking on the pitch
contour pane. The following Wavesurfer settings are used to TABLE II
WAVE S URFER R ESULTS
find the pitch contour at a frame rate of 10milliseconds:

Parameter Value
Pitch method ESPS
Max pitch value 600 Hz the average pitch of a human being ranges from 125 to 300 Hz.
Min pitch value 50 Hz By setting the Wavesurfers pitch limit to 60-400 Hz, one can
Analysis window length 0.01s
Frame interval 0.01s eliminate noise in the signal while maintaining the information
of the speech. Other noise elimination techniques that can be
TABLE I
WAVESURFER C ONFIGURATION PARAMETERS implemented in the future include filtering noises and deleting
some outlined samples. From our observation, we noticed that
the fundamental frequencies of the last three frames are always
Wavesurfer saves the data of the pitch contour pane in four missing. This may be due to silence in the end of the speech
columns for every frame, and only the first column is the file. Hence, we pad zero at the end of the pitch contour to
fundamental frequency. Therefore, the saved data file needs to equalize the number of F0 s observed as well as the total
converted to .txt format for the MATLAB to parse the data number of frames. P. Pelle proposed a new method for pitch
and fetch F0 . MATLAB will also read the wave sound files estimation using phase locked loops (PLL). She claimed that
and calculate the total number of frames with a frame rate PLL outperformed Wavesurfer under noisy speech conditions
of 10milliseconds. If there is a difference between the actual [5]. For future work, we can incorporate PLL on top of the
number of frames and the number of F0 estimates the data Wavesurfer results to get a better pitch estimation.
will be zero-padded.
D. Cepstral Analysis
III. R ESULTS AND D ISCUSSION These results were achieved by performing pre-processing
A. Voice Detection on the signal before the analysis was performed. In the
Since we used the same voice detection algorithm through- revision of the full-band detection algorithm the causality has
out our project the results could be observed in the following been sacrificed for performance improvements. Steps of pre-
tables III, and Arya’s table. processing as follow:
After manipulating the threshold values for a while we 1) Coarse Pitch Estimation - we acquire a rough estimate
realized that there is a large drawback to using energy de- of the fundamental frequency of the signal. This is used
tection as means to determine voicing. The problem arises primarily to determine if the speaker is male or female.
from the fact that even the most optimal threshold only gives 2) Band Filtering - although, this is a full-band estimation
you an equilibrium between Voiced-to-Unvoiced Errors and it doesn’t mean we have to use the whole spectrum
Unvoiced-to-Voiced Errors. There are many frames that are from 0 − 8000Hz. Instead we focus on one band in
voiced but have a relatively low energy when compared to the particular from just below our rough F0 estimate to 4F0 .
Ethreshold ; thus, creating false negatives. By ”zooming” into this band we were able to double the
accuracy of our results. Most of the improvement was
B. Peak-to-Peak Detection achieved by eliminating the low frequency noise.
3) Voice Detection - the signal just goes through basic en-
Peak-to-peak algorithm showed a great deal of promise ergy detection and thresholding as described in equations
during the development stages; however, we were unable to 1 and 2.
achieve reasonable results due to numerous factors. 4) Liftering - depending on whether the speaker is male or
female we lifter in the range of [50-1000Hz] or [150-
C. Wavesurfer 1000Hz] for males and females, respectively.
After acquiring the results from Wavesurfer we compared 5) Peak Detection - upon liftering, we find the local max-
them to our truth files, with results summarized in table II. ima and translate it back into frequency. This is our F0
From the table, it can be seen that the GPE and VDE results estimate.
are smaller compared other methods, since Wavesurfer is an The results for Cepstral Analysis were fairly good, as
industry tool. Good results can be obtained regardless of type expected. The GPE for females were a little bit lower than
of noise or speaker. for their male counterparts. Using Cepstrum analysis the
In Wavesurfer, there are two algorithms to find the pitch Gross pitch error is for female voices fell in the range of
values: AMDF (Average Magnitude Difference Function) and 2 − 5% and for male voices in the range of 4 − 8%. There
ESPS (Entropic Speech Processing System) [4]. The ESPS is a noticeable difference between the estimation of male and
algorithm was used in this project. For the future work, AMDF female pitches. This can be attributed to the presence of low
algorithm can be used to compare results. Table 1 shows that frequency noise in most signals. These results by far the best
5

Female Male
GPE VDE GPE VDE
B. Arya Ahmadi
Clean 2.20 9.25 4.36 16.09 Pitch Detection algorithms, Voiced/Unvoiced Detection,
Noise 2.48 8.75 5.41 15.94
Babble 4.38 9.75 7.96 15.97
Multi-band Detection, in-class presentation and parts of
Project Report.
TABLE III
F ULL - BAND C EPSTRUM A NALYSIS R ESULTS
ACKNOWLEDGMENT
The authors would like to thank Prof. Alwan for providing
necessary guidance and knowledge needed to complete this
out of all the methods we attempted. It is also important to project.
note that Cepstral analysis is highly resistive to noise due to
detections in the frequency domain. Additional benefit to using
R EFERENCES
Cepstral analysis is the speed - computation times for each of
the .wav files were order of magnitude lower that the duration [1] Cariani, Peter. ”Neuro Corrolates of Pitch and Complex Tones.” Journal
of Neurophysiology. 6.3 (1996): 1698-1716. Print.
of the file in seconds. [2] Paul Boersma. (1993). Accurate Short-Term Analysis of the Fundamental
Frequency and the Harmonics-to-Noise Ratio of a Sampled Sound.
Institute of Phonetic Sciences, University of Amsterdam, Proceedings 17.
E. Multi-band Detection [3] B. P. Bogert, M. J. R. Healy, and J.W. Tukey, ”The Quefrency Alanysis
The results of the multi-band detection algorithm is shown of Time Series for Echoes: Cepstrum, Pseudo Autocovariance, Cross-
Cepstrum and Saphe Cracking”. Proceedings of the Symposium on Time
below in table IV for both male and female. Series Analysis (M. Rosenblatt, Ed) Chapter 15, 209-243. New York:
Wiley, 1963.
Female Male [4] C. Grigoras, ”Catalina Forensic Audio Toolbox Users Manual”.
GPE VDE GPE VDE [5] P. A. Pelle, A Robust Pitch Extraction System Based on Phase Locked
Clean 9.46 15.1 16.54 18.89 Loops, In Proc. ICASSP, 2006.
Noise 9.99 14.99 24.10 19.07 [6] Gerhard, David. Pitch Extraction and Fundamental Frequency: History
Babble 17.46 22.67 23.88 23.18 and Current Techniques
[7] Patricio de la Cuadra, Aaron Master, Craig Sapp. Efficient Pitch Detection
TABLE IV
Techniques for Interactive Music. [Center for Computer Research in
M ULTI - BAND C EPSTRAL A NALYSIS R ESULTS
Music and Acoustics, Stanford University].
[8] Kare Sjolander and Jonas Beskow. Wavesurfer version 1.5.7, 2003. Centre
for Speech Technology (CTT) at Royal Institute of Technology (KTH) in
Stockholm, Sweden. URL: http://www.speech.kth.se/wavesurfer/.
As expected, the overall error increases as you move from [9] K. Sjolander and J. Beskow. WaveSurfer - an Open Source Speech Tool,
normal to noise to babble. As it can be seen, since our In Proc. Int. Conf. on Spoken Language Processing, 2000.
voiced/unvoiced detection algorithm is based on the waveform,
introduction of noise has made the algorithm detection error
high by introducing more energy at different bands. Female
the GPE error is fairly reasonable, being around 9.5 and 10%
for normal and noised speech, and about 17.5% for babbled.
However for male the errors are higher due to concentration of
filter bank at higher frequency rather than lower frequencies
were the spectral content of male voices are richer. Tadjikov, Mikhail Received his B.S in Electrical and Computer Engineering
Overall we were expecting lower percentage errors in our from New Mexico State University in 2007 with a Minor in Mathematics. He
data using multi-band approach rather than a full band ap- is currently a graduate student working on his Masters Degree in Electrical
Engineering at the University of California, Los Angeles. Mikhail is currently
proach, thus this method needs further investigation. One thing a graduate student within the Electrical Engineering Department at the Henry
to look at is the weights we are assigning to each band when Samueli School of Engineering at the University of California, Los Angeles
combining f0s detected per band. In the presented algorithm, working in the lab of Prof. Danijela Cabric.
we are using the same weight for the remaining frequencies
after filtering the f0 dataset that we get per frame, however our
results have perhaps will change if we choose the important
bands (per investigation) and assign higher weights to those
bands. Another thing to investigate should be the width of each
channel in the filter bank. Although investigated thoroughly,
perhaps increasing the number of channels in a filter bank at
Ahmadi, Arya Received his B.S in Electrical Engineering from the University
the interest frequencies will have a mitigating effect on our of California, Irvine in 2008. He is currently working towards his Masters
errors. Degree in Electrical Engineering at the University of California, Los Angeles.
Ahmadi is currently a graduate student within the Electrical Engineering
Department at the Henry Samueli School of Engineering at the University
IV. P ROJECT C ONTRIBUTIONS of California, Los Angeles working in the lab of Prof. Danijela Cabric.
A. Mikhail Tadjikov
Pitch Detection algorithms, Voiced/Unvoiced Detection,
Full-band Detection, Project Report and parts of in-class
presentation.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy