Pitch
Pitch
Detection Algorithms
Prajwal S Rao*, Khoushikh S*, Sriram Ravishankar*, R Advaith Ananthkrishnan*, Balachandra K*
Abstract — Digital Signal Processing deals with performing B. Pitch Detection in Time Domain
varied mathematical operations on different types of signals and in Time domain algorithms process the data in its raw form as it is
this paper, we study pitch estimation algorithms applied on music usually read from a sound card - a series of uniformly spaced
signals. We give a detailed study of time domain and frequency samples representing the movement of a waveform over time.
domain pitch detection algorithms namely Modified For example, 44100 samples per second is a common recording
Autocorrelation, Average Magnitude Difference function, Yin, speed. Each input sample x[n], is assumed to be a real number in
Cepstrum. The algorithms have been tested with real and
synthesized music signals and the performances were compared
the range -1 to 1 inclusive, with its value representing the height
based on time complexity and error obtained. of the waveform at discrete time instance `n'. This section
summarizes time domain pitch algorithms, like Autocorrelation
Keywords - Pitch, Auto, Amdf, Yin, Ceps, Ger (AUTO), Square Difference Function or the Yin method (YIN),
the Average Magnitude Difference Function (AMDF), Simple
I. INTRODUCTION Feature Based Method.
Speech is produced by vibration of vocal chords when air moves
from lungs (which being the power house of speech) to oral 1) Modified Autocorrelation Method
cavity through trachea or windpipe. The rate at which the vocal One the most popular methods for pitch estimation is
chords vibrate is known as Fundamental frequency or Pitch of an autocorrelation. It takes an input function, x[n], and cross-
individual. The rate per second of a vibration of instruments correlates it with itself; that is, each element is multiplied by a
constituting a wave, in a material such as in sound waves is called shifted version of x[n], and the results summed to get a single
as Frequency or Pitch. Males generally have low pitched voices autocorrelation value.
and frequency ranges from 55Hz to 131Hz (which roughly maps
to A1 to C3 notes on a standard keyboard/piano) while females ȏȐαȭ ȏȐכȏΫȐ(1)
have high pitched voices and frequency ranges from 170Hz to αͲ
262Hz (which roughly maps to F3 to C4 notes). Musical Where n = 1, 2 …N/2; N being the frame length.
instruments have a large range of frequency starting from as less
as 16Hz and can go up to as high as 4000Hz (C0 to B7). But Autocorrelation methods need at least two pitch periods to detect
frequency above 1000Hz becomes less significant so we limit pitch. This means that in order to detect a fundamental
our area of interest below 1000Hz. frequency of 40 Hz, at least 50 milliseconds (ms) of the speech
signal must be analyzed. Hence, we just move half the window
size. The above equation represents auto correlation of an input
II. PITCH DETECTION ALGORITHMS signal x[n]. If a signal is periodic with period p, then the
Pitch can be calculated in time domain or frequency domain. autocorrelation will have maxima at Multiples of p where the
Calculation in time domain is direct by using the actual data of function matches itself. The auto correlated function will have
the audio whereas pitch detection in frequency domain involves global maxima for n=0 (origin) which is the primary peak and
in moving from time space to frequency spaces using operations auto correlated function will have local maxima at integral
like Fourier transform. multiple of `p'. So, the distance of first local maxima from origin
represent the fundamental time period. Inverse of fundamental
A. Pitch Detection in Time Domain time period gives the pitch or the fundamental frequency. We
eliminate all the peaks below a threshold value so that peak
Time domain algorithms process the data in its raw form as it is picking becomes easier and amount of error is checking peaks are
usually read from a sound card - a series of uniformly spaced reduced. The thresholding makes it a difference between auto
samples representing the movement of a waveform over time. correlation and modified autocorrelation method. Figure1 is the
For example, 44100 samples per second is a common recording block diagram to calculate pitch using AUTO.
speed. Each input sample x[n], is assumed to be a real number in
the range -1 to 1 inclusive, with its value representing the height
of the waveform at discrete time instance `n'. This section Steps involved in finding the pitch of an audio sample by
summarizes time domain pitch algorithms, like Autocorrelation modified autocorrelation:
(AUTO), Square Difference Function or the Yin method (YIN), 1. The entire audio is broken in smaller segment which is
the Average Magnitude Difference Function (AMDF), Simple known as windowing. We chose window size as 2048
Feature Based Method. samples. Minimum frequency or pitch obtained using 2048
samples is approximately 50 Hz. We multiply the segmented
978-1-7281-9180-5/20/$31.00 ©2020 IEEE
audio by a suitable window function
Authorized licensed use limited to: Tsinghua University. Downloaded on December 19,2020 at 09:50:13 UTC from IEEE Xplore. Restrictions apply.
2. Autocorrelation is performed on windowed audio. ȏȐαȭȁȏȐΫ ȏ ΫȐȁ(2)
αͲ
3. We eliminate all the values below a threshold value and now
Where n = 1, 2 …N; N being the frame length.
we check for peaks or maxima. The index of maxima is
When the waveform is shifted by an amount, that is not the
noted.
period the differences will become greater, and cause an increased
4. Pitch or fundamental frequency of the windowed audio is sum. Whereas, when equals the period it will tend to a
obtained by dividing the sampling frequency (Fs) by the minimum. In this algorithm we find the location of local minima.
maxima index. Figure 3 is the block diagram to calculate pitch using AMDF.
5. We increment the window pointer by hop length. If we use
50 percentage overlap, we increment the pointer by 1024 Steps involved in finding the pitch of an audio sample by Average
samples. Magnitude Difference Function:
6. We check if the window pointer has reached the end of 1. The entire audio is broken in smaller segment which is
audio. If the window pointer has not reached the end of the known as windowing. We chose window size as 2048
audio, we go back to step 1 else computation of pitch for samples. Minimum frequency or pitch obtained using 2048
entire audio is complete and perform desired operation on samples is approximately 50 Hz. We multiply the segmented
pitch like plotting etc. audio by a suitable window function.
2. We take absolute difference of the windowed audio with
circular shifted windowed sample.
3. We sum the result obtained in step 2 and sum value is stored.
4. Circular shift windowed audio and go back to step 2 The
process of circular shift and taking absolute difference is
repeated till we get back the initial windowed signal.
5. We check for the minimum value obtained in step 3 i.e. the
minimum value if the sums. The index of minima is noted.
6. Pitch or fundamental frequency of the windowed audio is
obtained by dividing the sampling frequency (Fs) by the
Fig. 1. Block diagram representation of pitch estimation using modified minima index.
auto correlation
7. We increment the window pointer by hop length. If we use
50 percentage overlap, we increment the pointer by 1024
samples.
8. We check if the window pointer has reached the end of
audio. If the window pointer has not reached the end of the
audio, we go back to step 1 else computation of pitch for
entire audio is complete and perform desired operation on
pitch like plotting etc.
Authorized licensed use limited to: Tsinghua University. Downloaded on December 19,2020 at 09:50:13 UTC from IEEE Xplore. Restrictions apply.
Figure 4 shows average magnitude difference function applied 5. We check for the minimum value obtained in step 4 i.e. the
on a windowed audio. First local minima are present at 133rd minimum value if the sums. The index of minima is noted.
sample. The sampling frequency (Fs) for the windowed sample 6. Pitch or fundamental frequency of the windowed and audio
is 44100 samples/sec. So, the pitch or fundamental frequency is is obtained by dividing the sampling frequency (Fs) by the
given by Fs/minima index which is 44100/133 = 331.57Hz. minima index.
7. We increment the window pointer by hop length. If we use
3) Yin Method 50 percentage overlap, we increment the pointer by 1024
The idea for using the square difference function (SDF) or Yin samples.
method is that if a signal is pseudo- periodic then any two 8. We check if the window pointer has reached the end of
adjacent periods of the waveform are similar in shape. So, if the audio. If the window pointer has not reached the end of the
waveform is shifted by one period and compared to its original audio, we go back to step 1 else computation of pitch for
self, then most of the peaks and troughs will line up well. If one entire audio is complete and perform desired operation on
simply takes the differences from one waveform to the other and pitch like plotting etc.
then sums them up, the result is not useful, as some values are
positive and some negative, tending to cancel each other out.
This could be dealt with by using the absolute value of the Figure 6 shows Yin method applied on a windowed audio. First
difference, as discussed in Equation stated above; however, it is local minima are present at 343rd sample. The sampling
more common to sum the square of the differences, where each frequency (Fs) for the windowed sample is 44100 samples/sec.
term contributes a non- negative amount to the total. So, the pitch or fundamental frequency is given by Fs/minima
index which is 44100/343 = 128.57Hz.
ȏȐα ȭ ȋȏȐΫ ȏΫȐȌʹ(3)
αͲ
Where n = 1, 2 …N; N being the frame length.
When the waveform is shifted by an amount, that is not the
period the differences will become greater, and cause an increased
sum. Whereas, when equals the period it will tend to a
minimum. In this algorithm we find the location of local minima.
Figure 5 is the block diagram to calculate pitch using YIN.
Authorized licensed use limited to: Tsinghua University. Downloaded on December 19,2020 at 09:50:13 UTC from IEEE Xplore. Restrictions apply.
2) Average Magnitude Difference Function
The time complexity of AMDF method is O(n2) since it involves
2 nested traversals. Outermost traversal is required for running
through the audio signal and inner traversal is for circular
shifting windowed sample. Not much significant time is used for
subtraction of 2 sequences.
3) Yin Method
The time complexity of YIN method is O(n2) since it involves 2
nested traversals. Outermost traversal is required for running
through the audio signal and inner traversal is for circular
shifting windowed sample. Significant time is used for
Fig. 7. Block diagram representation of pitch estimation using CEPS multiplication of 2 sequences. Because of the above two reasons,
algorithm
Yin takes more time for computation.
Steps involved in finding the pitch of an audio sample by Yin
method:
4) Cepstrum Method
1. The entire audio is broken in smaller segment which is
known as windowing. We chose window size as 2048 The time complexity of CEPS method is O(n) since it just
samples. Minimum frequency or pitch obtained using 2048 involves traversing through the windowed segment once i.e. just
samples is approximately 50 Hz. We multiply the segmented running through the audio length. Major amount of time is used
audio by a suitable window function. in determining the peaks of the processed audio.
2. We then move to frequency domain by applying FFT TABLE I. Computation time (in seconds) for various pitch detection algorithms
(Fourier Transform) on the windowed audio.
Instrument AUTO AMDF YIN CEPS
3. We apply logarithm for the transformed audio. samples
4. We move back to time domain by taking IFFT (Inverse Violin-e6 0.267 0.769 1.373 0.125
Fourier Transform). Violin-e4 0.215 0.323 0.507 0.115
5. Now we check for peaks or maxima. The index of maxima is Trumpet-e3 0.307 1.396 2.432 0.134
noted.
Oboe-g5 0.250 0.643 0.986 0.120
6. Pitch or fundamental frequency of the windowed audio is
obtained by dividing the sampling frequency (Fs) by the Gitar-b2 0.279 0.991 1.748 0.137
minima index.
Flute-a4 0.366 2.261 3.672 0.142
7. We increment the window pointer by hop length. If we use
50 percentage overlap, we increment the pointer by 1024 Bass-c1 0.327 1.057 1.855 0.130
samples.
8. We check if the window pointer has reached the end of
audio. If the window pointer has not reached the end of the
audio, we go back to step 1, else computation of pitch for
entire audio is complete and perform desired operation on
pitch like plotting etc.
Authorized licensed use limited to: Tsinghua University. Downloaded on December 19,2020 at 09:50:13 UTC from IEEE Xplore. Restrictions apply.
Where the summation runs from 1 to total number of frames. We Error in computation follows the order
take absolute sum of the error to avoid cancelling. In auto CEPS > AUTO > AMDF > YIN
correlation proper peak may not be detected due to improper
thresholding as a result which wrong pitch will be detected. The order of time computation is similar to previous live
Similar error can be expected in case of cepstrum method. calculation which is
If fundamental frequency is not dominant when compared to its CEPS < AUTO < AMDF < YIN
harmonics cepstrum method fails. This can be seen when B2 note Figure 10 shows the detection of pitch using various pitch
is played using guitar. Amplitude value at 121Hz is much lesser detection algorithms for a piano audio.
when compared to amplitude at 242Hz (first harmonic of B1 We discussed various parameters like time complexity and GER
note) in spectral domain. As a result of which we got a GER of between multiple pitch detection algorithms. Here we discuss
187%. All the pitch detection algorithms used performs good the pros and cons of multiple pitch detection algorithms.
when the pitch is less than 1000Hz or low frequency
components. So, when E6 note is played chances of getting error
is more and is evident when compared to others. Figure6 and
Table2 show comparison of error percentage for multiple pitch
detection algorithms.
1. Advantages
i. This method of pitch detection is easy and has a faster
computation.
ii. Easy to understand the concept since its mathematical
modeling is easy.
2. Disadvantages
i. The level of peak to be chosen is challenging. Peak picking
may give all the peaks before the first local maxima, but that is
not the actual peak. Hence, we make use of Adaptive
Autocorrelation function.
ii. Error in calculating the pitch is moderate
TABLE II. Comparison of error in various pitch detection algorithms (in %) 1. Advantages
Authorized licensed use limited to: Tsinghua University. Downloaded on December 19,2020 at 09:50:13 UTC from IEEE Xplore. Restrictions apply.
D. Cepstrum Method
REFERENCES
1. Advantages [1] Denis Jouvet, Yves Laprie. Performance Analysis of Several Pitch
Detection Algorithms on Simulated and Real Noisy Speech Data.
i. Cepstrum analysis, once understood is a very simple way to EUSIPCO’2017, 25th European Signal Processing Conference, Aug 2017, Kos,
estimate spectral component since it does not deal with phase. Greece.
ii. Time computation is moderate since it involves FFT and IFFT [2] Lyudmila Sukhostat and Yadigar Imamverdiyev (2014). “A Comparative
repeatedly. Analysis of Pitch Detection Methods Under the Influence of Different Noise
Conditions”.
2. Disadvantages
[3] Lyudmila Sukhostat, Yadigar Imamverdiyev “A Comparative Analysis of
i. Performing FFT and IFFT are computationally expensive Pitch Detection Methods under the Influence of Different Noise Conditions”,
and there is a chance of losing some data. Journal of Voice September 2014.
ii. Cepstrum analysis is essentially a low-pass filter filters the [4] Rabiner, L.R. (1977), “On the Use of Autocorrelation Analysis for Pitch
spectral components which is averaging the spectral Detection," IEEE Trans. Acoustic, Speech, Signal Process. 25, 24-33.
components.
[5]De Cheveigne, A., Kawahara, H. (2002). “YIN, a fundamental frequency
iii. For cepstrum method the fundamental must be dominant estimator for speech and music," J. Acoustic Society Am. 111, 1917-1930.
compared to its harmonics else chances of getting error are
high. This is explained in the GER part. [6] Alain de Cheveigne and Hideki Kawahara (2001). “Comparative
evaluation of F0 estimation algorithms”.
Authorized licensed use limited to: Tsinghua University. Downloaded on December 19,2020 at 09:50:13 UTC from IEEE Xplore. Restrictions apply.