Encoder The Block Diagram of The Encoder Described As Follows
Encoder The Block Diagram of The Encoder Described As Follows
Encoder The Block Diagram of The Encoder Described As Follows
The input to the encoder is a human speech, and the output is bit stream to be transmitted. This application is of 2400 bits per second and frame time of 22.5 ms, and because of that there are 54 bits in the output bit stream per frame. The input speech signal is sampled by an a/d converter with sampling frequency of 8000Hz, and frame time of 22.5ms leads to a frame size of 180 samples.
1. LOW FREQUENCY REMOVE The first step is to remove the low frequencies between DC and 60Hz. This is accomplished by a 4th order cheby chev type 2 highpass filter. The cutoff frequency is 60Hz, and a stop band rejection of 30 dB. From now on the output from this filter will be considered as the input speech to the system. 2. PITCH CALCULATION Several steps as shown in this block diagram accomplish the process of finding pitch:
2.1. INTEGER PITCH CALCULATION In order to find the integer pitch, it is necessary to pass the input speech through a 1KHz lowpass filter 6th order butterworth. The integer pitch is the first pitch that calculated by the process, and it is performed by an autocorrelation function. The standard requires that the autocorrelation function will be performed on samples 20 till 160, so that leads to vector of 2*160=320 samples of autocorrelation. Every calculation is centered on the last sample in the current frame.
The normalized autocorrelation function will result in a vector according to the input. We are interested in the maximum value of the autocorrelation result, and the place of this maximum is the integer pitch .
Where, And, 2.2. PITCH REFINEMENT AND VOICING ANALISYS First of all the speech signal is passed through 5 parallel bandpass filters. The filters are 6th order butterworth with passbands as shown in the figure below:
2.2.1. PITCH REFINEMENT This algorithm for pitch refinement is using the output of the first bandpass filter (0-500Hz) and the candidates are the integer pitch values from the current and the previous frames. The basic assumption is that we are dealing with sampled values so that the real pitch is in offset of from the integer pitch ( ). These are the steps that should be implemented in order to find the offset and with it to find the fractional pitch: On every candidate of the integer pitch we apply a normalized autocorrelation function over lags from 5 samples shorter to 5 samples longer than the candidate. Then a fractional pitch refinement is performed around the optimum integer pitch lag. Assuming that the integer has a value of T samples we perform a linear interpolation function according to the maximum values of between lags T and T+1. We assumed that the pitch will be between T and T+1, but in fact it may be between T-1 and T. in order to solve this we compute and and we decide in which interval the maximum falls. If > then the maximum falls between T-1 and T, and we have to decrement T by one before making the linear interpolation. The formula of is described as by:
And the normalized autocorrelation at the fractional pitch value is given by:
This produces two fractional pitch candidates and their corresponding normalized autocorrelation values. The higher result in one of these two candidates will be the fractional pitch , and the normalized autocorrelation will be , and he will be saved as . 2.2.2. VOICING ANALISYS In order to make an accurate voicing analysis we split the speech signal to five spectral bands. The algorithm combines two methods for this implementation: The first method is to apply the normalized autocorrelation function on every band. In this way we seek for maximum value of the autocorrelation and it tells us about the strength of the voice in that band. This method is good for frames that contain a stationary voice, but it is not good when the pitch is changing because the autocorrelation function will result in small values, and will not represent the voice itself.
The second method is to find the envelope of the fractional pitch by using a fullwave rectification followed by a smoothing filter with a zero at DC, and complex pole pair in 150Hz with radius of 0.97. It means that we use a notch filter and a DC removal. We apply the autocorrelation function on this envelope and get the maximum result. The voicing decision in every band that will represent the voicing strength will be the highest result comparing the two methods. The five voicing strengths are saved in , where .
Another tool that helps in the voicing decision is the peakiness calculation. We have to calculate it on the residual signal over a 160 samples window centered on the last sample in the current frame. The residual signal is represented by .
The peakiness value represents the ratio between the RMS value and the average value and he finds peaks in the residual signal. If then the lowest band (0500Hz) is voiced, that means forcing . If then the lowest three bands (0-500Hz, 500-1000Hz, and 1000-2000Hz) are voiced, that means forcing . 2.3. FINAL PITCH CALCULATION In order to calculate the final pitch we have to use the residual signal. We will pass it through a 6th order butterworth lowpass filter, with a 1 KHz cutoff frequency. First of all the normalized autocorrelation function is implemented over lags 5 samples shorter to 5 samples longer than rounded to the nearest integer. Then around the optimum integer pitch lag we apply another fractional pitch refinement process. This leads to the final pitch , and the normalized autocorrelation . The parameters that are taking place in the algorithm to find the final pitch are as follows: The input speech signal. The residual signal. The fractional pitch . .
This tool allows the encoder to detect pitch values that are multiplies of the actual pitch. In order to do so, we have to define to this tool the pitch that we want to check, and define the doubling threshold value represented by . The output of this tool is the checked pitch , and his correlation . The algorithm is described as follows: First of all calculating the fractional pitch refinement around . This will lead to initial values for and . Next is to find the largest value for that will lead , where and . will be calculated in two steps: the first one is to calculate a fractional pitch refinement around producing ; and the second step is to make a double check verification if . If we find after this process such a then we have to perform a fractional pitch refinement around and the result will update and . Afterwards if , then the double verification is performed. The actual meaning of the double ctool is that if we apply to it the inputs: and then this tool returns back the smaller value of and . This tool protects us against spurious short pitch values. 2.3.2. GAIN CALCULATION The gain calculation is performed on the input signal twice per frame and with different window length and is determined as follows: When the window length is the shortest multiple of which is longer than 120 samples. If this length exceeds 320 samples, it is divided by 2. When the window length is 120 samples.
The gain is the RMS value measured in dB, of the signal in the window length of L. First of all we calculate the first window and we will get the parameter and he will be centered 90 samples before the last sample in the current frame.
Secondly we calculate the second window and we will get the parameter will be centered on last sample in the current frame. 2.3.3. AVERAGE PITCH UPFATE
and he
The long-term average pitch is used for smoothing strong pitch values. It is based on a buffer that contains three most recent strong pitch values. If and dB (where G represents the gain) then is a strong pitch value and he will be put in this buffer. If this condition is false then all three pitch values in the buffer are moved toward the default pitch, samples, according to: Afterwards the average pitch is updated as the median value of the three values in the buffer. 2.3.4. FINAL PITCH ALGORITHM In this stage we can make the final pitch calculation. The algorithm is described below:
3. LPC ANALYSIS The analysis of the linear predictor will spread into 2 paths: the first one is the analysis of the speech signal, and the other one is the analysis of the residual signal.
3.1. SPEECH SIGNAL The linear prediction is implemented by using 10 coefficients, on the input speech signal. We take the speech signal and multiply it by a hamming window of 200 samples (25 ms), centered on the last sample in the current frame. Then we calculate the autocorrealtion function on this window. Finding the coefficients is by the recursion of levinson durbin, that uses a toeplitz matrix. In this stage we have 10 coefficients that represent the predictor of this window. The second step is to make a 15 Hz bandwidth expansion: this will be done by the calculations as follows:
,Where
The result of this process is that we to multiply the prediction coefficients by the factor where . 3.2. RESIDUAL SIGNAL In order to get the residual signal out the speech signal we have to pass the speech signal through a filter that his coefficients are the 10 LPC coefficients that already been calculated. The filter is:
This is a FIR filter and his output is the residual signal. Because we are working with a speech signal that a vector in the length of 320 samples, the residual signal will be also in the length of 320 samples. The residual window is centered on the last sample in the current frame. 4. APERIODIC FLAG
The aperiodic flag is a tool that helps us to deal with the problem that in voiced frames the speech signal is not periodically perfect. In these cases the aperiodic flag is going up, and tells the decoder to use a non- periodic excitation to simulate unstable glottal pulses. The aperiodic is set to 1 according to the lower band (0-500Hz) when and set to 0 otherwise. 5. FOURIER MAGITUDE CALCULATION This analysis measures the Fourier magnitudes of the residual signal in the frequency domain. In order to do so it uses a vector of 200 samples of the residual signal, and perform on it a 512 samples of FFT algorithm with zero padding. The output is the magnitude of the first 10 pitch harmonics of the residual signal in the Fourier domain. 6. QUANTIZATION This chapter will deal with the quantization of the parameters of the encoder. 6.1. QUANTIZATION OF PREDICTION COEFFICIENTS First of all we have to convert the 10 LPC coefficients that we already found in the earlier stages, into 10 LSF components. LSF means Line Spectrum Frequency and they will be represented in Hz. The second step is to organize these 10 LSF components in ascending order. The result is that we get LSF vector. Then we have to make sure that the 10 LSF frequencies are separated from one another by 50 Hz minimum, and if not so then a separation algorithm must be applied as described below: Calculating: Required: ,
The result of this algorithm is a vector - which is organized in ascending order and with difference of 50 Hz between the elements of the vector.
The next stage is to implement the vector quantization. This is done by the MSVQ multi stage vector quantizer. The MSVQ codebook consists of four stages of 128,64,64 and 64 as shown in this figure:
The algorithm is to find the quantized vector - and as seen in the above figure he is the sum of the vectors selected in each stage. The main purpose of the MSVQ is to find the quantized vector that will best represent the original LSF vector. In order to do so the MSVQ finds the codebook vector, which minimize the square of the weighted Euclidean distance, , between the original LSF and the quantized LSF vectors:
Where:
is the
inverse prediction filter power spectrum evaluated a frequency . That means that this is the original spectrum. The search is performed by saving in every stage the best 8 indexes. In the 1st stage saving the 8 codebook vectors that gives the minimum error. In the 2nd stage the search is combined with the result of the 1st stage, and searching again to find he indexes. And this way the algorithm searches all the 4 stages. This algorithm will result for every LSF vector8 best indexes, and because we have 10 LSF components, when the searching process will finish, we will have an array of vectors. The final step is arranging the quantized LSF vector in ascenorder, and assuring that the differbetween every two frequencies will be . The algorithm for this was described earlier. 6.2. PITCH QUANTIZATION The final pitch value is quantized on a logarithmic scale with a 99-level uniform quantizer ranging from 20 samples to 160 samples. These pitch values are then mapped to a 7-bit codeword using a look-up table. The values that are all zero means that his is an unvoiced state, and is sent if . 6.3. GAIN QUANTIZATION
The gain represented in two values: and . is quantized with a 5-bit uniform quantizer ranging from 10 to 77 dB. is quantized to 3 bits using the following algorithm: is the gain of the previous frame.
The algorithm is: if for the current frame is within 5 dB of , and is within 3 of the average of the values for the current and the previous frames, then qauntizer_index = 0 that means to the decoder to set to the mean of the values for the current and the previous frames.
Otherwise, the frame represents a transition and is quantized with a 7-level uniform quantizer ranging from 6 dB below the minimum of the values for the current and the previous frames to 6 dB above the maximum of those values. If the values are saturated they are clamped to 10 dB and 77 dB. 6.4. BANDPASS VOICING QUANTIZATION This is the quantizing algorithm of the voicing strengths result is the quantized voicing strengths . The
6.5. FOURIER MAGNITUDE QUANTIZATION The algorithm for this quantization is described in the following steps: -Quantizing the predictor coefficients from the quantized LSF vector.
-Generating the residual window based on quantized predictor coefficients. -Applying a 200 sample Hamming window and performing a 512 complex FFT with zero padding. -Transforming the complex FFT output to magnitudes, and the harmonics are found with a spectral peak-picking algorithm. The spectral peak-picking algorithm finds the maximum magnitudes of the pitch harmonics. The search is divided into several areas with width of where is the quantized pitch. The location of the harmonics in the divided areas is calculated by the formula: where represents the number of the harmonic required. The
number of harmonics will be the smaller between 10 and . After finding the harmonics they will be normalized to RMS value of 1. When the process finishes and the number of harmonics found is less than 10 the remaining will have the value of 1. In this stage we take the resulting 10 harmonics and quantize it with a codebook of 256 vectors (using 8-bit vector). The codebook is searched using perceptually weighted Euclidean distance with weights as described below that emphasize low frequencies over higher ones:
Where is the frequency in Hz of the harmonic for the default pitch period of 60 samples. This searching process is applied by minimizing the squared error between the magnitudes and the codebook values. 7. ERROR PROTECTION AND BIT PACKING
To improve performance in channel errors, the unused coder parameters for the unvoiced mode are replaced with FEC - forward error correction. Three Hamming (7,4) codes and one Hamming (8,4) code are used. In an unvoiced mode there is no need to transmit the 8-bits of Fourier magnitudes, 4-bits of bandpass voicing and the 1-bit of aperiodic flag. That means that we have a total of 13 spare bits to be replaced by the FEC algorithm. FEC replaces these 13bit with the parity bits of the Hamming codes. The algorithm protects the first MSVQ index, and the two gain values. The parity generator matrix for the Hamming (7,4) code is:
The parity generator matrix for the Hamming (8,4) code is:
The algorithm described as follows: the protected n-bits are placed into a column vector, and then multiplied by the parity matrix. The result is an n-bit parity vector, to be transmitted. Defining the 3-bit and 4-bit protected vector as:
The parity vector will be placed in the spare bits of the unvoiced frame, and he will be transmitted. 8. TRANSMITTION BIT STREAM The total 54 bits that are the output of the encoder are listed below. They are divided into 2 columns depended on the mode of the frame: voiced or unvoiced. These 54 are without forward error correction.
The arrangement of the bit stream of these 54 bits is also depended on the mode of the frame: voiced or unvoiced, and for unvoiced frame the forward error correction bits are placed to protect the relevant bits.
Notice: Some of the material in this site taken from books, federal standards, and other sites. This is purely an Educational site - if you find we have violated any copyright, please contact us and we will remove the material in violation.
MELP
DECODER Overview The MELP parameters which are quantized and transmitted are the final pitch, ; the bandpass voicing strengths, ; the two gain values, and ; the linear prediction coefficients, ; the Fourier magnitudes; and the aperiodic flag. The use of the following quantization procedures is required for interoperability among various implementations. Decoder Block Diagram
Bit Unpacking and Error Correction: After the Encoder Operation, all the described parameters are transmitted to the Decoder thought some medium path, which may cause Bit Errors (BER) at the decoder side, it is important to verify the received data with the Error correction mechanism. The received bits are unpacked from the channel and assembled into the parameter codewords, according to this table:
Parameter decoding is different for voiced and unvoiced modes, according to the following table that show how the 54 bits in an MELP frame are allocated among the parameters:
The pitch is decoded first, since it contains the mode information. Pitch Decoding:
This table described bellow is used in decoding the 7-bit pitch code to determine if a frame is voiced, unvoiced, or whether a frame erasure is indicated.
If the pitch code is all-zero or has only one bit set, then the unvoiced mode is used. If two bits are set, a frame erasure is indicated. Otherwise, the pitch value is decoded and the voiced mode is used. In the unvoiced mode, the (8,4) Hamming code is decoded to correct single bit errors and detect double errors. If an uncorrectable error is detected, a frame erasure is indicated. Otherwise, the (7,4) Hamming codes are decoded, correcting single errors but without double error detection. ( The theory on Error correction with Hamming code was explained in the Encoder chapter). If any erasure is detected in the current frame, by the Hamming code, by the pitch code, or directly signaled from the channel, then a frame repeat mechanism is implemented. All of the parameters for the current frame are replaced with the parameters from
the previous frame. In addition, the first gain term is set equal to the second gain term so that no gain transitions are allowed. If an erasure is not indicated, the remaining parameters are decoded. The LSFs are checked for ascending order and minimum separation as described in Section 6.1. - QUANTIZATION OF PREDICTION COEFFICIENTS in the Encoder. In the unvoiced mode, default parameter values are used for the pitch, jitter, bandpass voicing, and Fourier magnitudes. The pitch value is set to 50 samples, the jitter is set to 25%, all of the bandpass voicing strengths are set to 0, and the Fourier magnitudes are set to 1. In the voiced mode, Vbp1 is set to 1; jitter is set to 25% if the aperiodic flag is a 1; otherwise jitter is set to 0%. The bandpass voicing strength for the upper four bands is set to 1 if the corresponding bit is a 1; otherwise the voicing strength is set to 0. There is one exception. If 0001 is received for , respectively, then is set to 0. When the special all-zero code for the first gain parameter, G1, is received, some errors in the second gain parameter, G2, can be detected and corrected. This correction process provides improved performance in channel errors. Gain Decoding: The decoding for the two gain parameters is shown in the following Flow Chart:
Noise Attenuation: For quiet input signals, a small amount of gain attenuation is applied to both decoded gain parameters using a power subtraction rule. This attenuation is a simplified, frequency invariant case of the Smoothed Spectral Subtraction noise suppression method as defined in : L. Arslan, A. McCree, and V. Viswanathan, New Methods for Adaptive Noise Suppression, Proceedings of IEEE ICASSP 1995, pp. 812-815. This article describes the estimation of the the value of - the Speech Signal gain without the Environment Noise ( ). Before determining the attenuation for the first gain term, , a background noise estimate, , is updated as follows:
We can see that the noise estimator moves up by 3 dB per second and down by 12 dB per second for the gain update rate of 88.9 updates per second, and he clamped between 10 and 80. Noise estimation is disabled for repeated frames in order to prevent repeated attenuation. The background noise estimate is also used in the adaptive spectral enhancement calculation - this method is described later. Refinement Gain : Gain is modified by subtracting a (positive) correction term, , given in dB by
- the background noise estimated [dB]. - the first gain term [dB]. The correction is clamped to a maximum value of 6 dB to avoid fluctuations and signal distortion.
To ensure that the attenuation is applied only to quiet signals, the value as used in the equation bellow is clamped to an upper limit of 20 dB. The noise estimation and gain modification steps are then repeated for the second gain term, G 2. Noise estimation and gain attenuation are disabled for repeated frames. Parameter Interpolation: All MELP synthesis parameters are interpolated pitch-synchronously for each synthesized pitch period. The interpolated parameters are the gain (in dB), LSFs, pitch, jitter, Fourier magnitudes, pulse and noise coefficients for mixed excitation, and spectral tilt coefficient for the adaptive spectral enhancement filter. Gain is linearly interpolated between the second gain of the prior frame, , and the first gain of the current frame, ,if the starting point, , , of the new pitch period is less than 90. Otherwise, gain is interpolated between and . Normally, the other parameters are linearly interpolated between the past and current frame values. The interpolation factor, int , for these parameters is based on the starting point of the new pitch period:
There are two exceptions to this interpolation procedure. First, if there is an onset with a high pitch frequency, pitch interpolation is disabled and the new pitch is immediately used. This condition is met when is more than 6 dB greater than and the current frames pitch period is less than half the prior frames pitch period. The second exception also involves a gain onset. If differs from by more than 6 dB, then the LSFs, spectral tilt, and pitch are interpolated using the interpolated gain trajectory as a basis, since the gain is transmitted twice per frame and has a more accurate interpolation path. In this case, the interpolation factor is given by:
where Gint is the interpolated gain. This interpolation factor is then clamped between 0 and 1. Mixed Excitation Generation: The mixed excitation is generated as the sum of the filtered pulse and noise excitations. As described in the Bit Unpacking paragraph there is a differin the parameters values (pitch, jitter, bandpass voicing, and Fourier magnitudes) between Voiced and Unvoiced signals.
The pulse excitation, , is computed using an inverse Discrete Fourier Transform of one pitch period in length. The final equation for the pulse excitation is:
Pitch Period Estimation: The pitch period, T, is the interpolated pitch value plus the jitter times the pitch, where the jitter is the interpolated jitter strength times the output of a uniform randnumber generator between -1and 1. This pitch period is rounded to the nearest integer and clamped between 20 and 160. The equation that describe the Final Pitch Period:
All of the phases for the pulse excitation are set to zero, hence M(k) is real. Since is real, the magnitudes obey: M(T - k) = M(k), k = 1,2,,L Where, If T is even number then L=T/2 If T is odd number then L=(T-1)/2 The DC term, M(0) , is set to 0. Magnitude terms M(k), k=1,2,10 , are set to the interpolated values of the Fourier magnitudes, and any magnitudes not otherwise specified are set to 1.
To prevent rapid changes at the start of the pitch period, the pulse excitation is circularly shifted by ten samples of delay so the main excitation pulse occurs at the
tenth sample of the period. The pulse is then multiplied by the square root of the pitch to give a unity RMS signal, and then multiplied by 1000 to give a nominal signal level. The noise is generated by a uniform random number generator with an RMS value of 1000, and range of -1732 to 1732.
The pulse and noise excitation signals are then filtered and summed to form the mixed excitation. The pulse filter for the current frame is given by the sum of all the bandpass filter coefficients for the voiced frequency bands, while the noise filter is given by the sum of the bandpass filter coefficients for the unvoiced bands.
These filter coefficients are interpolated pitch synchronously. The bandpass filter coefficients for each of the five bands are given in Appendix A in the MELP Federal Standard specification. Adaptive Spectral Enhancement The adaptive spectral enhancement filter is applied to the mixed excitation signal. This filter is a tenth order pole/zero filter, with additional first-order tilt compensation. Its coefficients are generated by bandwidth expansion of the linear prediction filter transfer function, A(z) , corresponding to the interpolated LSFs. The transfer function of the enhancement filter, , is given by:
where,
and tilt coefficient is first calculated as , then interpolated, then multiplied by p, the signal probability. The first reflection coefficient, , is calculated from the decoded LSFs. By the MELP predictor coefficient sign convention, is usually negative for voiced spectra. The signal probability is estimated by comparing the current interpolated gain, , to the background noise estimate using the formula:
This signal probability is clamped between 0 and 1. Linear Prediction Synthesis: The synthesis uses a direct form filter, with the coefficients corresponding to the interpolated LSFs. The interpolated LSF's with the Vector Quantization parameters, decodes the LPC filter coefficients, so the LPC filter is described with the equation:
Since the excitation is generated at an arbitrary level, the speech gain must be introduced to the synthesized speech. The correct scaling factor, , is computed for each synthesized pitch period of length T by dividing the desired RMS value ( must be converted from dB) by the RMS value of the unscaled synthetic speech signal :
To prevent discontinuities in the synthesized speech, this scale factor is linearly interpolated between the previous and current values for the first ten samples of the pitch period. Pulse Dispersion: The pulse dispersion filter is a 65th order FIR filter derived from a spectrallyflattened triangle pulse. The coefficients are listed in Appendix B in the MELP Federal Standard specification. Synthesis Loop Control After processing each pitch period, the decoder updates by adding T, the number of samples in the period just synthesized. If , synthesis for the current frame continues from the Parameter Interpolation step. Otherwise, the decoder buffers the remainder of the current period which extends beyond the end of the current frame and subtracts 180 from produce its initial value next frame. Notice: Some of the material in this site taken from books, federal standards, and other sites. This is purely an Educational site - if you find we have violated any copyright, please contact us and we will remove the material in violation.