18EC743-MMC-Module-4 Notes

Multimedia Communications-[18EC743]
Module-4
Audio and Video Compression

Contents
 Audio compression
 Video compression
 Video compression principles
 Video compression standards
Introduction:
 Both audio and most video signals are continuously varying analog signals.
 After digitization, both comprise a continuous stream of digital values, each value representing the amplitude of
a sample of the particular analog signal taken at repetitive time intervals.
 Thus, the compression algorithms associated with digitized audio and video are different from those of text and
image.
Audio Basics:
 Digitization of audio signals is done using the digitization process known as pulse code modulation (PCM).
 PCM involves sampling the (analog) audio signal/waveform at a minimum rate which is twice that of the
maximum frequency component that makes up the signal.
 If the (frequency) bandwidth of the communications channel to be used is less than that of the signal, then the
sampling rate is determined by the bandwidth of the communication channel then is known as a band-limited
signal.
 In the case of a speech signal, the maximum frequency component is 10 kHz and hence the minimum sampling
rate is 20 ksps.
 Ex.: Sampling rate and number of bits per sample
 Sampling rate Audio = 20ksps. Music = 40ksps.
 Number of bits per sample Audio = 16 bits. Speech = 12 bits.
 Stereophonic signal has 2 channels; hence, two signals must be digitized.
 Due to the above reason, this produces a bit rate of 240 kbps for a speech signal and 1.28 Mbps for a
stereophonic music/general audio signal, the latter being the rate used with compact discs.
 In practice, however, in most multimedia applications involving audio, the bandwidth of the communication
channels that are available dictate rates that are less than these.
 It can be achieved in one of two ways:
1. The audio signal is sampled at a lower- rate (if necessary, with fewer bits per sample).
2. Compression algorithm is used.
Dept. of ECE-ATMECE-Mysuru Page 1

Disadvantage:
 The quality of the decoded signal is reduced due to the loss of the higher frequency components from the original
signal.
 The use of fewer bits per sample results in the introduction of higher levels of quantization noise.
 Thus often, a compression algorithm is used and its advantages are as follows:
 This achieves a comparable perceptual quality (as perceived by the ear) to that obtained with a higher sampling
rate but with a reduced bandwidth requirement.
Differential pulse code modulation (DPCM):
A DPCM encoder and decoder are shown in Figure (a) and a simplified timing diagram of the encoder is shown in
Figure (b).

 Differential pulse code modulation (DPCM) is a derivative of standard PCM.

 Exploits the fact that, for most audio signals, the range of the differences in amplitude between successive
samples of the audio waveform is less than the range of the actual sample amplitudes.
 Hence, if only the digitized difference signal is used to encode the waveform then fewer bits are required than for
a comparable PCM signal with the same sampling rate.
Operation:
 The previous digitized sample of the analog input signal is held in the register (R) a temporary storage facility
and the difference signal (DPCM) is computed by subtracting the current contents of the register (Ro) from the
new digitized sample output by the ADC (PCM).
 The value in the register is then updated by adding to the current register contents the computed difference signal
output by the subtractor prior to its transmission.
 The decoder operates by simply adding the received difference signal (DPCM) to the previously computed signal
held in the register (PCM).
 Typical savings with DPCM, are limited to just 1 bit, which, for a standard PCM voice signal, example, reduces
the bit rate requirement from 64kbps to 56 kbps.
 For the above circuit , the output of the ADC is used directly and hence the accuracy of each computed
difference signal also known as the residual (signal) is determined by the accuracy of the previous signal/value
held in the register.
 As, all ADC operations produce quantization errors and hence a string of, say, positive errors, will have a
cumulative effect on the accuracy of the value that is held in the register.
 Thus, with a basic DPCM scheme, the previous value held in the register is only an approximation.
 Due to above reason a more sophisticated technique has been developed for estimating and is known as
predicting.

 The above method is implemented by predicting the previous signal by using not only the estimate of the current
signal, but also varying proportions of a number of the immediately preceding estimated signals.
 The proportions used are determined by what is known as predictor coefficients and the principle is shown in the
figure below.
 In this scheme, the difference signal is computed by subtracting varying proportions of the last three predicted
values from the current digitized value output by the ADC.
 Ex.: If the three predictor coefficients have the values C, = 0.5 and C2 = C3 = 0.25, then the contents of register
R1 would be shifted right by 1 bit (thereby multiplying its contents by 0.5) and registers R2 and R3 by 2 bits
(multiplying their contents by 0.25).
 The three shifted values are then-added together and the resulting sum subtracted from the current digitized value
output by the ADC (PCM).
 The current contents of register R1 are then transferred to register R2 and that of register R2 to register R3. 
The new predicted value is then loaded into register R1 ready for the next sample to be processed.
 The decoder operates in a similar way by adding the same proportions of the last three computed PCM signals to
the received DPCM signal.
 By using this approach, a similar performance level to standard PCM is obtained by using only 6 bits for the
difference signal which produces a bit rate of 32 kbps.
Adaptive differential PCM:
 Definition: using varying number of bits for the difference signal depending on its amplitude; that is, using
fewer bits to encode (and hence transmit) smaller difference values than for larger values.
 Advantages: Saving in bandwidth and improving in audio quality.
 It’s an international standard for this is defined in ITU-T Recommendation G.721.
 This is based on the same principle as DPCM except an eight-order predictor is used and the number of bits used
to quantize each difference value is varied.
Variants:
 6 bits producing 32 kbps to obtain a better quality output than with third-order DPCM, or 5 bits producing 16
kbps if lower bandwidth is more important.
 A second ADPCM standard, which is a derivative of G.721, is defined in ITU-T Recommendation G.722.
 This provides better sound quality than the G.721 standard at the expense of added complexity.
 To achieve this, it uses an added technique known as subband coding.
 The input speech bandwidth is extended to be from 50 Hz through to 7 kHz compared with 3.4 kHz for a
standard PCM system and hence the wider bandwidth produces a higher-fidelity speech signal.
 This is particularly important in conferencing applications, in order to enable the members of the conference to
discriminate between the different voices of the members present.
 The general principle of the scheme is shown in Figure below

 To allow for the higher signal bandwidth, prior to sampling the audio input signal, it is first passed through two
filters:
 One which passes only signal frequencies in the range 50 Hz through to 3.5 kHz, and The other only frequencies
in the range 3.5 kHz through to 7 kHz.
 By doing above process, the input (speech) signal is effectively divided into two separate equal bandwidth
signals, the first known as the lower subband signal and the second the upper subband signal.
 Each subband is then sampled and encoded independently using ADPCM, the sampling rate of the upper
subband signal being 16ksps to allow for the presence of the higher frequency components in this subband.
 The use of two subbands has the advantage that different bit rates can be used for each subband.
 The frequency components that are present in the lower subband signal have a higher perceptual importance than
those in the higher subband.
 The operating bit rate can be 64, 56, or 48 kbps. With a bit rate of 64 kbps, the lower subband is ADPCM
encoded at 48 kbps and the upper subband at 16 kbps.
 The two bitstreams are then multiplexed (merged) together to produce the transmitted (64kbps) signal in such a
way that the decoder in the receiver is able to divide them back again into two separate streams for decoding.
 A third standard based on ADPCM is also available. This is defined in ITU-T Recommendation G.726.
 This also uses subband coding but with a speech bandwidth of 3.4 kHz. The operating bit rate can be 40, 32. 24
or 16 kbps.

Adaptive predictive coding:

 Higher levels of compression at higher levels of complexity can be obtained by making the predictor coefficients
adaptive. This is the principle of adaptive predictive coding (APC).
 Here the predictor coefficients continuously change.
 Practically, the optimum set of predictor coefficients continuously vary since they are a function of the
characteristics of the audio signal being digitized.
 To exploit this property, the input speech signal is divided into fixed time segments and, for each segment, the
currently prevailing characteristics are determined.
 The optimum sets of coefficients are then computed and these are used to predict more accurately the previous
signal.
 This type of compression can reduce the bandwidth requirements to 8 kbps while still obtaining an acceptable
perceived quality.
Linear predictive coding:

 The key to this approach is to identify the set of perceptual features to be used. In terms of speech, the three
features which determine the perception of a signal by the ear are its:
1. Pitch: this is closely related to the frequency of the signal and is important because the ear is more sensitive to
frequencies in the range 2-5 kHz than to frequencies that are higher or lower than these;
2. Period: this is the duration of the signal;
3. Loudness: this is determined by the amount of energy in the signal.
 The origins of the sound are important. These are known as vocal tract excitation parameters and classified as:
Voiced sounds: these are generated through the vocal chords and examples include the sounds relating to the
letters m, v, and I;
Unvoiced sounds: with these the vocal chords are open and example, include the sounds relating to the letters f
and s.
 When the above features have been obtained from the source waveform, it is possibly to use them together with a
suitable model of the vocal tract, to generate a, synthesized version of the original speech signal.
 The basic features of an LPC encoder/decoder are shown in Figure below.
Operation:
 The input speech waveform is first sampled and quantized at a defined rate.
 A block of digitized samples known as a segment is then analyzed to determine the various perceptual
parameters of the speech that it contains.
 The speech signal generated by the vocal tract model in the decoder is a function of the present output of the
speech synthesizer as determined by the current set of model coefficients plus a linear combination of the
previous set of model coefficients.
 Thus, the vocal tract model used is adaptive and, the encoder determines and sends a new set of coefficients for
each quantized segment.

 Hence, the output of the encoder is a string of frames, one for each segment.
 Each frame contains fields for pitch and loudness the period is determined by the sampling rate being used a
notification of whether the signal is voiced or unvoiced, and a new set of computed model coefficients.
 Some LPC encoders use up to ten sets of previous model coefficients to predict the output sound (LPC-10) and
use bit rates as low as 2.4 kbps or even 1.2 kbps.
 However, the generated sound at these rates is often very synthetic and hence LPC coders are used primarily in
military application in which bandwidth is all-important.
Code-excited LPC:
 Synthesizers used in most LPC decoders are based on a very basic model of the vocal tract.
 A more sophisticated version of this, known as a code-excited linear prediction (CEIP) model, is also used.
 Ex.: enhanced excitation (LPC) models.
 Used primarily for applications in which the amount of bandwidth available is limited but the perceived quality
of the speech must be of an acceptable standard for use in various multimedia applications. 
 In this model for encoding purposes, just a limited set of segments is used, each known as a waveform template.
 A precomputed set of templates are held by the encoder and decoder in what is known as a template codebook.
 Each of the individual digitized samples that make up a particular template in the codebook are differentially
encoded
 Each codeword that is sent selects a particular template from the codebook whose difference values best match
those quantized by the encoder.

 Hence there is continuity from one set of samples to another; as a result, an improvement in sound quality is
obtained.
 There are four international standards available they are ITU-T Recommendations G.728, 729, 729(A), and
723.1 all of which give a good perceived quality at low bit rates.
 Every coders of this type have a delay associated with them which is incurred while each block of digitized
samples is analyzed by the encoder and the speech is reconstructed at the decoder.
 The combined delay value is known as the coder's processing delay.
 Also before the speech samples can be analyzed, it is necessary to buffer in memory the block of samples.
 The time to accumulate the block of samples is known as the algorithmic delay and, in some CELP coders,
 This is extended to include samples from the next successive block, a technique known as lookahead.
 These delays occur in the coders are in addition to the end-to-end transmission delay over the network.
 Nevertheless, the combined delay value of a coder is an important parameter as it often determines the suitability
of the coder for a specific application.
Example:
 For a conventional telephony application, a low-delay coder is required since a large delay can impede the flow
of a conversation.
 For an interactive application that involves the output of speech stored in a file, a delay of several seconds before
the speech starts to be output is often acceptable and hence the coder's delay is less important.
 Other parameters of the coder that are considered are the complexity of the coding algorithm and the perceived
quality of the output speech.
 In general, a compromise has to be reached between a coder's speech quality and its delay/complexity.
 Application: Both LPC and CELP are used primarily in telephone application for the compression of speech
application.
Perceptual Coding:
 Perceptual, encoders are designed for the compression of general audio associated with a digital television
broadcast.
 They use a known as a psychoacoustic model, its role is to exploit a number of the limitations of the human ear.
 Here sampled segments of the source audio waveform are analyzed as with CELP-based coders but only those
features that are perceptible to the ear are transmitted.
 Ex.: the human ear is sensitive to signals in the range 15Hz through to 20 kHz, the level of sensitivity to each
signal is non-linear; that is, the ear is more sensitive to some signals than other.
 When multiple signals are present a strong signal may reduce the level of sensitivity of the ear to other signals
which are near to it in frequency, this effect is known as frequency masking.
 When the ear hears a loud sound, it takes a short but finite time before it can hear a quieter sound, an effect
known as temporal masking.
 A psychoacoustic model is used to identify those signals that are influenced by both these effects.

 These are then eliminated from the transmitted signals and, in so doing; this reduces the amount of information
to be transmitted.
 Application: range of audio compression algorithm.
Sensitivity of the ear:

 For the ear, its dynamic range is the ratio of the loudest sound it can hear to the quietest sound and is in the
region of 96dB.
 But, the sensitivity of the ear varies with the frequency of the signal and, assuming just a single frequency signal
is present at any one time, the perception threshold of the ear that is, its minimum level of sensitivity as a
function of frequency is shown in Figure below.
 The ear is most sensitive to signals in the range
2-5 kHz and hence signals within this band are
the quietest the ear is sensitive to.
 The vertical axis, indicates the amplitude level
of all the other signal frequencies relative to
this level - measured in dB - that are required
for them to be heard.
 Thus in the figure, although the two signals A
and B have the same relative amplitude, signal
A would be heard - that is, it is above the
hearing threshold - while signal B would not.
Frequency masking:
 When an audio sound consists of multiple signal frequency, the sensitivity of the ear changes and varies with the
relative amplitude of the signals.
 The curve showed below shows how the sensitivity of the
ear changes in the vicinity of a loud signal. In the above
curve signal B is larger in amplitude than signal A, this
causes the basic sensitivity curve of the ear to be
distorted in the region of signal B.
 Due to the above reason, signal A will no longer be heard
even though on its own, it is above the hearing threshold
of the ear for a signal of that frequency.
 This is the origin of the term frequency masking and, in practice, the masking effect also varies with frequency
as we show in Figure below.
 Different curves show the masking effect of a selection of different frequency signals - 1, 4, and 8 kHz - and, the
width of the masking curves - that is, the range of frequencies that are affected increase with increasing

frequency. The width of each curve at a particular signal level is known as the critical bandwidth for that
frequency.
 Experiments have shown that, for frequencies less than 500 Hz, the critical bandwidth remains constant at about
100 Hz. For frequencies greater than 500 Hz, however, the critical bandwidth increases (approximately) linearly
in multiples of 100Hz.
 Example: For signal of 1 kHz (2 x 500 Hz), the critical bandwidth is 200 (2 x 100) Hz  For signal of 5 kHz (10
x 500 Hz), the critical bandwidth is 1000 (10 x 100) Hz
Thus if the magnitude of the frequency components that make up an audio sound can be determined, then it is
possible to determine those frequencies that can be and not to be transmitted.
Temporal masking:
 Definition: It is the amount of taken by human ear to hear a
quite sound after hearing a loud sound.
 Temporal masking or non-simultaneous masking occurs when
a sudden stimulus sound makes inaudible other sounds which
are present immediately preceding or following the stimulus.
 The general effect of temporal masking is as shown in the
graph.
 From the figure above we see that after a loud sound ceases it
takes a short time (1/10th of millisecond) for the signal
amplitude to decay.
 During decay time, signals whose amplitude are less than the decay envelope will not be heard and thus are need
not to be transmitted.
 In order to exploit temporal masking, it is necessary to process the input audio waveform over a time period that
is comparable with that associated with temporal masking.

MPEG audio Coders:
 The Motion Pictures Expert Group (MPEG) was formed by the ISO to formulate a set of standards relating to a
range of multimedia applications that involve the use of video with sound.
 The coders associated with the audio compression part of these standards are known as MPEG audio coders and a
number of these use perceptual coding.
 All the signal processing operations associated with a perceptual coder carried out digitally and a schematic
diagram of a basic encoder and decoder is shown in figure below.
 The time-varying audio input signal is first sampled and quantized using PCM, the sampling rate and number
of bits per sample being determined the specific application.
 The bandwidth that is available for transmission is divided into a number of frequency subbands using a
bank of analysis filters which, because of their role, are also known as critical-band filters.
 Each frequency subband is of equal width and, essentially, the bank of filters maps each set of 32 (time-
related) PCM samples into an equivalent set of 32 frequency samples, one per subband. Hence each is

known as a subband sample and indicates the magnitude of each of the 32 frequency components that are
present in a segment of the audio input signal of a time duration equal 32 PCM samples.
 Ex.: assuming 32 subbands and a sampling rate of 32 ksps thus, a maximum signal frequency of 16 kHz and
each subband has bandwidth of 500 Hz.
 In a basic encoder, the time duration of each sampled segment of the audio input signal is equal to the time to
accumulate 12 successive sets 32 PCM and hence subband samples; that is, a time duration equal to (12 x
32) PCM samples.
 In addition to filtering the input samples into separate frequency subbands, the analysis filter bank also
determines the maximum amplitude the 12 subband samples in each subband.
 Each of the known as the scaling factor for the subband and these are passed both to the psychoacoustic
model and together with the set of frequency samples in each subband, to the corresponding quantizer block.
 The processing associated with both frequency and temporal masking is carried out by the psychoacoustic
model which is performed concurrently with the filtering and analysis operations.
 The 12 sets of 32 PCM samples are first transformed into an equivalent set of frequency components using a
mathematical technique known as the discrete Fourier transform.
 Using the known hearing thresholds and masking properties of each subband, the model determines the
various masking effects of this set of signals.
 The output of the model is a set of what are known as signal-mask ratios (SMRs) and indicate those
frequency components whose amplitude below the related audible threshold.
 In addition, the set of scaling factors used to determine the quantization accuracy - and hence bit allocations -
be used for each of the audible components.
 The above method is done so that those frequency components that are in regions of highest sensitivity can
be quantize with more accuracy and hence less quantization noise than those in regions where the ear is less
sensitive.
 In a basic encoder, all the frequency components in a sampled segment are encoded and these are carried in
frame the format of which is shown in figure below.
 The header contains information such as the sampling frequency that has been used.
 The quantization is performed in two stages using a form of companding.
 The peak amplitude level in each subband i.e., the scaling factor is first quantized using 6 bits giving 1 of 64
levels and a further 4 bits are then used to quantize the 12 frequency components in the subband relative to
this level.
 Collectively this is known as the subband sample (SBS) format and, in this way, all the information
necessary for decoding is carried within each frame.

 In the decoder, after the magnitude of each set of 32 subband samples have been determined by the
dequantizer, these are passed to the synthesis filter bank.
 The latter then produces the corresponding set of PCM samples which are decoded to produce the time-
varying analog output segment.
 The ancillary data field at the end of a frame is optional and is used, for example, to carry additional coded
samples associated with, say, the surround-sound that is present with some digital video broadcasts.
 The use in the encoder of different scaling factors for each subband means that the frequency components in
the different subbands have varying levels of quantization noise associated with them.
 The bank of synthesis filters in the decoder, however, limits the level of quantization noise in each subband
to the same band of frequencies as the set of frequency components in that subband.
 As a result, the effect of quantization noise is reduced since the signal-to-noise ratio in each subband is
increased by the larger amplitude of the signal frequency components in each subband masking the reduced
level of quantization noise that is-present.
 From the figure we see that, the psychoacoustic model is not required in the decoder and, as a result, it is less
complex than the encoder.
 This is a particularly desirable feature in audio and video broadcast applications as the cost of the decoder is
an important factor.
 Also, it means that different psychoacoustic models can be used or, if bandwidth is plentiful, none at all.
 An international standard based on this approach is defined in ISO Recommendation 11172-3.
 There are three levels processing associated with this known as:
 A higher layer makes a better use of the psychoacoustic model and hence higher compression rate can be
achieved.
 The 3 layers require increasing levels of complexity (and hence cost) to achieve a particular perceived
quality, the choice of layer and bit rate is often a compromise between the desired perceived quality and the
available bit rate.

Dolby audio coders:
 The psychoacoustic models associated with the various MPEG coders control the quantization accuracy of each
subband sample by computing and allocating the number of bits to be used to quantize each sample.
 As the quantization accuracy that is used for each sample in a subband may vary from one set of subband
samples to the next,
 The bit allocation information that is used to quantize the samples in each subband is sent with the actual
quantized samples.
 The above information is then used by the decoder to dequantizer the set of subband samples in the frame.
 This mode of operation of perceptual coder is known, therefore, as the forward adaptive bit allocation mode.
 A simplified schematic diagram showing this operational mode is given in figure (a) below:
 Advantage of MPEG Coders: psychoacoustic model is required only in the encoder.

 Disadvantage of MPEG Coders: a significant portion of each encoded frame contains bit allocation information
which, in turn, leads to a relatively inefficient use of the available bit rate.
 A variation of this approach is to use a fixed bit allocation strategy for each subband which is then used by both
the encoder and decoder.
 The principle of operation of this mode is shown in figure (b) below:

 Typically, the bit allocations that are selected for each subband are determined by the known sensitivity
characteristics of the ear and the use of fixed allocations means that this information need not be sent in the
frame.
 This approach is used in a standard known as Dolby AC-1, the acronym "AC" meaning acoustic coder.
 Dolby AC was designed for use in satellites to relay FM radio programs and the sound associated with
television programs.
 It uses a low-complexity psychoacoustic model with 40 subbands at a sampling rate of 32Itsps and
proportionately more at 44.1 and 48 ksps. A typical compressed bit rate is 512 kbps for two channels stereo.
 A second variation, which allows the bit allocations per subband to be adaptive while at the same time
minimizing the overheads in the encoder bit-stream, is for the decoder also to contain a copy of the
psychoacoustic model.
 Above is then used by the decoder to compute the same or very similar bit allocations that the psychoacoustic
model in the encoder has used to quantize each set of sub-band samples.
 In order for the psychoacoustic model in the decoder to carry out its own computation of the bit allocations, it is
necessary for it to have a copy of the sub-band samples.
 Hence with this operational mode, instead of each frame containing bit allocation information in addition to the
set of quantized samples it contains the encoded frequency coefficients that are present in the sampled
waveform segment.
 The above method is known as the encoded spectral envelope and this mode of operation, the backward
adaptive bit allocation mode.
 Figure(c) below illustrates the principle of operation of both the encoder and decoder.
 This approach is used in the Dolby AC-2 standard which is utilized in many applications including the
compression associated with the audio of a number of PC sound cards.
 Advantage of this method: these produce audio of hi-fi quality at a bit rate of 256 kbps.
 Disadvantage: For broadcast applications, the psychoacoustic model is required in the decoder, the model in
the encoder cannot be modified without changing all decoders. To meet the above requirement, a third variation
has been developed that uses both backward and forward bit allocation principles.
 This is known as the hybrid backward/forward adaptive bit allocation mode and is illustrated in figure (b) below

 As we can deduce from part (c) of the figure, with the backward bit allocation method on its own,  Since the
psychoacoustic model uses the encoded spectral envelope, the quantization accuracy of the subband samples is
affected by the quantization noise introduced by the spectral encoder.  In case of hybrid scheme, although, a
backward adaptive bit allocation scheme is used as in AC-2 - using PMB - an additional psychoacoustic model -
PMF - is used to compute the difference between the bit allocations computed by PMB and those that are
computed by PMF using the forwardadaptive bit allocation scheme.
 Above information is then used by PMB to improve the quantization accuracy of the set of subband samples.
 The modification information is also sent in the encoded frame and is used by the PMB in the decoder to
improve the dequantization accuracy.
 In addition, should it be required to modify the operational parameters of the PMB in the encoder and
decoder(s), then information can be sent also with the computed difference information.
 As we can see from the figure, the PMF must compute two sets of quantization information for each set of
subband samples and hence is relatively complex.
 However, since this is not required in the decoder, this is not an issue.  The hybrid approach is used in the
Dolby AC-3 standard which has been defined for use in a similar range of applications as the MPEG audio
standards including the audio associated with advanced television (ATV).
 Ex.: HDTV standard in North America and, in this application, the acoustic quality of both the MPEG and
Dolby audio coders were found to be comparable.
 The sampling rate can be 32, 44.1, or 48 ksps depending on the bandwidth of the source audio signal. Each
encoded block contains 512 sub-band samples. However, in order to obtain continuity from one block to the
next, the last 256 subband samples in the previous block are repeated to become the first 256 samples in the
next block and hence each block contains only 256 new samples.
 Assuming a PCM sampling rate of 32 ksps, although each block of samples is of 8ms duration 256/32 the
duration of each encoder block is 16 ms. The audio signal bandwidth at this sampling rate is 15 kHz and hence
each subband has a bandwidth of 62.5 Hz, that is, 15k/256.  The (stereo) bit rate is, typically, 192 kbps.

Video Compression:
Interpersonal: Video telephony and video conferencing.
Interactive: access to stored video in various forms.
Entertainment: digital television and movie / video – on –demand.
Video compression principles:

 With respect to compression, video is simply a sequence of digitized pictures, can also referred to as moving
pictures and the terms "frame” and "picture" are used interchangeably.
 One approach to compressing a video source is to apply the JPEG algorithm to each frame independently.
 This approach is known as moving JPEG or MJPEG.
 The drawback of JPEG is that its typical compression ratios obtainable with JPEG are between 10:1 and 20:1.
 In practice, in addition to the spatial redundancy present in each frame, considerable redundancy is often present
between a set of frames since, in general, only a small portion of each frame is involved with any motion that is
taking place. Ex.: the movement of a person's lips or eyes in a video telephony application and a person or
vehicle moving across the screen in a movie.
 In the latter case, a typical scene in a movie has a minimum duration of about 3 seconds, assuming a frame
refresh rate of 60 frames per second; each scene is composed of a minimum of 180 frames.
 Thus by sending only information relating to those segments of each frame that have movement associated with
them, considerable additional savings in bandwidth can be made by exploiting the temporal differences that exist
between many of the frames.
 The technique that is used to exploit the high correlation between successive frames is to predict the content of
many of the frames.
 This is based on a combination of a proceeding and in some instances a succeeding frame.
 Instead of sending the source video as a set of individually compressed frames, just a selection is sent in this
form and, for the remaining frames,
 Only the differences between the actual frame contents and the predicted frame contents are sent.
 The accuracy of the prediction operation is determined by how well any movement between successive frames is
estimated. This operation is known as motion estimation.
 As the estimation process is not exact, additional information must also be sent to indicate any small differences
between the predicted and actual positions of the moving segments involved. This is known as motion
compensation.
Frame types:
 There are two basic types of compressed frame:
 Intra coded frames or I-frames ( Frames that are encoded independently)
 Predictive Frames (P-Frames) and Bidirectional Frames (B Frames): Because of the way they are derived, the
above frames are also known as inter-coded or interpolation frames.

 A typical sequence of frames involving just I- and P-frames are shown in Figure below
 A sequence involving all three frame types is shown in Figure below
I Frames:
 I-frames are encoded without reference to any other frames.
 Each frame is treated as a separate (digitized) picture and the Y, Cb and Cr matrices are encoded independently
using the JPEG algorithm except that the quantization threshold values that are used are the same for all DCT
coefficients.
 Thus, the level of compression obtained with I-frames are relatively small.
 In principle, therefore, it would appear to be best if these were limited to, say, the first frame relating to a new
scene in a movie.
 In practice, however, the compression algorithm is independent of the contents of frames and hence has no
knowledge of the start and end of scenes.
 I-frames must be present in the output stream at regular intervals in order to allow for the possibility of the
contents of an encoded I-frame being corrupted during transmission.
 If an I-frame was corrupted then, in the case of a movie since the predicted frames are based on the contents of
an I-frame, a complete scene would be lost which, of course, would be totally unacceptable.
 Therefore, I-frames are inserted into the output stream relatively frequently.
 The number of frames/pictures between successive I-frames known as a group of pictures or GOP.
 It is represented by symbol N and typical values for N are from 3 through to 12.

P Frames:
 The encoding of a P-frame is relative to the contents of either a preceding I-frame or a preceding Pframe.
 P-frames are encoded using a combination of motion estimation and motion compensation and hence
significantly higher levels of compression can be obtained with them.
 The number of P-frame between each successive pair of I-frames is limited since any errors present the first P-
frame will be propagated to the next.
 The number of frame between a P-frame and the immediately preceding I- or P-frame is called the prediction
span.
 It is given the symbol M and typical values range from 1, through to 3, as shown in Figures above.
B Frames:
 Motion estimation involves comparing small segments of two consecutive frames for differences and, should a
difference be detected, a search is carried out to determine to which neighboring segment the original segment
has moved.
 In order to minimize the time for each search, the search region is limited to just a few neighboring segments.
 Applications such as video telephony, the amount of movement between consecutive frames is relatively small
and hence this approach works well.
 In applications that involve very fast moving objects, however, it is possible for a segment to have moved
outside of the search region.
 To allow for this possibility, in applications such as movies, in addition to P-frames, second type of prediction
frame is used.
 These are the B-frames and, as can see in Figure above, their contents are predicted using search regions in both
past and future frames.
 In addition to allowing for occasional moving objects, this also provides better motion estimation when an object
moves in front of or behind another object.

Decoding Operation:
I Frames:
 Information relating to I-frames can be decoded immediately it is received in order to recreate the original frame
P Frames:
 The received information is first decoded and the resulting information is then used, together with the decoded
contents of the preceding I- or P-frame, to derive the decoded frame contents.
B Frames:
 The received information is first decoded and the resulting information is then used, together with both the
immediately preceding I- or P-frame contents and the immediately succeeding P- or I-frame contents, to derive
the decoded frame contents.
 In order to minimize the time required to decode each B-frame, the order of encoding (and transmission) of the
(encoded) frames is changed so that both the preceding and succeeding I- or P-frames are available when the B-
frame is received.
 Ex.: Consider a uncoded frame sequence given as: IBBPBBPBBI…..
 Then the reordered sequence would be: IPBBPBBIBBPBB.
 Thus, with B-frames, the decoded contents of both the immediately preceding I- or P- frame and the immediately
succeeding P- or I-frame are available when the B-frame is received.
PB Frames:
 This does not refer to a new frame type as such but rather the way two neighboring P- and B-frames are encoded
as if they were a single frame.
 The general principle is as shown in Figure below.
 This is done in order to increase the frame rate without significantly increasing the resulting bit rate required .
D Frames:
 It has been defined for use in specific application like movie/video-on-demand applications.
 With this type of application, a user (at home) can select and watch a particular movie/video which is stored in a
remote server connected to a network.

 The selection operation is performed by means of a set-top box and, as with a VCR, the user may wish to rewind
or fast-forward through the movie.
 Clearly, this requires the compressed video to be decompressed at much higher speeds.
 Hence to support above function, the encoded video stream also contains D-frames which are inserted at regular
intervals throughout the stream.
 These are highly compressed frames and are ignored during the decoding of P- and B- frames.
 In case of DCT compression algorithm, the DC coefficient associated with each 8 x 8 block of pixels both for the
luminance and the two chrominance signals is the mean of all the values in the related block.
 Hence, by using only the encoded DC coefficients of each block of pixels in the periodically inserted D-frames, a
low-resolution sequence of frames is provided each of which can be decoded at the higher speeds that are
expected with the rewind and fast- forward operations.
Motion estimation and Compensation:

 The encoded contents of both P- and B-frames are predicted by estimating any motion that has taken place
between the frame being encoded and the preceding I- or P-frame and, in the case of B-frames, the succeeding P-
or I-frame.
 The various steps that are involved in encoding each P-frame are shown in Figure below.
 In figure (a) above, the digitized contents of the Y matrix associated with each frame are first divided into a two-
dimensional matrix of 16 x16 pixels known as a macroblock.
 Ex, the 4:1:1 digitization format is assumed and hence the related Cb and Cc matrices in the macroblock are both
8 x 8 pixels.
 For identification purposes, each macroblock has an address associated with it and, since the block size used for
the DCT operation is also 8 x 8 pixels, a macroblock comprises four DCT blocks for luminance and one each for
the two chrominance signals.
 To encode a P-frame, the contents of each macroblock in the frame -known as the target frame are compared on
a pixel-by-pixel basis with the contents of the corresponding macroblock in the preceding -I or P-frame.
 The latter is known as the reference frame.

 If a close match is found, then only the address of the macroblock is encoded. If a match is not found, the search
is extended to cover an area around the macroblock in the reference frame.
 Typically, this comprises a number of macroblocks as shown in Figure above.

 In practice, the various standards do not specify either the extent of the search area or a specific search strategy
and instead specify only how the results of the search are to be encoded.
 Normally, only the contents of the Y matrix are used in the search, and a match is said to be found if the mean of
the absolute errors in all the pixel positions in the difference macroblock is less than a given threshold.
 Hence, using a particular strategy, all the possible macroblocks in the selected search area in the reference frame
are searched for a match and, if a close match is found, two parameters are encoded.
 The first is known as the motion vector and indicates the (x,y) offset of the macro-block being encoded and the
location of the block of pixels in the reference frame which produces the (close) match.
 The search - and hence offset - can be either on macroblock boundaries or, as in the figure, on pixel boundaries.
The motion vector is then said to be single-pixel resolution.
 The second parameter is known as the prediction error and comprises three matrices (one each for Y, Cb and C.)
each of which contains the difference values (in all the pixel locations) between those in the target macroblock
and the set of pixels in the search area that produced the close match.
 Since the physical area of coverage of a macroblock is small, the motion vectors can be relatively large values.
 Also, most moving objects are normally much larger than a single macroblock. Hence, when an object moves,
multiple macroblocks are affected in a similar way.
 The motion vectors, therefore, are encoded using differential encoding (DE) and the resulting codewords are then
Huffman encoded.
 The three difference matrices, however, are encoded using the same steps as for I-frames: DCT, quantization,
entropy encoding.

 Finally, if a match cannot be found - for example if the moving object has moved out of the extended search area
- the macroblock is encoded independently in the same way as the macroblocks in an I frame.
Encoding B-frame:
 Any motion is estimated with reference to both the immediate preceding I- or P-frame and the immediately
succeeding P- or 1-frame.
 The general scheme is shown in Figure below:
 The motion vector and difference matrices are computed using first the preceding frame as the reference and then
the succeeding frame as the reference.
 A third motion vector and set of difference matrices are then computed using the target and the mean of the two
other predicted sets of values.
 The set with the lowest set of difference matrices is then chosen and these are encoded in the same way as for P-
frames. The motion vector is then said to be in a resolution of a fraction of a pixel. Ex, half-pixel resolution.

Implementation of Frames:
 A schematic diagram showing the essential units associated with the encoding of I- frames are given in the
Figure above.
 From the figure above we can deduce that, the encoding procedure used for the macroblocks that make up an I-
frame is the same as that used in the JPEG standard to encode each 8 x 8 block of pixels.
 The procedure involves each macroblock being encoded using the three steps:
Forward DCT, Quantization and Entropy Encoding.
 Hence, assuming four blocks for luminance and two for chrominance, each macroblock would require six 8 x 8
pixel blocks to be encoded.
 A schematic diagram showing the essential units associated with the encoding of P- frames are given in the
Figure above.
 In the case of P-frames, the encoding of each macroblock is dependent on the output of the motion estimation
unit which, in turn, depends on the contents of the macroblock being encoded and the contents of the macro-
block in the search area of the reference frame that produces the closest match to that being encoded.
 There are three possibilities:
1. If the two contents are the same, only the address of the macroblock in the reference frame is encoded.
2. If the two contents are very close, both the motion vector and the difference matrices associated with the
macroblock in the reference frame are encoded.
3. If no close match is found, then the target macro-block is encoded in the same way as a macroblock in an I-
frame.

 As we can see in Figure (b). In order to carry out its role, the motion estimation unit containing the search logic,
utilizes a copy of the (uncoded) reference frame.
 Above is obtained by taking the computed difference values between the frame currently being compressed (the
target frame) and the current reference frame and decompressing them using the de-quantize (DQ) plus inverse
DCT (IDCT) blocks.
 After the complete target frame has been compressed, the related set of difference values are used to update the
current reference frame contents ready to encode the next (target) frame.
 The same procedure is followed for encoding B-frames except both the preceding (reference) frame and the
succeeding frame to the target frame are involved.
 A schematic diagram showing the essential units associated with the encoding of B- frames are given in the
Figure below.
 An example format of the encoded bitstream output by the encoder is given in the figure below
 Thus, for each macroblock, it is necessary to identify the type of encoding that has been used.
 This is the role of the formatter and a typical format that is used to encode the macroblocks in each frame is
shown in part (d) of the figure.

 The type field indicates the type of frame being encoded P-, or B and the address identify the location of the
macroblock in the frame.
 The quantization value is the threshold value that has been used to quantize all the DCT coefficients in the
macroblock and motion vector is the encoded vector if one is present.
 The blocks present, indicates which of the six 8 x 8 pixel blocks that make up the macroblock are present if any,
and, for those present, the JPEG-encoded DCT coefficients for each block.
 Thus, the amount of information output by the encoder varies and depends on the complexity of the source
video.
 Hence a basic video encoder of the type shown in the figures above generates an encoded bitstream that has a
variable bit rate.
Decoding:
 The decoding of the received bitstream is simpler (and hence faster) than the encoding operation as the time-
consuming motion estimation processing is not required.
 Since the encoded bitstream is received and decoded, each new frame is assembled a macroblock at a time.
 Decoding the macroblocks of an I-frame is the same as decoding the blocks in a JPEG-encoded image.
 To decode a P-frame, the decoder retains a copy of the immediately preceding (decoded) I- or Pframe in a buffer
and uses this, together with the received encoded information relating to each macroblock, to build the Y, Cb,
and Cr matrices for the new frame in a second buffer.
 With uncoded macroblocks, the macro-block's address is used to locate the corresponding macroblock in the
previous frame and its contents are then transferred into the second buffer unchanged.
 With fully encoded macroblocks, these are decoded directly and the contents transferred to the buffer.
 With macroblocks containing a motion vector and a set of difference matrices, then these are used, together with
the related set of matrices in the first buffer, to determine the new values for the macroblock in the second buffer.
 The decoding of B-frames are similar, except three buffers areused:
 One containing the decoded contents of the preceding I- or P-frame
 For the decoded contents of the succeeding P- or I-frame, and to hold the frame being assembled.
Performance:
 The compression ratio for I-frames and hence all intra-coded frames is similar to that obtained with JPEG and,
for video frames, typically is in the region of 10:1 through 20:1 depending on the complexity of the frame
contents.
 The compression ratio of both P- and B- frames is higher and depends on the search method used.
 However, typical figures are in the region of 20:1 through 30:1 for P- frames and 30:1 through 50:1 for B-
frames.

H.261:
 The H.261 video compression standard has been defined by the ITU-T for the provision of video telephony and
videoconferencing services over an integrated-services digital network (ISDN).
 Thus the standard assumes that the network offers transmission channels of multiples of 64 kbps.
 The standard is also known, therefore, as p x 64 where p can be 1 through 30.
 The digitization format used is either the common intermediate format (CIF) or the quarter CIF (QCIF).
Normally, the CIF is used for videoconferencing and the QCIF for video telephony.
 Each frame is divided into macroblocks of 16 x 16 pixels for compression, the horizontal resolution is reduced
from 360 to 352 pixels to produce an integral number of 22 macroblocks.
 Hence, since both formats use subsampling of the two chrominance signals at half the rate used for the
luminance signal, the spatial resolution of each format is:
CIF: Y= 352 x 288, Cb = Cr = 176 x 144
QCIF: Y= 176 x 144, Cb = Cr = 88 x 72
 Progressive (non-interlaced) scanning is used with a frame refresh rate 30 fps for the CIF and either 15 or 7.5 fps
for the QCIF.
 Only I- and P- frames are used in H.261 with three P-frames between pair of I-frames.
 The encoding of each of the six 8 x 8 pixel blocks that make up each macroblock in both I- and Pframes 4 blocks
for Y and one each for Cb and Cr- is carried out.
 The format of each encoded macroblock is shown in outline in Figure below.
 For each macroblock:-

1. Address us for identification purposes
2. The type field indicates whether the macroblock has been encoded independently intracoded or with reference
to a macroblock in a preceding frame inter-coded.
3. The Quantization value is the threshold value that has been used to quantize all the DCT coefficients in the
macroblock.
4. Motion vector is the encoded vector if one present.
5. The coded block patient indicates which of the six 8 x 8 pixel bloc that rake up the macroblock are present if
any, and, for those present, the JPEG-encoded DCT coefficients are given in each block.

The format of each complete frame is shown in the figure below
 The picture start code indicate the start of each new (encoded) video frame/picture.
 The temporal reference field which is a time stamp to enable the decoder to synchronize each video block with
an associated audio block containing the same time stamp.
 The picture type indicates whether the frame is an I- or P-frame.
 Although the encoding operation is carried out on individual macroblocks, a larger data structure known as a
group of (macro) blocks (GOB) is also defined. This is a matrix of 11 x 3 macroblocks, the size of which has
been chosen that both the CIF and QCIF comprise an integral number of GOBs 12 in the case of the CIF (2 x 6)
and 3 in the case of the QCIF (1 x 3) which allows interworking between the two formats.
 At the head of each GOB is a unique start code which is chosen so that no valid sequence of variablelength
codewords from the table of codewords used in the entropy encoding stage can produce the same code.
 In the event of a transmission error affecting a GOB, the decoder simply searches the received bitstream for this
code which signals the start of the next GOB. For this reason the start code is also known resynchronization
marker.

 Each GOB has a group number associated with it which allows for a string of GOBs to be missing from a
particular frame.
 This may be necessary, for example, if the amount of (compressed) information to be transmitted is temporarily
greater than the bandwidth the transmission channel.
 With motion estimation, the amount of information produced during the compression operation varies.
 However, since the transmission bandwidth that is available with target applications of the H.261 standard is
fixed 64 kbps or multiples of this in order to optimize the use of this bandwidth, it is necessary to convert the
variable bit rate produced by the basic encoder into a constant bit rate.
 This is achieved by first passing the encoded bitstream output by the encoder through a first-in, first-out (FIFO)
buffer prior to it being transmitted and at then providing a feedback path from this to the quantizer unit within
the encoder.
The role of FIFO buffer is shown in figure below.
Role of FIFO Buffer:

 The output bit rate produced by the encoder is determined by the quantization threshold values that are used; the
higher the threshold the lower the accuracy and hence the lower is the output bit rate.
 Hence, since the same compression technique is used for macroblocks in video encoders.
 It is possible to obtain a constant output bit rate from the encode by dynamically varying the quantization
threshold used. This is the role of the FIFO buffer.
FIFO Buffer:
 As the name implies, the order of the output from a FIFO buffer is the same as that on input.
 However, since the output rate from the buffer is a constant determined by the (constant) bit rate of the
transmission channel the input rate temporarily exceeds the output rate then the buffer will start to fill.
 Conversely, if the input rate falls below the output rate then the buffer contents will decrease.  In order to
exploit this property, two threshold level is derived:
 The low threshold and The high threshold. The amount of information in the buffer is continuously monitored
and, should the contents fall below the low threshold, then the quantization threshold is reduced thereby
increasing the output rate from the encoder.

 Conversely, should the contents increase beyond the high threshold, then the quantization threshold is increased
in order to reduce the output rate from the encoder.
 Normally, the control procedure operates at the GOB level rather than at the macroblock level.
 Hence, should the high threshold be reached, first the quantization value associated with the GOB is increased
and, if this is not sufficient, GOBs are dropped until the overload subsides.
 Of course, any adjustments to the quantization threshold values that are made must be made also to those used in
the matching dequantizer.
 In addition, the standard also allows complete frames to be missing in order to match the frame rate to the level
of transmission bandwidth that is available
H.263:
 Video compression standard has been defined by the ITU-T for use in a range of video application over wireless
and PSTN.
 The applications include video telephony, video-conferencing, survey surveillance, interactive games playing,
and so on, all of which require the output of the video encoder to be transmitted across the network connection
in real time as it is output by the encoder.
 The access circuit to the PSTN operates in an analog mode and to transmit a digital signal over these circuits,
requires modem.
 Typical maximum bit rates over switches connections arrange from 28.8 kbps through to 56kbps and  hence
the requirement of the video encoder is to compress the video associated with these applications down to very
low bit rates
 The basic structure is based on that used in the H.261 standard.
 At bit rates lower than 64kbps, however, the H.261 encoder gives a relatively poor picture quality.
 Since it uses only I- and P- frames at low bit rates it has to revert to using a high quantization threshold and
relatively low frame rate.
 The high quantization threshold leads known as BLOCKING ARTIFACTS – which are caused by the
macroblocks encoded using high thresholds differing from those quantized using lower thresholds.
 The use of a low frame rate can also leads to jerky movements.
 In order to minimize these effects, the H.263 standard has a number of advanced coding options compared with
those used in an H.261 encoder.
Digitization Formats:
 Two formats are QCIF and Sub – QCIF
 Each frame is divided into macroblock of 16X16 pixels for compression, the horizontal resolution is reduced
from 180 to 176 pixels to produce an integral number of (11) macroblocks.
 Hence since subsampling of the two chrominance signals used, the two alternative spatial resolutions are
QCIF: Y = 176 X 144, Cb = Cr = 88 X 72 o S-QCIF: Y = 128 X 96, Cb = Cr = 64 X 68
Progressive scanning is used with frame refresh rate of either 15 or 7.5 fps.

 The support of both formats is mandatory only for the decoder and encoder need support only one of them.
 The motion estimation unit is not required in the decoder and hence is less expensive to implement than the
encoder.
 The additional cost of having two alternative decoders is only small.
 However by having choice for the encoder, low cost encoder design based on the SQCIF can be used for
applications such as games playing or more sophisticated design based on the QCIF can be used for
videoconferencing.
 The decoder can be the same in both cases as it supports both formats.
Frame Types:
 In order to obtain higher levels of compression that are needed, H.263 uses I-P-& B- frames.
 Also in order to use as high a frame rate as possible, optionally neighboring pairs of P- and B- frames can be
encoded as a single entity, resulting frame is known as a PB – frame and Because of the much reduced encoding
overheads that are required, its use enables a higher frame rate to be used with given transmission channel.
 PB – frame comprises a B- frame and immediately succeeding P- frame.
 The encoded information for the macroblock in both these frames is interleaved, with information for the P-
frame preceding that of the B-frame.
 Hence at the decoder, as the encoded information is received, the macroblock for the P-frame is reconstructed
first using the received information relating to the P-macroblock and the retained contents of the preceding P-
frame.
 The contents of the reconstructed P- macroblock are used together with the received encoded information
relating to the macroblock in the B- frame and retained contents of the preceding P frame bi-directionally predict
the decoded contents of the B-macro-block.
 When the decoding of both frames complete, the B-frame is decoded first followed by the P-frame.
Unrestricted Motion Vectors
 The motion vectors associated with predicted macro-blocks are restricted to defined areas in the reference frame
around the location in the target frame of the macro-block being encoded.
 The search area is restricted to the edge of the frame.
 This means that should a small portion of a potential close- match macro-block fall outside of the frame
boundary, then the target macro-block is automatically encoded as for an I-frame.
 This occurs even though the portion of the macro-block within the frame area is a close match.
 To overcome this limitation, in the unrestricted motion vector mode for those pixels of a potential close –match
macroblock that fall outside of the frame boundary and the edge pixels themselves are used instead and should
the resulting macro-block produce a close match, then the motion vector.
 If necessary is allowed to point outside of the frame area. The small digitized frame formats are used with
H.263 standard.

Error Resilence:
 The target network for the H.263 standard is a wireless network or PSTN.
 This type of network is a relatively high probability that transmission bit errors will be present in the bitstream
received by the decoder.
 Normally such errors are characterized by periods when a string of error – free frames is received followed by a
short burst of errors typically corrupt a string macro-blocks within a frame.
 In practice, it is not possible to identify the specific macroblocks that are corrupted but rather that the related
group of blocks (GOB) contains one or more macroblocks that are in error.
 Also we can deduce from the frame because the contents of many frames are predicted from information in
other frames and it is highly probable that the same GOB in each of following frames that are derived from the
GOB is error will contains errors.
 This means that when an error in a GOB occurs , the error will persist for a number of frames and hence
making the errors more apparent to the viewers.
 When an error in GOB is detected, the decoder skips the remaining macro-blocks in the affected GOB and
searches for the resynchronization marker (start code) at the head of the GOB. It then recommences decoding
from the start of this GOB.
 In order to mask the error from the viewer and error concealment scheme is incorporated into the decoder.
 Since a PSTN provides only a relatively low bit rate transmission channel, to conserve bandwidth, intra coded
(I) frames are inserted at relatively infrequent intervals.
 Hence in applications such as video telephony in which the video and audio are being transmitted in real time.
 The lack of I – frames has the effect that errors within a GOB may propagate to other regions of the frame due
to the resulting errors in the motion estimation vectors and motion compensation information
 With digitization formats such as the QCIF the resulting effect can be very annoying to the viewer, although the
initial errors on neighboring GOBs. The schemes include error tracking independent segment decoding &
reference picture selection.
Error Tracking
 In video telephony; two – way communication channel required for the exchange of the compressed audio and
video information generated by the codec in each terminal.
 This means that there is always a return channel from the receiving terminal back to the sending terminal and
this is used in all 3 schemes by the decoder in order to inform the related encoder that error in a GOB has been
detected.
 Errors are detected in a number of ways including :
 1 or more out-of-range motion vectors
 1 or more invalid variable –length codewords
 1 or more out – of – range DCT coefficients
 An excessive number coefficients within a macro-block

 In error tracking scheme, the encoder retains known as error prediction information for all the GODs in each of
the most recently transmitted frames; i.e., the likely spatial and temporal effects on the macro-blocks in the
following frames that will result if a specific GOB in a frame is corrupted.
 When an error is detected, the return channel is used by the decoder to send a negative acknowledgment (NAK)
message back to the encoder in the source codec containing both the frame number and the location of the GOB
in the frame that is in error.
 The encoder then uses the error prediction information relating to this GOB to identify the macroblocks in those
GOBs later frames that are likely to be affected.
 Then proceeds to transmit the macro-block in these frames in their intra-coded-form. It is shown in figure below.
 H.263 error tracking scheme: (a) example error propagation
 H.263 error tracking scheme: (b) same example with error tracking applied.

Independent Segment Decoding:
 The aim of this scheme is not to overcome errors that occur within a GOB but rather to prevent these errors form
affecting neighboring GOBs in succeeding frames.
 To achieve this, each GOB is treated as a separate sub-video which is independent of the other GOBs in the
frame.
 There the motion estimation and compensation in limited to the boundary pixels GOB rather than a frame.
 Operation is shown in a & b of below figure.
 In (a) part, although when an error in a GOB occurs the same GOB in each successive frame is affected – until a
new intra-coded GOB is sent by the encoder – neighboring GOB s are not affected.
 Clearly, however, a limitation of this scheme is that the efficiency of the motion estimation the search area being
limited to a single GOB.
 The scheme is not normally used on its own but in conjunction with either the error tracking scheme.
 Independent segment decoding: (a) effect of a GOB being corrupted.
Independent segment decoding: (b) when used with error tracking.

Reference Picture Selection

 This is similar to the error tracking to stop errors propagating by the decoder returning acknowledgement
messages when an error in a GOB is detected
 The scheme can be operated in 2 different modes
1)NAK MODE and 2) ACK MODE
 NAK MODE: only GOBs in error signaled by the decoder returning a NAK message. Normally intracoded
messages are encoded using an intra-coded (I) as the initial reference frame.
 However during encoding of Intracoded frames a copy of the decoded preceding frame is retained by the
encoder.
 Therefore , the encoder can select any of these when the NAK relating to frame 2 is received, the encoder selects
GOB 3 of frame 1 as the reference to encode god 3 of the next frame – frame.
 This scheme the GOB in error will propagate for a numbers of frames, the number being determined by the
round – trip delay being sent by the decoder and interceded frame derived from the initial I-frame being received
 Reference picture selection with independent segment decoding: (a) NAK mode
 ACK Mode: all frames received without errors are acknowledged by the decoder returning an ACK message.
 One frame that have acknowledged are used as reference frames.
 Hence the lack of an ACK for frame 3 means that frame 2 must be used to encode frame 6 in addition to frame 5.
 At this point the ACK for frame 4 is received and hence the encoder then uses this to encode frame 7
 The effect of using a reference frame which is distant from the frame being encoded is to reduce the encoding
efficiency for the frame.
 The ACK mode performs best when the round-trip delay of the communication channel is short and less than the
time the encoder takes to encode each frame.

 Reference picture selection with independent segment decoding: (b) ACK mode.
MPEG: Motion Pictures Expert Group

 Formed by the ISO to formulate a set of standards relating to a range of multimedia applications that involve the
use of video with sound.
 The outcome is a set of 3 standards which relate to either the recording or the transmission of integrated audio
and video streams.
 Each targeted at particular application domain and describes how the audio and video are compressed and
integrated together.
 The 3 standards which use different video resolutions are MPEG-1, MPEG – 2 MPEG – 4.
MPEG-1
 This is defined in a series of documents which are all subset of ISO Recommendation 11172.
 The video resolution is based on the source intermediate digitization format(SIF) with resolution of up to 352 X
288 pixels.
 The standard is intended for the storage of VHS – quality audio and video on CD-ROM at bit rates upto
1.5Mbps.
 Normally higher bit rates of multiples of this are more common in order to provide faster access to the stored
material
 MPEG – 1 & MPEG- 2 video standard uses similar video compression technique as H.261.

 The digitization format used with MPEG-1 is the SIF each frame is divided into macro-blocks of 16X16 pixels
for compression; the horizontal resolution is reduced from 360 to 352 pixels to produce an integral number of 22
macro-blocks.
 Hence the two chrominance signals are subsamples at half the rate of the luminance signal, the spatial resolutions
for the two types of video source are
 NTSC: Y = 352 X 240, Cb = Cr = 176 X120
 PAL: Y = 352 X 288, Cb = Cr = 176 X144
 Progressive scanning is used with refresh rate of 30Hz for NTSC and 25Hz for PAL.
 The standard allows the use of I-frames one I- and P – frames only or I – P – and B – frames,  No D- frames
are supported in any of the MPEG standards and hence the case MPEG-1, I – frames must be used for the
various random access functions associated with VCRs.
 The accepted maximum random – access time is 0.5 seconds and so this is the main factor – along with video
quality – that influences the maximum separation of I – frames in the frame sequence used.
 Two example sequences are: IBBBPBBPBBI………………………… and IBBPBBPBBPBBI…………….
 The first being the original sequence proposed for use with PAL (which has slower frame refresh rate) and the
second for use with NTSC.
 The compression algorithm used is based on the H.261 standard.
 Hence each macro-block made up of 16X16 pixels in the Y plane and 8X8 pixels in the Cb and Cr planes
 However there are 2 main differences.
 The first is time stamps can be inserted within a frame to enable the decoder to resynchronize more quickly in
the event of one or more corrupted or missing macro-blocks
 The number macro-blocks between two – stamps is known as slice and slice can comprise from 1 through to the
maximum number of macro-blocks in a frame.
 Typically slice is made equal to 22 which is the number of macro-blocks in a line.
 The second difference arises because of the introduction of B-frames increases the time interval btw I & P
frames.
 To allow for the resulting increase the separation of moving objects with P- frames, the search window in the
reference frame is increased.
 Also to improve the accuracy of the motion vectors, finer resolution is used.
 Typical compression rations vary from about 10:1 for I frames, 20:1 for P-frames and 50:1 for B frames
 At the top level, the completely compressed video is known as a sequence which in turn consists of a string of
groups of pictures (GOPs) each comprising a string of I, P or B pictures / frames in the defined sequence.
 Each picture / frame is made of N slices, each of which comprises multiple macro-blocks and won to an 8 X 8
pixel block.
 Hence in order for the decoder to de-compress the received bitstream , each data structure must be clearly
identified within bitstream.

 MPEG-1 video bitstream structure: (a) composition.
 MPEG-1 video bitstream structure: (b) format.
 The start of the sequence is indicated by a sequence start code. This is followed by 3 parameters, each of which
apply to the complete video sequence.
 The video parameters specify the screen size and aspect ratio, bitstream parameters indicate the bit rate and the
size of the memory / frame buffers that are required, and the quantization parameters contain the contents of the
quantization tables are to be used for the various frame / picture types. These are followed by the encoded video
stream which in the form of a string of GOPs.
 Each GOP is separated by a start code followed by time- stamp for synchronization purposes and a parameter
fields defines the particular sequence of frame types that are used in each GOP.
 This is then followed by the string of encoded pictures /frames in each GOP.
 Each is separated by a picture start code and is followed by a type field (I,P, or B), buffer parameters, which
indicate how full the memory buffer should be before the decoding operation should start and encode parameters
which indicate the resolution used for the motion vectors. This is followed by a string of slices, each comprising
a string of macroblocks.
 Each slice is separated by a slice start code followed by a vertical position field, which defines the scan line the
slice relates to and quantization parameter that indicates the scaling factor that applies to this slice. This is the
followed by a string of macro-blocks each of which is encoded in the same was for H.261.
MPEG-2
 This is defined in a series of documents which are all subsets of ISO recommendation 11172.
 Intended for the recording and transmission of studio – quality audio and video.
 The standard covers 4 levels of video resolutions
 LOW: based on SIF digitization format with a resolution of up to 352 X 288 Pixels. It is compatible with the
MPEG-1 standard and produces VHS quality video. The audio is of CD quality and the target bit rate is upto
4 Mbps
 MAIN: based on the 4:2:0 digitization format with resolution of upto 720X576 pixels. This produces studio-
quality video and the audio allows for multiple CD-quality audio channels. The target bit rate is upto
15Mbps or 20Mbps with the 4:2:2 format.
 HIGH 1440: based on the 4:2:0 digitization format with a resolution of 1440X 1152 pixels. Intended for
HDTV at bit rates upto 60Mbps or 80Mbps with the 4:2:2 format
 HIGH: based on the 4:2:0 digitization format with a resolution of 1920X 1152 pixels. It is intended for wide
– screen HDTV at a bit rate of up to 80Mbps or 100Mbps with the 4:2:2 format.
 In addition there are 5 profiles associated with each level; simple, main, spatial resolution, quantization accuracy
and high .
 These have been defined so that the 4 levels and 5 profiles collectively form a 2D table which acts framework for
all standards activities associated with MPEG-2.
 Low level MPEG-2 is compatible with MPEG-1 ; MP@ML and two high levels relating to HDTV.
MP@ML
 Standard for digital TV broadcasting - Interlaced scanning used – resulting frame refresh rate of either 30Hz
(NTSC) or 25HZ (PAL) - 4:2:0 digitization format used with resolution of either 720 X 480 pixels at 30Hz or
720 X 576 Pixels at 25Hz .
 The output bit rate from MUX can range from 4 Mbps through to 15Mbps, the actual rate used being determined
by the bandwidth available with broadcast channel.
 Video coding scheme similar to that used MPEG-1 difference is interlaced scanning used - Interlaced means that
each frame is made up of two fields : filed mode – frame mode ; The choice of the mode is determined by the
amount of motion present in the video. If large amount of motion is present then it is better to perform the DCT

encoding operation on the lines in a field part since this will produce a higher compression ration owing to the
shorter time interval btw successive fields.
 Alternatively if there is little movement, the frame mode can be used since longer time interval btw successive
frames is less important. Hence in this case, the macro-blocks / DCT blocks are derived from the lines in each
complete frame. Standard allows either mode to be used, the choice being determined by the type of video. For
example, a live sports event is likely to be encoded using the field mode and studio based program the frame
mode.
 In relation to the motion estimation associated with the encoding of macro-blocks in P and B frames, 3 different
modes are possible; field, frame, mixed.
 Field mode: the motion vector for each macro-block is computed using the search window around the
corresponding macro-block in the immediately preceding (I or P) field for P- frame and B – frame and for B-
frames, the immediately succeeding (P or I) field. The motion vectors, therefore relate to the amount of
movement that has taken place in the time to scan one field.
 Frame mode: - macro-block in an odd field is encoded relative to that in the preceding / succeeding odd fields
and similarly for the macro-blocks in even fields. The motion vectors relate to the amount of movement that has
taken place in the time to scan two fields; that is, time to scan complete frame.
 Mixed mode: the motion vectors for both field and frame modes are computed and the one with the smallest
mean value is selected.
HDTV
 3 standards associated with HDTV
i. Advanced television (ATV) in north America
ii. Digital Video broadcast (DVB) in Europe
iii. Multiple sub-Nyquist sampling encoding (MUSE) in Japan and rest of ASIA
 All 3 standards defining the digitization format and audio video compression schemes used also how the
resulting digital bitstream are transmitted over the different types of broadcast network.
 There is ITU-R HDTV specification concerned with the normal digital TV used in TV studios for the
production of HDTV programs and also for the international exchange of programs; this defines a 16/9
aspect ratio with 1920 samples per line and 1152 lines frame.
 Currently interlaced scanning used with 4:2:0 digitization format. In future it is expected that progressive
scanning will be introduced with the 4:2:2 formats.
 ATV standard formulated by an alliance of a large number of TV manufacturer known as GRAND
ALLIANCE. It includes ITU-R HDTV specification and second lower – resolution format. This also uses a
16 /9 aspect ratio but with resolution of 1280X720.
 The video compression algorithm in both case is based on the main profile at the high level mp@HL of
MPEG-2 and audio compression standard is based on Dolby AC-3.

 DVB HDTV standard on 4/3 aspect ratio and defines a resolution of 1440 samples per line and 1152 lines
per frame. This is exactly twice the resolution of the low-definition PAL digitization format of 720X576.
 The video compression algorithm is based on SSP@H1440 – spatially scalable profile at high 1440 – of
MPEG-2 similar to that used with MP@HL. The audio compression standard is MPEG audio layer 2.
 MUSE standard based on 16/9 aspect ratio with a digitization format of 1920 samples per line and 1035 lines
per frame. The video compression algorithm is used MP@HL.
 MPEG-2 DCT block derivation with I-frames: (a) effect of interlaced scanning.
 MPEG-2 DCT block derivation with I-frames: (b) field mode;
 MPEG-2 DCT block derivation with I-frames: (c) frame mode.

MPEG- 4
 Initially this standard was concerned with a similar range of application to those of H.263.
 Each running over very low bit rate channels ranging from 4.8 to 64kbps.
 Later its scope was expanded to embrace a wide range of interactive multimedia app over the internet and the
various types of entertainment networks.
 The first 3 MPEG standards are in 3 parts: video, audio and system.
 The video and audio are concerned with the way each is compressed and how the resulting bitstream are
formatted.
 The system part is concerned with how the two streams are integrated together to produce a synchronized output
stream.
 The standard contains features to enable a user not only to passively access a video sequence but also to access
and manipulate the individual elements that make up each scene within the sequence / video.
 Of the accessed video is computer – generated – cartoon, the user may be given the capability by the creator of
the video to reposition, delete, or alter the movements of the individual characters within scene.
 Because of its high coding efficiency with scenes like video telephony, standard also used is low bit rate network
for such application it is an alternative to the H.263 standard.
Scene Composition:
 MPEG-4 has number called content-based functionalities. Before being compressed each scene is defined in the
form of background and one or more foreground AVO. Each AVO in turn defined in the form of one or more
video objects and audio objects.
 For example stationary car in a scene may be defined suing just single video object while a person who is talking
may be defined using both an AV object.
 Similar way each video and audio object may itself be defined as being made up of a number of subobjects. For
e.g. if a person face may be defined in the form of three sub-object 1) head 2) eyes 3) mouth, once has been done
the encoding of the background and each AVO is carried out separately.
 AVO consisting of both audio and video objects each has additional timing information relating to it to enable
the receiver to synchronize the various objects and sub-objects together before they are decoded.
 Each AV object has separate object descriptor associated which allows the object – providing the creator of the
AV – to be manipulated by the viewer prior to it being decoded and played out. The language used to describe
and modify objects called binary format for scenes (BIFS). This has commands to delete an object and in case of
video object, change its shape, appearance – color and animate the object in rea time.
 Audio object have similar set of commands to change its volume. Possible to have multiple versions of an AVO,
the first containing the base-level compressed AV streams and other various levels of enhancement.
 This types of compression called scaling and allows the encoded contents of an AVO to be played out at a rate
and resolution that matches those of the interactive terminal being used. At a higher level, the composition of a
scene contains is defined in a separate scene descriptor. This defines the way the various AVO are related to each
other in the context of the complete scene.

AV compression
 AVO is compressed using one of the algorithms depends on the available bit rate of the transmission channel and
sound quality required.
 From figure part (A) each VOP is derived is a difficult image processing task. It involves identifying regions of a
frame that have similar properties such as color, texture, or brightness.
 Each of the resulting object shapes is then bounded by a rectangle to form the relate VOP which has no motion
associated with it will produce a minimum compressed information. Also which move often occupy only a small
portion of the scene / frame, the bit rate of multiplexed video stream is much lower than that obtained with other
standards.
 For applications that support interactions with particular VOP, a number of advanced coding algorithms are
available to perform the shape coding functions.
 MPEG-4 coding principles: (a) encoder/decoder schematics:

 VOP encoder schematic.
Transformation format
 MPEG-4 Part 14 or MP4 is a digital multimedia container format most commonly used to store video and audio,
but it can also be used to store other data such as subtitles and still images. Like most modern container formats,
it allows streaming over the Internet
 All information relating to frame / scene encoded in MPEG-4 is transmitted over a network in the form of
Transparent Stream consisting of Multiplexed stream of packetized elementary streams.
 The compressed Audio Video information relating to each AVO in the scene is called an Elementary stream this
is carried in the pay load field of the PES packet. Each PES packet contains a type field in the packet header and
this is used by the FlexMux layer to identify and route the PES to the related synchronization block in the
synchronization layer. The compressed Audio Video associated with each AVO is carries in a separate ES.

Exercise: A digitized video is to be compressed using the MPEG-1 standard. Assuming a frame sequence
of: IBBPBBPBBPBBI... and average compression ratios of 10:1 (I), 20:1 (P) and 50:1 (B), derive the
average bit rate that is generated by the encoder for both the NTSC and PAL digitization formats.

18EC743-MMC-Module-4 Notes

Uploaded by

Copyright:

Available Formats

18EC743-MMC-Module-4 Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

18EC743-MMC-Module-4 Notes

Uploaded by

Copyright:

Available Formats

Multimedia Communications-[18EC743]

Audio and Video Compression

Dept. of ECE-ATMECE-Mysuru Page 1

Differential pulse code modulation (DPCM):

Dept. of ECE-ATMECE-Mysuru Page 2

 Differential pulse code modulation (DPCM) is a derivative of standard PCM.

Dept. of ECE-ATMECE-Mysuru Page 3

Adaptive differential PCM:

Dept. of ECE-ATMECE-Mysuru Page 4

Dept. of ECE-ATMECE-Mysuru Page 5

Adaptive predictive coding:

Linear predictive coding:

Dept. of ECE-ATMECE-Mysuru Page 6

Dept. of ECE-ATMECE-Mysuru Page 7

Dept. of ECE-ATMECE-Mysuru Page 8

Sensitivity of the ear:

Dept. of ECE-ATMECE-Mysuru Page 9

Dept. of ECE-ATMECE-Mysuru Page 10

MPEG audio Coders:

Dept. of ECE-ATMECE-Mysuru Page 11

Dept. of ECE-ATMECE-Mysuru Page 12

Dept. of ECE-ATMECE-Mysuru Page 13

Dolby audio coders:

 Advantage of MPEG Coders: psychoacoustic model is required only in the encoder.

Dept. of ECE-ATMECE-Mysuru Page 14

Dept. of ECE-ATMECE-Mysuru Page 15

Dept. of ECE-ATMECE-Mysuru Page 16

Video compression principles:

Dept. of ECE-ATMECE-Mysuru Page 17

 A sequence involving all three frame types is shown in Figure below

Dept. of ECE-ATMECE-Mysuru Page 18

Dept. of ECE-ATMECE-Mysuru Page 19

Dept. of ECE-ATMECE-Mysuru Page 20

Motion estimation and Compensation:

Dept. of ECE-ATMECE-Mysuru Page 21

 Typically, this comprises a number of macroblocks as shown in Figure above.

Dept. of ECE-ATMECE-Mysuru Page 22

Dept. of ECE-ATMECE-Mysuru Page 23

Dept. of ECE-ATMECE-Mysuru Page 24

Dept. of ECE-ATMECE-Mysuru Page 25

Dept. of ECE-ATMECE-Mysuru Page 26

 For each macroblock:-

Dept. of ECE-ATMECE-Mysuru Page 27

The format of each complete frame is shown in the figure below

Dept. of ECE-ATMECE-Mysuru Page 28

The role of FIFO buffer is shown in figure below.

Role of FIFO Buffer:

Dept. of ECE-ATMECE-Mysuru Page 29

Dept. of ECE-ATMECE-Mysuru Page 30

Unrestricted Motion Vectors

Dept. of ECE-ATMECE-Mysuru Page 31

Dept. of ECE-ATMECE-Mysuru Page 32

Dept. of ECE-ATMECE-Mysuru Page 33

Independent Segment Decoding:

Independent segment decoding: (b) when used with error tracking.

Dept. of ECE-ATMECE-Mysuru Page 34

Reference Picture Selection

Dept. of ECE-ATMECE-Mysuru Page 35

MPEG: Motion Pictures Expert Group

Dept. of ECE-ATMECE-Mysuru Page 36