Multimedia Computing

MULTIMEDIA COMPUTING
UNIT I
1. Multimedia and Hypermedia :
People who use the term "multimedia" often

seem to have quite different, even opposing,
viewpoints. A PC vendor would like us to think
of multimedia as a PC that has sound
capability, a DVD-ROM drive, and perhaps the
superiority of multimedia-enabled
microprocessors that understand additional Examples of typical multimedia applications
multimedia instructions. A consumer include: digital video editing and production
entertainment vendor may think of systems; electronic newspapers and
multimedia as interactive cable TV with magazines; the World Wide Web; online
hundreds of digital channels, or a cable-TV- reference works, such as encyclopedias;
like service delivered over a high-speed games; groupware; home shopping;
Internet connection. interactive TV; multimedia courseware; video
conferencing; video-an-demand; and
Ted Nelson invented the term "HyperText" interactive movies.
around 1965. Whereas we may think of a
book as a linear medium, basically meant to Amazingly, this most predominant networked
be read from beginning to end, a hypertext multimedia applications has its roots in
system is meant to be read nonlinearly, by nuclear physics! In 1990, Tim Berners-Lee
following links that point to other parts of the proposed the World Wide Web to CERN
document, or indeed to other documents. (European Center for Nuclear Research) as a
Figure 1.1 illustrates this idea. Hypermedia is means for organizing and sharing their work
not constrained to be text-based. It can and experimental results. With approval from
include other media, such as graphics, images, CERN, he started developing a hypertext
and especially the continuous media - sound server, browser, and editor on a NeXTStep
and video. Apparently Ted Nelson was also workstation. His team invented the Hypertext
the first to use this term. Markup Language (HTML) and the Hypertext
Transfer Protocol (HTTP) for this purpose, too.
The World Wide Web 0vww) is the best
example of a hypennedia application. As we
have seen, multimedia fundamentally means
that computer information can be 2. WORLD WIDE WEB :
represented through audio, graphics, images, The World Wide Web is the largest and most
video. and animation in addition to traditional
commonly used hypermedia application. Its
media (text and graphics). Hypermedia can be popularity is due to the amount of
considered one particular multimedia
information available from web servers, the
application. capacity to post such information, and the
ease of navigating such information with a
web browser. WWWtechnologyismaintained
and developed by the WorldWide Web
Consortium(W3C), although the Internet
Engineering Task Force (IETF) standardizes the These tools are really only the beginning—a
technologies. The W3C has listed the fully functional multimedia project can also
following three goals for the WWW: universal call for stand-alone programming as well as
access of web resources (by everyone just the use of predefined tools to fully
everywhere), effectiveness of navigating exercise the capabilities of machines and the
available information, and responsible use of Internet.
posted material.
In courses we teach using this text, students
WWW size was estimated at over one billion are encouraged to try these tools, producing
pages. Sony unveiled the first Blu-ray Disc full-blown and creative multimedia
prototypes in October 2000, and the first productions.
prototype player was released in April 2003 in
Yet this textbook is not a “how-to” book
Japan.
about using these tools—it is about
Multimedia Web information systems understanding the fundamental design
represent a whole new breed of information principles behind these tools! With a clear
systems intended to process, store, retrieve, understanding of the key multimedia data
and disseminate information through the structures, algorithms, and protocols, a
Internet using Web protocols. student can make smarter and advanced use
of such tools, so as to fully unleash their
Since its inception the World Wide Web potentials, and even improve the tools
(WWW) has revolutionized the way people themselves or develop new tools.
interact, learn and communicate. A whole
new breed of information systems has been The categories of software tools we examine
developed to process, store, retrieve, and here are
disseminate information through the Internet
• Music sequencing and notation
using Web protocols. These systems are using
hypertext to link documents and present any • Digital audio
kind of information, including multimedia, to
the end user through a Web browser. Web • Graphics and image editing
based information systems are often called
• Video editing
Web Information Systems (WIS). A WIS, which
uses multimedia as its key component, is a • Animation
Multimedia Web Information System.
• Multimedia authoring.
3.1 Music Sequencing and Notation

3. Overview of Multimedia Software
Tools: Cakewalk Pro Audio
For a concrete appreciation of the current Cakewalk Pro Audio is a very straightforward
state of multimedia software tools available music-notation program for “sequencing.”
for carrying out tasks in multimedia, we now The term sequencer comes from older devices
include a quick overview of software that stored sequences of notes in the MIDI
categories and products. music language.
Finale, Sibelius :Finale and Sibelius are two 3D animation, is a common API used to
composer-level notation systems; these develop multimedia Windows applications
programs likely set the bar for excellence, but such as computer games.
their learning curve is fairly steep.
Animation Software: Autodesk 3ds Max
Digital Audio: Digital Audio tools deal with (formerly 3D Studio Max) includes a number
accessing and editing the actual sampled of high-end professional tools for character
sounds that make up audio. animation, game development, and visual
effects production. Models produced using
3.2 Graphics and Image Editing this tool can be seen in several consumer
Adobe Illustrator: Illustrator is a powerful games, such as for the Sony Playstation.
publishing tool for creating and editing vector 3.5 Multimedia Authoring :
graphics, which can easily be exported to use
on the Web. Tools that provide the capability for creating a
complete multimedia presentation, including
Adobe Photoshop: Photoshop is the standard interactive user control, are called authoring
in a tool for graphics, image processing, and programs.
image manipulation. Layers of images,
graphics, and text can be separately Adobe Flash: Flash allows users to create
manipulated for maximum flexibility, and its interactive movies by using the score
set of filters permits creation of sophisticated metaphor—a timeline arranged in parallel
lighting effects. event sequences, much like a musical score
consisting of musical notes. Elements in the
3.3 Video Editing : movie are called symbols in Flash. Symbols
Adobe Premiere Premiere is a simple, intuitive are added to a central repository, called a
video editing tool for nonlinear editing— library, and can be added to the movie’s
putting video clips into any order. Video and timeline. Once the symbols are present at a
audio are arranged in tracks, like a musical specific time, they appear on the Stage, which
score. It provides a large number of video and represents what the movie looks like at a
audio tracks, superimpositions, and virtual certain time, and can be manipulated and
clips. A large library of built-in transitions, moved by the tools built into Flash. Finished
filters, and motions for clips allows easy Flash movies are commonly used to show
creation of effective multimedia productions. movies or games on the Web.
3.4 Animation : Adobe Director: Director uses a movie

metaphor to create interactive presentations.
Multimedia APIs: Java3D is an API used by This powerful program includes a built-in
Java to construct and render 3D graphics, scripting language, Lingo, that allows creation
similar to the way Java Media Framework of complex interactive movies.
handles media files. It provides a basic set of
object primitives (cube, splines, etc.) upon
which the developer can build scenes. It is an
abstraction layer built on top of OpenGL or
DirectX (the user can select which), so the
graphics are accelerated. DirectX, a Windows
API that supports video, images, audio, and
4. Graphics/Image Data Types :
The number of file formats used in

multimedia continues to proliferate. For
example, Table 3.1 shows a list of file formats
used in the popular product Adobe Premiere.
To develop a sense of how they operate. We
concentrate on GIF and JPG image file
formats, since the GIF file format is one of the Table 4.1: Adobe Premiere file format
simplest and contains several fundamental
8-Bit Gray-Level Images :
features, and the JPG file format is arguably
the most important overall. Now consider an 8-bit image—that is, one for
which each pixel has a gray value between 0
To begin with, we discuss the features of file
and 255. Each pixel is represented by a single
formats in general.
byte—for example, a dark pixel might have a
1-Bit Images : value of 10, and a bright one might be 230.
The entire image can be thought of as a two-
Images consist of pixels—picture elements in dimensional array of pixel values. We refer to
digital images. A 1-bit image consists of on such an array as a bitmap—a representation
and off bits only and thus is the simplest type of the graphics/image data that parallels the
of image. Each pixel is stored as a single bit (0 manner in which it is stored in video memory.
or 1). Hence, such an image is also referred to Image resolution refers to the number of
as a binary image. It is also sometimes called a pixels in a digital image (higher resolution
1-bit monochrome image since it contains no always yields better quality).
color. Figure 3.1 shows a 1-bit monochrome
image (called “Lena” by multimedia scientists Fairly high resolution for such an image might
—this is a standard image used to illustrate be 1,600 × 1,200, whereas lower resolution
many algorithms). might be 640 × 480. Notice that here we are
using an aspect ratio of 4:3. We do not have
A 640×480 monochrome image requires 38.4 to adopt this ratio, but it has been found to
kilobytes (kB) of storage (= 640×480/8). look natural. For this reason 4:3 was adopted
Monochrome 1-bit images can be satisfactory for early TV and most early laptop screens.
for pictures containing only simple graphics Later displays typically use an aspect ratio of
and text. Moreover, fax machines use 1-bit 16:9, to match high-definition video and TV.
data, so in fact 1-bit images are still 1-Bit
Images Images consist of pixels—picture
elements in digital images.
A 1-bit image consists of on and off bits only

and thus is the simplest type of image. Each
pixel is stored as a single bit (0 or 1). Hence,
such an image is also referred to as a binary
image. It is also sometimes called a 1-bit
monochrome image since it contains no color.
The most important current standard for

image compression is JPEG [6]. This standard
was created by a working group of the
International Organization for Standardization
(ISO) that was informally called the Joint
Photographic Experts Group and is therefore
so named. We shall study JPEG in greater
detail, but a few salient features of this
compression standard can be mentioned
here. The human vision system has some
Fig. 4.2 Bitplanes for 8-bit grayscale image specific limitations, which JPEG takes
advantage of to achieve high rates of
5. File Formats : compression. The eye–brain system cannot
see extremely fine detail.
Some popular image file formats are
described below. One of the simplest is the 8- 5.3 PNG:
bit GIF format, and we study it because it is
easily understood, and also because of its One interesting development stemming from
historical connection to the WWW and HTML the popularity of the Internet is efforts toward
markup language as the first image type more system-independent image formats.
recognized by net browsers. However, One such format is Portable Network Graphics
currently the most important common file (PNG).This standard is meant to supersede
format is JPEG, which will be explored in- the GIF standard and extends it in important
depth. ways. The motivation for a new standard was
in part the patent held by UNISYS and
5.1 GIF : Compuserve on the LZW compression
method. (Interestingly, the patent covers only
Graphics Interchange Format (GIF) was
compression, not decompression: this is why
devised by UNISYS Corporation and
the Unix gunzip utility can decompress LZW-
Compuserve, initially for transmitting
compressed files).
graphical images over phone lines via
modems. The GIF standard uses the Lempel- 5.4 TIFF :
Ziv-Welch algorithm (a form of compressio),
modified slightly for image scanline packets to Tagged Image File Format (TIFF) is another
use the line grouping of pixels effectively. popular image file format. Developed by the
Aldus Corporation in the 1980s, it was later
The GIF standard is limited to 8-bit (256) color supported by Microsoft. Its support for
images only. While this produces acceptable attachment of additional information
color, it is best suited for images with few (referred to as “tags”) provides a great deal of
distinctive colors (e.g., graphics or drawing). flexibility. The most important tag is a format
The GIF image format has a few interesting signifier: what type of compression, etc., is in
features, notwithstanding the fact that it has use in the stored image. For example, TIFF can
been largely supplanted. store many different types of images: 1-bit,
grayscale, 8-bit, 24-bit RGB, and so on.
5.2 JPEG
5.5 Windows WMF :
Windows MetaFile (WMF) is the native vector characterized by the wavelength of the wave.
file format for the Microsoft Windows Laser light consists of a single wavelength:
operating environment. WMF files actually e.g., a ruby laser produces a bright, scarlet red
consist of a collection of Graphics Device beam. So if we were to make a plot of the
Interface (GDI) function calls, also native to light intensity versus wavelength, we would
the Windows environment. When a WMF file see a spike at the appropriate red wavelength,
is “played” (typically using the Windows and no other contribution to the light.
PlayMetaFile() function) the described graphic
is rendered. WMF files are ostensibly device- In contrast, most light sources produce
contributions over many wavelengths.
independent and unlimited in size. The later
Enhanced Metafile Format Plus Extensions However, humans cannot detect all light, but
just contributions that fall in the “visible
(EMF+) format is device independent.
wavelengths.” Short wavelengths produce a
5.6 PS and PDF ; blue sensation, and long wavelengths produce
a red one.
PostScript is an important language for
typesetting, and many high-end printers have
a PostScript interpreter built into them.
PostScript is a vector-based, rather than
pixelbased, picture language: page elements
are essentially defined in terms of vectors.
With fonts defined this way, PostScript
includes vector/structured graphics as well as
text; bit-mapped images can also be included
in output files. Encapsulated PostScript files
add some information for including PostScript
files in another document. Several popular
graphics programs, such as Adobe Illustrator, 6.1.2 Human Vision :
use PostScript. Note, however, that the The eye works like a camera, with the lens
PostScript page description language does not focusing an image onto the retina
provide compression; in fact, PostScript files (upsidedown and left-right reversed). The
are just stored as ASCII. Therefore files are retina consists of an array of rods and three
often large, and in academic settings, it is kinds of cones. These receptors are called
common for such files to be made available such because they are shaped like cones and
only after compression by some Unix utility, rods, respectively. The rods come into play
such as compress or gzip. when light levels are low and produce a image
in shades of gray (“all cats are gray at night!”).
For higher light levels, the cones each produce

6. Color in Image and Video:
a signal. Because of their differing pigments,
6.1 Color Science: the three kinds of cones are most sensitive to
red (R), green (G), and blue (B) light. Higher
6.1.1 Light and Spectra : light levels result in more neurons firing; the
issue of just what happens in the brain further
Recall from high school that light is an
down the pipeline is the subject of much
electromagnetic wave, and that its color is
debate. However, it seems likely that the
brain makes use of differences R–G, G–B, and and consists of the product of the illuminant
B–R, as well as combining all of R, G, and B E(λ) times the reflectance S(λ): C(λ) = E(λ) S(λ).
into a high- and light-level achromatic channel The equations similar to Eq. (4.2), then, that
(and thus we can say that the brain is good at take into account the image formation model
algebra). are:
6.1.3 Spectral Sensitivity of the Eye :
The eye is most sensitive to light in the middle

of the visible spectrum. Like the SPD profile of
a light source, as in Fig. 4.2, for receptors we
show the relative sensitivity as a function of
wavelength. The Blue receptor sensitivity is
not shown to scale because in fact it is much 6.1.5 Camera Systems :
smaller than the curves for Red or Green—
Now, we humans develop camera systems in
Blue is a late addition, in evolution (and,
a similar fashion, and a good camera has
statistically, is the favorite color of humans,
three signals produced at each pixel location
regardless of nationality—perhaps for this
(corresponding to a retinal position). Analog
reason, Blue is a bit surprising!).
signals are converted to digital, truncated to
Figure 4.3 shows the overall sensitivity as a integers, and stored. If the precision used is 8-
dashed line. This dashed curve in Fig. 4.3 is bit, then the maximum value for any of R, G, B
important and is called the luminous- is 255, and the minimum is 0.
efficiency function. It is usually denoted V(λ)
However, the light entering the eye of the
and is formed as the sum of the response
computer user is that which is emitted by the
curves to Red, Green, and Blue [1,2].
screen—the screen is essentially a self-
6.1.4 Image Formation : luminous source. Therefore, we need to know
the light E(λ) entering the eye.
Equation (4.2) above actually only applies
when we view a self-luminous object (i.e., a 6.2 Color Models in Images :
light). In most situations, we actually image
6.2.1 RGB Color Model for Displays:
light that is reflected from a surface. Surfaces
reflect different amounts of light at different According to Chap. 3, we usually store color
wavelengths, and dark surfaces reflect less information directly in RGB form. However,
energy than light surfaces. Figure 4.4 shows we note from Sect. 4.1 above that such a
the surface spectral reflectance from (1) coordinate system is in fact device-
orange sneakers and (2) faded bluejeans [4]. dependent. We expect to be able to use 8 bits
per color channel for color that is accurate
The reflectance function is denoted S(λ). The
enough. However, in fact we have to use
image formation situation is thus as follows:
about 12 bits per channel to avoid an aliasing
light from the illuminant with SPD E(λ)
effect in dark image areas—contour bands
impinges on a surface, with surface spectral
that result from gamma correction since
reflectance function S(λ), is reflected, and
gamma correction results in many fewer
then is filtered by the eye’s cone functions
available integer levels (see Exercise 9).
q(λ). The basic arrangement is as shown in Fig.
4.5. The function C(λ) is called the color signal
For images produced from computer graphics, system percepts of Lightness L∗; hue h∗,
we store integers proportional to intensity in meaning a magnitude-independent notion of
the frame buffer; then we should have a color; and chroma c∗, meaning the purity
gamma correction LUT between the frame (vividness) of a color.
buffer and the display. If gamma correction is
applied to floats before quantizing to integers,
before storage in the frame buffer, then in
fact we can use only 8 bits per channel and
still avoid contouring artifacts.
6.2.2 Multisensor Cameras : 6.3 Color Models in Video :

More accurate color can be achieved by using
6.3.1 Video Color Transforms :
cameras with more than three sensors, i.e.,
more than three color filters. One way of Methods of dealing with color in digital video
doing this is by using a rotating filter, which largely derive from older analog methods of
places a different color filter in the light path coding color for TV. Typically, some version of
over a quick series of shots. In work on the luminance is combined with color
capture of artwork at the Museum of Modern information in a single signal. For example, a
Art in New York City, a six-channel camera has matrix transform method similar to Eq. (4.19)
been used to accurately capture images of called YIQ is used to transmit TV signals in
important artworks, such that images are North America and Japan. In Europe, video
closer to full-spectrum; this work uses an tape uses the PAL or SECAM codings, which
altered color filter checkerboard array, or set are based on TV that uses a matrix transform
of these, built into the camera (“Art Spectral called YUV.
Imaging”). Part of work in this direction also
has included removing the near-infrared filter 6.3.2 YUV Color Model :
typically placed in a camera, so as to extend Initially, YUV coding was used for PAL analog
the camera’s sensitivity into the infrared. video. A version of YUV is now also used in the
6.2.3 Camera-Dependent Color : CCIR 601 standard for digital video. First, it
codes a luminance signal (for gamma-
The values of R,G,B at a pixel depend on what corrected signals) equal to Y ≡ in Eq. (4.20).
camera sensors are used to image a scene. (Recall that Y ≡ is often called the “luma”).
Here we look at other two other camera-
dependent color spaces that are commonly The luma Y ≡ is similar, but not exactly the
used: HSV and sRGB. First, recall that in Sect. same as, the CIE luminance value Y, gamma-
4.1.14 on CIELAB we defined color corrected. In multimedia, practitioners often
coordinates that are meant to be camera- blur the difference and simply refer to both as
independent, i.e., human perception oriented. the luminance. As well as magnitude or
There, the proposed set of axes L∗, a∗ brightness we need a colorfulness scale, and
(redness–greenness), b∗ (yellowness– to this end chrominance refers to the
blueness) are associated with human visual difference between a color and a reference
white at the same luminance. It can be between 0 and 1. This makes the equations as
represented by the color differences U, V: follows:
From Eqs. (4.20), (4.30) reads Written out, we then have
One goes backwards, from (Y’ , U’ , V’ ) to (R≡ ,

G≡ , B≡ ), by inverting the matrix in Equation.
6.3.3 YIQ Color Model :
YIQ (actually, Y ≡ I Q) is used in NTSC color TV UNIT II

broadcasting. Again, gray pixels generate zero
(I, Q) chrominance signal. The original Fundamental Concept in Video
meanings of these names came from
Digital Audio:
combinations of analog signals, I for “in-phase
chrominance” and Q for “quadrature
chrominance” signal—these names can now
Types of Video Signals:
be safely ignored.
2.1 Analog Video:
It is thought that, although U and V are more
simply defined, they do not capture the most- Up until last decade, most TV programs were
to-least hierarchy of human vision sensitivity. sent and received as an analog signal. Once
Although U and V nicely define the color the electrical signal is received, we may
differences, they do not best correspond to assume that brightness is at least a monotonic
actual human perceptual color sensitivities. In function of voltage, if not necessarily linear,
NTSC, I and Q are used instead. because of gamma correction. An analog
signal f (t) samples a time-varying image.
6.3.4 YCbCr Color Model :
So-called progressive scanning traces through
The international standard for component (3- a complete picture (a frame) row-wise for
signal, studio quality) digital video is officially each time interval.
“Recommendation ITU-R BT.601-4” (known as
Rec. 601). This standard uses another color A high-resolution computer monitor typically
space, Y CbCr, often simply written YCbCr. The uses a time interval of 1/72 s. In TV and in
YCbCr transform is closely related to the YUV some monitors and multimedia standards,
transform. YUV is changed by scaling such another system, interlaced scanning, is used.
that Cb is U, but with a coefficient of 0.5
Here, the odd-numbered lines are traced first,
multiplying B≡ . In some software systems, Cb
then the even-numbered lines. This results in
and Cr are also shifted such that values are
“odd” and “even” fields—two fields make up interlaced fields. Its broadcast TV signals are
one frame. also used in composite video.
In fact, the odd lines (starting from 1) end up This important standard is widely used in
at the middle of a line at the end of the odd Western Europe, China, India, and many other
field, and the even scan starts at a half-way parts of the world. Because it has higher
point. First the solid (odd) lines are traced—P resolution than NTSC (625 vs. 525 scan lines),
to Q, then R to S, and so on, ending at T—then the visual quality of its pictures is generally
the even field starts at U and ends at V. better.
2.1.1 NTSC Video: PAL uses the YUV color model with an 8 MHz
channel, allocating a bandwidth of 5.5 MHz to
The NTSC TV standard is mostly used in North Y and 1.8 MHz each to U and V. The color
America and Japan. It uses a familiar 4:3 subcarrier frequency is fsc ≈ 4.43 MHz. To
aspect ratio (i.e., the ratio of picture width to improve picture quality, chroma signals have
height) and 525 scan lines per frame at 30 fps. alternate signs (e.g., +U and −U) in successive
More exactly, for historical reasons NTSC uses scan lines; hence the name “Phase Alternating
29.97 fps—or, in other words, 33.37 ms per Line.”
frame.
NTSC follows the interlaced scanning system,

and each frame is divided into two fields, with 2.1.3 SECAM Video :
262.5 lines/field. Thus the horizontal sweep
frequency is 525×29.97 ≈ 15, 734 lines/s, so SECAM, which was invented by the French, is
the third major broadcast TV standard.
that each line is swept out in 1/15,734 s ≈
63.6µs. Since the horizontal retrace takes SECAM stands for Systeme Electronique
Couleur Avec Memoire. SECAM also uses 625
10.9µs, this leaves 52.7µs for the active line
signal, during which image data is displayed. scan lines per frame, at 25 fps, with a 4:3
aspect ratio and interlaced fields.
the effect of “vertical retrace and sync” and
“horizontal retrace and sync” on the NTSC The original design called for a higher number
of scan lines (over 800), but the final version
video raster. Blanking information is placed
into 20 lines reserved for control information settled for 625. SECAM and PAL are similar,
differing slightly in their color coding scheme.
at the beginning of each field. Hence, the
number of active video lines per frame is only In SECAM,U and V signals are modulated using
485. Similarly, almost 1/6 of the raster at the separate color subcarriers at 4.25 MHz and
left side is blanked for horizontal retrace and 4.41 MHz, respectively. They are sent in
sync. The nonblanking pixels are called active alternate lines—that is, only one of the U or V
pixels. signals will be sent on each scan line.
2.1.2 PAL Video :
PAL (Phase Alternating Line) is a TV standard

2.2 Digital Video:
originally invented by German scientists. It
uses 625 scan lines per frame, at 25 fps (or 40 The advantages of digital representation for
ms/frame), with a 4:3 aspect ratio and video are many. It permits
• Storing video on digital devices or in important standards it has produced is CCIR-

memory, ready to be processed (noise 601 for component digital video. This
removal, cut and paste, and so on) and standard has since become standard ITU-R
integrated into various multimedia Rec. 601, an international standard for
applications. professional video applications. It is adopted
by several digital video formats, including the
• Direct access, which makes nonlinear video popular DV video.
editing simple.
The NTSC version has 525 scan lines, each
• Repeated recording without degradation of having 858 pixels (with 720 of them visible,
image quality. not in the blanking period). Because the NTSC
• Ease of encryption and better tolerance to version uses 4:2:2, each pixel can be
channel noise. represented with two bytes (8 bits for Y and 8
bits alternating between Cb and Cr). The Rec.
601 (NTSC) data rate (including blanking and
sync but excluding audio) is thus
approximately 216 Mbps (megabits per
In earlier Sony or Panasonic recorders, digital second):
video was in the form of composite video.
Modern digital video generally uses
component video, although RGB signals are
first converted into a certain type of color
opponent space. The usual color space is
During blanking, digital video systems may
YCbCr
make use of the extra data capacity to carry
2.2.1 Chroma Subsampling : audio signals, translations into foreign
languages, or error-correction information.
Since humans see color with much less spatial
resolution than black and white, it makes 2.2.3 High-Definition TV:
sense to decimate the chrominance signal.
The introduction of wide-screen movies
Interesting but not necessarily informative
brought the discovery that viewers seated
names have arisen to label the different
near the screen enjoyed a level of
schemes used. To begin with, numbers are
participation (sensation of immersion) not
given stating how many pixel values, per four
experienced with conventional movies.
original pixels, are actually sent.
Apparently the exposure to a greater field of
Thus the chroma subsampling scheme “4:4:4” view, especially the involvement of peripheral
indicates that no chroma subsampling is used. vision, contributes to the sense of “being
Each pixel’s Y , Cb, and Cr values are there.”
transmitted, four for each of Y , Cb, and Cr.
The main thrust of High-Definition TV (HDTV)
2.2.2 CCIR and ITU-R Standards for Digital is not to increase the “definition” in each unit
Video : area, but rather to increase the visual field,
especially its width. First-generation HDTV
The CCIR is the Consultative Committee for was based on an analog technology
International Radio. One of the most
developed by Sony and NHK in Japan in the variables, x, and y, or video, which depends
late 1970s. on 3 variables, x, y, t. The amplitude value is a
continuous quantity.
2.3 Digitization of Sound :
Since we are interested in working with such
2.3.1 What is Sound? data in computer storage, we must digitize
the analog signals (i.e., continuous-valued
Sound is a wave phenomenon like light, but it
voltages) produced by microphones.
is macroscopic and involves molecules of air
Digitization means conversion to a stream of
being compressed and expanded under the
numbers—preferably integers for efficiency.
action of some physical device. For example, a
Since the graph is two-dimensional, to fully
speaker in an audio system vibrates back and
digitize the signal shown we have to sample in
forth and produces a longitudinal pressure
each dimension—in time and in amplitude.
wave that we perceive as sound.
Sampling means measuring the quantity we
(As an example, we get a longitudinal wave by are interested in, usually at evenly spaced
vibrating a Slinky along its length; in contrast, intervals.
we get a transverse wave by waving the Slinky
back and forth perpendicular to its length).
Without air there is no sound—for example,
in space.
Nevertheless, if we wish to use a digital

version of sound waves, we must form
digitized representations of audio
information. Although such pressure waves
are longitudinal, they still have ordinary wave
properties and behaviors, such as reflection
(bouncing), refraction (change of angle when
entering a medium with a different density), Fig. An Analog signal: Continuous
and diffraction (bending around an obstacle). Measurement of Pressure Wave
This makes the design of “surround sound” The first kind of sampling—using
possible. Since sound consists of measurable measurements only at evenly spaced time
pressures at any 3D point, we can detect it by intervals—is simply called sampling
measuring the pressure level at a location, (surprisingly), and the rate at which it is
using a transducer to convert pressure to performed is called the sampling frequency.
voltage levels. Figure 6.3a shows this type of digitization.
2.3.2 Digitization:
The “one-dimensional” nature of sound.

Values change over time in amplitude: the
pressure increases or decreases with time [1].
Since there is only one independent variable,
time, we call this a 1D signal—as opposed to
images, with data that depends on two
MIDI Concepts
• Music is organized into tracks in a

sequencer. Each track can be turned on or off
on recording or playing back. Usually, a
particular instrument is associated with a
MIDI channel. MIDI channels are used to
separate messages. There are 16 channels,
numbered from 0 to 15. The channel forms
the last four bits (the least significant bits) of
that do refer to the channel.
The idea is that each channel is associated

with a particular instrument—for example,
2.4 MIDI: Musical Instrument
channel 1 is the piano, channel 10 is the
Digital Interface: drums. Nevertheless, you can switch
instruments midstream, if desired, and
Wave table files provide an accurate
associate another instrument with any
rendering of real instrument sounds but are
channel.
quite large. For simple music, we might be
satisfied with FM synthesis versions of audio • Along with channel messages (which include
signals that could easily be generated by a a channel number), several other types of
sound card. messages are sent, such as a general message
for all instruments indicating a change in
tuning or timing; these are called system
Essentially, every computer is equipped with a messages. It is also possible to send a special
sound card; a sound card is capable of message to an instrument’s channel that
manipulating and outputting sounds through allows sending many notes without a channel
speakers connected to the board, recording specified. We will describe these messages in
sound input from a microphone or line-in detail later.
connection to the computer, and
• The way a synthetic musical instrument
manipulating sound stored in memory.
responds to a MIDI message is usually by
If we are willing to be satisfied with the sound simply ignoring any “play sound” message
card’s defaults for many of the sounds we that is not for its channel. If several messages
wish to include in a multimedia project, we are for its channel, say several simultaneous
can use a simple scripting language and notes being played on a piano, then the
hardware setup called MIDI. instrument responds, provided it is multi-
voice—that is, can play more than a single
MIDI, which dates from the early 1980s, is an note at once.
acronym that stands for Musical Instrument
Digital Interface. It forms a protocol adopted
by the electronic music industry that enables
computers, synthesizers, keyboards, and
other musical devices to communicate with
each other.
audio, the μ-law technique for companding

audio signals is usually combined with a
simple algorithm that exploits the temporal
redundancy present in audio signals.
Differences in signals between the present

and a previous time can effectively reduce the
size of signal values and, most important,
concentrate the histogram of pixel values
(differences, now) into a much smaller range.
The result of reducing the variance of values is
that lossless compression methods that
produce a bitstream with shorter bit lengths
for more likely values, fare much better and
produce a greatly compressed bitstream.
In general, producing quantized sampled

output for audio is called Pulse Code
Fig. A Typical MIDI Setup
Modulation, or PCM. The differences version
is called DPCM (and a crude but efficient
variant is called DM). The adaptive version is
called ADPCM, and variants that take into
account speech properties follow from these.
2.5.2 Differential Coding of Audio :
Audio is often stored not in simple PCM but in

a form that exploits differences. For a start,
differences will generally be smaller numbers
and hence offer the possibility of using fewer
bits to store.
2.5 Quantization and Transmission An advantage of forming differences is that

of Audio : the histogram of a difference signal is usually
considerably more peaked than the histogram
To be transmitted, sampled audio information for the original signal. For example, as an
must be digitized, and here we look at some extreme case, the histogram for a linear ramp
of the details of this process. Once the signal that has constant slope is uniform,
information has been quantized, it can then whereas the histogram for the derivative of
be transmitted or stored. We go through a the signal (i.e., the differences, from sampling
few examples in complete detail, which helps point to sampling point) consists of a spike at
in understanding what is being discussed. the slope value.
2.5.1 Coding of Audio : Generally, if a time-dependent signal has

some consistency over time (temporal
Quantization and transformation of data are
redundancy), the difference signal—
collectively known as coding of the data. For
subtracting the current sample from the
previous one—will have a more peaked As a start, suppose we want to encode the call
histogram, with a maximum around zero. numbers of the 120 million or so items in the
Consequently, if we then go on to assign Library of Congress (a mere 20 million, if we
bitstring codewords to differences, we can consider just books). Why don’t we just
assign short codes to prevalent values and transmit each item as a 27-bit number, giving
long codewords to rarely occurring ones. each item a unique binary code (since 227 >
120, 000, 000)?
2.5.3 Lossless Predictive Coding :
The main problem is that this “great idea”
Predictive coding simply means transmitting requires too many bits. And in fact there exist
differences—we predict the next sample as many coding techniques that will effectively
being equal to the current sample and send reduce the total number of bits needed to
not the sample itself but the error involved in represent the above information. The process
making this assumption. That is, if we predict involved is generally referred to as
that the next sample equals the previous one, compression. We had a beginning look at
then the error is just the difference between compression schemes aimed at audio.
previous and next.
There, we had to first consider the complexity
Our prediction scheme could also be more of transforming analog signals to digital ones,
complex. However, we do note one problem. whereas here, we shall consider that we at
Suppose our integer sample values are in the least start with digital signals. For example,
range 0 .. 255. Then differences could be as even though we know an image is captured
much as −255 .. 255. So we have using analog signals, the file produced by a
unfortunately increased our dynamic range digital camera is indeed digital.
(ratio of maximum to minimum) by a factor of
two: we may well need more bits than we
needed before to transmit some differences.
Fortunately, we can use a trick to get around
this problem.
UNIT III We call the output of the encoder codes or

codewords. The intermediate medium could
Multimedia Data Compression I: either be data storage or a
communication/computer network. If the
3.1 Lossless Compression Algorithms:
compression and decompression processes
The emergence of multimedia technologies induce no information loss, the compression
has made digital libraries a reality. Nowadays, scheme is lossless; otherwise, it is lossy.
libraries, museums, film studios, and
The next several chapters deal with lossy
governments are converting more and more
compression algorithms as they are
data and archives into digital form. Some of
commonly used for image, video, and audio
the data (e.g., precious books and paintings)
compression. Here, we concentrate on
indeed need to be stored without any loss.
lossless compression.
If the total number of bits required to twodimensional variant of it is generally used

represent the data before compression is B0 to code bilevel images. This algorithm uses
and the total number of bits required to the coded run information in the previous row
represent the data after compression is B1, of the image to code the run in the current
then we define the compression ratio as row.
3.3 Variable-Length Coding :
Since the entropy indicates the information

content in an information source S, it leads to
a family of coding methods commonly known
In general, we would desire any codec as entropy coding methods. As described
(encoder/decoder scheme) to have a earlier, variable-length coding (VLC) is one of
compression ratio much larger than 1.0. The the best-known such methods. Here, we will
higher the compression ratio, the better the study the Shannon–Fano algorithm, Huffman
lossless compression scheme, as long as it is coding, and adaptive Huffman coding.
computationally feasible.
3.3.1 Shannon–Fano Algorithm :
The Shannon–Fano algorithm was

3.2 Run-Length Coding : independently developed by Shannon at Bell
Labs and Robert Fano at MIT. To illustrate the
Instead of assuming a memoryless source,run- algorithm, let us suppose the symbols to be
length coding (RLC) exploits memory present coded are the characters in the word HELLO.
in the information source. It is one of the The frequency count of the symbols is
simplest forms of data compression. The basic
idea is that if the information source we wish
to compress has the property that symbols
tend to form continuous groups, instead of
coding each symbol in the group individually,
we can code one such symbol and the length
of the group.
As an example, consider a bilevel image (one

with only 1-bit black and white pixels) with
monotone regions—like an fx. This
information source can be efficiently coded
using run-length coding. In fact, since there
are only two symbols, we do not even need to
code any symbol at the start of each run.
Instead, we can assume that the starting run
is always of a particular color (either black or
white) and simply code the length of each run.
The above description is the one-dimensional

run-length coding algorithm. A
The encoding steps of the Shannon–Fano builds up the same dictionary dynamically
algorithm can be presented in the following while receiving the data—the encoder and the
top-down manner: decoder both develop the same dictionary.
Since a single code can now represent more
1. Sort the symbols according to the than one symbol/character, data compression
frequency count of their occurrences. is realized.
2. Recursively divide the symbols into two LZW proceeds by placing longer and longer
parts, each with approximately the same repeated entries into a dictionary, then
number of counts, until all parts contain only emitting the code for an element rather than
one symbol. the string itself, if the element has already
3.3.2 Huffman Coding : been placed in the dictionary. The
predecessors of LZW are LZ77 and LZ78, due
First presented by Huffman in a 1952 paper, to Jacob Ziv and Abraham Lempel in 1977 and
this method attracted an overwhelming 1978. Welch improved the technique in 1984.
amount of research and has been adopted in LZW is used in many applications, such as
many important and/or commercial UNIX compress, GIF for images, WinZip, and
applications, such as fax machines, JPEG, and others.
MPEG. In contradistinction to Shannon–Fano,
which is top-down, the encoding steps of the Example:
Huffman algorithm are described in the (LZW Compression for String
following bottom-up manner. Let us use the ABABBABCABABBA). Let us start with a very
same example word, HELLO. simple dictionary (also referred to as a string
A similar binary coding tree will be used as table), initially containing only three
above, in which the left branches are coded 0 characters, with codes as follows:
and right branches. A simple list data
structure is also used.
Now if the input string is ABABBABCABABBA,

3.4 Dictionary-Based Coding : the LZW compression algorithm works as
follows:
The Lempel-Ziv-Welch (LZW) algorithm
employs an adaptive, dictionary-based
compression technique. Unlike variable-
length coding, in which the lengths of the
codewords are different, LZW uses fixed-
length codewords to represent variablelength
strings of symbols/characters that commonly
occur together, such as words in English text.
As in the other adaptive compression

techniques, the LZW encoder and decoder
can be attributed to Pasco (1976) and

Rissanen and Langdon (1979).
Various modern versions of arithmetic coding

have been developed for newer multimedia
standards: for example, Fast Binary Arithmetic
Coding in JBIG, JBIG2 and JPEG-2000, and
Context-Adaptive Binary Arithmetic Coding
(CABAC) in H.264 and H.265.
Although it is possible to group symbols into

metasymbols for codeword assignment (as in
extended Huffman coding) to overcome the
limitation of integral number of bits per
symbol, the increase in the resultant symbol
table required by the Huffman encoder and
decoder would be formidable.
The output codes are 1 2 4 5 2 3 4 6 1. Instead Arithmetic coding can treat the whole
of 14 characters, only 9 codes need to be message as one unit and achieve fractional
sent. If we assume each character or code is number of bits for each input symbol. In
transmitted as a byte, that is quite a saving practice, the input data is usually broken up
(the compression ratio would be 14/9 = 1.56). into chunks to avoid error propagation.
(Remember, the LZW is an adaptive algorithm,
In our presentation below, we will start with a
in which the encoder and decoder
simplistic approach and include a terminator
independently build their own string tables.
symbol. Then we will introduce some
Hence, there is no overhead involving
improved methods for practical
transmitting the string table.)
implementations.
3.5.1 Arithmetic Coding Algorithm :

3.5 Arithmetic Coding :
A message is represented by a half-open
Arithmetic coding is a more modern coding interval [a, b) where a and b are real numbers
method that usually outperforms Huffman between 0 and 1. Initially, the interval is [0, 1).
coding in practice. It was initially developed in When the message becomes longer, the
the late 1970s and 1980s. The initial idea of length of the interval shortens, and the
arithmetic coding was introduced in number of bits needed to represent the
Shannon’s 1948 work. interval increases. Suppose the alphabet is [A,
B,C, D, E, F, $], in which $ is a special symbol
Peter Elias developed its first recursive used to terminate the message, and the
implementation (which was not published but known probability distribution.
was mentioned in Abramson’s 1963 book).
The method was further developed and

described in Jelinek’s 1968 book. Some better Algorithm: (Arithmetic Coding Encoder).
known improved arithmetic coding methods
BEGIN background and foreground objects in images

low = 0.0; high = 1.0; range = 1.0; initialize tend to change relatively slowly across the
symbol; // so symbol != terminator image frame.
while (symbol != terminator) Since we were dealing with signals in the time
{ domain for audio, practitioners generally refer
get (symbol); to images as signals in the spatial domain. The
low = low + range * Range_low(symbol); generally slowly changing nature of imagery
high = low + range * Range_high(symbol); spatially produces a high likelihood that
range = high - low; neighboring pixels will have similar intensity
} values. Given an original image I(x, y), using a
output a code so that low <= code < high; simple difference operator we can define a
END difference image d(x, y) as follows:
3.6 Lossless Image Compression :
One of the most commonly used compression

This is a simple approximation of a partial
techniques in multimedia data compression is
differential operator θ/θx applied to an image
differential coding. The basis of data reduction
defined in terms of integer values of x and y.
in differential coding is the redundancy in
consecutive symbols in a datastream. Another approach is to use the discrete
version of the 2D Laplacian operator to define
Recall that we considered lossless differential
a difference image d(x, y) as
coding in Chap. 6, when we examined how
d(x, y) = 4 I(x, y)− I(x, y −1)− I(x, y +1)−
audio must be dealt with via subtraction from
I(x +1, y)− I(x −1, y).
predicted values.
3.6.2 Lossless JPEG :
Audio is a signal indexed by onedimensional
time. Here we consider how to apply the Lossless JPEG is a special case of JPEG image
lessons learned from audio to the context of compression. It differs drastically from other
digital image signals that are indexed by two, JPEG modes in that the algorithm has no lossy
spatial, dimensions (x, y). steps. Thus we treat it here and consider the
more used lossy JPEG methods.
3.6.1 Differential Coding of Images :
Lossless JPEG is invoked when the user selects
Let us consider differential coding in the
a 100 % quality factor in an image tool.
context of digital images. In a sense, we move
Essentially, lossless JPEG is included in the
from signals with domain in one dimension to
JPEG compression standard simply for
signals indexed by numbers in two dimensions
completeness.
(x, y)—the rows and columns of an image.
Later, we will look at video signals. The following predictive method is applied on
the unprocessed original image (or each color
These are even more complex, in that they
band of the original color image). It essentially
are indexed by space and time (x, y, t).
involves two steps: forming a differential
Because of the continuity of the physical
prediction and encoding.
world, the gray-level intensities (or color) of
1. A predictor combines the values of up to 4.1 Quantization:

three neighboring pixels as the predicted
value for the current pixel, indicated by X in Quantization in some form is the heart of any
Fig. 7.17. The predictor can use any one of the lossy scheme. Without quantization, we
seven schemes listed. If predictor P1 is used, would indeed be losing little information. The
the neighboring intensity value A will be source we are interested in compressing may
adopted as the predicted intensity of the contain a large number of distinct output
current pixel; if predictor P4 is used, the values (or even infinite, if analog).
current pixel value is derived from the three
To efficiently represent the source output, we
neighboring pixels as A + B − C; and so on.
have to reduce the number of distinct values
2. The encoder compares the prediction with to a much smaller set, via quantization. Each
the actual pixel value at position X and algorithm (each quantizer) can be uniquely
encodes the difference using one of the determined by its partition of the input range,
lossless compression techniques we have on the encoder side, and the set of output
discussed, such as the Huffman coding values, on the decoder side.
scheme.
The input and output of each quantizer can be
either scalar values or vector values thus
leading to scalar quantizers and vector
quantizers. In this section, we examine the
design of both uniform and nonuniform scalar
quantizers and briefly introduce the topic of
vector quantization (VQ).
4.1.1 Uniform Scalar Quantization :
A uniform scalar quantizer partitions the

domain of input values into equally spaced
intervals, except possibly at the two outer
intervals. The endpoints of partition intervals
are called the quantizer’s decision boundaries.
The output or reconstruction value
corresponding to each interval is taken to be
the midpoint of the interval. The length of
each interval is referred to as the step size,
denoted by the symbol θ. Uniform scalar
quantizers are of two types: midrise and
midtread.
A midrise quantizer is used with an even

UNIT IV number of output levels, and a midtread
quantizer with an odd number. The midrise
Multimedia Data Compression II: quantizer has a partition interval that brackets
zero.
4. Lossy Compression Algorithms:
The midtread quantizer has zero as one of its consecutive samples from the source input
output values, hence, it is also known as dead- into vectors.
zone quantizer, because it turns a range of
Let X = {x1, x2,..., xk }T be a vector of samples.
nonzero input values into the zero output.
The midtread quantizer is important when Whether our input data is an image, a piece of
music, an audio or video clip, or even a piece
source data represents the zero value by
fluctuating between small positive and of text, there is a good chance that a
substantial amount of correlation is inherent
negative numbers.
among neighboring samples xi .
Applying the midtread quantizer in this case
The rationale behind transform coding is that
would produce an accurate and steady
representation of the value zero. For the if Y is the result of a linear transform T of the
input vector X in such a way that the
special case θ = 1, we can simply compute the
output values for these quantizers as components of Y are much less correlated,
then Y can be coded more efficiently than X.
For example, if most information in an RGB

image is contained in a main axis, rotating so
that this direction is the first component
means that luminance can be compressed
The goal for the design of a successful uniform differently from color information.
quantizer is to minimize the distortion for a
given source input with a desired number of In higher dimensions than three, if most
output values. This can be done by adjusting information is accurately described by the first
the step size θ to match the input statistics. few components of a transformed vector, the
remaining components can be coarsely
4.1.2 Nonuniform Scalar Quantization : quantized, or even set to zero, with little
If the input source is not uniformly signal distortion.
distributed, a uniform quantizer may be 4.2.1 Discrete Cosine Transform (DCT) :
inefficient. Increasing the number of decision
levels within the region where the source is The Discrete Cosine Transform (DCT), a widely
densely distributed can effectively lower used transform coding technique, is able to
granular distortion. perform decorrelation of the input signal in a
data-independent manner.
In addition, without having to increase the
total number of decision levels, we can Because of this, it has gained tremendous
enlarge the region in which the source is popularity. We will examine the definition of
sparsely distributed. Such nonuniform the DCT and discuss some of its properties, in
quantizers thus have nonuniformly defined particular the relationship between it and the
decision boundaries. more familiar Discrete Fourier Transform
(DFT).
4.2 Transform Coding :
Let’s start with the two-dimensional DCT.
From basic principles of information theory, Given a function f (i, j) over two integer
we know that coding vectors is more efficient variables i and j (a piece of an image), the 2D
than coding scalars. To carry out such an DCT transforms it into a new function F(u,v),
intention, we need to group blocks of with integer u and v running over the same
range as i and j. The general definition of the Consider again a time-dependent signal f (t) (it
transform is is best to base discussion on continuous
functions to start with). The traditional
method of signal decomposition is the Fourier
transform.
Above, in our discussion of the DCT, we

considered a special cosinebased transform. If
where i, u = 0, 1,..., M − 1, j, v = 0, 1,..., N − 1,
we carry out analysis based on both sine and
and the constants C(u) and C(v) are
cosine, then a concise notation assembles the
determined by
results into a function F(ω), a complex-valued
function of real-valued frequency ω given in
Eq. Such decomposition results in very fine
resolution in the frequency domain.
However, since a sinusoid is theoretically

infinite in extent in time, such a
In the JPEG image compression standard, an decomposition gives no temporal resolution.
image block is defined to have dimension M = Another method of decomposition that has
N = 8. Therefore, the definitions for the 2D gained a great deal of popularity in recent
DCT and its inverse (IDCT) in this case. years is the wavelet transform.
It seeks to represent a signal with good

resolution in both time and frequency, by
4.3 Wavelet-Based Coding:
using a set of basis functions called wavelets.
Decomposing the input signal into its There are two types of wavelet transforms:
constituents allows us to apply coding the Continuous Wavelet Transform (CWT) and
techniques suitable for each constituent, to the Discrete Wavelet Transform (DWT).
improve compression performance. Consider
We assume that the CWT is applied to the
again a time-dependent signal f (t) (it is best
large class of functions f (x) that are square
to base discussion on continuous functions to
integrable on the real line—that is, [ f (x)] 2
start with).
dx < ≤. In mathematics, this is written as f (x) ∈
The traditional method of signal L2(R). The other kind of wavelet transform,
decomposition is the Fourier transform. the DWT, operates on discrete samples of the
Above, in our discussion of the DCT, we input signal.
considered a special cosinebased transform.
The DWT resembles other discrete linear
If we carry out analysis based on both sine transforms, such as the DFT or the DCT, and is
and cosine, then a concise notation assembles very useful for image processing and
the results into a function F(ω), a complex- compression. Before we begin a discussion of
valued function Decomposing the input signal the theory of wavelets, let’s develop an
into its constituents allows us to apply coding intuition about this approach by going
techniques suitable for each constituent, to through an example using the simplest
improve compression performance. wavelet transform, the so-called Haar
Wavelet Transform, to form averages and
differences of a sequence of float values.
If we repeatedly take averages and The bits are effectively ordered by importance
differences and keep results for every step, in the bitstream. An embedded code allows
we effectively create a multiresolution the encoder to terminate the encoding at any
analysis of the sequence. point and thus meet any target bitrate
exactly.
For images, this would be equivalent to
creating smaller and smaller summary images, Similarly, a decoder can cease to decode at
one-quarter the size for each step, and any point and produce reconstructions
keeping track of differences from the average corresponding to all lower rate encodings.
as well.
4.4.1 The Zerotree Data Structure :
Mentally stacking the full-size image, the
quarter-size image, the sixteenth size image, The coding of the significance map is achieved
using a new data structure called the
and so on, creates a pyramid. The full set,
along with difference images, is the zerotree. A wavelet coefficient x is said to be
insignificant with respect to a given threshold
multiresolution decomposition.
T if |x| < T .
The zerotree operates under the hypothesis

4.4. Embedded Zerotree of Wavelet that if a wavelet coefficient at a coarse scale is
Coefficients: insignificant with respect to a given threshold
T , all wavelet coefficients of the same
The Embedded Zerotree Wavelet (EZW) orientation in the same spatial location at
algorithm introduced by Shapiro [16] is an finer scales are likely to be insignificant with
effective and computationally efficient respect to T .
technique in image coding.
Using the hierarchical wavelet decomposition
This work has inspired a number of presented in this chapter, we can relate every
refinements to the initial EZW algorithm, the coefficient at a given scale to a set of
most notable being Said and Pearlman’s Set coefficients at the next finer scale of similar
Partitioning in Hierarchical Trees (SPIHT) orientation.
algorithm and Taubman’s Embedded Block
Coding with Optimized Truncation (EBCOT) An element of a zerotree is a zerotree root if it
algorithm [18], which is adopted into the is not the descendant of a previously found
JPEG2000 standard. zerotree root. The significance map is coded
using the zerotree with a four-symbol
alphabet.
The EZW algorithm addresses two problems: The four symbols are
obtaining the best image quality for a given
bitrate and accomplishing this task in an • The zerotree root. The root of the zerotree
embedded fashion. An embedded code is one is encoded with a special symbol indicating
that contains all lower rate codes that the insignificance of the coefficients at
“embedded” at the beginning of the finer scales is completely predictable.
bitstream.
• Isolated zero. The coefficient is insignificant Pearlma gives a full description of this
but has some significant descendants. algorithm.
• Positive significance. The coefficient is

significant with a positive value.
• Negative significance. The coefficient is

significant with a negative value.
4.5 Set Partitioning in Hierarchical Trees

(SPIHT) :
SPIHT is a revolutionary extension of the EZW

algorithm. Based on EZW’s underlying
principles of partial ordering of transformed
coefficients, ordered bitplane transmission of
refinement bits, and the exploitation of self-
similarity in the transformed wavelet image,
the SPIHT algorithm significantly improves the
performance of its predecessor by changing
the ways subsets of coefficients are
partitioned and refinement information is
conveyed.
A unique property of the SPIHT bitstream is its

compactness. The resulting bitstream from
the SPIHT algorithm is so compact that
passing it through an entropy coder would
produce only marginal gain in compression at
the expense of much more computation.
Therefore, a fast SPIHT coder can be

implemented without any entropy coder or
possibly just a simple patent-free Huffman
coder. Another signature of the SPIHT
algorithm is that no ordering information is
explicitly transmitted to the decoder.
Instead, the decoder reproduces the

execution path of the encoder and recovers
the ordering information. A desirable side UNIT V
effect of this is that the encoder and decoder
have similar execution times, which is rarely Basic Video Compression
the case for other coding methods. Said and Techniques :
5.1 Introduction to Video Compression:
A video consists of a time-ordered sequence change relatively slowly across images,

of frames—images. An obvious solution to making a large suppression of higher spatial
video compression would be predictive coding frequency components viable.
based on previous frames. For example,
suppose we simply created a predictor such A video can be viewed as a sequence of
images stacked in the temporal dimension.
that the prediction equals the previous frame.
Since the frame rate of the video is often
Then compression proceeds by subtracting relatively high (e.g., ⇒15 frames per second)
images: instead of subtracting the image from and the camera parameters (focal length,
itself (i.e., use a spatial-domain derivative), we position, viewing angle, etc.) usually do not
subtract in time order and code the residual change rapidly between frames, the contents
error. And this works. Suppose most of the of consecutive frames are usually similar,
video is unchanging in time. unless certain objects in the scene move
extremely fast or the scene changes.
Then we get a nice histogram peaked sharply
at zero—a great reduction in terms of the In other words, the video has temporal
entropy of the original video, just what we redundancy. Temporal redundancy is often
wish for. However, it turns out that at significant and it is indeed exploited, so that
acceptable cost, we can do even better by not every frame of the video needs to be
searching for just the right parts of the image coded independently as a new image. Instead,
to subtract from the previous frame. the difference between the current frame and
other frame(s) in the sequence is coded.
After all, our naive subtraction scheme will
likely work well for a background of office If redundancy between them is great enough,
furniture and sedentary university types, but the difference images could consist mainly of
wouldn’t a football game have players small values and low entropy, which is good
zooming around the frame, producing large for compression.
values when subtracted from the previously
All modern digital video compression
static green playing field?
algorithms (including H.264 and H.265) adopt
So in the next section we examine how to do this Hybrid coding approach, i.e., predicting
better. The idea of looking for the football and compensating for the differences
player in the next frame is called motion between video frames to remove the
estimation, and the concept of shifting pieces temporal redundancy, and then transform
of the frame around so as to best subtract coding on the residual signal (the differences)
away the player is called motion to reduce the spatial redundancy.
compensation.
Video compression algorithms that adopt this

5.2 Video Compression Based on Motion approach are said to be based on motion
Compensation : compensation (MC).
The image compression techniques discussed The three main steps of these algorithms are:
in the previous chapters (e.g., JPEG and 1. Motion estimation (motion vector search)
JPEG2000) exploit spatial redundancy, the 2. Motion compensation based prediction
phenomenon that picture contents often
3. Derivation of the prediction error—the macroblock in the Reference frame, where k

difference. and l are indices for pixels in the macroblock
and i and j are the horizontal and vertical
For efficiency, each image is divided into displacements, respectively.
macroblocks of size N × N. By default, N = 16
for luminance images. For chrominance The difference between the two macroblocks
images, N = 8 if 4:2:0 chroma subsampling is can then be measured by their Mean Absolute
adopted. Motion compensation is not Difference (MAD), defined as
performed at the pixel level, nor at the level
of video object, as in later video standards
(such as MPEG-4). Instead, it is at the
macroblock level.
The current image frame is referred to as the

Target frame. A match is sought between the where N is the size of the macroblock. The
macroblock under consideration in the Target goal of the search is to find a vector (i, j) as
frame and the most similar macroblock in the motion vector MV = (u, v), such that
previous and/or future frame(s) [referred to MAD(i, j) is minimum:
as Reference frame(s)]. In that sense, the
Target macroblock is predicted from the
Reference macroblock(s).
We used the mean absolute difference in the

5.3 Search for Motion Vectors : above discussion. However, this measure is by
The search for motion vectors MV(u, v) as no means the only possible choice. In fact,
defined above is a matching problem, also some encoders (e.g., H.263) will simply use
called a correspondence problem. Since MV the Sum of Absolute Difference (SAD).
search is computationally expensive, it is Some other common error measures, such as
usually limited to a small immediate the Mean Square Error (MSE), would also be
neighborhood. appropriate.
Horizontal and vertical displacements i and j
are in the range [−p, p], where p is a positive
integer with a relatively small value. This
makes a search window of size (2p + 1) × (2p +
1).
The center of the macroblock (x0, y0) can be 5.4 MPEG :

placed at each of the grid positions in the
The Moving Picture Experts Group (MPEG)
window.
was established in 1988 to create a standard
For convenience, we use the upper left for delivery of digital video and audio.
corner(x, y) as the origin of the macroblock in Membership grew rapidly from about 25
the Target frame. Let C(x + k, y + l) be pixels in experts in 1988 to a community of hundreds
the macroblock in the Target (current) frame of companies and organizations.
and R(x + i + k, y + j + l) be pixels in the
It is appropriately recognized that proprietary MPEG-2 has managed to meet the

interests need to be maintained within the compression and bitrate requirements of
family of MPEG standards. This is digital TV/HDTV and in fact supersedes a
accomplished by defining only a compressed separate standard, MPEG-3, initially thought
bitstream that implicitly defines the decoder. necessary for HDTV.
The compression algorithms, and thus the 5.4.3 MPEG-4:

encoders, are completely up to the
MPEG-1 and -2 employ frame-based coding
manufacturers. In this chapter, we will study
some of the most important design issues of techniques, in which each rectangular video
frame is divided into macroblocks and then
MPEG-1 and 2, followed by some basics of the
later standards, MPEG-4 and 7, which have blocks for compression.
very different objectives. This is also known as block-based coding.
5.4.1 MPEG-1 : Their main concern is high compression ratio
and satisfactory quality of video under such
The MPEG-1 audio/video digital compression compression techniques.
standard [2,3] was approved by the
5.4.4 MPEG-7 :
International Organization for
Standardization/International Electrotechnical As more and more multimedia content
Commission (ISO/IEC) MPEG group in becomes an integral part of various
November 1991 for Coding of Moving Pictures applications, effective and efficient retrieval
and Associated Audio for Digital Storage becomes a primary concern. In October 1996,
Media at up to about 1.5 Mbit/s. the MPEG group therefore took on the
development of another major standard,
Common digital storage media include
compact discs (CDs) and video compact discs MPEG-7, following on MPEG-1, 2, and 4.
(VCDs). Out of the specified 1.5, 1.2 Mbps is One common ground between MPEG-4 and
intended for coded video, and 256 MPEG-7 is the focus on audiovisual objects.
kbps(kilobits per second) can be used for The main objective of MPEG-7 [27–29] is to
stereo audio. This yields a picture quality serve the need of audiovisual content-based
comparable to VHS cassettes and a sound retrieval (or audiovisual object retrieval) in
quality equal to CD audio. applications such as digital libraries.
5.4.2 MPEG-2 : 5.5 Basic Audio Compression Techniques:

Development of the MPEG-2 standard started
5.5.1 ADPCM in Speech Coding :
in 1990. Unlike MPEG-1, which is basically a
standard for storing and playing video on the ADPCM
CD of a single computer at a low bitrate (1.5
Mbps), MPEG-2 [6] is for higher quality video ADPCM forms the heart of the ITU’s speech
at a bitrate of more than 4 Mbps. compression standards G.721, G.723, G.726,
G.727, G.728, and G.729. The differences

among these standards involve the bitrate
and some details of the algorithm.
The default input is µ-law-coded PCM 16- bit

samples. Speech performance for ADPCM is where sˆn is the predicted signal value. fn is
such that the perceived quality of speech at then fed into the quantizer for quantization.
32 kbps (kilobits per second) is only slightly The quantizer is displayed in Fig. 13.2. Here,
poorer than with the standard 64 kbps PCM the input value is defined as a ratio of a
transmission and is better than DPCM. difference with the factor α.
Figure 13.1 shows a 1 s speech sample of a 5.5.3 Vocoders:

voice speaking the word “audio.” In Fig. 13.1a,
The coders (encoding/decoding algorithms)
the audio signal is stored as linear PCM (as
we have studied so far could have been
opposed to the default µ-law PCM) recorded
applied to any signals, not just speech.
at 8,000 samples per second, with 16 bits per
Vocoders are specifically voice coders.
sample. After compression with ADPCM using
Vocoders are concerned with modeling
ITU standard G.721, the signal appears as in
speech, so that the salient features are
Fig. 13.1b.
captured in as few bits as possible.
5.5.2 G.726 ADPCM, G.727-9 :
They use either a model of the speech
ITU G.726 provides another version of G.711, waveform in time (Linear Predictive Coding
including companding, at a lower bitrate. (LPC) vocoding), or else break down the signal
G.726 can encode 13- or 14-bit PCM samples into frequency components and model these
or 8-bitµ-law or A-law encoded data into 2, 3, (channel vocoders and formant vocoders).
4, or 5-bit codewords.
It can be used in speech transmission over

digital networks. The G.726 standard works by
adapting a fixed quantizer in a simple way.
The different sizes of codewords used amount
to bitrates of 16, 24, 32, or 40 kbps.
The standard defines a multiplier constant α

that will change for every difference value en,
depending on the current scale of signals.
Define a scaled difference signal fn as follows:
UNIT VI:
6.1 Basics Of Multimedia Network:
In a typical distributed multimedia sources. Required transmission rates for

application. multimedia data must be various types of media (data, text, graphics,
compressed, transmitted over the network to images, video. and audio).
its destination and decompressed and
synchronized for playout at the receiving site.
In addition. a multimedia information system

must allow a user to retrieve. store, and
manage a variety of data types including
images, audio, and video. We first analyze
network requirements to transmit multimedia
data and evaluate traditional data
communications versus multimedia
communications.
Then, present traditional networks (such as

Ethernet, token ring, FDDI, and ISDN) and how
they can be adapted for multimedia
applications. We also descrihe the ATM
network. which is well suited for transfering
multimedia data. Finally, we discuss the
newtork architectures for current and future
information superhighways.
In today's communication market, there are

two distinct types of networks: local-area
networks (LANs) and wide-area networks
(WANs).
LANs run on a premises and interconnect

desktop and server resources. while WANs
are generally supported by public carrier
services or leased private lines which link
geographically separate computing system
elements.
Many multimedia applications, such as on-

demand multimedia services,
videoconferencing, collaborative work
systems, and video mail require networked
multimedia [FM95). In these applications.
multimedia objects are stored at a server and
played back at the clients' sites.
Such applications might require broadcasting

multimedia data to various remote locations
or accessing large depositories of multimedia

Multimedia Computing

Uploaded by

Copyright:

Available Formats

Multimedia Computing

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multimedia Computing

Uploaded by

Copyright:

Available Formats

MULTIMEDIA COMPUTING

People who use the term "multimedia" often

3.1 Music Sequencing and Notation

3.4 Animation : Adobe Director: Director uses a movie

4. Graphics/Image Data Types :

The number of file formats used in

A 1-bit image consists of on and off bits only

The most important current standard for

For higher light levels, the cones each produce

6.1.3 Spectral Sensitivity of the Eye :

The eye is most sensitive to light in the middle

6.2.2 Multisensor Cameras : 6.3 Color Models in Video :

From Eqs. (4.20), (4.30) reads Written out, we then have

One goes backwards, from (Y’ , U’ , V’ ) to (R≡ ,

6.3.3 YIQ Color Model :

YIQ (actually, Y ≡ I Q) is used in NTSC color TV UNIT II

NTSC follows the interlaced scanning system,

PAL (Phase Alternating Line) is a TV standard

• Storing video on digital devices or in important standards it has produced is CCIR-

Nevertheless, if we wish to use a digital

The “one-dimensional” nature of sound.

• Music is organized into tracks in a

The idea is that each channel is associated

audio, the μ-law technique for companding

Differences in signals between the present

In general, producing quantized sampled

2.5.2 Differential Coding of Audio :

Audio is often stored not in simple PCM but in

2.5 Quantization and Transmission An advantage of forming differences is that

2.5.1 Coding of Audio : Generally, if a time-dependent signal has

UNIT III We call the output of the encoder codes or

If the total number of bits required to twodimensional variant of it is generally used

3.3 Variable-Length Coding :

Since the entropy indicates the information

The Shannon–Fano algorithm was

As an example, consider a bilevel image (one

The above description is the one-dimensional

Now if the input string is ABABBABCABABBA,

As in the other adaptive compression

can be attributed to Pasco (1976) and

Various modern versions of arithmetic coding

Although it is possible to group symbols into

3.5.1 Arithmetic Coding Algorithm :

The method was further developed and

BEGIN background and foreground objects in images

3.6 Lossless Image Compression :

One of the most commonly used compression

1. A predictor combines the values of up to 4.1 Quantization:

4.1.1 Uniform Scalar Quantization :

A uniform scalar quantizer partitions the

A midrise quantizer is used with an even

For example, if most information in an RGB

Above, in our discussion of the DCT, we

However, since a sinusoid is theoretically

It seeks to represent a signal with good

The zerotree operates under the hypothesis

• Positive significance. The coefficient is

• Negative significance. The coefficient is

4.5 Set Partitioning in Hierarchical Trees