Multimedia Computing
Multimedia Computing
Multimedia Computing
UNIT I
1. Multimedia and Hypermedia :
Engineering Task Force (IETF) standardizes the These tools are really only the beginning—a
technologies. The W3C has listed the fully functional multimedia project can also
following three goals for the WWW: universal call for stand-alone programming as well as
access of web resources (by everyone just the use of predefined tools to fully
everywhere), effectiveness of navigating exercise the capabilities of machines and the
available information, and responsible use of Internet.
posted material.
In courses we teach using this text, students
WWW size was estimated at over one billion are encouraged to try these tools, producing
pages. Sony unveiled the first Blu-ray Disc full-blown and creative multimedia
prototypes in October 2000, and the first productions.
prototype player was released in April 2003 in
Yet this textbook is not a “how-to” book
Japan.
about using these tools—it is about
Multimedia Web information systems understanding the fundamental design
represent a whole new breed of information principles behind these tools! With a clear
systems intended to process, store, retrieve, understanding of the key multimedia data
and disseminate information through the structures, algorithms, and protocols, a
Internet using Web protocols. student can make smarter and advanced use
of such tools, so as to fully unleash their
Since its inception the World Wide Web potentials, and even improve the tools
(WWW) has revolutionized the way people themselves or develop new tools.
interact, learn and communicate. A whole
new breed of information systems has been The categories of software tools we examine
developed to process, store, retrieve, and here are
disseminate information through the Internet
• Music sequencing and notation
using Web protocols. These systems are using
hypertext to link documents and present any • Digital audio
kind of information, including multimedia, to
the end user through a Web browser. Web • Graphics and image editing
based information systems are often called
• Video editing
Web Information Systems (WIS). A WIS, which
uses multimedia as its key component, is a • Animation
Multimedia Web Information System.
• Multimedia authoring.
For a concrete appreciation of the current Cakewalk Pro Audio is a very straightforward
state of multimedia software tools available music-notation program for “sequencing.”
for carrying out tasks in multimedia, we now The term sequencer comes from older devices
include a quick overview of software that stored sequences of notes in the MIDI
categories and products. music language.
MULTIMEDIA COMPUTING
Finale, Sibelius :Finale and Sibelius are two 3D animation, is a common API used to
composer-level notation systems; these develop multimedia Windows applications
programs likely set the bar for excellence, but such as computer games.
their learning curve is fairly steep.
Animation Software: Autodesk 3ds Max
Digital Audio: Digital Audio tools deal with (formerly 3D Studio Max) includes a number
accessing and editing the actual sampled of high-end professional tools for character
sounds that make up audio. animation, game development, and visual
effects production. Models produced using
3.2 Graphics and Image Editing this tool can be seen in several consumer
Adobe Illustrator: Illustrator is a powerful games, such as for the Sony Playstation.
publishing tool for creating and editing vector 3.5 Multimedia Authoring :
graphics, which can easily be exported to use
on the Web. Tools that provide the capability for creating a
complete multimedia presentation, including
Adobe Photoshop: Photoshop is the standard interactive user control, are called authoring
in a tool for graphics, image processing, and programs.
image manipulation. Layers of images,
graphics, and text can be separately Adobe Flash: Flash allows users to create
manipulated for maximum flexibility, and its interactive movies by using the score
set of filters permits creation of sophisticated metaphor—a timeline arranged in parallel
lighting effects. event sequences, much like a musical score
consisting of musical notes. Elements in the
3.3 Video Editing : movie are called symbols in Flash. Symbols
Adobe Premiere Premiere is a simple, intuitive are added to a central repository, called a
video editing tool for nonlinear editing— library, and can be added to the movie’s
putting video clips into any order. Video and timeline. Once the symbols are present at a
audio are arranged in tracks, like a musical specific time, they appear on the Stage, which
score. It provides a large number of video and represents what the movie looks like at a
audio tracks, superimpositions, and virtual certain time, and can be manipulated and
clips. A large library of built-in transitions, moved by the tools built into Flash. Finished
filters, and motions for clips allows easy Flash movies are commonly used to show
creation of effective multimedia productions. movies or games on the Web.
Windows MetaFile (WMF) is the native vector characterized by the wavelength of the wave.
file format for the Microsoft Windows Laser light consists of a single wavelength:
operating environment. WMF files actually e.g., a ruby laser produces a bright, scarlet red
consist of a collection of Graphics Device beam. So if we were to make a plot of the
Interface (GDI) function calls, also native to light intensity versus wavelength, we would
the Windows environment. When a WMF file see a spike at the appropriate red wavelength,
is “played” (typically using the Windows and no other contribution to the light.
PlayMetaFile() function) the described graphic
is rendered. WMF files are ostensibly device- In contrast, most light sources produce
contributions over many wavelengths.
independent and unlimited in size. The later
Enhanced Metafile Format Plus Extensions However, humans cannot detect all light, but
just contributions that fall in the “visible
(EMF+) format is device independent.
wavelengths.” Short wavelengths produce a
5.6 PS and PDF ; blue sensation, and long wavelengths produce
a red one.
PostScript is an important language for
typesetting, and many high-end printers have
a PostScript interpreter built into them.
PostScript is a vector-based, rather than
pixelbased, picture language: page elements
are essentially defined in terms of vectors.
With fonts defined this way, PostScript
includes vector/structured graphics as well as
text; bit-mapped images can also be included
in output files. Encapsulated PostScript files
add some information for including PostScript
files in another document. Several popular
graphics programs, such as Adobe Illustrator, 6.1.2 Human Vision :
use PostScript. Note, however, that the The eye works like a camera, with the lens
PostScript page description language does not focusing an image onto the retina
provide compression; in fact, PostScript files (upsidedown and left-right reversed). The
are just stored as ASCII. Therefore files are retina consists of an array of rods and three
often large, and in academic settings, it is kinds of cones. These receptors are called
common for such files to be made available such because they are shaped like cones and
only after compression by some Unix utility, rods, respectively. The rods come into play
such as compress or gzip. when light levels are low and produce a image
in shades of gray (“all cats are gray at night!”).
brain makes use of differences R–G, G–B, and and consists of the product of the illuminant
B–R, as well as combining all of R, G, and B E(λ) times the reflectance S(λ): C(λ) = E(λ) S(λ).
into a high- and light-level achromatic channel The equations similar to Eq. (4.2), then, that
(and thus we can say that the brain is good at take into account the image formation model
algebra). are:
For images produced from computer graphics, system percepts of Lightness L∗; hue h∗,
we store integers proportional to intensity in meaning a magnitude-independent notion of
the frame buffer; then we should have a color; and chroma c∗, meaning the purity
gamma correction LUT between the frame (vividness) of a color.
buffer and the display. If gamma correction is
applied to floats before quantizing to integers,
before storage in the frame buffer, then in
fact we can use only 8 bits per channel and
still avoid contouring artifacts.
white at the same luminance. It can be between 0 and 1. This makes the equations as
represented by the color differences U, V: follows:
“odd” and “even” fields—two fields make up interlaced fields. Its broadcast TV signals are
one frame. also used in composite video.
In fact, the odd lines (starting from 1) end up This important standard is widely used in
at the middle of a line at the end of the odd Western Europe, China, India, and many other
field, and the even scan starts at a half-way parts of the world. Because it has higher
point. First the solid (odd) lines are traced—P resolution than NTSC (625 vs. 525 scan lines),
to Q, then R to S, and so on, ending at T—then the visual quality of its pictures is generally
the even field starts at U and ends at V. better.
2.1.1 NTSC Video: PAL uses the YUV color model with an 8 MHz
channel, allocating a bandwidth of 5.5 MHz to
The NTSC TV standard is mostly used in North Y and 1.8 MHz each to U and V. The color
America and Japan. It uses a familiar 4:3 subcarrier frequency is fsc ≈ 4.43 MHz. To
aspect ratio (i.e., the ratio of picture width to improve picture quality, chroma signals have
height) and 525 scan lines per frame at 30 fps. alternate signs (e.g., +U and −U) in successive
More exactly, for historical reasons NTSC uses scan lines; hence the name “Phase Alternating
29.97 fps—or, in other words, 33.37 ms per Line.”
frame.
developed by Sony and NHK in Japan in the variables, x, and y, or video, which depends
late 1970s. on 3 variables, x, y, t. The amplitude value is a
continuous quantity.
2.3 Digitization of Sound :
Since we are interested in working with such
2.3.1 What is Sound? data in computer storage, we must digitize
the analog signals (i.e., continuous-valued
Sound is a wave phenomenon like light, but it
voltages) produced by microphones.
is macroscopic and involves molecules of air
Digitization means conversion to a stream of
being compressed and expanded under the
numbers—preferably integers for efficiency.
action of some physical device. For example, a
Since the graph is two-dimensional, to fully
speaker in an audio system vibrates back and
digitize the signal shown we have to sample in
forth and produces a longitudinal pressure
each dimension—in time and in amplitude.
wave that we perceive as sound.
Sampling means measuring the quantity we
(As an example, we get a longitudinal wave by are interested in, usually at evenly spaced
vibrating a Slinky along its length; in contrast, intervals.
we get a transverse wave by waving the Slinky
back and forth perpendicular to its length).
Without air there is no sound—for example,
in space.
This makes the design of “surround sound” The first kind of sampling—using
possible. Since sound consists of measurable measurements only at evenly spaced time
pressures at any 3D point, we can detect it by intervals—is simply called sampling
measuring the pressure level at a location, (surprisingly), and the rate at which it is
using a transducer to convert pressure to performed is called the sampling frequency.
voltage levels. Figure 6.3a shows this type of digitization.
2.3.2 Digitization:
MIDI Concepts
previous one—will have a more peaked As a start, suppose we want to encode the call
histogram, with a maximum around zero. numbers of the 120 million or so items in the
Consequently, if we then go on to assign Library of Congress (a mere 20 million, if we
bitstring codewords to differences, we can consider just books). Why don’t we just
assign short codes to prevalent values and transmit each item as a 27-bit number, giving
long codewords to rarely occurring ones. each item a unique binary code (since 227 >
120, 000, 000)?
2.5.3 Lossless Predictive Coding :
The main problem is that this “great idea”
Predictive coding simply means transmitting requires too many bits. And in fact there exist
differences—we predict the next sample as many coding techniques that will effectively
being equal to the current sample and send reduce the total number of bits needed to
not the sample itself but the error involved in represent the above information. The process
making this assumption. That is, if we predict involved is generally referred to as
that the next sample equals the previous one, compression. We had a beginning look at
then the error is just the difference between compression schemes aimed at audio.
previous and next.
There, we had to first consider the complexity
Our prediction scheme could also be more of transforming analog signals to digital ones,
complex. However, we do note one problem. whereas here, we shall consider that we at
Suppose our integer sample values are in the least start with digital signals. For example,
range 0 .. 255. Then differences could be as even though we know an image is captured
much as −255 .. 255. So we have using analog signals, the file produced by a
unfortunately increased our dynamic range digital camera is indeed digital.
(ratio of maximum to minimum) by a factor of
two: we may well need more bits than we
needed before to transmit some differences.
Fortunately, we can use a trick to get around
this problem.
The encoding steps of the Shannon–Fano builds up the same dictionary dynamically
algorithm can be presented in the following while receiving the data—the encoder and the
top-down manner: decoder both develop the same dictionary.
Since a single code can now represent more
1. Sort the symbols according to the than one symbol/character, data compression
frequency count of their occurrences. is realized.
2. Recursively divide the symbols into two LZW proceeds by placing longer and longer
parts, each with approximately the same repeated entries into a dictionary, then
number of counts, until all parts contain only emitting the code for an element rather than
one symbol. the string itself, if the element has already
3.3.2 Huffman Coding : been placed in the dictionary. The
predecessors of LZW are LZ77 and LZ78, due
First presented by Huffman in a 1952 paper, to Jacob Ziv and Abraham Lempel in 1977 and
this method attracted an overwhelming 1978. Welch improved the technique in 1984.
amount of research and has been adopted in LZW is used in many applications, such as
many important and/or commercial UNIX compress, GIF for images, WinZip, and
applications, such as fax machines, JPEG, and others.
MPEG. In contradistinction to Shannon–Fano,
which is top-down, the encoding steps of the Example:
Huffman algorithm are described in the (LZW Compression for String
following bottom-up manner. Let us use the ABABBABCABABBA). Let us start with a very
same example word, HELLO. simple dictionary (also referred to as a string
A similar binary coding tree will be used as table), initially containing only three
above, in which the left branches are coded 0 characters, with codes as follows:
and right branches. A simple list data
structure is also used.
The output codes are 1 2 4 5 2 3 4 6 1. Instead Arithmetic coding can treat the whole
of 14 characters, only 9 codes need to be message as one unit and achieve fractional
sent. If we assume each character or code is number of bits for each input symbol. In
transmitted as a byte, that is quite a saving practice, the input data is usually broken up
(the compression ratio would be 14/9 = 1.56). into chunks to avoid error propagation.
(Remember, the LZW is an adaptive algorithm,
In our presentation below, we will start with a
in which the encoder and decoder
simplistic approach and include a terminator
independently build their own string tables.
symbol. Then we will introduce some
Hence, there is no overhead involving
improved methods for practical
transmitting the string table.)
implementations.
while (symbol != terminator) Since we were dealing with signals in the time
{ domain for audio, practitioners generally refer
get (symbol); to images as signals in the spatial domain. The
low = low + range * Range_low(symbol); generally slowly changing nature of imagery
high = low + range * Range_high(symbol); spatially produces a high likelihood that
range = high - low; neighboring pixels will have similar intensity
} values. Given an original image I(x, y), using a
output a code so that low <= code < high; simple difference operator we can define a
END difference image d(x, y) as follows:
The midtread quantizer has zero as one of its consecutive samples from the source input
output values, hence, it is also known as dead- into vectors.
zone quantizer, because it turns a range of
Let X = {x1, x2,..., xk }T be a vector of samples.
nonzero input values into the zero output.
The midtread quantizer is important when Whether our input data is an image, a piece of
music, an audio or video clip, or even a piece
source data represents the zero value by
fluctuating between small positive and of text, there is a good chance that a
substantial amount of correlation is inherent
negative numbers.
among neighboring samples xi .
Applying the midtread quantizer in this case
The rationale behind transform coding is that
would produce an accurate and steady
representation of the value zero. For the if Y is the result of a linear transform T of the
input vector X in such a way that the
special case θ = 1, we can simply compute the
output values for these quantizers as components of Y are much less correlated,
then Y can be coded more efficiently than X.
range as i and j. The general definition of the Consider again a time-dependent signal f (t) (it
transform is is best to base discussion on continuous
functions to start with). The traditional
method of signal decomposition is the Fourier
transform.
If we repeatedly take averages and The bits are effectively ordered by importance
differences and keep results for every step, in the bitstream. An embedded code allows
we effectively create a multiresolution the encoder to terminate the encoding at any
analysis of the sequence. point and thus meet any target bitrate
exactly.
For images, this would be equivalent to
creating smaller and smaller summary images, Similarly, a decoder can cease to decode at
one-quarter the size for each step, and any point and produce reconstructions
keeping track of differences from the average corresponding to all lower rate encodings.
as well.
4.4.1 The Zerotree Data Structure :
Mentally stacking the full-size image, the
quarter-size image, the sixteenth size image, The coding of the significance map is achieved
using a new data structure called the
and so on, creates a pyramid. The full set,
along with difference images, is the zerotree. A wavelet coefficient x is said to be
insignificant with respect to a given threshold
multiresolution decomposition.
T if |x| < T .
The EZW algorithm addresses two problems: The four symbols are
obtaining the best image quality for a given
bitrate and accomplishing this task in an • The zerotree root. The root of the zerotree
embedded fashion. An embedded code is one is encoded with a special symbol indicating
that contains all lower rate codes that the insignificance of the coefficients at
“embedded” at the beginning of the finer scales is completely predictable.
bitstream.
MULTIMEDIA COMPUTING
• Isolated zero. The coefficient is insignificant Pearlma gives a full description of this
but has some significant descendants. algorithm.
The image compression techniques discussed The three main steps of these algorithms are:
in the previous chapters (e.g., JPEG and 1. Motion estimation (motion vector search)
JPEG2000) exploit spatial redundancy, the 2. Motion compensation based prediction
phenomenon that picture contents often
MULTIMEDIA COMPUTING