Imaging and Image Representation: 2.1 Sensing Light
Imaging and Image Representation: 2.1 Sensing Light
Imaging and Image Representation: 2.1 Sensing Light
OBJECT
point source of
surface element illumination
surface N
normal
surface
reflectance
radiance CAMERA
irradiance
Z
optical axis
sensor element
even clouds.
Figure 2.1 shows a simple model of common photography: a surface element, illuminated
by a single source (the sun or a
ash bulb) re
ects radiation toward the camera, which
senses it via chemicals on lm. More details of this situation are covered in Chapter 6.
Wavelengths in the light range result from generating or re
ecting mechanisms very near
the surface of objects. We are concerned with many properties of electro-magnetic radiation
in this book; however, we will usually give a qualitative description of phenomena and leave
the quantitative details to books in physics or optics. Application engineering requires some
knowledge of the material being sensed and the radiation and sensor used.
6 alternative
lens 7 analog to
3D TV TV
scene signal
image
plane
pixel
Figure 2.2: A CCD (charge-coupled device) camera imaging a vase; discrete cells convert
light energy into electrical charges, which are represented as small numbers when input to
a computer.
If the digital image has 500 rows and 500 columns of byte-sized gray values, a memory
array of a quarter of a million bytes is obtained. A CCD camera sometimes plugs into a
computer board, called a frame grabber which contains memory for the image and perhaps
control of the camera. New designs now allow for direct digital communication (e.g. using
the IEEE 1394 standard). Today major camera manufacturers oer digital cameras that
can store a few dozen images in memory within the camera body itself; some contain a
oppy disk for this purpose. These images can be input for computer processing at any
time. Figure 2.3 sketches an entire computer system with both camera input and graphics
output. This is a typical system for an industrial vision task or medical imaging task. It
is also typical for multimedia computers, which may have an inexpensive camera available
to take images for teleconferencing purposes. The role of a frame buer as a high speed
Machine Graphic
Vision
Display
Algorithms
lens
A/D Display
Processor
3D scene Frame
Buffer
CCD
array
Figure 2.4: Other useful array geometries: (a) circular, (b) linear, (c) \ROSA".
image store is central here: the camera provides an input image which is stored in digital
form in the frame buer after analog to digital conversion where it is available for display
to the user and for processing by various computer algorithms. The frame buer actually
may store several images or their derivatives.
A computer program processing a digital image might refer to pixel values as I[r, c] or
I[r][c] where I is an array name and r and c are row and column numbers, respectively.
This book uses such notation in the algorithms presented. Some cameras can be set so that
they produce a binary image { pixels are either 0 or 1 representing dark versus bright, or the
reverse. A simple algorithm can produce the same eect by changing all pixels below some
threshold value t to 0 and all pixels at or above it to 1. An example was given in Chapter 1
where a magnetic resonance image was thresholded to contrast high blood
ow versus low
blood
ow.
Image Formation
The geometry of image formation can be conceptualized as the projection of each point
of the 3D scene through the center of projection or lens center onto the image plane. The
intensity at the image point is related to the intensity radiating from the 3D surface point:
the actual relationship is complex as we'll later learn. This projection model can be physi-
cally justied since a pin-hole camera can actually be made by using a camera box with a
small hole and no lens at all. A CCD camera usually will employ the same kind of lens as
35mm lm cameras used for family photos. A single lens with two convex surfaces is shown
in Figure 2.2, but most actual lenses are compound with more than two refracting surfaces.
There are two very important points to be made. First, the lens is a light collector: light
reaches the image point via an entire cone of rays reaching the lens from the 3D point.
Three rays are shown projecting from the top of the vase in Figure 2.2; these determine the
extremes of the cone of rays collected by the lens for only the top of the vase. A similar
cone of rays exists for all other scene points. Because of geometric imperfections in the lens,
dierent bending of dierent colors of light, and other phenomena, the cone of rays actually
results in a nite or blurred spot on the image plane called the circle of confusion. Secondly,
the CCD sensor array is constructed of physically discrete units and not innitesimal points;
Shapiro and Stockman 5
thus, each sensor cell integrates the rays received from many neighboring points of a 3D
surface. These two eects cause blurring of the image and limit its sharpness and the size
of the smallest scene details that can be sensed.
CCD arrays are manufactured on chips typically measuring about 1 cm x 1 cm. If the
array has 640 x 480 pixels or 512 x 512 pixels, then each pixel has a real width of roughly
0.001 inch. There are other useful ways of placing CCD sensor cells on the image plane (or
image line) as shown in Figure 2.4. A linear array can be used in cases where we only need
to measure the width of objects or where we may be imaging and inspecting a continuous
web of material
owing by the camera. With a linear array, 1000 to 5000 pixels are available
in a single row. Such an array can be used in a push broom fashion where the linear sensor is
moved across the material being scanned as done with a hand held scanner or in highly ac-
curate mechanical scanners, such as
atbed scanners. Currently, many
atbed scanners are
available for a few hundred dollars and are used to acquire digital images from color photos
or print media. Cylindrical lenses are commonly used to focus a \line" in the real world onto
the linear CCD array. The circular array would be handy for inspecting analog dials such
as on watches or speedometers: the object is positioned carefully relative to the camera and
the circular array is scanned for the image of the needle. The interesting \ROSA" partition
shown in Figure 2.4(c) provides a hardware solution to integrating all the light energy falling
into either sectors or bands of the circle. It was designed for quantizing the power spectrum
of the an image, but might have other simple uses as well. Chip manufacturing technology
presents opportunities for implementing other custom designs.
Exercise 2
Suppose that an analog clock is to be read using a CCD camera that stares directly at it.
The center of the clock images at the center of a 256 x 256 digital image and the hour hand
is twice the width of the minute hand but 0.7 times its length. To locate the images of the
hands of the clock we need to scan the pixels of the digital image in a circular fashion. (a)
Give a formula for computing r(t) and c(t) for pixels I[r; c] on a circle of radius R centered
at the image center I[256; 256], where t is the angle made between the ray to I[r; c] and the
horizontal axis. (b) Is there a problem in controlling t so that a unique sequence of pixels of
a digital circle is generated? (*c) Do some outside reading in a text on computer graphics
and report on a practical method for generating such a digital circle.
Video cameras
Video cameras creating imagery for human consumption record sequences of images at a
rate of 30 per second, enabling a representation of object motion over time in addition to the
6 Computer Vision: Mar 2000
iris retina
( rods
and
pupil cones )
fovea
Figure 2.5: Crude sketch of the human eye as camera. (Much more detail can be obtained
in the 1985 book by Levine.)
spatial features represented in the single images or frames. To provide for smooth human
perception, 60 half frames per second are used: these half frames are all odd image rows
followed by all even image rows in alternate succession. An audio signal is also encoded.
Video cameras creating imagery for machine consumption can record images at whatever
rate is practical and need not use the half frame technique.
Frames of a video sequence are separated by markers and some image compression scheme
is usually used to reduce the amount of data. The analog TV standards have been carefully
designed to satisfy multiple requirements: the most interesting features allow for the same
signal to be used for either color or monochrome TVs and to carry sound or text signals as
well. The interested reader should consult the related reading and the summary of MPEG
encoding given below. We continue here with the notion of digital video being just a se-
quence of 2D digital images.
CCD camera technology for machine vision has sometimes suered from display stan-
dards designed for human consumption. First, the interlacing of odd/even frames in a video
sequence, needed to give a smooth picture to a human makes unnecessary complexity for
machine vision. Secondly, many CCD arrays have had pixels with a 4:3 ratio of width to
height because most displays for humans have a 4:3 size ratio. Square pixels and a single
scale parameter would benet machine vision. The huge consumer market has driven device
construction toward human standards and machine vision developers have had to either
adapt or pay more for devices made in limited quantities.
The Human Eye
Crudely speaking, the human eye is a spherical camera with a 20mm focal length lens
at the outside focusing the image on the retina which is opposite the lens and xed on
the inside of the surface of the sphere (see Figure 2.5). The iris controls the amount of
light passing through the lens by controlling the size of the pupil. Each eye has one hun-
dred million receptor cells { quite a lot compared to a typical CCD array. Moreover, the
retina is unevenly populated with sensor cells. An area near the center of the retina, called
the fovea, has a very dense concentration of color receptors, called cones. Away from the
center, the density of cones decreases while the density of black-white receptors, the rods,
increases. The human eye senses three separate intensities for three constituent colors of
a single surface spot imaging on the fovea, because the light received from that spot falls
Shapiro and Stockman 7
Figure 2.6: Images showing various distortions. (Left) Grey level clipping during A/D
conversion occurs at the intersection of some bright stripes; (center) blooming increases the
intensity at the neighbors of bright pixels; (right) barrel distortion is often observed when
short focal length lenses are used.
on 3 dierent types of cones. Each type of cone has a special pigment that is sensitive
to wavelengths of light in a certain range. One of the most intriguing properties of the
human eye-brain is its ability to smoothly perceive a seamless and stable 3D world even
though the eyes are constantly moving. These saccades of the eye are necessary for proper
human visual perception. A signicant part of the human brain is engaged in processing
visual input. Other characteristics of the human visual system will be discussed at vari-
ous points in the book: in particular, more details of color perception are given in Chapter 6.
Exercise 3
Assume that a human eyeball is 1 inch in diameter and that 108 rods and cones populate a
fraction of 1= of its inner surface area. What is the average size of area covered by a single
receptor? (Remember, however, that foveal receptors are packed much more densely than
this average, while peripheral receptors are more sparse.)
quantization eects
The digitization process collects a sample of intensity from a discrete area of the scene and
maps it to one of a discrete set of grey values and thus is susceptible to both mixing and
Shapiro and Stockman 9
rounding problems. These are addressed in more detail in the next section.
Types of images
In computing with images, it is convenient to work with both the concepts of analog image
and digital image. The picture function is a mathematical model that is often used in anal-
ysis where it is fruitful to consider the image as a function of two variables. All of functional
analysis is then available for analyzing images. The digital image is merely a 2D rectangular
array of discrete values. Both image space and intensity range are quantized into a discrete
set of values, permitting the image to be stored in a 2D computer memory structure. It is
common to record intensity as an 8-bit (1-byte) number which allows values of 0 to 255.
256 dierent levels is usually all the precision available from the sensor and also is usually
enough to satisfy the consumer. And, bytes are convenient for computers. For example, an
image might be declared in a C program as \char I[512][512];". Each pixel of a color image
would require 3 such values. In some medical applications, 10-bit encoding is used, allow-
ing 1024 dierent intensity values, which approaches the limit of humans in discerning them.
The following denitions are intended to clarify important concepts and also to establish
notation used throughout this book. We begin with an ideal notion of an analog image
created by an ideal optical system, which we assume to have innite precision. Digital im-
ages are formed by sampling this analog image at discrete locations and representing the
intensity at a location as a discrete value. All real images are aected by physical processes
that limit precision in both position and intensity.
x
. ..
.
x
[0, 0]
r I [M-1, 0] [x , y ]
[-W/2, -H/2] 0 0
F [ 0, 0]
I [M-1, N-1] [ x + i x, y + j y]
0 0
Figure 2.7: Dierent coordinate systems used for images: (a) raster oriented uses row and
column coordinates starting at [0; 0] from the top left; (b) Cartesian coordinate frame with
[0; 0] at the lower left; (c) Cartesian coordinate frame with [0; 0] at the image center. (d)
Relationship of pixel center point [x; y] to area element sampled in array element I [i; j ].
4 Definition A grey scale image is a monochrome digital image I[r, c] with one intensity
value per pixel.
5 Definition A multispectral image is a 2D image M[x,y] which has a vector of values
at each spatial point or pixel. If the image is actually a color image, then the vector has 3
elements.
6 Definition A binary image is a digital image with all pixel values 0 or 1.
7 Definition A labeled image is a digital image L[r, c] whose pixel values are symbols
from a nite alphabet. The symbol value of a pixel denotes the outcome of some decision
made for that pixel. Related concepts are thematic image and pseudo-colored image.
A coordinate system must be used to address individual pixels of an image; to operate
on it in a computer program, to refer to it in a mathematical formula, or to address it
relative to device coordinates. Dierent systems used in this book and elsewhere are shown
in Figure 2.7. Unfortunately, dierent computer tools often use dierent systems and the
user will need to get accustomed to them. Fortunately, concepts are not tied to a coordinate
system. In this book, concepts are usually discussed using a Cartesian coordinate system
consistent with mathematics texts while image processing algorithms usually use raster
coordinates.
Image Quantization and Spatial Measurement
Each pixel of a digital image repesents a sample of some elemental region of the real image
as is shown in Figure 2.2. If the pixel is projected from the image plane back out to the
source material in the scene, then the size of that scene element is the nominal resolution
Shapiro and Stockman 11
(a) (b)
(c) (d)
Figure 2.8: Four digital images of two faces; (a) 127 rows of 176 columns; (b) (126x176)
created by averaging each 2 2 neighborhood of (a) and replicating the average four times
to produce a 2 2 average block; (c) (124x176) created in same manner from (b); (d)
(120x176) created in same manner from (c). Eective nominal resolutions are (127x176),
(63x88), (31x44), (15x22) respectively. (Try looking at the blocky images by squinting; it
usually helps by blurring the annoying sharp boundaries of the squares.) Photo courtesy of
Frank Biocca.
of the sensor. For example, if a 10 inch square sheet of paper is imaged to form a 500 x
500 digital image, then the nominal resolution of the sensor is 0.02 inches. This concept
may not make sense if the scene has a lot of depth variation, since the nominal resolution
will vary with depth and surface orientation. The eld of view of an imaging sensor is a
measure of how much of the scene it can see. The resolution of a sensor is related to its
precision in making spatial measurements or in detecting ne features. (With careful use,
and some model information, a 500 x 500 pixel image can be used to make measurements
to an accuracy of 1 part in 5000, which is called subpixel resolution.)
8 Definition The nominal resolution of a CCD sensor is the size of the scene element
that images to a single pixel on the image plane.
9 Definition The term resolution refers to the precision of the sensor in making mea-
surements, but is formally dened in dierent ways. If dened in real world terms, it may
just be the nominal resolution, as in \the resolution of this scanner is one meter on the
ground" or it may be in the number of line pairs per millimeter that can be \resolved" or
distinguished in the sensed image. A totally dierent concept is the number of pixels available
12 Computer Vision: Mar 2000
{ \the camera has a resolution of 640 by 480 pixels". This later denition has an advantage
in that it states into how many parts the eld of view can be divided, which relates to both
the capability to make precise measurements and to cover a certain region of a scene. If
precision of measurement is a fraction of the nominal resolution, this is called subpixel
resolution.
Figure 2.8 shows four images of the same face to emphasize resolution eects: humans
can recognize a familiar face using 64x64 resolution, and maybe using 32x32 resolution, but
16x16 is insucient. In solving a problem using computer vision, the implementor should
use an appropriate resolution; too little resolution will produce poor recognition or imprecise
measurements while too much will unnecessarily slow down algorithms and waste memory.
10 Definition The eld of view of a sensor (FOV) is the size of the scene that it can
sense, for example 10 inches by 10 inches. Since this may vary with depth, it may be more
meaningful to use angular eld of view, such as 55 degrees by 40 degrees.
Since a pixel in an image measures an area in the real world and not a point, its value
often is determined by a mixture of dierent materials. For example, consider a satellite im-
age where each pixel samples from a spot of the earth 10m x 10m. Clearly, that pixel value
may be a sample of water, soil, and vegetation combined. The problem appears in a severe
form when binary images are formed. Reconsider the above example of imaging a sheet of
paper with 10 characters per inch. Many image pixels will overlap a character boundary
and hence receive a mixture of higher intensity from the background and lower intensity
from the character; the net result being a value in between background and character that
could be set to either 0 or 1. Whichever value it is, it is partly incorrect!
Figure 2.9 gives details of quantization problems. Assume that the 2D scene is a 10x10
array of black (brightness 0) and white (brightness 8) tiles as shown at the left in the gure.
The tiles form patterns that are 2 bright spots and two bright lines of dierent widths. If
the image of the scene falls on a 5x5 CCD array such that each 2x2 set of adjacent tiles falls
precisely on one CCD element the result is the digital image shown in Figure 2.9(b). The
top left CCD element senses intensity 2 = (0 + 0 + 0 + 8)=4 which is the average intensity
from four tiles. The set of four bright tiles at the top right fall on two CCD elements,
each of which integrates the intensity from two bright and two dark tiles. The single row
of bright tiles of intensity 8 images as a row of CCD elements sensing intensity 4, while
the double row images as two rows of intensity 4; however, the two lines in the scene are
blended together in the image. If the image is thresholded at t = 3, then a bright pattern
consisting of one tile will be lost from the image and the three other features will all fuse
into one region! If the camera is displaced by an amount equivalent to one tile in both the
horizontal and vertical direction, then the image shown in Figure 2.9(d) results. The shape
of the 4-tile bright spot is distorted in (d) a dierent manner than in (b) and the two bright
lines in the scene result in a \ramp" in (d) as opposed to the constant grey region of (b);
moreover, (d) shows three object regions whereas (b) shows two. Figure 2.9 shows that the
images of scene features that are nearly the size of one pixel are unstable.
Figure 2.9, shows how spatial quantization eects impose limits on measurement accu-
racy and detectability. Small features can be missed or fused and even when larger features
are detected, their spatial extent might be poorly represented. Note how the bright set
Shapiro and Stockman 13
0 0 0 0 0 0 0 0 0 0
0 8 0 0 0 0 8 8 0 0 2 0 0 4 0
0 0 0 0 0 0 8 8 0 0
0 0 0 4 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
4 4 4 4 4
8 8 8 8 8 8 8 8 8 8
0 0 0 0 0 0 0 0 0 0
4 4 4 4 4
8 8 8 8 8 8 8 8 8 8
8 8 8 8 8 8 8 8 8 8
4 4 4 4 4
0 0 0 0 0 0 0 0 0 0
(a) (b)
0 0 0 0 0 0 0 0 0 0
0 8 0 0 0 0 8 8 0 0
0 0 0 0 0 0 8 8 0 0 2 0 4 4
0 0 0 0 0 0 0 0 0 0
0 0 0 0
0 0 0 0 0 0 0 0 0 0
8 8 8 8 8 8 8 8 8 8
4 4 4 4
0 0 0 0 0 0 0 0 0 0
8 8 8 8 8 8 8 8 8 8
8 8 8 8
8 8 8 8 8 8 8 8 8 8
0 0 0 0 0 0 0 0 0 0
(c) (d)
11 Definition A mixed pixel is an image pixel whose intensity represents a sample from
a mixture of material types in the real world.
Figure 2.10: Many devices or application programs create, consume, or convert image data.
Standard format image les (IFs) are needed to do this productively for a family of devices
and programs.
called raster order, perhaps with line-feeds separating rows. Information such as image type,
size, time taken, and creation method is not part of a raw image. Such information might
be handwritten on a tape label or in someone's research notebook { this is inadequate. (One
project, in which one author took part, videotaped a bar code before videotaping images
from the experiment. The computer program would then process the bar code to obtain
overall non-image information about the experimental treatment.) Most recently developed
standard formats contain a header with non-image information necessary to label the data
and to decode it.
Several formats originated with companies creating image processing or graphics tools;
in some cases but not in others, public documentation and conversion software are available.
The details provided below should provide the reader with practical information for han-
dling computer images. Although the details are changing rapidly with technology, there
are several general concepts contained in this section that should endure.
Image File Header
A le header is needed to make an image le self-describing so that image processing
tools can work with them. The header should contain the image dimensions, type, date of
creation, and some kind of title. It may also contain a color table or coding table to be used
to interpret pixel values. A nice feature not often available is a history section containing
notes on how the image was created and processed.
16 Computer Vision: Mar 2000
Image Data
Some formats can handle only limited types of images, such as binary and monochrome;
however, those surviving today have continued to grow to include more image types and
features. Pixel size and image size limits typically dier between dierent le formats. Sev-
eral formats can handle a sequence of frames. Multimedia formats are evolving and include
image data along with text, graphics, music, etc.
Data Compression
Many formats provide for compression of the image data so that all pixel values are not
directly encoded. Image compression can reduce the size of an image to 30% or even 3%
of its raw size depending on the quality required and method used. Compression can be
lossless or lossy. With lossless compression, the original image can be recovered exactly.
With lossy compression, the pixel representations cannot be recovered exactly: sometimes a
loss of quality is perceived, but not always. To implement compression, the image le must
include some overhead information about the compression method and parameters. Most
digital images are very dierent from symbolic digital information { loss or change of a few
bits of digital image data will have little or no eect on its consumer, regardless of whether
it is a human or machine. The situation is quite dierent for most other computer les; for
example, changing a single bit in an employee record could change the salary eld by $8192
or the apartment address from 'A' to 'B'. Image compression is an exciting area that spans
the gamut from signal processing to object recognition. Image compression is discussed at
several points of this textbook, but is not systematically treated.
Run-code A : 8(0)5(1)12(0)3(1)7(0)9(1)5(0)
Run-code B : (8,12)(25,27)(35,43)
Figure 2.11: Runcoding encodes the runs of consecutive 0 or 1 values, and for some domains,
yields an eciently compressed image.
Assume a binary image; for each image row, we could record the number of 0's followed
by the number of 1's alternating across the entire row. Figure 2.11A gives an example.
Run-code B of the gure shows a more compact encoding of just the 1-runs from which we
can still recover the original row. We will use such encodings for some algorithms in this
book. Run-coding is often used for compression within standard le formats.
PGM: Portable Grey Map
One of the simplest le formats for storing and exchanging image data is the PBM
or \Portable Bit Map family of formats (PBM/PGM,PPM). The image header and pixel
information are encoded in ASCII. The image le representing an image of 8 rows of 16
columns with maximum grey value of 192 is shown in Figure 2.12. Two graphic renderings
are also shown, each is the output of image conversion tools applied to the original text input.
The image at the lower left was made by replicating the pixels to make a larger image of
32 rows of 64 columns each; the image at the lower right was made by rst converting to
JPG format with lossy compression. The rst entry of the PGM le is the Magic Value,\P2"
in our example, indicating how the image information is coded (ASCII grey levels in our
example). Binary, rather than ASCII pixel coding is available for large pictures. (The magic
number for binary is \P4").
64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
64 64 128 128 64 64 64 128 128 64 64 192 192 64 64 64
64 64 128 128 64 64 64 128 128 64 64 192 192 64 64 64
64 64 128 128 128 128 128 128 128 64 64 64 64 64 64 64
64 64 128 128 128 128 128 128 128 64 64 128 128 64 64 64
64 64 128 128 64 64 64 128 128 64 64 128 128 64 64 64
64 64 128 128 64 64 64 128 128 64 64 128 128 64 64 64
64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
Figure 2.12: Text (ASCII) le representing an image of the word \Hi"; 64 is the background
level, 128 is the level of \H" and the lower part of \i', and 192 is the level of the dot of
the \i". At the lower left is a printed picture made from the above text le using image
format conversion tools. At the bottom right is an image made using a lossy compression
algorithm.
Exercise 8
Locate an image viewing toolset on your computer system. (These might be available just
by clicking on your image le icon.) Use one image of a face and one of a landscape; both,
should originally be of high quality, say 800 x 600 color pixels, from a
atbed scanner or
digital camera. Convert the image among dierent formats { GIF,TIFF,JPEG etc. Record
the size of the encoded image les in bytes and note the quality of the image; consider the
overall scene plus small details.
many regions do not change much from one frame to the next and an encoding scheme
can just encode changes and even predict frames from frames before and after in the video
sequence. (Future versions of MPEG will have codes for recognized objects and program
code to generate their images.) Media quality is determined at the time of encoding. Motion
JPEG is a hybrid scheme which just applies JPEG compression to video single frames and
does not take advantage of temporal redundancy. While encoding and decoding is simplied
using Motion JPEG, compression is not as good, so memory useage and transmission will
be poorer than with MPEG. Use of motion vectors by MPEG for compression of video is
described in Chapter 9.
Comparison of Formats
Table 2.1 compares some popular image formats in terms of storage size. The left columns
of the table apply to the tiny 8 16 greyscale picture \Hi" whereas the right column applies
to a 347 489 color image. It is possible to obtain dierent size pictures for the \same"
image by using dierent sequences of format conversions. For example, the CARS.TIF le
output from the scanner was 509,253 bytes, whereas a conversion to a GIF le with only
256 colors required 138,267 bytes, and a TIF le derived from that required 171,430 bytes.
This nal TIF le had fewer bits in the color codes, but appeared to be qualitatively the
same viewed on the CRT. The JPEG le one third its size also displayed the same. While
the lossy JPEG is a clear winner in terms of space, this is at a cost of decoding complexity
which may require hardware for real-time performance.
Figure 2.13: A complex scene with many kinds of depth cues for human perception.
the viewer and the surface orientation. In the park, we can see individual blades of grass or
maple leaves up close, while far away we see only green color. The change of image texture
due to perspective viewing of a surface receding in the distance is called the texture gradient.
Chapter 12 gives much more discussion of the issues just mentioned.
I b
xf
F
a
zb
r
B
z
c
W y
w
xb y
zp b
A
xw y
p
xp
Figure 2.14: Five coordinate frames needed for 3D scene analysis: world W, object O (for
pyramid Op or block Ob ), camera C, real image F and pixel image I.
boresight
prism
linear
array
earth
Multispectral Satellite Scanner
Figure 2.15: Boresighted multispectral scanner aboard a satellite. Radiation from a single
surface element is refracted into separate components based on wavelength.
rectangular area is scanned. Having a single sensor gives one advantage over the CCD array,
there should be less variation in intensity values due to manufacturing dierences. Another
advantage is that many more rows and columns can be obtained. Such an instrument is
slow however and cannot be used in automation environments.
The reader might nd some interest in the following history of a related scanning tech-
nique. In the 1970's in the lab of Azriel Rosenfeld, many pictures were input for computer
processing in the following manner. Black and white pictures were taken and wrapped
around a steel cylinder. Usually 9 x 9 inch pictures or collages were scanned at once. The
cylinder was placed in a standard lathe which spun all spots of the picture area in front
of a small LED and sensor that measured light re
ecting o each spot. Each revolution
of the cylinder produced a row of 3600 pixels that were stored as a block on magnetic
tape whose recording speed was in sync with the lathe! The nal tape le had 3600 x 3600
pixels, usually containing many experimental data sets that were then separated by software.
* Color and Multispectral Images
Because the human eye has separate receptors for sensing light in separate wavelength
bands, it may be called a multispectral sensor. Some color CCD cameras are built by placing
a thin refracting lm just in front of the CCD array. The refracting lm disperses a single
beam of white light into 4 beams falling on 4 neighboring cells of the CCD array. The digital
image it produces can be thought of as a set of four interleaved color images, one for each
of the 4 dierently refracted wavelength components. The gain in spectral information is
traded for a loss in spatial resolution. In a dierent design, a color wheel is synchronously
rotated in the optical pathway so that during one period of time only red light is passed;
then blue, then green. (A color wheel is just a disk of transparent lm with equal size
sectors of each color.) The CCD array is read 3 times during one rotation of the color wheel
to obtain 3 separate images. In this design, sensing speed is traded for color sensitivity; a
point on a rapidly moving object may actually image to dierent pixels on the image plane
during acquisition of the 3 separate images.
Some satellites use the concept of sensing through a straw or boresight: each spot of the
earth is viewed through a boresight so that all radiation from that spot is collected at the
same instant of time, while radiation from other spots is masked. See Figure 2.15. The
beam of radiation is passed through a prism which disperses the dierent wavelengths onto
26 Computer Vision: Mar 2000
a linear CCD array which then simultaneously samples and digitizes the intensity in the
several bands used. (Recall that light of shorter wavelength is bent more by the prism than
light of longer wavelength.) Figure 2.15 shows a spectrum of 5 dierent bands resulting in
a pixel that is a vector [ b1 ; b2; b3; b4; b5 ] of 5 intensity values. A 2D image is created by
moving the boresight or using a scanning mirror to get columns of a given row. The motion
of the satellite in its orbit around the earth yields the dierent rows of the image. As you
might expect, the resulting image suers from motion distortion { the set of all scanned
spots form a trapezoidal region on the earth whose form can be obtained from the \rectan-
gular" digital image le using the warping methods of Chapter 11. By having a spectrum
of intensity values rather than just a single intensity for a single spot of earth, it is often
possible to classify the ground type as water or forest or asphalt, etc.
* X-ray
X-ray devices transmit X-ray radiation through material, often parts of the human body,
but also welded pipes and jars of applesauce. Sensors record transmitted energy at image
points on the far side of the emitter in much the same manner as with the microdensitome-
ter. Low energy at one sensed image point indicates an accumulation of material density
along the entire ray projected from the emitter. It is easy to imagine a 2D X-ray lm being
exposed to X-rays passing through a body. 3D sensing can be accomplished using a CT
scanner (\cat" scanner), which mathematically constructs a 3D volume of density values
from data collected by projecting X-rays along many dierent rays through the body. At
the right in Figure 2.16 is a 2D computer graphic rendering of high density 3D voxels from
a CT scan of a dog: these voxels are rendered as if they were nontransparent re
ecting
surfaces seen from a particular viewpoint. A diagnostician can examine the sensed bone
structure from any viewpoint.
Exercise 11
Think about some of your own dental X-rays: what was bright and what was dark { the
dense tooth or the softer cavity? Why?
Figure 2.16: (Left) A maximumintensity projection (MIP) made by projecting the brightest
pixels from a set of MRA slices from a human head (provided by MSU Radiology); (right)
a computer generated image displaying high density voxels of a set of CT scans as a set of
illuminated surface elements. Data courtesy of Theresa Bernardo.
transmit albedo a
receive
nod and
pan surface spot
sensor
head amplitude modulated laser beam
( x, y ) z
varied
by nod
and pan
Figure 2.17: A LIDAR sensor can produce pixels containing both range and intensity.
can be determined by analytical geometry: we have one equation in those 3 unknowns from
the light sheet and 2 equations in those 3 unknowns from the imaging ray; solving these 3
simultaneous linear equations yields the location of the 3D surface point. In Chapter 13,
calibration methods are given that enable us to derive the necessary equations from several
measurements made on the workbench.
The above argument is even simpler if a single beam of light is projected rather than
an entire plane of light. Many variations exist and a sensor is usually chosen according to
the particular application. To scan an entire scene, a light sheet or beam must be swept
across the scene. Scanning mirrors can be used to do this, or objects can be translated past
the sheet over time using a conveyor system. Many creative designs can be found in the
literature. Machines using multiple light sheets are used to do automobile wheel alignment
and to check the t of car doors during manufacture. When looking at specic objects
in very specic poses, image analysis may just need to verify if a particular image stripe
is close enough to some ideal image position. The stream of observations from the sensor
is used to adjust online manufacturing operations for quality control and for reporting oine.
2.10 References
More specic information about the design of imaging devices can be found in the text by
Schalko (1989). Tutorials and technical specications of charge-coupled devices are readily
found on the web using a search engine: one example is a tutorial provided by the Univer-
sity of Wisconsin at www.mrsec.wisc.edu/edetc/ccd.html. One of several early articles on the
early development of color CCD cameras is by Dillon et al (1978). Discussion and modeling
of many optical phenomena can be found in the book by Hecht and Zajac (1976).
Shapiro and Stockman 29
TRIANGULATING 3D SENSOR ( xc , yc )
Yc
plane equation laser
ax w+ byw + czw = d stripe CCD
projector . camera
(x , y , z ) Xc
w w w
Zc
Z
w
.
.
3D imaging ray
Yw object yields two
linear equations
in the 3 unknowns
(x ,y ,z )
w w w
Xw