Bailey Python Book
Bailey Python Book
Duane A. Bailey
Williams College
September 2013
This primordial text copyrighted 2010-2013 by Duane Bailey
Preface ix
0 Python 1
0.1 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Statements and Comments . . . . . . . . . . . . . . . . . . . . . . 3
0.3 Object-Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.4 Built-in Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.4.1 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.4.2 Booleans . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.4.3 Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.4.4 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.4.5 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.4.6 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . 13
0.4.7 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
0.5 Sequential Statements . . . . . . . . . . . . . . . . . . . . . . . . 14
0.5.1 Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . 14
0.5.2 Assignment (=) Statement . . . . . . . . . . . . . . . . . . 15
0.5.3 The pass Statement . . . . . . . . . . . . . . . . . . . . . 15
0.5.4 The del Statement . . . . . . . . . . . . . . . . . . . . . . 15
0.6 Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . 16
0.6.1 If statements . . . . . . . . . . . . . . . . . . . . . . . . . 16
0.6.2 While loops . . . . . . . . . . . . . . . . . . . . . . . . . . 17
0.6.3 For statements . . . . . . . . . . . . . . . . . . . . . . . . 18
0.6.4 Comprehensions . . . . . . . . . . . . . . . . . . . . . . . 20
0.6.5 The return statement . . . . . . . . . . . . . . . . . . . . 21
0.6.6 The yield operator . . . . . . . . . . . . . . . . . . . . . . 21
0.7 Exception Handling . . . . . . . . . . . . . . . . . . . . . . . . . . 21
0.7.1 The assert statement . . . . . . . . . . . . . . . . . . . . 22
0.7.2 The raise Statement . . . . . . . . . . . . . . . . . . . . . 22
0.7.3 The try Statement . . . . . . . . . . . . . . . . . . . . . . 23
0.8 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
0.8.1 Method Definition . . . . . . . . . . . . . . . . . . . . . . 24
0.8.2 Method Renaming . . . . . . . . . . . . . . . . . . . . . . 27
0.8.3 Lambda Expressions: Anonymous Methods . . . . . . . . 28
0.8.4 Class Definitions . . . . . . . . . . . . . . . . . . . . . . . 28
0.8.5 Modules and Packages . . . . . . . . . . . . . . . . . . . . 29
0.8.6 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
0.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
vi Contents
3 Contract-Based Design 63
3.1 Using Interfaces as Contracts . . . . . . . . . . . . . . . . . . . . 64
3.2 Polymorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3 Incrementalism: Good or Bad? . . . . . . . . . . . . . . . . . . . 68
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4 Lists 71
4.1 Python’s Built-in list Class . . . . . . . . . . . . . . . . . . . . . 71
4.1.1 Storage and Allocation . . . . . . . . . . . . . . . . . . . . 71
4.1.2 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Alternative list Implementations . . . . . . . . . . . . . . . . . . 74
4.2.1 Singly-linked Lists . . . . . . . . . . . . . . . . . . . . . . 74
4.2.2 Doubly-Linked Lists . . . . . . . . . . . . . . . . . . . . . 83
4.2.3 Circularly Linked Lists . . . . . . . . . . . . . . . . . . . . 89
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5 Iteration 93
5.1 The Iteration Abstraction . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Traversals of Data Structures . . . . . . . . . . . . . . . . . . . . 98
5.2.1 Implementing Structure Traversal . . . . . . . . . . . . . . 98
5.2.2 Use of Structure Traversals . . . . . . . . . . . . . . . . . 100
5.3 Iteration-based Utilities . . . . . . . . . . . . . . . . . . . . . . . 102
7 Sorting 125
7.1 Approaching the Problem . . . . . . . . . . . . . . . . . . . . . . 125
7.2 Selection Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.3 Insertion Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.4 Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.5 Mergesort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.5.1 Adaptive Sorts . . . . . . . . . . . . . . . . . . . . . . . . 140
7.6 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.7 Key Functions: Supporting Multiple Sort Keys . . . . . . . . . . . 144
7.8 Radix and Bucket Sorts . . . . . . . . . . . . . . . . . . . . . . . . 145
7.9 Sorting Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.10 Ordering Objects Using Comparators . . . . . . . . . . . . . . . . 148
7.11 Vector-Based Sorting . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.12 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
13 Sets 249
13.1 The Set Abstract Base Class . . . . . . . . . . . . . . . . . . . . . 249
13.2 Example: Sets of Integers . . . . . . . . . . . . . . . . . . . . . . 251
13.3 Hash tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
13.3.1 Fingerprinting Objects: Hash Codes . . . . . . . . . . . . . 255
13.4 Sets of Hashable Values . . . . . . . . . . . . . . . . . . . . . . . 261
13.4.1 HashSets . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
13.4.2 ChainedSets . . . . . . . . . . . . . . . . . . . . . . . . . . 267
13.5 Freezing Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 270
14 Maps 273
14.1 Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
14.1.1 Example: The Symbol Table, Revisited . . . . . . . . . . . 273
14.1.2 Unordered Mappings . . . . . . . . . . . . . . . . . . . . . 274
14.2 Tables: Sorted Mappings . . . . . . . . . . . . . . . . . . . . . . . 277
14.3 Combined Mappings . . . . . . . . . . . . . . . . . . . . . . . . . 278
Contents ix
15 Graphs 279
15.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
15.2 The Graph Interface . . . . . . . . . . . . . . . . . . . . . . . . . 280
15.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
15.3.1 Abstract Classes Reemphasized . . . . . . . . . . . . . . . 287
15.3.2 Adjacency Matrices . . . . . . . . . . . . . . . . . . . . . . 288
15.3.3 Adjacency Lists . . . . . . . . . . . . . . . . . . . . . . . . 296
15.4 Examples: Common Graph Algorithms . . . . . . . . . . . . . . . 301
15.4.1 Reachability . . . . . . . . . . . . . . . . . . . . . . . . . . 302
15.4.2 Topological Sorting . . . . . . . . . . . . . . . . . . . . . . 304
15.4.3 Transitive Closure . . . . . . . . . . . . . . . . . . . . . . 306
15.4.4 All Pairs Minimum Distance . . . . . . . . . . . . . . . . . 307
15.4.5 Greedy Algorithms . . . . . . . . . . . . . . . . . . . . . . 307
15.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
A Answers 313
A.1 Solutions to Self Check Problems . . . . . . . . . . . . . . . . . . 313
A.2 Solutions to Select Problems . . . . . . . . . . . . . . . . . . . . . 313
for Mary,
my wife and best friend
0.1 Execution
There are several ways to use Python, depending on the level of interaction
required. In interactive mode, Python accepts and executes commands one
statement at a time, much like a calculator. For very small experiments, and
1 For those who are concerned about the penalties associated with crafting Python data structures
we encourage you to read on, in Part II of this text, where we consider how to tune Python struc-
tures.
2 Python
while you are learning the basics of the language, this is a helpful approach to
learning about Python. Here, for example, we experiment with calculating the
golden ratio:
% python3
Python 3.1.2 (r312:79360M, Mar 24 2010, 01:33:18)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import math 5
>>> golden = (1+math.sqrt(5))/2
>>> print(golden)
1.61803398875
>>> quit()
Experimentation is important to our understanding of the workings of any com-
plex system. Python encourages this type of exploration.
Much of the time, however, we place our commands in a single file, or
script. To execute those commands, one simply types python32 followed by
the name of the script file. For example, a traditional first script, hello.py,
prints a friendly greeting:
print("Hello, world!")
This is executed by typing:
python3 hello.py
This causes the following output:
Hello, world!
Successfully writing and executing this program makes you a Python program-
mer. (If you’re new to Python programming, you might print and archive this
program for posterity.)
Many of the scripts in this book implement new data structures; we’ll see
how to do that in time. Other scripts—we’ll think of them as applications—are
meant to be directly used to provide a service. Sometimes we would like to
use a Python script to the extend the functionality of the operating system by
implementing a new command or application. To do this seamlessly typically
involves some coordination with the operating system. If you’re using the Unix
operating system, for example, the first line of the script is the shebang line3 .
Here are the entire contents of the file golden:
#!/usr/bin/env python3
import math
golden = (1+math.sqrt(5))/2
print(golden)
When the script is made executable (or blessed), then the command appears as
2 Throughout this text we explicitly type python3, which executes the third edition of Python.
Local documentation for your implementation of Python 3 is accessible through the shell command
pydoc3.
3 The term shebang is a slang for hash-bang or shell-comment-bang line. These two characters
form a magic number that informs the Unix loader that the program is a script to be interpreted.
The script interpreter (python3) is found by the /usr/bin/env program.
0.2 Statements and Comments 3
a new application part of the operating system. In Unix, we make a script exe-
cutable by giving the user permission (u+) the ability to execute the script (x):
% chmod u+x golden
% golden
1.61803398875
Once the executable script is placed in a directory mentioned by the PATH envi-
ronment variable, the Python script becomes a full-fledged Unix command.4
4 In this text, we’ll assume that the current directory (in Unix, this directory is .) has been placed
in the path of executables.
4 Python
explicitly continued long statements make your script unreadable and are to be
avoided.
Comments play an important role in understanding the scripts you write;
we discuss the importance of comments in Chapter 2. Comments begin with a
hash mark (#) and are typically indented as though they were statements. The
comment ends at the end of the line.
0.3 Object-Orientation
Much of the power of Python comes from the simple fact that everything is
an object. Collections of data, methods, and even the most primitive built-in
constants are objects. Because of this, effective programmers can be uniformly
expressive, whether they’re manipulating constants or objects with complex be-
havior. For example, elementary arithmetic operations on integers are imple-
mented as factory methods that take one or more integer values and produce
new integer values. Some accessor methods gain access to object state (e.g. the
length of a list), while other mutator methods modify the state of the object (e.g.
append a value to the end of the list). Sometimes we say the mutator methods
are destructive to emphasize the fact an object’s state changes. The programmer
who can fully appreciate the use of methods to manipulate objects is in a better
position to reason about how the language works at all levels of abstraction. In
cases where Python’s behavior is hard to understand, simple experiments can
be performed to hone one’s model of the language’s environment and execution
model.
We think about this early on—even before we meet the language in any
significant way—because the deep understanding of how Python works and
how to make it most effective often comes from careful thought and simple
experimentation.
defined for a particular class may be used to directly manipulate the class’s
objects.5
It is sometimes useful to have a place holder value, a targetless reference,
or a value that does not refer to any other type. In Python this value is None.
Its sole purpose in Python is to represent “I’m not referring to anything.” In
Python, an important idiomatic expression guards against accessing the meth-
ods of a None or null reference:
if v is not None:
...
0.4.1 Numbers
Numeric types may be integer, real, or complex (PLRM 3.2). Integers have no
fractional part, may be arbitrarily large, and integer literals are written without
5 Python makes heavy use of generic methods, like len(l) that, nonetheless, ultimately make a
call to an object’s method, in this case, the special method l.__len__().
6 Python
type literal
1, -10, 0
numbers 3.4, 0.0, -.5, 3., 1.0e-5, 0.1E5
0j, 3.1+1.3e4J
booleans False, True
strings "hello, world"
’hello, world’
tuples (a,b)
(a,)
lists [3,3,1,7]
dictionaries {"name": "duane", "robot": False, "height": 72}
sets {11, 2, 3, 5, 7}
objects <__main__.animal object at 0x10a7edd90>
a decimal point. Real valued literals are always written with a decimal point,
and may be expressed in scientific notation. Pure complex literals are indicated
by a trailing j, and arbitrary complex values are the result of adding an integer
or real value. These three built-in types are all examples of the Number type.6
The arithmetic operations found in the table of Figure 3 are supported by all
numeric types. Different types of values combined with arithmetic operations
are promoted to the most general numeric type involved to avoid loss of infor-
mation. Integers, for example, may be converted to floating point or complex
values, if necessary. The functions int(v), float(v), and complex(v) convert
numeric and string values, v, to the respective numerics. This may appear to be
a form of casting often found in other languages but, with thought, it is obvious
that the these conversion functions are simply constructors that flexibly accept
a wide variety of types to initialize the value.
C or Java programmers should note that exponentiation is directly supported
and that Python provides both true,full-precision division (/) and division, trun-
cated to the smallest nearby integer (//). The expression
(a//b)*b+(a%b)
always returns a value equal to a. Comparison operations may be chained (e.g.
0 <= i < 10) to efficiently perform range testing for numeric types. Integer
values may be written in base 16 (with an 0x prefix), in base 8 (with an 0o
prefix), or in binary (with an 0b prefix).
6 The Number type is found in the number module. The type is rarely used directly.
0.4 Built-in Types 7
Figure 3 The 33 words reserved in Python and not available for other use
8 Python
0.4.2 Booleans
The boolean type, bool, includes the values False and True. In numeric expres-
sions these are treated as values 0 and 1, respectively. When boolean values are
constructed from numeric values, non-zero values become True and zero val-
ues become False. When container objects are encountered, they are converted
to False if they are empty, or to True if the structure contains values. Where
Python requires a condition, non-boolean expressions are converted to boolean
values first.
Boolean values can be manipulated with the logical operations not, and, and
or. The unary not operation converts its value to a boolean (if necessary), and
then computes the logical opposite of that value. For example:
>>> not True
False
>>> not 0
True
>>> not "False" # this is a string
False
The operations and and or are short-circuiting operations. The binary and op-
eration, for example, immediately returns the left-hand side if it is equivalent
to False. Otherwise, it returns the right-hand side. One of the left or right are
always returned and that value is never converted to a bool. For example:
>>> False and True
False
>>> 0 and f(1) # f is never called
0
>>> "left" and "right"
"right"
This operation is sometimes thought of as a guard: the left argument to the and
operation is a condition that must be met before the right side is evaluated.
The or operator returns the left argument if it is equivalent to True, other-
wise it returns its right argument. This behaves as follows:
>>> True or 0
True
>>> False or True
True
>>> ’’ or ’default’
’default’
We sometimes use the or operator to provide a default value (the right side), if
the left side is False or empty.
Although the and and or operators have very flexible return values, we typ-
ically imagine those values are booleans. In most cases the use of and or or
to compute a non-boolean value makes the code difficult to understand, and
the same computation can be accomplished using other techniques, just as effi-
ciently.
0.4 Built-in Types 9
0.4.3 Tuples
The tuple type is a container used to gather zero or more values into a single
object. In C-like languages these are called structs. Tuple constants or tuple
displays are formed by enclosing comma-separated expressions in parentheses.
The last field of a tuple can always be followed by an optional comma, but
if there is only one expression, the final comma is required. An empty tuple
is specified as a comma-free pair of parentheses (()). In non-parenthesized
contexts the comma acts as an operator that concatenates expressions into a
tuple. Good style dictates that tuples are enclosed in parentheses, though they
are only required if the tuple appears as an argument in a function or method
call (f((1,2))) or if the empty tuple (written ()) is specified. Tuple displays
give one a good sense of Python’s syntactic flexibility.
Tuples are immutable; the collection of item references cannot change, though
the objects they reference may be mutable.
>>> emptyTuple = ()
>>> president = (’barack’,’h.’,’obama’,1.85)
>>> skills = ’python’,’snake charming’
>>> score = ((’red sox’,9),(’yankees’,2))
>>> score[0][1]+score[1][1]
11
The fields of a tuple are directly accessed by a non-negative index, thus the
value president[0] is ’barack’. The number of fields in a tuple can be de-
termined by calling the len(t) function. The len(skills) method returns 2.
When a negative value is used as an index, it is first added to the length of the
tuple. Thus, president[-1] returns president[3], or the president’s height in
meters.
When a tuple display is used as a value, the fields of a tuple are the result
of evaluation of expressions. If the display is to be an assignable target (i.e.
it appears on the left side of an assignment) each item must be composed of
assignable names. In these packed assignments the right side of the assignment
must have a similar shape, and the binding of names is equivalent to a collection
of independent, simultaneous or parallel assignments. For example, the idiom
for exchanging the values referenced by names jeff and eph is shown on the
second line, below
>>> (jeff,eph) = (’trying harder’,’best’)
>>> (jeff,eph) = (eph,jeff) # swap them!
>>> print("jeff={0} eph={1}".format(jeff,eph))
jeff=best eph=trying harder
This idiom is quite often found in sorting applications.
0.4.4 Lists
Lists are variable-length ordered containers and correspond to arrays in C or
vectors in Java. List displays are collections of values enclosed by square brack-
ets ([...]). Unlike tuples, lists can be modified in ways that often change
10 Python
their length. Because there is little ambiguity in Python about the use of square
brackets, it is not necessary (though it is allowed) to follow the final element
of a list with a comma, even if the list contains one element. Empty lists are
represented by square brackets without a comma ([]).
You can access elements of a list using an index. The first element of list l is
l[0] and the length of the list can be determined by len(l). The last element
of the list is l[len(l)-1], or l[-1]. You add a new element e to the “high
index” end of list l with l.append(e). That element may be removed from l
and returned with l.pop(). Appending many elements from a second list (or
other iterable7 ) is accomplished with extend (e.g. l.extend(l2)). Lists can be
reversed (l.reverse()) or reordered (l.sort()), in place.
Portions of lists (or tuples) may be extracted using slices. A slice is a regular
pattern of indexes. There are several forms, as shown in the table of Figure 4. To
delete or remove a value from a list, we use the del operator: del l[i] removes
the element at the ith index, moving elements at higher indices downward.
Slices of lists may be used for their value (technically, an r-value, since they
appear on the right side of assignment), or as a target of assignment (technically,
an l-value). When used as an r-value, a copy of the list slice is constructed as a
new list whose values are shared with the original. When the slice appears as
a target in an assignment, that portion of the original list is removed, and the
assigned value is inserted in place. When the slice appears as a target of a del
operation, the portion of the list is logically removed.
Lists, like tuples, can be used to bundle targets for parallel list assignment.
For example, the swapping idiom we have seen in the tuple discussion can be
recast using lists:
>>> here="bruce wayne"
>>> away="batman"
>>> [here,away] = [away,here]
7 Lists are just one example of an iterable object. Iterable objects, which are ubiquitous in Python,
are discussed in more detail in Chapter 5.
0.4 Built-in Types 11
>>> here
’batman’ 5
>>> away
’bruce wayne’
In practice, this approach is rarely used.
When the last assignable item of a list is preceded by an asterisk (*), the
target is assigned the list of (zero or more) remaining unassigned values. This
allows, for example, the splitting of a list into parts in a single packed assign-
ment:
>>> l = [2, 3, 5, 7, 11]
>>> [head,*l] = l
>>> head
2
>>> l
[3, 5, 7, 11]
As we shall see in Chapter 4, lists are versatile structures, but they can be
used inefficiently if you are not careful.
0.4.5 Strings
Strings (type str) are ordered lists of characters enclosed by quote (’) or quo-
tation (") delimiters. There is no difference between these two quoting mech-
anisms, except that each may be used to specify string literals that contain the
other type of quote. For example, one might write:
demand = "See me in the fo’c’sle, sailor!"
reply = "Where’s that, Sir?"
retort = ’What do you mean, "Where\’s that, Sir?"?’
In any case, string-delimiting characters may be protected from interpretation
or escaped within a string. String literals are typically specified within a single
line of Python code. Other escape sequences may be used to include a variety
12 Python
Figure 6
of white space characters (see Figure 6). To form a multi-line string literal,
enclose it in triple quotation marks. As we shall see throughout this text, triple-
quoted strings often appear as the first line of a method and serve as a simple
documentation form called a docstring.
Unlike other languages, there is no character type. Instead, strings with
length 1 serve that purpose. Like lists and tuples, strings are indexable se-
quences, so indexing and slicing operations can be used. String objects in Python
may not be modified; they are immutable. Instead, string methods are factory
operations that construct new str objects as needed.
The table of Figure 7 is a list of many of the functions that may be called on
strings.
0.4.6 Dictionaries
We think of lists as sequences because their values are accessed by non-negative
index values. The dictionary object is a container of key-value pairs (associa-
tions), where they keys are immutableThe technical reason for immutability is
discussed later, in Section ??. and unique. Because the keys may not have any
natural order (for example, they may have multiple types), the notion of a se-
quence is not appropriate. Freeing ourselves of this restriction allows dictionar-
ies to be efficient mechanisms for representing discrete functions or mappings.
Dictionary displays are written using curly braces ({...}), with key-value
pairs joined by colons (:) and separated by commas (,). For example, we have
the following:
>>> bigcities = { ’new york’:’big apple’,
... ’new orleans’:’big easy’,
... ’dallas’:’big d’ }
>>> statebirds = { ’AL’ : ’yellowhammer’, ’ID’ : ’mountain bluebird’,
... ’AZ’ : ’cactus wren’, ’AR’ : ’mockingbird’,
... ’CA’ : ’california quail’, ’CO’ : ’lark bunting’,
... ’DE’ : ’blue hen chicken’, ’FL’ : ’mockingbird’,
... ’NY’ : ’eastern bluebird’ }
Given an immutable value, key, you can access the corresponding value using a
simple indexing syntax, d[key]. Similarly, new key-value pairs can be inserted
into the dictionary with d[key] = value, or removed with del d[key]:
>>> bigcities[’edmonton’] = ’big e’
>>> for state in statebirds:
... if statebirds[state].endswith(’bluebird’):
... print(state)
... 5
NY
ID
8 A view generates its value “lazily.” An actual list is not created, but generated. We look into
generators in Section 5.1.
14 Python
0.4.7 Sets
Sets are unordered collections of unique immutable values, much like the collec-
tion of keys in a dictionary. Because of their similarity to dictionary types, they
are specified using curly braces ({...}), but the entries are simply immutable
values—numbers, strings, and tuples are common. Typically, all entries are the
same type of object, but they may be mixed. Here are some simple examples:
>>> newEngland = { ’ME’, ’NH’, ’VT’, ’MA’, ’CT’, ’RI’ }
>>> commonwealths = { ’KY’, ’MA’, ’PA’, ’VA’ }
>>> colonies13 = { ’CT’, ’DE’, ’GA’, ’MA’, ’MD’, ’NC’, ’NH’,
... ’NJ’, ’NY’, ’PA’, ’RI’, ’SC’, ’VA’ }
>>> newEngland - colonies13 # new New England states 5
{’ME’, ’VT’}
>>> commonwealths - colonies13 # new commonwealths
{’KY’}
Sets support union (s1|s2), intersection (s1&s2), difference (s1-s2), and sym-
metric difference (s1^s2). You can test to see if e is an element of set s with
e in s (or the opposite, e not in s), and perform various subset and equality
tests (s1<=s2, s1<s2).
We discuss the implementation of sets in Chapter 13.
0.5.1 Expressions
Any expression may appear on a line by itself and has no impact on the script
unless the expression has some side effect. Useful expressions call functions
that directly manipulate the environment (like assignment or print()) or they
0.5 Sequential Statements 15
perform operations or method calls that manipulate the state of an object in the
environment.
When Python is used interactively to run experiments or perform calcula-
tions expressions can be used to verify the state of the script. If expressions
typed in this mode return a value, that value is printed, and the variable _ is set
to that value. The value None does not get printed, unless it is part of another
structure.
0.6.1 If statements
Choices in Python are made with the if statement. When this statement is
encountered, the condition is evaluated and, if True, the statements of the fol-
lowing suite are executed. If the condition is False, the suite is skipped and, if
provided, the suite associated with the optional else is executed. For example,
the following statement set isPrime to False if factor divides number exactly:
if (number % factor) == 0:
isPrime = False
Sometimes it is useful to choose between two courses of action. The following
statement is part of the “hailstone” computation:
if (number % 2) == 0:
number = number / 2
else:
number = 3*number + 1
If the number is divisible by 2, it is halved, otherwise it is approximately tripled.
The if statement is potentially nested (each suite could contain if state-
ments), especially in situations where a single value is being tested for a series
of conditions. In these cases, when the indentation can get unnecessarily deep
and unreadable, the elif reserved word is equivalent to else with a subordi-
nate if, but does not require further indentation. The following two if state-
ments both search a binary tree for a value:
if tree.value == value:
return True
else:
0.6 Control Statements 17
if value == tree.value:
return True 10
elif value < tree.value:
return value in tree.left
else:
return value in tree.right
Both statements perform the same decision-making and have the same perfor-
mance, but the second is more pleasing, especially since the conditions perform
related tests.
It is important to note that Python does not contain a “case” or “switch”
statement as you might find in other languages. The same effect, however, can
be accomplished using code snippets stored in a data structure; we’ll see an
extended example of this later.
isPrime = True
while f*f <= n:
if n%f == 0:
isPrime = False 5
break
f = 3 if f == 2 else f+2
if isPrime:
print(n)
Here, the boolean value, isPrime, keeps track of the premature exit from the
loop, if a factor is found. In the following code, Python will print out the number
only if it is prime:
f = 2
while f*f <= n:
if n%f == 0:
break
f = 3 if f == 2 else f+2 5
else:
print("{} is prime.".format(n))
Notice that the else statement is executed if the suite of the while statement is
never executed. In Python, the while loop is motivated by the need to search
for something. In this light, the else is seen as a mechanism for handling a
failed search. In languages like C and Java, the programmer has a choice of
addressing these kinds of problems with a while or a for loop. As we shall
see over the course of this text, Python’s for loop is a subtly more complex
statement.
9 One definition of an iterable structure is, essentially, that it can be used as the object of for loop
iteration.
0.6 Control Statements 19
While many for loops often make use of range, it is important to understand
that range is simply one example of an iterable. A focus of this book is the
construction of a rich set of iterable structures. Making the best use of for
loops is an important step toward making Python scripts efficient and readable.
For example, if you’re processing the characters of a string, one approach might
be to loop over the legal index values:
# s is a string
10 Infact, one may think of a slice as performing indexing based on the variable of a for loop over
the equivalent range.
20 Python
for i in range(len(s)):
process(s[i])
Here, process(s[i]) performs some computation on each character. A better
approach might be to iterate across the characters, c, directly:
# s is a string
for c in s:
process(c)
The latter approach is not only more efficient, but the loop is expressed at an
appropriate logical level that can be more difficult in other languages.
Iterators are a means of generating that values that are encountered as one
traverses a structure. A generator (a method that contains a yield) is a method
for generating a (possibly infinite) stream of values for consideration in a for
loop. One can easily imagine a random number generator or a generator of
the sequence of prime numbers. Perfect numbers are easily detected if one has
access to the nontrivial factors of a value:
sum = 0
for f in factors(n):
sum += f
if sum == n:
print("{} is perfect.".format(n))
Later, we’ll see how for loops can be used to compactly form comprehensions—
literal displays for built-in container classes.
0.6.4 Comprehensions
One of the novel features of Python is the ability to generate tuple, list, dic-
tionary, and set displays from from conditional loops over iterable structures.
Collectively, these runtime constructions are termed comprehensions. The avail-
ability of comprehensions in Python allows for the compact on-the-fly construc-
tion of container types. The simplest form of list comprehension has the general
form:
[ expression for v in iterable ]
where expression is a function of the variable v. Tuple, dictionary, and set
comprehensions are similarly structured:
( expression for v in iterable ) # tuple comprehension
{ key-expression:value-expression for v in iterable } # dictionary
{ expression for v in iterable } # set comprehension
Note that the difference between set and dictionary comprehension is use of
colon-separated key-values pairs in the dictionary construction.
It is also possible to conditionally include elements of the container, or to
nest multiple loops. For example, the following defines an upper-triangular
multiplication table:
>>> table = { (i,j):i*j for i in range(10) for j in range(10) if i < j }
>>> table[2,3]
6
>>> table[3,2]
0.7 Exception Handling 21
KeyError: (3, 2)
print(item)
is actually a while construct with the following form:
try:
iterator = iter(conainer)
while True:
item = next(iterator)
print(item) 5
except StopIteration:
pass
Knowing this helps us better understand the subtle techniques that are used to
control for loops. We’ll see more on this topic when we consider iteration, in
Chapter 5.
0.8 Definitions
What makes Python more than a simple calculator is the ability to define new
methods and classes. Methods describe how computations may be performed,
while classes describe how new objects are constructed. These are important
tasks with subtleties that give Python much of its expressiveness and beauty.
Like most beautiful things, this will take time to appreciate. We will touch on
the most important details of method and class definition, here, but much of
this book may be regarded as a discussion of how these important mechanisms
may be used to make programs more efficient. The reader is encouraged to
scan through this section quickly and then return at a later time with more
experience.
def hail(x): 5
"""From x, compute the next term of the ’hailstone sequence’"""
if odd(x):
return 3*x+1
else:
return x//2 10
0.8 Definitions 25
def hailstorm(n):
"""Returns when the hailstone sequence, beginning at n, reaches 1"""
while n != 1:
n = hail(n)
In the definition of odd, n is a formal parameter. When the odd function is called,
n stands in for the actual parameter passed to the method. The formal parame-
ter names provide a way for the programmer to describe a computation before
there are any actual parameter values in hand. This is a common practice in
most sciences: when computing the p roots of a polynomial a · x + b · x + c we
2
ter is not provided, the default value is used as the actual.11 Here is a function
that counts the number of vowels that appear in a word:
def vowelCount(word,vowels=’aeiouy’):
"""return the number of vowels appearing in word
(by default, the string of vowels is ’aeiouy’)"""
occurances = [c for c in word.lower() if c in vowels]
return len(occurances)
and here is how the vowelCount function might be called:
>>> vowelCount("ytterbium") # english
4
>>> vowelCount("cwm","aeiouwy") # welsh
1
11 There
are some subtleties here stemming from the fact that the default value—which may be
mutable—is constructed once, at the time the method is defined. Be aware.
0.8 Definitions 27
12 Use of the word “lambda,” here, is a reference to the lambda calculus, a formal system for un-
derstanding computation. In that formalism, anonymous functions are often reasoned about and
treated as objects. The analogy here is entirely appropriate.
0.8 Definitions 29
13 These definitions, though not directly accessible, are accessible to functions you do import.
30 Python
0.8.6 Scope
As we have seen above, Python accesses the objects (e.g. data, methods, classes)
in our programs through names. When we use a name as a value Python must
work hard to find the correct definition of the name. In method definition,
for example, the names of formal parameters can obscure definitions of those
names made outside the method. The visibility of a name throughout a script
determines its scope and the rules that determine where Python finds definitions
are called scope rules.
Names come into existence through the binding of the name to an object
reference. Examples of bindings are
1. Assignment of the name to a value (i.e. using the assignment operator, =),
2. Names of formal parameters in a method definition,
3. Use of the name in a for loop,
4. Definition of the name as a method (def) or class (class), or
5. Values imported with the from ... import statements.
(There are other more obscure binding operations as well. For those details see
Section 4.1 of the Python Language Reference Manual.)
When we use or reference a name, Python searches for the binding that is in
the tightest containing block of code. A block is a collection of statements that
are executed as a unit.
1. A function definition (beginning def name...),
2. A class definition (beginning class name...), or
3. The containing module (i.e. the file the contains the definition).
The scope of a name is that portion of code that where the name is visible. So,
for example, the scope of any name bound within a function is the entire body
of the function. The scope of a name bound in a class is the set of executable
statements of the class (not including its functions). Names bound at the mod-
ule level are visible everywhere within the module, so they are called global
names.
It is sometimes useful to bind names locally in a way that might hide bind-
ings in a wider scope. For example, formal parameters in method definitions
are local bindings to actual values. References to these names will resolve to
the actual parameters, not to similarly named bindings found in a larger scope.
0.8 Definitions 31
The environment of a statement is the set of all scopes that contain the state-
ment. The environment of a statement contains, essentially, all the code that
might contain bindings that could be referenced by the statement. This environ-
ment is, effectively, an encapsulation of all information that helps us understand
the meaning or semantics of a statement’s execution. The execution stack keeps
track of the nesting of the execution of methods and indirectly annotates the
environment with bindings that help us understand the information necessary
to execute the statement.
When Python makes a reference to a name bound in the tightest containing
scope, we say the binding and any reference are local. Again, because scoping is
implied by the location of bindings, it is occasionally difficult to re-bind objects
to names that were originally bound outside the current local scope. By default,
bindings are made within the local scope. So, for example, when we make the
following definition:
def outer():
v = 1
def inner():
v = 2
inner()
return v
a call to outer() returns 1.
To force a binding to the name just visible outside the local scope, we give
Python a hint by indicating the a name is nonlocal. In this definition of outer,
the value returned is 2, not 1:
def outer():
v = 1
def inner():
nonlocal v
v = 2 5
inner()
return v
When there are global values (including built-ins and imported values) that
you wish to rebind, we declare those bindings with global:
v = 0
def outer():
v = 1
def inner():
global v 5
v = 2
inner()
return v
Here, a call to outer() returns 1, but the global value, v, becomes 2.
In practice, Python’s scope rules lead to the most obvious behavior. Because
of this, the use of scope hinting is rarely required. The need to use scope hints,
in most cases, suggests an obscure and, perhaps, ugly approach.
32 Python
0.9 Discussion
Python is a fun, modern, object-oriented scripting language. We can write big
programs in Python, but if we need to get good performance from our programs
we need to organize our data effectively. Before we can be investigate how to
build efficient data structures, it is vitally important that we understand the
fundamentals of language, itself.
Everything in Python is an object, even the most primitive types and val-
ues. Objects are accessed through names that stand for references to objects.
All data—including classes, objects, and methods—are accessed through these
names. All features of an object are associated with attributes, some of which
are determined when the object is constructed. All objects are produced by
a class, which determines their type. Since the objects that are referenced by
names can be dynamically changed, so can their types.
Python provides a few constructs that allow us to control the flow of exe-
cution. Some, like the if statement, parallel those from other languages. The
looping statements, however, differ in significant ways—ways that, as we shall
see, support very powerful programming abstractions.
Python is a subtle language that takes time to understand. The reader is en-
couraged to return to this chapter or, more importantly, the resources available
at python.org, as a means of reconsidering what has been learned in light of
the new ideas we encounter in this brief text.
Problems
Selected solutions to problems begin on page ??.
0.1 Write a script that prints Hello, world.
0.2 Write scripts for each of the following tasks:
1. Print the integers 1 through 5, one per line.
2. Print the integers 5 downto 1, all on one line. You may need to learn more
about the statment print by typing help(print) from within Python.
3. Print the multiples of 3 between 0 and 15, inclusive. Make it easy for
someone maintaining your script to change 3 or 15 to another integer.
4. Print the values -n to n, each with its remainder when divided by m. Test
your program with n=m and m=3 and observe that n modulo m is not nec-
essarily -n modulo m.
0.3 What does the statement a=b=42 do?
0.4 Assume that neg=-1, pos=1, and zero=0. What is the result of each of
the following expressions?
1. neg < zero < pos
2. neg < zero > neg
3. (zero == zero) + zero
4. (zero > neg and zero < pos)
5. zero > (neg and zero) < 1/0
0.5 What is an immutable object? Give two examples of immutable object
types.
0.6 What is the difference between a name, a reference, and an object?
0.7 Indicate the types of the following expressions, or indicate it is un-
known:
1. i
2. 1729
3. ’c’
4. "abc"
5. "abc"[0]
6. (’baseball’,)
7. ’baseball’,’football’,’soccer’
8. ’US’ : ’soccer’, ’UK’ : ’football’
9. i*i for i in range(10)
34 Python
Chapter 1
The Object-Oriented Method
The focus of language designers is to develop programming languages that are
simple to use but provide the power to accurately and efficiently describe the
details of large programs and applications. The development of an object ori-
ented model in Python is one such effort.
Throughout this text we focus on developing data structures using object-
oriented programming. Using this paradigm the programmer spends time devel- OOP:
oping templates for structures called classes. The templates are then used to Object-oriented
construct instances or objects. A majority of the statements in object-oriented programming.
programs involve applying functions to objects to have them report or change
their state. Running a program involves, then, the construction and coordina-
tion of objects. In this way, languages like Python are object-oriented.
In all but the smallest programming projects, abstraction is a useful tool for
writing working programs. In programming languages including Java, Scheme,
and C, and in the examples we have seen so far in Python the details of a pro-
gram’s implementation are hidden away in its procedures and functions. This
approach involves procedural abstraction. In object-oriented programming the
details of the implementation of data structures are hidden away within its ob-
jects. This approach involves data abstraction. Many modern programming
languages use object orientation to support basic abstractions of data. We re-
view the details of data abstraction and the concepts involved in the design of
interfaces for objects in this chapter.
1 The author once saw the final preparations of a House pizza at Supreme Pizza which involved the
sprinkling of a fine powder over the fresh-from-the-oven pie. Knowing, exactly, what that powder
was would probably detract from the taste of the pie.
36 The Object-Oriented Method
our intent to limit access to them, in which case we think of them as being pri-
vate. A typical class declaration is demonstrated by the following simple class
that keeps track of the location of a rectangle on a screen:
class Rect:
"""A class describing the geometry of a rectangle."""
__slots__ = [ "_left", "_top", "_width", "_height" ]
Rect def __init__(self, midx = 50, midy = 50, width = 100, height = 100):
"""Initialize a rectangle.
If not specified otherwise, the rectangle is located
with it’s upper-left at (0,0), with width and height 100."""
self._width = int(width)
self._height = int(height) 10
self._left = int(midx) - (self._width//2)
self._top = int(midy) - (self._height//2)
def left(self):
"""Return the left coordinate of the rectangle.""" 15
return self._left
def right(self):
"""Return the right coordinate of the rectangle."""
return self.left()+self.width() 20
def bottom(self):
"""Return the bottom coordinate of the rectangle."""
return self.top()+self.height()
25
def top(self):
"""Return the top coordinate of the rectangle."""
return self._top
def width(self): 30
"""Return the width of the rectangle."""
return self._width
def height(self):
"""Return the height of the rectangle.""" 35
return self._height
def area(self):
"""Return the area of the rectangle."""
return self.width()*self.height() 40
def center(self):
"""Return the (x,y) coordinates of the center of the Rect."""
1.3 Object-Oriented Terminology 39
return (self.left()+self.width()//2,self.top()+self.height()//2)
45
def __repr__(self):
"""Return a string representation of the rectangle."""
(x,y) = self.center()
return "Rect({0},{1},{2},{3})".format(
x, y, self.width(), self.height())
Near the top of the class definition, we declare the slots, or a list of names of
all the attributes we expect this class to have. Our Rect object maintains four
pieces of information: its left and top coordinates, and its width and height, so
four slots are listed. The declaration of __slots__ is not required, but it serves
as a commitment to using only these attributes to store the characteristics of
the rectangle object. If we attempt to assign a value to an attribute that does
not appear in the attribute list, Python will complain. This is one of the first of N
many principles that are meant to instill good Python programming practice.
NW
NE
W
E
SW
SE
Principle 1 Commit to a fixed list of attributes by explicitly declaring them.
S
The method __init__ is responsible for filling in the attributes of a newly al-
located Rect object. The formal comments (found in triple quotes) at the top
of each method are pre- and postconditions. We will discuss these in detail in
Chapter 2.) The __init__ method2 is the equivalent of a constructor in other
languages. The method initializes all the attributes of an associated object, plac-
ing the object into a predictable and consistent initial state. To construct a new
Rect users make a call to Rect(). Parameters to this method are passed directly
to the __init__ method. One extra parameter, self is a reference to the the
object that is to be initialized. Python makes it fairly easy to pick-and-choose
which parameters are passed to the constructor and which are left to their de-
fault values. Without exception, attributes are accessed in the context of self.
It is important to notice, at this point, that there were choices on how we
would represent the rectangle. Here, while we allow the user to construct
the rectangle based on its center coordinate, the rectangle’s location is actu-
ally stored using the left-top coordinates. The choice is, of course, arbitrary, and
from the external perspective of a user, it is unimportant.
The methods left, right, top, bottom, width and height, and area are
accessor or property methods that allow users to access these logical features
of the rectangle. Some (left and top) are directly represented as attributes in
the structure, while others (right, bottom and area), are computed on-the-fly.
Again, because the user is expected to use these methods to access the logical
attributes of the rectangle through methods, there is no need to access the actual
attributes directly. Indeed, we indicate our expectation of privacy for the actual
attributes by prefixing their names with an underscore (_). If you find yourself
Exercise 1.1 Nearly everything can be improved. Are there improvements that
might be made to the Rect class?
We should feel pleased with the progress we have made. We have developed
the signatures for the rectangle interface, even though we have no immediate
application. We also have some emerging ideas about implementing the Rect
internally. Though we may decide that we wish to change that internal repre-
sentation, by using an interface to protect the implementation of the internal
state, we have insulated ourselves from changes suggested by inefficiencies we
may yet discover.
3 Python, unlike Java and C++, does not actually enforce a notion of privacy. As the implementers
are fond of pointing out, that privacy is a “gentleman’s agreement”.
4 It is interesting to note that many compact fluorescent bulbs actually have (very small) processor
chips that carefully control the power used by the light. Newer bulbs use digital circuitry that
effectively simulates more expensive analog circuitry (resistors, capacitors, and the like) of the past.
We realize the implementation doesn’t matter, because of the power of abstraction.
1.4 A Special-Purpose Class: A Bank Account 41
def balance(self):
"""Get balance from account."""
10
def deposit(self, amount):
"""Make deposit into account."""
The substance of these methods has purposefully been removed because, again,
it is unimportant for us to know exactly how a BankAccount is implemented.
We have ways to construct new accounts, as well as ways to access the account
name and value, and two methods to update the balance. The special method
__eq__ is provided to allow us to see if two accounts are the same (i.e. they
have the same account number).
Let’s look at the implementation of these methods, individually. To build
a new bank account, you must call the BankAccount constructor with one or
two parameters (the initial balance is optional, and defaults to 0). The ac-
count number provided never changes over the life of the BankAccount—if it
were necessary to change the value of the account number, a new BankAccount
would have to be made, and the balance would have to be transferred from one
to the other. The constructor plays the important role of performing the one-
42 The Object-Oriented Method
time initialization of the account name field. Here is the code for a BankAccount
constructor:
def __init__(self, acc, bal=0.0):
"""Create a bank account."""
self._account = acc
self._balance = bal
always returns a value, and rarely has a side-effect. When attributes, properties,
or features of an object are acted on—typically indicated by the use of a verb
(e.g. deposit)—we declare the method normally, without the decoration. Ac-
tions optionally return values, and frequently have side-effects. Thus properties
are accessed as attributes, and actions are performed by methods.
In the end, properties do little more than pass along the information found
in the _account and _balance attributes, respectively. We call such methods
accessors. In a different implementation of the BankAccount, the balance would
not have to be explicitly stored—the value might be, for example, the difference
between two attributes, _deposits and _drafts. Given the interface, it is not
much of a concern to the user which implementation is used.
We provide two more methods, deposit and withdraw, that explicitly mod-
ify the current balance. These are actions that change the state of the object;
they are called mutator methods:
def deposit(self, amount):
"""Make deposit into account."""
self._balance += amount
Again, because these actions are represented by verbs, they’re not decorated
with the @property annotation. Because we would like to change the balance
of the account, it is important to have a method that allows us to modify it. On
the other hand, we purposefully don’t have a setAccount method because we
do not want the account number to be changed without a considerable amount
of work (work that, by the way, models reality).
Here is a simple application that determines whether it is better to deposit
$100 in an account that bears 5 percent interest for 10 years, or to deposit $100
in an account that bears 2 12 percent interest for 20 years. It makes use of the
BankAccount object just outlined:
jd = BankAccount("Jain Dough")
jd.deposit(100)
js = BankAccount("Jon Smythe",100.00)
for year in range(10):
jd.deposit(jd.balance * 0.05) 5
for year in range(20):
js.deposit(js.balance * 0.025)
print("Jain invests $100 over 10 years at 5%.")
print("After 10 years {0} has ${1}".format(jd.account,jd.balance))
print("Jon invests $100 over 20 years at 2.5%.")
print("After 20 years {0} has ${1}".format(js.account,js.balance))
44 The Object-Oriented Method
While this application may seem rather trivial, it is easy to imagine a large-scale
application with similar needs.6
We now consider the design of the KV class. Notice that while the type of
data maintained is different, the purpose of the KV is very similar to that of the
BankAccount class we just discussed. A KV is a key-value pair such that the key
cannot be modified, but the value can be manipulated. Here is the interface for
the KV class:
5 Python provides many more efficient techniques for implementing this application. The approach,
here, is for demonstration purposes. Unprofessional driver, open course, please attempt.
6 Pig Latin has played an important role in undermining court-ordered restrictions placed on music
KV piracy. When Napster—the rebel music trading firm—put in checks to recognize copyrighted music
by title, traders used Pig Latin translators to foil the recognition software!
1.5 A General-Purpose Class: A Key-Value Pair 45
class KV:
__slots__ = ["_key", "_value"]
def __init__(self,key,value=None):
5
@property
def key(self):
@property
def value(self): 10
@value.setter
def value(self,value):
def __str__(self): 15
def __repr__(self):
def copy(self):
20
def __eq__(self,other):
def __hash__(self):
Like the BankAccount class, the KV class has a straightforward initializer that
sets the attributes pre-declared in the slot list:
__slots__ = ["_key", "_value"]
def __init__(self,key,value=None):
assert key is not None, "key-value pair key is not None"
self._key = key
self._value = value
Usually the initializer allows the user to construct a new KV by initializing both
fields. On occasion, however, we may wish to have a KV whose key field is
set, but whose value field is left referencing nothing. (An example might be a
medical record: initially the medical history is incomplete, perhaps waiting to
be forwarded from a previous physician.) For this purpose, we provide a default
value for the value argument of None. None is the way we indicate a reference
that references nothing.
Now, given a particular KV, it is useful to be able to retrieve the key or value.
The attribute implementing the key (_key) is hidden. Our intent is that users of
the KV class must depend on methods to read or write the attribute values. In
Python, read-only attributes are typically implemented as a property. This ap-
proach makes the key portion of the KV appear to be a standard attribute, when
in fact it is a hidden attribute accessed by a similarly named method. We make
46 The Object-Oriented Method
@property 5
def value(self):
return self._value
The value property, however, can be changed as well. Mutable properties are
set using a setter method, which is indicated by a setter decoration. The setter
method simply takes its parameter and assigns it to the underlying _value at-
tribute:
@value.setter
def value(self,value):
self._value = value
Setter methods are used in circumstances that the associated property is the
target of an assignment.
There are other methods that are made available to users of the KV class,
but we will not discuss the details of that code until later. Some of the methods
are required and some are just nice to have around. While the code may look
N complicated, we take the time to implement it correctly, so that we will not have
NW
NE
E
SW
SE
2. Identify, given your operations, those attributes that support the state of
your object. Information about an object’s state is carried by attributes
within the object between operations that modify the state. Since there
may be many ways to encode the state of your object, your description of
the state may be very general.
3. Identify any rules of consistency. In the Rect class, for example, it would
not be good to have a negative width or height. We will frequently focus
on these basic statements about consistency to form pre- and postcondi-
tion checks. A word list, for example, should probably not contain numeric
digits.
4. Determine the general form of the initializers. Initializers are synthetic:
their sole responsibility is to get a new object into a good initial and consis-
tent state. Don’t forget to consider the best state for an object constructed
using the parameterless constructor.
5. Identify the types and kinds of information that, though not directly vis-
ible outside the class, efficiently provide the information needed by the
methods that transform the object. Important choices about the internals
of a data structure are usually made at this time. Sometimes, competing
approaches are developed until a comparative evaluation can be made.
That is the subject of much of this book.
The operations necessary to support a list of words can be sketched out with-
out much controversy, even if we don’t know the intimate details of constructing
the word game itself. Once we see how the data structure is used, we have a
handle on the design of the interface. We can identify the following general use
of the WordList object:
possibleWords = WordList() # read a word list
possibleWords.length(min=2,max=4) # keep only 5 letter words
playing = True
while playing: 5 WordGame
word = possibleWords.pick() # pick a random word
print("I’m thinking of a {} letter word.".format(len(word)))
userWord = readGuess(word,possibleWords)
while userWord != word:
printDifferences(word,userWord) 10
userWord = readGuess(word,possibleWords)
print(’You win!’)
possibleWords.remove(word) # remove word from play
playing = playAgainPrompt()
Let’s consider these lines. One of the first lines (labeled declaration) declares
a reference to a WordList. For a reference to refer to an object, the object must
be constructed. We require, therefore, an initializer for a WordList that reads a
list of words from a source, perhaps a dictionary. To make the game moderately
challenging, we thin the word list to those words that are exactly 5 letters long.
48 The Object-Oriented Method
def pick(self): 10
"""Select a word randomly from word list."""
def contains(self,word):
"""Returns true exactly when word list contains word."""
15
def remove(self, word):
"""Remove a word from the word list, if present."""
def alphabet(self,alpha):
"""Thin word list to words drawn from alphabet alpha."""
1.7 A Full Example Class: Ratios 49
We will leave the implementation details of this example until later. You might
consider various ways that the WordList might be implemented. As long as
the methods of the interface can be supported by your data structure, your
implementation is valid.
Exercise 1.3 Finish the sketch of the WordList class to include details about its
attributes.
def fact(n):
result = 1
for i in range(1,n+1): 5
result *= i
return result
def computeE():
e = Ratio(0) 10
for i in range(15):
e += Ratio(1,fact(i))
print("Euler’s constant = {}".format(e.numerator/e.denominator))
print("Approximate ratio = {}".format(e))
computeE()
Given our thinking, the following attributes and constructor lay the founda-
tion for our Ratio class:
@total_ordering
class Ratio(numbers.Rational):
__slots__ = [ "_top", "_bottom" ]
def __init__(self,numerator=1,denominator=1): 5
"""Construct a ratio of numerator/denominator."""
self._top = int(numerator)
self._bottom = int(denominator)
assert denominator != 0
self._lowestTerms()
This method makes use of gcd, a general purpose function that is declared out-
side the class. First, it is not a computation that depends on a Ratio (so it should
not be a method of the Ratio class), and second, it may be useful to others. By
making gcd a top-level method, everyone can make use of it. The computation
is a slightly optimized form of the granddaddy of all computer algorithms, first
described by Euclid:
def gcd(a,b):
"""A private utility function that determines the greatest common
divisor of a and b."""
while b != 0:
(a,b) = (b,a%b)
return a if a >= 0 else -a
1.7 A Full Example Class: Ratios 51
The Ratio object is based on Python’s notion of a rational value (thus the
mention of Rational in the class definition). This is our first encounter with
inheritance. For the moment, we’ll just say that thinking of this class as a rational
value obliges us to provide certain basic operations, which we’ll get to shortly.
The primary purpose of the class, of course, is to hold a numeric value. The
various components of a Ratio that are related it its value we make accessible
through basic properties, numerator and denominator:
@property
def numerator(self):
return self._top
@property 5
def denominator(self):
return self._bottom
Because these are declared as properties, we may read these values as though
numerator and denominator were fields of the class. We are not able to modify
the values. We also provide two methods—int and float—that provide float-
ing point and integer equivalents of the Ratio. We may find these useful when
we’re mixing calculations with Python’s other built-in types.
def int(self):
"""Compute the integer portion of the Ratio."""
return int(float(self))
def float(self): 5
"""Compute the floating point equivalent of this Ratio."""
return n/d
Now, because Python has a rich collection of numeric classes, and because
many of the operations that act on those values are associated with mathemat-
ical operations, it is useful to have the Ratio class provide the special methods
that support these basic operations. First, some basic operations support several
forms of comparison, while others extend the meaning of existing mathematical
functions, like abs that computes the absolute value of a numeric. These meth-
ods are as follows:
def __lt__(self,other):
"""Compute self < other."""
return self._top*other._bottom < self._bottom*other._top
5
def __le__(self,other):
"""Compute self < other."""
return self._top*other._bottom <= self._bottom*other._top
def __eq__(self,other): 10
"""Compute self == other."""
52 The Object-Oriented Method
def __abs__(self): 15
"""Compute the absolute value of Ratio."""
return Ratio(abs(self._top),self._bottom)
def __neg__(self):
"""Compute the negation of Ratio."""
return Ratio(-self._top,self._bottom)
Notice that many of these methods all have a notion of self and other. The
self object appears on the left side of the binary operator, while other appears
on the right.
Basic mathematical operations on Ratio are similarly supported. Most of
these operations return a new Ratio value holding the result. Because of this,
the Ratio class consists of instances that are read-only.
def __add__(self,other):
"""Compute new Ratio, self+other."""
return Ratio(self._top*other._bottom+
self._bottom*other._top,
self._bottom*other._bottom) 5
def __sub__(self,other):
"""Compute self - other."""
return Ratio(self._top*other._bottom-
self._bottom*other._top, 10
self._bottom*other._bottom)
def __mul__(self,other):
"""Compute self * other."""
return Ratio(self._top*other._top, 15
self._bottom*other._bottom)
def __floordiv__(self,other):
"""Compute the truncated form of division."""
return math.floor(self/other) 20
def __mod__(self,other):
"""Computer self % other."""
return self-other*(self//other)
25
def __truediv__(self,other):
return Ratio(self._top*other._bottom,self._bottom*other._top)
def __str__(self):
if self._bottom == 1:
return "{0}".format(self._top)
else:
return "{0}/{1}".format(self._top,self._bottom)
The integers in the format string give the relative positions of the parameters
consumed, starting with the first, parameter 0. Notice that we are using the
format method for strings but we’re not printing the value out; the result can
be used of a variety of purposes. That decision—whether to print it out or
not—is the decision of the caller of these two functions.
You can see the difference by comparing the results of the following interac-
tive Python expressions:
>>> a = Ratio(4,6)
>>> print(str(a))
2/3
>>> print(repr(a))
Ratio(2,3)
It is a good time to note that the format statement can, with these routines,
print both forms of ratio:
>>> "{0!r} = {0}".format(a)
Ratio(2,3) = 2/3
The string returned by the __str__ method is meant to be understood by hu-
mans. The __repr__ method, on the other hand, can be interpreted in a way
that constructs the current state of the Ratio. This will be pivotal when we
discuss how to checkpoint and reconstruct states of objects.
1.8 Interfaces
Sometimes it is useful to describe the interface for a number of different classes,
without committing to an implementation. For example, in later sections of this
text we will implement a number of data structures that are mutable—that is,
they are able to be modified by adding or removing values. We can, for all of
these classes, specify a few of their fundamental methods by using the Python
abstract base class mechanism, which is made available as part of Python 3’s abc
54 The Object-Oriented Method
package. Here, for example, is an abstract base class, Linear, that requires a
class to implement both an add and remove method:
import abc
from collections import Iterable, Iterator, Sized, Hashable
from structure.view import View
from structure.decorator import checkdoc, mutatormethod, hashmethod
5
class Linear(Iterable, Sized, Freezable):
"""
An abstract base class for classes that primarily support organization
through add and remove operations.
""" 10
@abc.abstractmethod
def add(self, value):
"""Insert value into data structure."""
... 15
@abc.abstractmethod
def remove(self):
"""Remove next value from data structure."""
...
Notice that the method bodies have been replaced by ellipses. Essentially, this
means the implementation has yet to be specified. Specifying just the signatures
of methods in a class is similar to writing boilerplate for a contract. When we
are interested in writing a new class that meets the Linear interface, we can
choose to have it implement the Linear interface. For example, our WordList
structure of Section 1.6 might have made use of our Linear interface by begin-
ning its declaration as follows:
class WordList(Linear):
When the WordList class is compiled by the Python compiler, it checks to see
that each of the methods mentioned in the Linear interface—add and remove—
WordList is actually implemented. If any of these methods that have been marked with
the @abstractmethod decorator do not ultimately get implemented in WordList,
the program will compile with errors. In this case, only remove is part of the
WordList specification, so we must either (1) not have WordList implement
the Linear interface or (2) augment the WordList class with an add method.
Currently, our WordList is close to, but not quite, a Linear. Applications
that demand the functionality of a Linear will not be satisfied with a WordList.
Having the class implement an interface increases the flexibility of its use. Still,
it may require considerable work for us to upgrade the WordList class to the
level of a Linear. It may even work against the design of the WordList to
provide the missing methods. The choices we make are part of an ongoing
design process that attempts to provide the best implementations of structures
1.9 Who Is the User? 55
NW
NE
did later, when you optimize your implementation.
E
SW
SE
S
Principle 3 Design and abide by interfaces as though you were the user.
NW
NE
W
E
If the data are protected in this manner, you cannot easily access them from
SW
SE
S
outside the class, and you are forced to abide by the restricted access of the
interface.
1.10 Discussion
The construction of substantial applications involves the development of com-
plex and interacting structures. In object-oriented languages, we think of these
structures as objects that communicate through the passing of messages or,
more formally, the invocation of methods.
We use object orientation in Python to write the structures found in this
book. It is possible, of course, to design data structures without object orienta-
tion, but any effective data structuring model ultimately depends on the use of
some form of abstraction that allows the programmer to avoid considering the
complexities of particular implementations.
56 The Object-Oriented Method
Problems
Selected solutions to problems begin on page ??.
1.1 Which of the following are primitive Java types: int, Integer, double,
Double, String, char, Association, BankAccount, boolean, Boolean?
1.2 Which of the following variables are associated with valid constructor
calls?
BankAccount a,b,c,d,e,f; Association g,h; a = new
BankAccount("Bob",300.0); b = new BankAccount(300.0,"Bob"); c = new
BankAccount(033414,300.0); d = new BankAccount("Bob",300); e = new
BankAccount("Bob",new Double(300)); f = new
BankAccount("Bob",(double)300); g = new Association("Alice",300.0); h =
new Association("Alice",new Double(300));
Chapter 2
Comments, Conditions,
and Assertions
C ONSIDER THIS : WE CALL OUR PROGRAMS “ CODE ”! Computer languages, includ-
ing Python, are designed to help express algorithms in a manner that a machine
can understand. Making a program run more efficiently often makes it less un-
derstandable. If language design was driven by the need to make the program
readable by programmers, it would be hard to argue against programming in
English. Okay, perhaps
A comment is a carefully crafted piece of text that describes the state of the French!
machine, the use of a variable, or the purpose of a control construct. Many
of us, though, write comments for the same reason that we exercise: we feel
guilty. You feel that, if you do not write comments in your code, you “just know”
something bad is going to happen. Well, you are right. A comment you write Ruth Krauss: “A
today will help you out of a hole you dig tomorrow. hole
Commenting can be especially problematic in Python, a scripting language is to dig.”
that is often used to hastily construct solutions to problems. Unfortunately,
once a program is in place, even if it was hastily constructed, it gains a little
momentum: it is much more likely to stick around. And when uncommented
programs survive, they’re likely to be confusing to someone in the future. Or,
comments are hastily written after the fact, to help understand the code. In
either case, the time spent thinking seriously about the code has long since
passed and the comment might not be right. If you write comments beforehand,
while you are designing your code, it is more likely your comments will describe
what you want to do as you carefully think it out. Then, when something goes
wrong, the comment is there to help you find the errors in the code. In fairness,
the code and the comment have a symbiotic relationship. Writing one or the
other does not really feel complete, but writing both supports you with the
redundancy of concept.
The one disadvantage of comments is that, unlike code, they cannot be
checked. Occasionally, programmers come across comments such as “If you
think you understand this, you don’t!” or “Are you reading this?” One could, of
course, annotate programs with mathematical formulas. As the program is com-
piled, the mathematical comments are distilled into very concise descriptions of
what should be going on. When the output from the program’s code does not
match the result of the formula, something is clearly wrong with your logic. But Semiformal
which logic? The writing of mathematical comments is a level of detail most convention: a
programmers would prefer to avoid. meeting of tie
haters.
58 Comments, Conditions, and Assertions
2.2 Assertions
In days gone by, homeowners would sew firecrackers in their curtains. If the
house were to catch fire, the curtains would burn, setting off the firecrackers. It
And the was an elementary but effective fire alarm.
batteries never An assertion is an assumption you make about the state of your program. In
needed Python, we will encode our assumptions about the state of the program using
replacing. an assertion statement. The statement does nothing if the assertion is true, but
it halts your program with an error message if it is false. It is a firecracker to
N sew in your program. If you sew enough assertions into your code, you will get
NW
NE
E
SW
SE
Here’s an example of a check to make sure that the precondition for the root
function was met: Should we call root with a negative value, the assertion
fails, the message describing the failure is printed out, and the program comes
to a halt. Here’s what appears at runtime: The first two lines of the traceback
indicate that we were trying to print the square root of -4, on line 22 in root.py.
The root function’s assertion, on line 9 of root.py failed because the value
was negative. This resulted in a failed assertion (thus the AssertionError).
The problem is (probably) on line 22 in root.py. Debugging our code should
probably start at that location.
When you form the messages to be presented in the AssertionError, re-
member that something went wrong. The message should describe, as best
as possible, the context of the failed assumption. Thus, while only a string is
required for an assertion message, it is useful to generate a formatted string de-
scribing the problem. The formatting is not called if the assertion test succeeds.
A feature of Python’s native assertion testing is that the tests can be auto-
matically removed at compile time when one feels secure about the way the
code works. Once fixed, we might run the above code with the command:
python3 -O root.py
The -O switch indicates that you wish to optimize the code. One aspect of the
optimization is to remove the assertion-testing code. This may significantly im-
prove performance of functions that are frequently called, or whose assertions
involve complex tests.
2.4 Craftsmanship
If you really desire to program well, a first step is to take pride in your work—
pride enough to sign your name on everything you do. Through the centuries,
fine furniture makers signed their work, painters finished their efforts by dab-
bing on their names, and authors inscribed their books. Programmers should
stand behind their creations.
Computer software has the luxury of immediate copyright protection—it is
a protection against piracy, and a modern statement that you stand behind the
belief that what you do is worth fighting for. If you have crafted something
as best you can, add a comment at the top of your code: If, of course, you
have stolen work from another, avoid the comment and carefully consider the
appropriate attribution.
2.5 Discussion
Effective programmers consider their work a craft. Their constructions are well
considered and documented. Comments are not necessary, but documentation
makes working with a program much easier. One of the most important com-
ments you can provide is your name—it suggests you are taking credit and re-
sponsibility for things you create. It makes our programming world less anony-
mous and more humane.
Special comments, including conditions and assertions, help the user and
implementor of a method determine whether the method is used correctly.
While it is difficult for compilers to determine the “spirit of the routine,” the
implementor is usually able to provide succinct checks of the sanity of the func-
I’ve done my tion. Five minutes of appropriate condition description and checking provided
time! by the implementor can prevent hours of debugging by the user.
Problems
Selected solutions to problems begin on page ??.
2.1 What are the pre- and postconditions for the math.sqrt method?
2.5 Discussion 61
2.2 What are the pre- and postconditions for the ord method?
2.3 What are the pre- and postconditions for the chr method?
2.4 What are the pre- and postconditions for str.join?
2.5 Improve the comments on an old program.
2.6 What are the pre- and postconditions for the math.floor method?
2.7 What are the pre- and postconditions for math.asin class?
62 Comments, Conditions, and Assertions
Chapter 3
Contract-Based Design
One way we might measure the richness of a programming language is to ex-
amine the diversity of the data abstractions that are available to a programmer.
Over time, of course, as people contribute to the effort to support an environ-
ment for solving a wide variety of problems, new abstractions are developed
and used. In many object-oriented languages, these data types are arranged
into type hierarchies that can be likened to a taxonomy of ways we arrange data
to effectively solve problems. The conceptually simplest type is stored at the top
or the base of the hierarchy. In Python, this type is called object. The remain-
ing classes are all seen as either direct extensions of object, or other classes
that, themselves, are ultimately extensions of object. When one class, say int,
extends another, say object, we refer to object as the superclass or base class,
and we call int a subclass or extension.
Languages, like Python, that support this organization of types benefit in two
important ways. First, type extension supports the notion of the specialization
of purpose. Typically, through the fleshing out of methods using method over-
riding, we can commit to aspects of the implementation, but only as necessary.
There are choices, of course, in the approach, so each extension of a type leads
to a new type motivated by a specific purpose. In this way, we observe spe-
cialization. Perhaps the most important aspect of this approach to writing data
structures is the adoption, or inheritance, of code we have already committed
to in the base class. Inheritance is motivated by the need to reuse code. Sec-
ondly, because more general superclasses—like automobile—can (often) act as
stand-ins for more specific classes—like beetle—the users of subclass beetle
can (often) manipulate beetle objects in the more general terms of its super-
class, automobile. That is because many methods (like inspect, wash, and
race) are features that are common to all specializations of automobile. This
powerful notion—that a single method can be designed for use with a variety of
types—is called polymorphism. The use of inheritance allows us to use a general
base type to manipulate instances of any concrete subclass.
One of the motivations of this book is to build out the object hierarchy in
an effort to develop a collection of general purpose data structures in a careful
and principled fashion. This exercise, we hope, serves as a demonstration of the
benefits of thoughtful engineering. Helpful to our approach is the use of the ab-
stract base class system. This is an extension to the standard Python type system
that, nonetheless, is used extensively in Python’s built-in collections package.
It supports a modern approach to the incremental development of classes and
allows the programmer to specify the user’s view of a class, even if the specifics
64 Contract-Based Design
of implementations have not yet been fleshed out. In other languages, like Java
and C++, the specification of an incomplete type is called an interface. Python
does not formally have any notion of interface design, but as we shall see, it
allows for the notion of partially implemented or abstract classes that can form
the basis of fully implemented, concrete type extension. This chapter is ded-
icated to developing these formal notions into a organized approach to data
structure design.
@abc.abstractmethod
def append(self, value):
"""Append object to end of list."""
... 10
@abc.abstractmethod
def count(self, value):
"""Return the number of occurences of value."""
... 15
@abc.abstractmethod
3.1 Using Interfaces as Contracts 65
@abc.abstractmethod
def index(self, value, start=0, stop=-1):
"""Return index of value in list, or -1 if missing."""
... 25
@abc.abstractmethod
def insert(self, index, value):
"""Insert object before index."""
... 30
@abc.abstractmethod
def pop(self, index=-1):
"""Remove and return item at index (default last).
35
Raises IndexError if list is empty or index is out of range."""
...
@abc.abstractmethod
def remove(self, value): 40
"""Remove first occurrence of value.
def __repr__(self): 60
"""Return string representation of this list."""
return "{}({})".format(type(self).__name__,str(list(self)))
66 Contract-Based Design
This interface, List, outlines what the user can expect from the various im-
plementations of list-like objects. First, there are several methods (append,
count, extend, etc.) that are available for general use. Their docstrings help
the user understand the designer’s intent, but there is no implementation. In-
stead, the specifics of the implementation—which are generally unimportant to
the user—are left to implementors of List-like classes. Notice that we were
able to construct this definition even though we don’t know any details of the
actual implementation of the built-in list class.
These code-less abstract methods are decorated with the @abc.abstract-
method decoration1 . To indicate that code is missing, we use the ellipses (...)
or the Python synonym, pass. The ellipses indicate the designer’s promise, as
part of the larger contract, to provide code as more concrete extension classes
are developed. Classes that extend an abstract base class may choose to provide
the missing code, or not. If a concrete implementation of the abstract method
is not provided the subclass remains “abstract.” When a class provides concrete
implementations of all of the abstract methods that have been promised by its
superclasses, the class designer has made all of the important decisions, and the
implementation becomes concrete. The purpose of this book, then, is to guide
the reader through the process of developing an abstract base class, which we
use as the description of an interface, and then committing (in one or more
steps in the type hierarchy) to one or more concrete implementations.
Because the specification of an abstract base class is incomplete, it is im-
possible to directly construct instances of the class. It is possible, however that
some concrete extension of the abstract class will provide all the implementa-
tion details that are missing here. Any instance of the concrete subclass are,
however, also considered instances of this base class. It is often convenient to
cast algorithms in terms of operations on instances of an abstract base class (e.g.
to inspect an automobile), even though the actual objects are concrete (e.g.
beetle instances), and have committed to particular implementation decisions
(e.g. air-cooled engines). If the performance of a concrete class is less than
ideal, it is then a relatively minor task to switch to another implementation.
The portion of the type hierarchy developed in this text (the “structure
package”) is depicted in Appendix ??. There, we see that abstract base classes
appear high in the hierarchy, and that concrete implementations are direct or
indirect extensions of these core interfaces.
In some cases we would like to register an existing concrete class as an im-
plementation of a particular interface, an extension of an abstract base class of
our own design. For example, the List interface, above, is consistent with the
built-in list class, but that relationship is not demonstrated by the type hier-
archy. After all, the List interface was developed after the internal list class
was developed. In these situations we make use of the special register class
method available to abstract base classes:
List.register(list)
1 In Python, decorations are identifiers preceded by the @ symbol. These are functions that work at
the meta level, on the code itself, as it is defined. You can read more about decorations in Section ??.
3.2 Polymorphism 67
Before we leave this topic we point out that the List interface is an exten-
sion of two other interfaces that are native to Python’s collections package:
Iterable and Sized. These are abstract base classes that indicate promises to
provide even more methods. In the case of the Iterable interface, a important
method supporting iteration, called __iter__, and in the case of the Sized in-
terface, the method __len__ which indicates the structure has a notion of length
or size. Collectively, these “mix-in” interfaces put significant burdens on anyone
that might be interested in writing a new List implementation. If, however, the
contract can be met (i.e. the abstract methods are ultimately implemented), we
have an alternative to the list class. We shall see, soon, that alternatives to the
list class can provide flexibility when we seek to improve the performance of
a system.
How, then, can we write application code that avoids a commitment to a
particular implementation? We consider that question next, we we discuss poly-
morphism.
3.2 Polymorphism
While variables in Python are not declared to have particular types, every ob-
ject is the result of instantiating a particular class and thus has a very specific
origin in the type hierarchy. One can ask a value what its type is with the type
function:
>>> type(3)
type(’int’)
>>> i = 3
>>> type(i)
type(’int’) 5
>>> i = ’hello, world’
>>> type(i)
type(’str’)
A consistent development strategy—and our purpose, here, is to foster such
techniques—means that the position of new classes will be used to reflect the
(sometimes subtle) relationship between types. Those relations are the result of
common ancestors in the type hierarchy. For example, the boolean class, bool,
is a subclass of the integer type, int. This allows a boolean value to be used in
any situation where an int is required. A little experimentation will verify that
the boolean constant True is, essentially, the integer 1, while False is a stand-in
for zero. We can see this by using bool values in arithmetic expressions:
>>> 1+True
2
>>> 3*False
0
>>> False*True
0
68 Contract-Based Design
2 Actually, we know that classes are callable types because we actually call the class (viz. list())
as a factory method to create instances of the class.
3.4 Discussion 69
3.4 Discussion
A great deal of the optimism about Python as a programming utility is its uni-
form commitment to a type hierarchy. This hierarchy reflects the relationships
between not only concrete types, but also the partially implemented abstract
types, or interfaces, that establish expectations about the consistency-preserving
operations in concrete classes. While Python provides many classes, it is impos-
sible (and unnecessary) to predict every need, and so Python has many features
that allow us to extend the type hierarchy in ways the reflect common engineer-
ing principles. These include:
3 Williams students should note that, at graduation, the Dean need only recommend students for a
degree based on “notable progress in learning”.
70 Contract-Based Design
1. isinstance(True,bool)
2. isinstance(True,int)
3. isinstance(type(True),bool)
Problems
Selected solutions to problems begin on page ??.
3.1 For a class to be declared abstract, they must be prefixed with the
@abc.abstractmethod decoration. Explain why it is not sufficient to simply
provide elipses (...) for the method body. 3.2 The abstract base class
mechanism is an add-on to the Python 3 language. The notion of an abstract
method is not directly supported by the language itself. Observing the output of
pydoc3 collections.Iterable, indicate how an abstract class might be iden-
tified.
3.3 Investigate each of the following interfaces and identify the correspond-
ing methods required by the abstract class: (1) collections.abc.Iterable,
(2) collections.abc.Sized, (3) collections.abc.Container, (4) collections.abc.Hashable,
and (4) collections.abc.Callable.
Chapter 4
Lists
In Python, a sequence is any iterable object that can be efficiently indexed by
a range of integers. One of the most important built-in sequences is the list,
first discussed in Section 0.4. In this chapter we will design a family of list-like
sequences, each with different performance characteristics.
Before we develop our own structures, we spend a little time thinking about
how Python’s native list structure is implemented. This will allow us to com-
pare alternative approaches and highlight their differences.
array needs to be reallocated again. One approach that is commonly used when
reallocating an extensible array is to initially allocate a small array (in analyses
we will assume it has one element, but it is typically larger) and to double the
length of the array as necessary.
Each time a list is extended, we must transfer each of the values that are
already stored within the vector. Because the extension happens less and less
frequently, the total number of reference copies required to incrementally grow
a list to size n is as large as
1 + 2 + 4 + ... + 2log n = 2n 1
or approximately twice the number of elements currently in the array. Thus the
average number of times each value is copied as n elements are added to a list
is approximately 2.
Storage Overhead
We can make a few more observations. If we construct a list by adding a single
element at a time it eventually fills the current allocation—it is making 100%
use of the available space—and so it must reallocate, doubling the size—making
its use 50% of the available space. It consumes this space in a linear manner
until the next extension. Averaging over all sizes we see that the list takes up
75% of the space allocated. Another way to think about this is that, on average,
a list has a 33% storage overhead—a possibly expensive price to pay, but we are
motivated by a stronger desire to reduce the cost of maintaining the integrity of
the data through the extension process.
The designers of Python felt 33% overhead was a bit too much to pay, so
Python is a little less aggressive. Each time a vector of size n is extended,
it allocates n8 + c extra locations. The value of c is small, and, for analysis
purposes, we can think of the reallocation as growing to about 98 n. As the size
of the list grows large, the average number of copies required per element is
approximately 9. Since, in the limit, the array is extended by about 12.5%, the
average unused allocated space is about 6.25%, or an average list overhead of
about 6.7%.
When a list shrinks one can always live within the current allocation, so
copying is not necessary. When the list becomes very small compared to its
current allocation, it may be expensive to keep the entire allocation available.
When doubling is used to extend lists, it may be useful to shrink the allocation
when the list size drops below one third of the allocated space. The new allo-
cation should contain extra cells, but not more than twice the current list size.
Shrinking at one third the allocation may seem odd, but shrinking the allocation
when the list length is only half the current allocation does not actually result
in any freed memory.
Other approaches are, of course, possible, and the decision may be aided by
observing the actual growth and shrinkage of lists in real programs. Python
reallocates a smaller list (to 98 n + c) when the size, n, dips below 50% of al-
4.1 Python’s Built-in list Class 73
Exercise 4.1 Explain why it is important that a list’s elements are stored contigu-
ously.
Args:
data: the data to be stored in this node.
next: the node that follows this in the linked list.
""" 10
self._data = data
self._next = next
@property
def value(self): 15
"""The value stored in this node."""
return self._data
@property
def next(self): 20
"""Reference to the node that follows this node in the list."""
return self._next
4.2 Alternative list Implementations 75
@next.setter
def next(self, new_next): 25
"""Set next reference to new_next.
Args:
new_next: the new node to follow this one."""
self._next = new_next
This structure is, essentially, a pair of references—one that refers to the data
stored in the list, and another that is a link to the node that follows. The ini-
tializer, __init__, has keyword parameters with default values so that either
parameter can be defaulted and so that when _Nodes are constructed, the pa-
rameter interpretation can be made explicit. In this class we have chosen to
think of the slots as properties (through the use of the @property decorator)
even though this essentially makes the slots public. Again, this approach al-
lows us to hide the details of the implementation, however trivial, so that in the
future we may chose to make changes in the design. For example, the use of
properties to hide accessor methods makes it easier to instrument the code (for
example, by adding counters) if we were interested in gathering statistics about
_Node use.
head
l[0]
l[1]
..
.
head
l[n 1]
Figure 4.1 An empty singly-linked list (left) and one with n values (right).
Let’s now consider how multiple _Nodes can be strung together, connected by
next references, to form a new class, LinkedList. Internally, each LinkedList
object keeps track of two private pieces of information, the head of the Linked-
List (_head) and its length (_len). The head of the list points to the first
_Node in the list, if there is one. All other _Nodes are indirectly referenced
through the first node’s next reference. To initialize the LinkedList, we set the
_head reference to None, and we zero the _len slot. At this point, the list is
in a consistent state (we will, for the moment, ignore the statements involving
frozen and hash).
76 Lists
Our initializer, like that of the built-in list class, can take a source of data
that can be used to prime the list.
def __init__(self, data=None, frozen=False):
"""Initialize a singly linked list."""
self._hash = None
self._head = None
self._len = 0 5
self._frozen = False
if data is not None:
self.extend(data)
if frozen:
self.freeze()
Because the LinkedList is in a consistent internal state, it is valid to call its
extend method, which is responsible for efficiently extending the LinkedList
with the values found in the “iterable” data source. We will discuss the details
of extend later, as well.
Before we delve too far into the details of this class, we construct a private
utility method, _add. This procedure adds a new value after a specific _Node
(loc) or, if loc is omitted or is None, it places the new value at the head of the
list.
@mutatormethod
def _add(self,value,loc=None):
"""Add an element to the beginning of the list, or after loc if loc.
Args: 5
value: the value to be added to the beginning of the list
loc: if non-null, a _Node, after which the value is inserted.
"""
if loc is None:
self._head = _Node(value,self._head) 10
else:
loc.next = _Node(value,loc.next)
self._len = self._len + 1
Clearly, this method is responsible for constructing a new _Node and updating
the length of list, which has grown by one. It will be the common workhorse of
methods that add values to the list.
It is useful to take a moment and think a little about the number of possible
ways we might call _add on a list with n values. The method can be called with
loc set to None, or it can be called with loc referencing any of the n _Nodes
that are already in the list. There are n + 1 different ways to call _add. This
parallels the fact that there are n + 1 different ways that a new value can be
inserted into a list of n elements. This is a subtle proof that the default value
of loc, None, is necessary for _add to be fully capable. Thinking in this way is
to consider an information theoretic argument about the appropriateness of an
approach. Confident that we were correct, we move on.
4.2 Alternative list Implementations 77
Let’s think about how the length of the list is accessed externally. In Python,
the size of any object that holds a variable number of values—a container class—
is found by calling the utility function, len, on the object. This function, in turn
calls the object’s special method, __len__. For the LinkedList class, which
carefully maintains its length in _add and _free, this method is quite simple:
def __len__(self):
"""Compute the length of the list."""
return self._len
Throughout this text we will see several other special methods. These methods
allow our classes to have the look-and-feel of built-in classes, so we implement
them when they are meaningful.
Next, is the append method. Recall that this method is responsible for adding
a single value to the end of the list:
@mutatormethod
def append(self, value):
"""Append a value at the end of the list (iterative version).
Args: 5
value: the value to be added to the end of the list.
"""
loc = self._head
if loc:
while loc.next is not None: 10
loc = loc.next
self._add(value,loc)
This method iterates through the links of the list, looking for the node that has
a None next reference—the last node in the list. The _add method is then used
to construct a new terminal node. The last element of the list is then made to
reference the new tail.
Exercise 4.2 Suppose we kept an extra, unused node at the head of this list. How
would this change our implementation of append?
Notice that it is natural to use a while loop here and that a for loop is
considerably more difficult to use efficiently. In other languages, like C or Java,
this choice is less obvious.
It may be instructive to consider a recursive approach to implementing append:
@mutatormethod
def append_recursive(self, value, loc=None):
"""Append a value at the end of the list.
Args: 5
value: the value to be appended
loc: if non-null, a _Node, after which the value is inserted.
"""
78 Lists
if loc is None:
# add to tail 10
loc = self._head
if loc is None or loc.next is None:
# base case: we’re at the tail
self._add(value,loc)
else: 15
# problem reduction: append from one element below
self.append_recursive(value,loc.next)
While the method can take two parameters, the typical external usage is to call
it with just one—the value to be appended. The loc parameter is a pointer
into the list. It is, if not None, a mechanism to facilitate the append operation’s
search for the last element of the list. (The new value will be inserted in a new
_Node referenced by this last element.) When loc is None, the search begins
from the list’s head; the first if statement is responsible for this default behav-
ior. The second if statement recursively traverses the LinkedList, searching
for the tail, which is identified by a next field whose value is None.
Logically, these two implementations are identical. Most would consider the
use of recursion to be natural and beautiful, but there are limits to the number
of recursive calls that can be supported by Python’s call stack.1 It is likely, also,
that the recursive implementation will suffer a performance penalty since there
are, for long lists, many outstanding calls to append that are not present in the
iterative case; they are pure overhead. Compilers for some languages are able
to automatically convert simple tail-recursive forms (methods that end with a
single recursive call) into the equivalent iterative, while-based implementation.
Over time, of course, compiler technology improves and the beauty of recursive
solutions will generally outweigh any decrease in performance. Still, it is one
decision the implementor must consider, especially in production code. Fortu-
nately, procedural abstraction hides this decision from our users. Any beauty
here, is sub rosa.
Our initializer, recall, made use of the extend operation. This operation
appends (possibly many) values to the end of the LinkedList. Individually,
each append will take O(n) time, but only because it takes O(n) time just to find
the end of the list. Our implementation of extend avoids repeatedly searching
for the end of the list:
def _findTail(self):
"""Find the last node in the list.
Returns:
A reference to the last node in the list. 5
"""
loc = self._head
if loc:
1 Python’s stack is limited to 1000 entries. This may be increased, but it is one way that Python can
halt, not because of errors in logic, but because of resource issues.
4.2 Alternative list Implementations 79
@mutatormethod
def extend(self, data):
"""Append values in iterable to list. 15
Args:
data: an iterable source for data.
"""
tail = self._findTail() 20
for value in data:
self._add(value,tail)
tail = self._head if tail is None else tail.next
A private helper method, _findTail, returns a reference to the last element, or
None, if the list is empty.
Exercise 4.3 Write a recursive version of _findTail.
Once the tail is found, a value is added after the tail, and then the temporary
tail reference is updated, sliding down to the next node. Notice again that the
use of the _add procedure hides the details of updating the list length. It is
tempting to avoid the O(n) search for the tail of the list if no elements would
ultimately be appended. A little thought, though, demonstrates that this would
introduce an if statement into the while loop. The trade-off, then, would be
to improve the performance of the relatively rare trivial extend at the cost of
slowing down every other case, perhaps significantly.
Before we see how to remove objects from a list, it is useful to think about
how to see if they’re already there. Two methods __contains__ and count (a
sequence method peculiar to Python) are related. The special __contains__
method (called when the user invokes the in operator; see page 7) simply tra-
verses the elements of the list, from head to tail, and compares the object sought
against the value referenced by the current node:
def __contains__(self,value):
"""Return True iff value is contained in list."""
loc = self._head
while loc is not None:
if loc.value == value: 5
return True
loc = loc.next
return False
As soon as a reference is found (as identified by the == operator; recall our
extended discussion of this on page 42), the procedure returns True. This be-
havior is important: a full traversal of an n element list takes O(n) time, but
the value we’re looking for may be near the head of the list. We should return
as soon as the result is known. Of course if the entire list is traversed and the
80 Lists
while loop finishes, the value was not encountered and the result is False.
While the worst case time is O(n), in the expected case2 finds the value much
more quickly.
The count method counts how many equal values there are:
def count(self, value):
"""Count occurrences of value in list."""
result = 0
loc = self._head
while loc is not None: 5
if loc.value == value:
result += 1
loc = loc.next
return result
The structure of this code is very similar to __contains__, but count must
always traverse the entire list. Failure to consider all the elements in the list
may yield incorrect results. Notice that __contains__ could be written in terms
of count but its performance would suffer.
Now, let’s consider how to remove elements. Removing a value, in some
sense in the “inverse” of the insertion of a value, so it would not be surprising if
the structure of the methods that supports removal of values was similar to the
what we’ve seen so far, for insertion. The counterpart to _add is a private utility
method, _free:
@mutatormethod
def _free(self, target):
result = target.next
target.next = None
self._len -= 1
return result
When we’re interested in removing a particular node, say target, we pass
target to _free and the routine unlinks the target node from the list. It re-
turns the portion of the list that hangs off the target node. This value is useful,
because it is the new value of the next field in the _Node that precedes target
or if target was at the head of the list, it is the new _head value. We will fre-
quently see, then, a statement similar to
prior.next = self._free(prior.next)
Finally, _free is the place where we hide the details of reducing the length of
the list.
When we want to remove a specific value from a LinkedList, we call remove.
@mutatormethod
def remove(self,value):
2 The term expected case depends, of course, on the distribution of accesses that we might “expect”.
For example, in some situations we might know that we’re always looking for a value that is known
to be in the list (perhaps as part of an assert). In other situations we may be looking for values that
have a non-uniform distribution. Expected case analysis is discussed in more detail in Section ??.
4.2 Alternative list Implementations 81
3 Actually, __getitem__ can accept a range of fairly versatile indexing specifications called slices,
as hinted at earlier on page 10.
82 Lists
4 The pop method, along with append, supports the notation of a stack. We will see more about
stacks when we discussion linear structures in Chapter 6.
4.2 Alternative list Implementations 83
prior.next = self._free(prior.next)
return result
Since this method removes values from the list, it makes use of the _free
method.
1. The head and tail of the LinkedList are treated somewhat differently.
This directly leads to differences in performance between head-oriented
operations and tail-oriented operations. Since, for many applications the
head and tail are likely to be frequently accessed, it would be nice to have
these locations supported with relatively fast operations.
2. The internal _add and _free methods both require you to be able to access
the node that falls before the target node. This makes for some admittedly
awkward code that manages a prior reference rather than a current
reference. Some of this can be resolved with careful use of recursion,
84 Lists
but the awkwardness of the operation is inherent in the logic that all the
references point in one direction, toward the tail of the list. It would be nice
to remove a node given just a reference to that node.
3. LinkedLists that are empty have a _head that is None. All other lists have
_head referencing a _Node. This leads to a number of un-beautiful if
statements that must test for these two different conditions. For example,
in the _add method, there are two ways that a new _Node is linked into
the list, depending on whether the list was initially empty or not. Unusual
internal states (like an empty LinkedList) are called boundary conditions
or edge- or corner-cases. Constantly having to worry about the potential
boundary case slows down the code for all cases. We would like to avoid
this.
Here, we consider a natural extension to singly-linked lists that establishes
not only forward references (toward the tail of the list), but also backward
references (toward the list head). The class is DoublyLinkedList. Admit-
tedly, this will take more space to store, but the result may be a faster or—
for some, a greater motivation—more beautiful solution to constructing lists
from linked nodes. Our discussion lightly parallels that of the LinkedList
class, but we highlight those features that demonstrate the relative beauty of
the DoublyLinkedList implementation.
Like the LinkedList class, the DoublyLinkedList stores its values in a pri-
vate5 _Node class. Because the list will be doubly-linked, two references will
have to be maintained in each node: the tail-ward _next and head-ward _prev.
While, in most cases, a _Node will hold a reference to valid user-supplied
data, it will be useful for us to have some seemingly extraneous or dummy nodes
that do not hold data (see Figure 4.2). These nodes ease the complexities of the
edge-cases associated with the ends of the list. For accounting and verification
purposes, it would be useful to be able to identify these dummy nodes. We
could, for example, add an additional boolean (“Am I a dummy node?”), but
that boolean would be a feature of every node in the list. Instead we will simply
have the _value reference associated with a dummy node be a reference to the
node itself . Since _Node references are not available outside the class, it would
be impossible to for an external user to construct such a structure for their own
purpose.6
The _Node implementation looks like this:
@checkdoc
5 The class is not “as private” as private classes in languages like Java. Here, we simply mean
that the class is not mentioned in the module variable __all__, the collection of identifiers that
are made available with the from...import * statement. In any case, Python will not confuse the
_Node class that supports the LinkedList with the _Node class that supports the DoublyLinkedList
class.
6 This is another information theoretic argument: because we’re constructing a structure we can
prove it cannot be confused with user data, thus we’re able to convey this is a dummy/non-data
node. The construction of valid dummy node definitions can be complicated. For example, we
cannot use a value field of None as an indicator, since None is reasonably found in a list as user data.
4.2 Alternative list Implementations 85
head
tail
l[0]
l[1]
..
.
head
l[n 1]
tail
Figure 4.2 An empty doubly linked list, left, and a non-empty list, on right. Dummy
nodes are shown in gray.
class _Node:
__slots__ = ["_data", "_prev", "_next"]
@property
def next(self):
"""The reference to the next element in list (toward tail)."""
return self._next 20
@next.setter
def next(self, new_next):
"""Set next reference."""
self._next = new_next 25
86 Lists
@property
def prev(self):
"""The reference to the previous element in list (toward head)."""
return self._prev 30
@prev.setter
def prev(self, new_prev):
"""Set previous reference."""
self._prev = new_prev
Typically, the _Nodes hold data; they’re not “dummy” _Nodes. When a dummy
node is constructed, the _data reference points to itself. If the value property
accessor method is ever called on a dummy node, it fails the assertion check.
In a working implementation it would be impossible for this assertion to fail;
when we’re satisfied our implementation is correct we can disable assertion
testing (run Python with the -O switch), or comment out this particular assert.7
The initializer for the DoublyLinkedList parallels that of the LinkedList,
except that the _head and _tail references are made to point to dummy nodes.
These nodes, in turn, point to each other through their next or prev references.
As the list grows, its _Nodes are found between _head.next and _tail.prev.
def __init__(self, data=None, frozen=False):
"""Construct a DoublyLinkedList structure."""
self._head = _Node(dummy=True)
self._tail = _Node(dummy=True)
self._tail.prev = self._head 5
self._head.next = self._tail
self._len = 0
if data:
self.extend(data)
All insertion and removal of _Nodes from the DoublyLinkedList makes use of
two private utility routines, _link and _unlink. Let’s first consider _unlink.
Suppose we wish to remove a an item from list l. We accomplish that with:
self._unlink(item)
Compare this with the the _free method in LinkedList. There, it was nec-
essary for the caller of _free to re-orient the previous node’s next reference
because the previous node was not accessible from the removed node. Here,
_unlink is able to accomplish that task directly. _unlink is written as follows:
@mutatormethod
def _unlink(self,node):
"""unlink this node from surrounding nodes"""
node.prev.next = node.next
node.next.prev = node.prev
The process of introducing a _Node into a list inverts the operation of _unlink:
7 Commenting out the assertion is better than physically removing the assertion. Future readers of
the code who see the assertion are then reminded of the choice.
4.2 Alternative list Implementations 87
head head
temp
tail tail
alice alice
bob bob
charlie charlie
.. ..
. .
Figure 4.3 A doubly linked list before (left) and after (right) the action of _unlink.
After the operation the node that has been removed (and referenced here, by a temporary
reference) remembers its location in the list using the retained pointers. This makes re-
insertion particularly quick.
@mutatormethod
def _link(self, node):
"""link to this node from its desired neighbors"""
node.next.prev = node
node.prev.next = node
Paired, these two methods form a beautiful way to temporarily remove (using
_unlink) and re-introduce (using _link) a list element (see Figure 4.3).8 Such
operations are also useful when we want to move an element from one location
to another, for example while sorting a list.
We can now cast the _add and _free methods we first saw associated with
singly linked lists:
@mutatormethod
def _add(self, value, location):
"""Add a value in list after location (a _Node)"""
# create a new node with desired references
entry = _Node(value,prev=location,next=location.next) 5
# incorporate it into the list
self._link(entry)
# logically add element to the list
8 This observation lead Don Knuth to write one of the most interesting papers of the past few years,
called Dancing Links, [?].
88 Lists
self._len += 1
@mutatormethod
def _free(self,item):
"""Logically remove the indicated node from its list."""
# unlink item from list
self._unlink(item) 5
# clear the references (for neatness)
item.next = item.prev = None
# logically remove element from the list
self._len -= 1
Again, these operations are responsible for hiding the details of accounting (i.e.
maintaining the list length) and for managing the allocation and disposal of
nodes.
Most of the implementation of the DoublyLinkedList class follows the think-
ing of the LinkedList implementation. There are a couple of methods that
can make real use of the linkage mechanics of DoublyLinkedLists. First, the
append method is made considerably easier since the tail of the list can be found
in constant time:
@mutatormethod
def append(self, value):
"""Append a value at the end of the list."""
self._add(value,self._tail.prev)
Because of the reduced cost of the append operation, the extend operation can
now be efficiently implemented in the naïve way, by simply appending each of
the values found in the iterable data:
@mutatormethod
def extend(self, data):
"""Append values in iterable to list (as efficiently as possible)."""
for x in data:
self.append(x)
Finally, we look at the insert method. Remember that the cost of inserting
a new value into an existing list is O(n); the worst-case performance involves
traversing the entire list to find the appropriate location. This could be improved
by a factor of two by searching for the correct location from the beginning or
the end of the list, whichever is closest:
@mutatormethod
def insert(self,index,value):
"""Insert value in list at offset index."""
# normalize the index
l = len(self) 5
if index > l:
index = l
if index < -l:
index = -l
if index < 0: 10
4.2 Alternative list Implementations 89
index += l
tail
..
.
zeta
alice
bob
tail
..
.
Figure 4.4 An empty circular list (left) and one with many nodes (right). Only the tail
reference is maintained, and a dummy node is used to avoid special code for the empty
list.
4.2 Alternative list Implementations 91
self._frozen = False
if data:
self.extend(data)
self._frozen = frozen
When the list is empty, of course, there is just one node, the dummy. Otherwise,
the dummy node appears after the tail, and before the head of the list. For sim-
plicity, we define a _head property:
@property
def _head(self):
"""A property derived from the tail of the list."""
return self._tail.next.next
The _add and _free methods work with the node that falls before the de-
sired location (because we only have forward links) and, while most boundary
conditions have been eliminated by the use of the dummy node, they must con-
stantly be on guard that the node referenced by _tail is no longer the last (in
_add) or that it has been removed (in _tail). In both cases, the _tail must be
updated:
@mutatormethod
def _add(self,value,location=None):
"""Add an element to the beginning of the list, or after loc if loc."""
if location is None:
location = self._tail.next 5
element = _Node(value,next=location.next)
location.next = element
if location == self._tail:
self._tail = self._tail.next
self._size += 1
def _free(self,previous):
element = previous.next
previous.next = element.next
if self._tail == element:
self._tail = previous 5
element.next = None
self._size -= 1
return element.value
These few considerations are sufficient to implement the circular list. All other
methods are simply borrowed from the lists we have seen before.
Exercise 4.4 The pop method is typically used to remove an element from the tail
of the list. Write the pop method for CircularList.
In almost any case where a LinkedList is satisfactory, a CircularList is
as efficient, or better. The only cost comes at the (very slight) performance de-
crease for methods that work with the head of the list, which is always indirectly
accessed in the CircularList.
92 Lists
4.3 Discussion
It is useful to reflect on the various choices provided by our implementations of
the List class.
The built-in list class is compact and, if one knows the index of an element
in the list, the element can be accessed quickly—in constant time. Searching for
values is linear (as it is with all our implementations). Inserting a value causes
all elements that will ultimately appear later in the list to be shifted. This is
a linear operation, and can be quite costly if the list is large and the index is
small. The only location where insertion is not costly is at the high-indexed
tail of the list. Finally, when a list grows, it may have to be copied to a new
larger allocation of memory. Through careful accounting and preallocation, we
can ameliorate the expense, but it comes at the cost of increasingly expensive
(but decreasingly frequent) reallocations of memory. This can lead to occasional
unexpected delays.
The singly linked list classes provide the same interface as the list class.
They are slightly less space efficient, but their incremental growth is more grace-
ful. Once a correct location for insertion (or removal) is found the inserting
(removing) of a value can be accomplished quite efficiently. These structures
are less efficient at directly indexed operations—operations that are motivating
concepts behind the list design. The CircularList keeps track of both head
and tail allowing for fast insertions, but removing values from the tail remains
a linear operation.
Supporting a linked list implementation with links in both directions makes
it a very symmetric data structure. Its performance is very similar to the other
linked list classes. Backwards linking allows for efficient removal of values from
the tail. Finally, it is quite easy to temporarily remove values from a doubly
linked list and later undo the operation; all of the information necessary is
stored directly within the node. In space conscious applications, however, the
overhead of DoublyLinkedList is probably prohibitive.
Problems
Selected solutions to problems begin on page 313.
4.1 Regular problem.
Chapter 5
Iteration
We have spent a considerable amount of time thinking about how data abstrac-
tion can be used to hide the unnecessary details of an implementation. In this
chapter, we apply the same principles of abstraction to the problem of hiding
or encapsulating basic control mechanisms. In particular, we look at a powerful
abstraction called iteration, which is the basis for Python’s main looping con-
struct, the for loop. In the first sections of this chapter, we think about general
abstractions of iteration. Our last topics consider how these abstractions can be
used as a unified mechanism for traversing data structures without exposing the
details otherwise hidden by the data abstraction.
Notice that while most of the code is focused on generating the primes, the
part that is responsible for consuming the values (here, the print statement on
line 18) is actually embedded in the code itself.
Our code would be improved if we could simply generate a list of primes,
and then make use of that list elsewhere. We could construct and return a list:
def firstPrimes(max=None):
sieve = []
n = 2
while (max is None) or (n < max):
is_prime = True 5
for p in sieve:
if p*p > n:
break
if n % p == 0:
is_prime = False 10
break
if is_prime:
sieve.append(n)
n = 3 if n == 2 else n+2
return sieve
Now, the code that computes the primes has been logically separated from the
code that makes use of them:
for p in firstPrimes(100):
print(p)
1 We will still want to keep track of a list of primes, but only for internal use. This approach, of
course, has all the same motivations we outlined when we discussed the abstract data type; here
we’re simply hiding the implementation of the determination of prime values.
96 Iteration
signaled by the use of the yield statement—are actually classes that wrap the
intermediate state of the computation. The initializer creates a new version of
the generator, and the special method __next__ forces the generator to com-
pute the next value to be yielded:
class primeGen(Iterable):
def __init__(self): 5
self._sieve = []
self._n = 2
def __next__(self):
yielding = False 10
while not yielding:
is_prime = True
for p in self._sieve:
if p*p > self._n:
break 15
if self._n % p == 0:
is_prime = False
break
if is_prime:
self._sieve.append(self._n) 20
# effectively yield the use of the machine
# returning _n as the computed value
result = self._n
yielding = True
self._n = 3 if self._n == 2 else self._n+2 25
return result
def __iter__(self):
return self
When the primes generator is defined, the effect is to essentially generate the
primesGen class. As with many of the special methods that are available in
Python, the __next__ method for objects of this class is indirectly invoked by
the builtin method next2 . Here is how we might explicitly make use of this
approach:
def primeFactors3(n):
result = []
primeSource = primeGen()
p = next(primeSource)
2 This may seem strange to non-Python users. The reason that special methods are called in this
indirect way is to help the language verify the correct use of polymorphism and to better handle
targets that are None.
5.1 The Iteration Abstraction 97
while n > 1: 5
if n % p == 0:
result.append(p)
n //= p
else:
p = next(primeSource)
return result
As a final exercise, we consider how we might rewrite our prime factor com-
putation as a generator. Here, the values that should be produced are the (possi-
bly repeated) prime factors, and there is always a finite number of them. Because
of this, it is necessary for our generator to indicate the end of the sequence of
values. Fortunately, Python has a builtin exception class dedicated to this pur-
pose, StopIteration. When this exception is raised it is an indication that
any for loop that is consuming the values produced by the generator should
terminate iteration.
Here, then, is our lazy generator for prime factors of a number, prime-
FactorGen:
def primeFactorGen(n):
for p in primeGen():
if n == 1:
raise StopIteration()
while n % p == 0: 5
yield p
n //= p
This could be used either as the basis of a for loop:
print([p for p in primeFactorGen(234234234)])
or as an equivalent loop that explicitly terminates on a StopIteration signal:
i = primeFactorGen(234234)
results = []
while True:
try:
results.append(next(i)) 5
except StopIteration:
break
print(results)
item = l[i] 4
except IndexError: 5
break
...
i += 1
When a class does not define the special __iter__ method, Python will provide
an implementation that looks like the one for lists, by default. In other words,
the definition of the __getitem__ method, in a very indirect way, allows a data
structure to be the target of a for loop. (It is instructive to write the simplest
class you can whose traversal yields the values 0 through 9; see Problem ??.)
Suppose, however, that l was a LinkedList. Then the indexing operation of
line 4 in the previous code can be costly; we have seen that for linked structures,
the __getitem__ method takes time that is O(n) in the worst case. Can an
efficient traversal be constructed?
The answer, of course, is Yes. What is required is an alternative specification
of the __iter__ generator that generates the stream of values without giving
away any of the hidden details of the implementation. Here, for example, is
suitable implementation of the iteration method for the LinkedList class:
def __iter__(self):
"""Iterate over the values of this list in the order they appear."""
current = self._head
while current is not None:
yield current.value
current = current.next
Because the local variable current keeps track of the state of the traversal be-
tween successive yield statements, the iterator is always able to quickly ac-
cess the next element in the traversal. (This is opposed to having to, in the
__getitem__ method, restart the search for the appropriate element from the
head of the linked list.) Now, we see that, while random access of linked list
structures can be less efficient than for built-in list objects, the iterative traver-
sal of linked lists is just as efficient as for their built-in counterparts.
We see, then, that the support of efficient traversals of container classes in-
volves the careful implementation of one of two special methods. If there is
a natural interpretation of the indexing operation, c[i], the special method
__getitem__ should be provided. Since nearly every class holds a finite num-
ber of values, the value of the index should be verified as a valid index and,
if it is not valid, an IndexError exception should be raised. Typically, the in-
dex is an integer, though we will see many classes where this need not be the
case. If indexing can be meaningfully performed with small non-negative in-
tegers, Python can be depended upon to write an equally meaningful method
for iterating across the container object. It may not be efficient (as was the case
for all of our linked structures), but it will work correctly. Where traversal effi-
ciency can be improved—it nearly always can, with a little state—a class-specific
implementation of the special __iter__ method will be an improvement over
Python’s default method. When indexing of a structure by small positive inte-
gers is not meaningful, a traversal mechanism is typically still very useful but
100 Iteration
Python can not be depended upon to provide a useful mechanism. It must spec-
ified directly or inherited from superclasses.
Many container classes, and especially those that implement the List inter-
face support a wide variety of special methods. Most of these may be cast in
terms of high-level operations that involve traversals and other special methods.
As a demonstration of the power of traversals to support efficient operations, we
evaluate a few of these methods.
First, we consider the extend method of the linked list class. An important
feature of this operation is the traversal of the Iterable that is passed in:
def extend(self, data):
"""Append values in iterable to list.
Args:
data: an iterable source for data. 5
"""
for value in data:
Here, the target of the iteration is a general Iterable object. In fact, it may
be another instance of the LinkedList class. This might have been called, for
example, from the __add__ method, which is responsible for LinkedList con-
catenation with the + operator. There, we use the extend method twice.
def __add__(self,iterable):
"""Construct a new list with values of self as well as iterable."""
result = LinkedList(self)
result.extend(iterable)
return result
The first call is within the initializer for the result object. Since we have effi-
ciently implemented the special __iter__ method for self, this operation will
be as efficient as possible—linear in the list length. The second call to extend,
which is explicit, is responsible for the catenation of values from the iterable
5.2 Traversals of Data Structures 101
object. That object may be a LinkedList instance; if it is, we can rest assured
that the traversal is efficient.
Two special methods are often implemented quite early in the development
process, mainly to support debugging. The first, __str__, is responsible for
converting the object to a compact and “informal” string representation. This
method is called whenever the object is used in a place where a string is called
for.3 One method might be to simply construct, by concatenation, a long,
comma-separated list of string representations of each item in the container:
def __str__(self):
result = ""
for item in self:
if result:
result += ", " 5
result += str(item)
return result
Here, the container class (notice we don’t know what kind of container this is,
just that it supports iteration) is traversed and each value encountered is pro-
duces a string representation of itself. These are concatenated together to form
the final string representation.
This approach, while effective, is quite inefficient. Since strings are not
mutable in Python, each catenation operation forces a new, larger copy of the
representation to be constructed. The result is an operation whose number of
character copies may grown as O(n2 ), not O(n). An improvement on this ap-
proach is to use the join operation, which efficiently concatenates many strings
together, with intervening separator strings (here, a comma):
def __str__(self):
item_strings = [str(item) for item in self]
return ", ".join(item_strings)
Here, item_strings is the list of collected string representations, constructed
through a list comprehension. The comprehension operation, like the for loop,
constructs an iterator to perform the traversal.
While we’re thinking about string-equivalents for container classes, it’s worth
thinking about the __repr__ special method. This method, like __str__, con-
structs a string representation of an object, but it typically meets one more
requirement—that it be a valid Python expression that, when evaluated,4 con-
structs an equal object. Since container classes can typically be constructed by
providing a list of initial values, we can “cheat” by producing the list represen-
tation of the container class, wrapped by the appropriate constructor:
def __repr__(self):
return "{}({})".format(type(self).__name__,repr(list(self)))
3 This plays much the same role as the toString method in Java.
4 Because Python is an interpreter, it is quite simple to evaluate an expression that appears within
string s: one simply calls eval(s). Thus, it should be the case that obj == eval(repr(obj))
always returns True.
102 Iteration
def __init__(self,*args):
self._min = args[0] if len(args) > 1 else 0 5
self._max = args[-1]
def __iter__(self):
current = self._min
while current < self._max: 10
yield current
current += 1
The initializer takes interprets one or two arguments5 as the bounds on a range
of values to be generated. When necessary, the __iter__ method performs the
idiomatic loop that yields each value in turn. The builtin range class is only
slightly more complex, supporting increments or strides other than 1 and sup-
porting a small number of simple features.6 Python’s count iterator is even
5 The builtin range object is much more complex, taking as many as three arguments, as well as
slice values.
6 The slice class is a builtin class that provides the mechanism for complex slice indexing through
the __getitem__ method. While slice is not iterable, its provides an indices(len) method that
returns a list of indicies for a container, provided the container has length len. The resulting tuple,
5.3 Iteration-based Utilities 103
then, can be used as the arguments (using the * operator) to initialize a range object which you can
then use to traverse the indicated portion of the container.
7 A similar feature, not used here, is the ** specifier that identifies a formal parameter that is to
be bound to a dictionary of all the keyword-value pairs specified in as actual parameters. We learn
more about dictionaries in Chapter 14.
104 Iteration
The values list is intended to collect the next value from each of the subor-
dinate iterators. This might not be successful, of course, if one of the iterators
expires during a call to next. This is indicated by the raising of a StopIteration
exception, caught on line 10. When that occurs, the complete list of values can-
not be filled out, so the generator terminates by explicitly returning. This causes
a StopIteration exception to be generated on behalf of the zip iterator. One
might be tempted to simply re-raise the exception that was caught, but this gen-
erates is not a proper way to terminate a generator; the return is preferred. If
the collection of next values is successful, the tuple of values is generated.
It might be tempting to use a tuple comprehension8 similar to:
try:
yield tuple(next(i) for i in iters)
except StopIteration:
return
but something subtle happens here. This code is exactly the same as:
try:
temp = []
iters_iter = iter(iters)
while True:
try:
i = next(iters_iter)
temp.append(next(i))
except StopIteration:
break
yield tuple(temp)
except StopIteration:
return
Exercise 5.1 What happens when you call zip with one iterable object?
Exercise 5.2 What happens when you provide zip no iterable objects? Is this a
reasonable behavior? If not, is there an elegant fix?
Tool Purpose
count([start=0[, step=1]]) A stream of values counting from start by step.
Figure 5.1 Iterator manipulation tools from the itertools package.
106 Iteration
Chapter 6
Linear Structures
The classes that implement the List interface support a large number of oper-
ations with varied efficiency. Sometimes when we design new data structures
we strip away operations in an attempt to restrict access. In the case of linear
structures, the most important operations in the interface are those that add
and remove values. Because it is not necessary, for example, to support random
access, the designer of linear structures is able to efficiently implement add and
remove operations in novel ways.
The two main linear structures are the stack and the queue. These two
general structures are distinguished by the relationship between the ordering
of values placed within the structure and the ordering of the values removed
(rewrite). Because of their general utility, linear structures are central to the
control of the flow of data in many important algorithms.
def __init__(self):
"""Construct an unfrozen Linear data structure.""" 10
self._hash = None
self._frozen = False
@abc.abstractmethod
def add(self, value): 15
"""Insert value into data structure."""
...
108 Linear Structures
@abc.abstractmethod
def remove(self): 20
"""Remove next value from data structure."""
...
@abc.abstractmethod
def __len__(self): 25
"""The number of items in the data structure."""
...
@property
def empty(self): 30
"""Data structure contains no items."""
return len(self) == 0
@property
def full(self): 35
"""Data structure has no more room."""
return False;
@abc.abstractmethod
def peek(self): 40
"""The next value that will be removed."""
...
@abc.abstractmethod
def __iter__(self): 45
"""An iterator over the data structure."""
...
def __str__(self):
"""Return a string representation of this linear.""" 50
return str(list(self));
def __repr__(self):
"""Parsable string representation of the data structure."""
return self.__class__.__name__ + "(" + str(self) + ")"
Clearly, any value that is removed from a linear structure has already been
added. The order in which values appear through successive remove opera-
tions, however, is not specified. For example, a stack object always returns the
value which was added to the stack most recently. We might decide, however,
to return the value that was added least recently (a queue), or the value that
is largest or smallest (a priority queue). While an algorithm may only make
use of the add and remove methods, the possible re-orderings of values as they
pass through the linear structure typically plays a pivotal role in making the
algorithm efficient. The limitations placed on the Linear interface help to guar-
antee that the relationship between the adds and removes is not tainted by other
less important operations.
def top(self): 15
"""Show next value to be removed from structure."""
return self.peek()
110 Linear Structures
In addition to these alternative methods, the LIFO interfaces suggests that any
class that implements the interface abides by the reordering policy that the
last items placed in the linear are the first to come out. Nothing in the in-
terface actually guarantees that fact, but the implementation of the interface
provides a type-oriented way for checking whether a structure implements the
policy. A subtle aspect of this interface is the class header, which declares
metaclass=abc.ABCMeta. When we wish to implement abstract base classes
that are roots—that do not extend another class—it is necessary to declare the
class in this manner to include it in the abstract base class hierarchy.1
It is natural to think of storing the values of a stack in some form of List.
Indeed, the List interface provides a routine, pop, that suggests that the im-
plementors had imagined this use. Since the list version of pop defaults to
removing the element at the far end of the list (at index -1), we will place the
elements of the stack in that orientation. This has the advantage, of course,
that push/add and pop/remove are performed where they are most efficiently
implemented in lists, in constant (O(1)) time. This implementation also has
the feature that the structure can grow to be arbitrarily large; it is unbounded.
We can actually implement, as an abstract class (i.e. a class with unspecified,
abstract methods), much of the implementation as an UnboundedLinear struc-
ture. Here are the details:
class UnboundedLinear(Linear):
__slots__ = ["_data"]
@abc.abstractmethod
def remove(self): 15
"""Remove a value from the linear strucuture."""
...
@abc.abstractmethod
1 This is not ideal. It reflects the development of Python over many years. Since hierarchies of
incompletely specified classes (abc hierarchy) are a relatively new addition to Python, the default
behavior (for backwards compatibility) is not to place new classes in this hierarchy. For technical
reasons, it is not ideal to mix old-style and new-style class definitions, so we opt to include all our
structure classes in the abc hierarchy by directly specifying the metaclass, as we have here, or by
extending an existing structure (as we do much of the the time).
6.2 The Stack 111
def peek(self): 20
"""The next value that will be popped. Equivalent to top."""
...
def __len__(self):
"""The number of items in the linear structure."""
return len(self._data)
Though not all the details of the implementation can be predicted at this point,
because we were motivated by the particular efficiencies of a list, we commit
to using some form of List implementation as the internal container for our
unbounded Linear classes. Here, the initializer takes a parameter, container,
that is bound to the class associated with the desired container object. In
Python, classes are, themselves, objects. Calling these class-objects creates in-
stances of that class. Here, the variable container is simply a formal-parameter,
a stand-in, for the desired class. On line 7 the underlying List is allocated.
While we cannot commit to a reordering policy, here, without loss of gener-
ality, we decide to add elements to the tail of the underlying list. We could have
made the decision to add them to the head of the list, but then, in the case of
Stack implementations, we would be forced to remove or pop values from the
head, not the tail, which seems contrary to the default behavior of list’s pop
method. Because we can’t commit to a remove implementation, peek (which
allows us to see the about-to-be-removed value) must also remain abstract. The
special method __len__ simply returns the length of the underlying container.
Finally, we can develop an implementation of a stack. This structure is
an unbounded LIFO structure, so we will extend two abstract base classes,
UnboundedLinear and LIFO, as we have just seen. You should note that these
two interfaces do not overlap—they declare methods with different signatures—
therefore any reference to a method specified by either class is uniquely defined.
class Stack(UnboundedLinear,LIFO,Freezable):
The initializer provides a mechanism for declaring the container type, as dis-
cussed earlier. This is simply passed to the UnboundedLinear class for alloca-
tion, and then the structure is extended by pushing any initial values. While,
internally, we imagine the stack placed in the list with the top on the right, ex-
ternally, we represent it with the top on the left. This requires us to reverse the
112 Linear Structures
def peek(self): 5
"""The next value that will be popped. Equivalent to top."""
assert not self.empty
return self._data[-1]
Notice that remove calls pop on the underlying List, the entire motivation for
our particular orientation of the stack in the underlying container.
Note that extending LIFO causes the push and pop methods and the top
property to be defined.
While we want to hide the implementation from the user, it would be nice
to be able to iterate through the values of the Stack in the order that we would
expect them to appear from successive remove operations. For this reason, we
construct an iterator that traverses the underlying list in reverse order:
def __iter__(self):
"""An iterator over the Stack."""
return reversed(self._data)
The reversed iterator is an itertool that simply takes an underlying iterator
(here, the one provided by the list self._data) and reverses the order that
the elements appear. This requires (possibly) substantial storage to maintain
the intermediate, in-order list. A more efficient technique might write a Stack-
specific generator that directly yields the values of the underlying list in reverse
order. We leave that as an exercise for the reader.
Now that we have defined an iterator for the stack we can appreciate the
general implemenation of the special __str__ method provided by the Linear
interface. It simply returns a string that represents a list of values in iteration
order:
def __str__(self):
"""Return a string representation of this linear."""
return str(list(self));
The special __repr__ method, recall, depends directly on the meaningful rep-
resentation of __str__. We must always be observant that the special meth-
ods __init__, __iter__, __str__, and __repr__ work consistently, especially
when they are defined across many related classes.
FIFO, structures. Queues are typically used to solve problems in the order that
they were first encountered. “To do” lists and agenda-driven algorithms typi-
cally have some form of queue as the fundamental control mechanism. When
removed values are reinserted (as, obviously, the least recently added item),
the structure provides a round robin organization. Thus, queues are commonly
found in systems that periodically provide exclusive access to shared resources.
Operating systems often turn over computational resources to processes in ap-
proximate round-robin order.
The ordering concept of FIFO structures is so important we provide a base
class implementation of three alias commonly used with queue-like structures:
the enqueue and dequeue methods, and the next property.
@checkdoc
class FIFO(metaclass=abc.ABCMeta):
"""
A base class for queue-like objects.
""" 5
def next(self): 15
"""Return next removed value from data structure."""
return self.peek()
Again, we use these methods when we wish to emphasize the importance of
FIFO ordering of values stored within in the linear structure.
Paralleling our construction of the Stack class, the Queue class is based on
the UnboundedLinear and FIFO classes. Between these two base classes, most
of the Queue methods have been implemented.
class Queue(UnboundedLinear,FIFO):
the head of a list. When queues are long, the list class may induce significant
penalties for removing from the head of the structure. If we know the Queue
will hold, ever, only a few elements, we can always pass a container=list
parameter when we initialize the Queue. These are the subtle choices that go
beyond simply meeting correctness requirements.
What remains, recall, is the remove method, which determines the first-in,
first-out policy. Elements are added at the tail of the list, and are removed from
the head. Since elements only appear at the head after all longer-resident values
have been removed, the policy is FIFO. The peek method allows us to see this
value before it is removed from the structure:
@mutatormethod
def remove(self):
"""Wrapper method for dequeue to comply with ABC Linear."""
assert not self.empty
return self._data.pop(0); 5
def peek(self):
"""The next value that will be dequeued. Equivalent to next."""
assert not self.empty
return self._data[0]
Here, the use of the unaesthetic pop operation is hidden by our implementation.
We are thankful.
Because the natural order of the elements to be removed from the queue
parallels the natural order of elements that appear in the underlying list, the
central __iter__ method can simply fall back on iterator for the underlying
list.
def __iter__(self):
"""An iterator over the Queue."""
return iter(self._data)
Notice, by the way, that we have to explictly ask (via iter) for the iterator from
the list class. In the Stack iterator, we used reversed, an iterator that asked for
the subordinate iterator from the List class on our behalf.
2 For example, many “smart card” technologies allow the embedding of limited computation in a
low-cost portable device. It is reasonable, in these situations, to impose a limit on certain resources,
like stack or queue sizes, fully disclosing them as part of the technology’s API. Programming for
6.4 Bounded Implementations: StackBuffer and QueueBuffer 115
these devices can be challenging, but it is often rewarding to realize how efficient computation can
occur within these imposed limits.
116 Linear Structures
@property
def empty(self): 15
"""Data structure is empty."""
return self._size == 0
@property
def capacity(self): 20
"""Return the size of the underlying buffer."""
return self._extent
@property
def full(self): 25
"""Data structure cannot accept more items."""
return self._size == self._extent
@abc.abstractmethod
def add(self, value): 30
"""Add a value to the linear structure."""
...
def remove(self):
"""Remove a value from the linear strucuture.""" 35
if self.empty:
raise EmtpyError()
else:
result = self._data[self._next]
self._data[self._next] = None 40
self._next = (self._next+1) % self._extent
self._size -= 1
return result
def peek(self): 45
"""The next value that will be popped. Equivalent to top."""
if self.empty:
raise EmptyError()
else:
return self._data[self._next] 50
def __len__(self):
"""The number of items in the linear structure."""
return self._size
By implementing the remove operation a this point in the class hierarchy, we
are making the (arbitrary) decision that all buffer-based linear structures will
be oriented in the same direction—the head of the structure is found at index
6.4 Bounded Implementations: StackBuffer and QueueBuffer 117
def __iter__(self): 20
"""An iterator over the Stack."""
return (self._data[i%self._extent] for i in range(self._next,self._next+self._size))
Notice that we update the _next index, not by subtracting 1, but by adding the
extent less 1. This ensures that all the intermediate values are non-negative,
where the behavior of the modulus operator (%) is well-behaved.3 Thus, the val-
3 The concern, here, is that the definition of n%m for n<0 is different for different languages, though
most strive to have the remainder computation meet the constraint that n%m == n-(n/m)*m. In this
particular application, we would like to have the result be a valid index, a value that should never
118 Linear Structures
ues produced by the next method are found at indices that increase (wrapping
around to index 0, if necessary), over time. The iterator, then, must carefully
return the values starting at index _next.
def __iter__(self): 20
"""An iterator over the Queue."""
return (self._data[i%self._extent] for i in range(self._next,self._next+self._size)
Here, of course, the list is made longer simply by increasing _size, provided
that the queue is not full. There is no need to worry about the magnitude of
_size since, if the queue was not full, _size must have been less than _extent.
Because we have oriented the buffer-based stack and queue the same way
(from the perspective of the remove operation), the iterators are precisely the
same. It is reasonable, at this point, to consider whether the __iter__ method
should be pushed back up into the super-class, BoundedLinear, where it might
serve to motivate a similar orientation in other subclasses of BoundedLinear.
be negative. In Python, the sign of the result is always the same as the sign of the divisor (m, above).
In Java, the sign of the result is always the same as the dividend (n). In C the result is defined by the
underlying hardware. From our point of view, Python computes the correct result, but the answer
is never incorrect if we take the extra step described in the text, to guarantee the dividend is always
non-negative.
6.5 The Subtleties of Iteration and Equality Testing 119
together to form a doubly linked list. The first element of the deque is located
somewhere within the first block, and the last element is located somewhere
within the last block.4 All other elements are stored contiguously from after the
first element, through successive blocks, and until the last element in the last
block. The deque also quietly supports a bounded version of the structure by
allowing the user to specify a maximum length associated with the structure.
Appending values to either end causes excess elements to be lost off the other.
Our version of deque maintains several state variables, including _list, the
underlying doubly linked list; _first and _len, that determine the span of
the deque across the blocks of store; the _maxlen variable is either None if the
deque is unbounded, or the maximum length of the buffer. Here’s how our
implementation of the deque class begins:
from collections import Iterable
from structure.doubly_linked_list import DoublyLinkedList
from structure.decorator import checkdoc, hashmethod, mutatormethod
# size of the blocks of contiguous elements.
5
@checkdoc
class deque(Iterable):
blocksize = 64
4 For simplicity, we assume there is always at least one block backing the deque.
6.6 Python’s deque Collection 121
def __getitem__(self,index):
"""Get a value stored at a position in the deque indicated by index."""
if index >= len(self):
raise IndexError("Index {} out of range.".format(index))
(block,line) = self._addr(self._first + index) 5
return self._list[block][line]
def __setitem__(self,index,value):
"""Assign the element of the deque at index to value."""
if index > len(self): 10
raise IndexError("Index {} out of range.".format(index))
(block,line) = self._addr(self._first + index)
self._list[block][line] = value
The reader may find it useful to walk through these methods with index set to
zero: it simply accesses the _first entry of block 0 of the _list. It is important
to observe, here, that the performance of this routine is O(1) and optimized
to search for the appropriate block from the nearest end of the structure (see
page ??).
We now consider the operations that increase the size of the structure:
append and appendleft. The append operation checks to see if the last block in
the _list is full, potentially extends the list, and then writes the desired value
at the block and index indicated by the new value of _last:
def append(self,value):
"""Append value to the far end of the deque."""
(block,line) = self._addr(self._last+1)
if block >= len(self._list):
self._list.append(deque.blocksize*[None]) 5
self._len += 1
self._list[block][line] = value
if (self._maxlen is not None) and (self._len > self._maxlen):
self.popleft()
Here, we note that increasing the size of the list simply requires incrementing
_len; _first stays where it is and _last is a property whose value depends on
122 Linear Structures
_first and _size. The appendleft operation is symmetric, save for the need
to update _first:
def appendleft(self,value):
"""Append value to the near end of the list."""
self._first = (self._first-1)%deque.blocksize
if self._first == deque.blocksize-1:
self._list.insert(0,deque.blocksize*[None]) 5
self._len += 1
self._list[0][self._first] = value
if (self._maxlen is not None) and (self._len > self._maxlen):
self.pop()
An important part of both of these methods is to enforce any buffer limits that
may have been imposed at the construction of the deque.
All other growth operations are cast in terms of one of the append methods.
For example, extend simply (and efficiently) extends the list through multiple
appends:
def extend(self,iterable):
"""Add elements from iterable to the far end of the deque."""
for v in iterable:
self.append(v)
The extendleft method is the same, but performs an appendleft, instead,
reversing the order of the elements placed at the beginning of the deque.
There are no surprises in the pop and popleft methods. The reader is en-
couraged to sketch out and verify possible implementations.
A single rotate method allows the elements of the deque to shifted, end-
around, to the right (for positive n) or left (for negative n). With a little thought
it becomes clear that these rotates are related and that smaller left (or right)
rotates are preferred over larger right (or left) rotates:
def rotate(self, n=1):
"""Pop elements from one end of list and append on opposite.
If n > 0, this method removes n elements from far end of list and
successively appends them on the near end; a rotate right of n.
If n < 0, the opposite happens: a rotate left of n.
Args:
n: number of elements to rotate deque to the right.
""" 10
l = len(self)
l2 = l//2
n = ((n + l2)%l)-l2
while n < 0:
self.append(self.popleft()) 15
n += 1
while n > 0:
self.appendleft(self.pop())
6.7 Conclusions 123
n -= 1
While not a feature of Python’s deque objects, insertion and deletion of values at
arbitrary locations can be accomplished by combinations of rotations, appends,
and pops.
Finally, because deque structures support __len__ and __getitem__ meth-
ods, Python constructs a default iterator. Unfortunately, __getitem__ is a linear
operation, so a traversal of the structure with the default iterator leads to O(n2 )
performance in time. This can be improved if we use a generator to manage a
long-running traversal over the underlying doubly linked list. One implementa-
tion is as follows:
def __iter__(self):
"""Traverse the elements of the deque from head to tail."""
lit = iter(self._list)
offset = self._first
block = None 5
while offset <= self._last:
index = offset%deque.blocksize
if block is None:
block = next(lit)
yield block[index] 10
offset += 1
if index == deque.blocksize-1:
block = None
This generator is quite efficient, but it is not as pretty as one might expect,
largely because care has to be taken when _first or _last fall at one of the
ends of an underlying block.
Python’s deque, which is implemented in a manner similar to this, provides
a general alternative to linear structure implementation. Because of a versatil-
ity that provides many features that we may not want accessible in our linear
structures, it may allow users to violate desired FIFO or LIFO behaviors we de-
sire. It remains, however, a suitable container atop which we may implement
more restrictive interfaces that support only FIFO or LIFO operations. Because,
for example, the deque does not provide a general insert method, it cannot
be classified as a list. Such a method would be easy to construct, provided we
think carefully about which element should be removed if the insert grows the
deque beyond its desired maximum length.
6.7 Conclusions
Often, one of the most important features of a container class is to take away
features of an underlying container. For linear structures, we seek to guarantee
certain order relationships between the values inserted into the linear structure
and those extracted from it. In our examples and, as we shall continue to see in
later chapters, these relationships are important to making sure that algorithms
achieve a desired performance or, even more critically, compute the correct re-
124 Linear Structures
sult.
Chapter 7
Sorting
Computers spend a considerable amount of their time keeping data in order.
When we view a directory or folder, the items are sorted by name or type or
modification date. When we search the Web, the results are returned sorted by
“applicability.” At the end of the month, our checks come back from the bank
sorted by number, and our deposits are sorted by date. Clearly, in the grand
scheme of things, sorting is an important function of computers. Not surpris-
ingly, data structures can play a significant role in making sorts run quickly. This
chapter begins an investigation of sorting methods.
1 We focus on list of integers to maintain a simple approach. These techniques, of course, can be
applied to other container classes, provided that some relative comparison can be made between
two elements. This is discussed in Section 7.9.
126 Sorting
40 2 1 43 3 65 0 −1 58 3 42 4
0 1 2 3 4 5 6 7 8 9 10 11
(a) Unordered
−1 0 1 2 3 3 4 40 42 43 58 65
0 1 2 3 4 5 6 7 8 9 10 11
(b) Sorted
Figure 7.1 The relations between entries in unordered and sorted lists of integers.
(l[i-1],l[i]) = (l[i],l[i-1])
Observe that the only potentially time-consuming operations that occur in this
sort are comparisons (on line 4) and exchanges (on line 5). While the cost
of comparing integers is relatively small, if each element of the list were to
contain a long string (for example, a DNA sequence) or a complex object (for
example, a Library of Congress entry), then the comparison of two values might
be a computationally intensive operation. Similarly, the cost of performing an
exchange is to be avoided.2 We can, therefore, restrict our attention to the
number of comparison and exchange or data movement operations that occur
in sorts in order to adequately evaluate their performance.
In bubble sort each pass of the bubbling phase performs n 1 comparisons
and as many as n 1 exchanges. Thus the worst-case cost of performing bubble
sort is O(n2 ) operations. In the best case, none of the comparisons leads to an
exchange. Even then, though, the algorithm has quadratic behavior.3
Most of us are inefficient sorters. Anyone having to sort a deck of cards or
a stack of checks is familiar with the feeling that there must be a better way to
do this. As we shall see, there probably is: most common sorting techniques
used in day-to-day life run in O(n2 ) time, whereas the best single processor
comparison-based sorting techniques are expected to run in only O(n log n)
time. (If multiple processors are used, we can reduce this to O(log n) time,
but that algorithm is beyond the scope of this text.) We shall investigate some
2 In languages like Python, where objects are manipulated through references, the cost of an ex-
change of even large objects is usually fairly trivial. In some languages, however, the cost of ex-
changing large values stored directly in the list is a real concern.
3 If, as we noted in Figure 7.2, we stopped when we performed a pass with no exchanges, bubble
sort would run in O(n) time on data that were already sorted. Still, the average case would be
quadratic.
7.1 Approaching the Problem 127
Bubble
40 2 1 43 3 65 0 −1 58 3 42 4
2 1 40 3 43 0 −1 58 3 42 4 65
1 2 3 40 0 −1 43 3 42 4 58 65
1 2 3 0 −1 40 3 42 4 43 58 65
1 2 0 −1 3 3 40 4 42 43 58 65
1 0 −1 2 3 3 4 40 42 43 58 65
0 −1 1 2 3 3 4 40 42 43 58 65
−1 0 1 2 3 3 4 40 42 43 58 65
Detectable finish
−1 0 1 2 3 3 4 40 42 43 58 65
−1 0 1 2 3 3 4 40 42 43 58 65
−1 0 1 2 3 3 4 40 42 43 58 65
−1 0 1 2 3 3 4 40 42 43 58 65
Figure 7.2 The passes of bubble sort: hops indicate “bubbling up” of large values.
Shaded values are in sorted order. A pass with no exchanges indicates sorted data.
128 Sorting
other sorting techniques that run in O(n2 ) time, on average, and some that run
in O(n log n) time. In the end we will attempt to understand what makes the
successful sorts successful.
Our first two sorting techniques are based on natural analogies.
(Notice that the maximum is not updated unless a larger value is found.) Now,
consider where this maximum value would be found if the data were sorted: it
should be clear to the right, in the highest indexed location. All we need to do is
swap the last element of the unordered elements of the list with the maximum.
Once this swap is completed, we know that at least that one value is in the
correct location—the one on the right—and we logically reduce the size of the
problem—the number of unsorted values—by one. If we correctly place each of
the n 1 largest values in successive passes (see Figure 7.3), we have selection
sort. Here is how the entire method appears in Python:
def selection(l,n):
for sorted in range(n):
unsorted = n-sorted
max = 0 # assume element 0 is the largest
for loc in range(1,unsorted): 5
if l[max] < l[loc]:
# bigger value found
max = loc
# l[max] is largest value
7.3 Insertion Sort 129
We can think of selection sort as an optimized bubble sort, that simply makes
one exchange—the one that moves the maximum unsorted value to its final
location. Bubble sort performs as many as n 1 exchanges on each pas. Like
bubble sort, however, selection sort’s performance is dominated by O(n2 ) time
for comparisons.
The performance of selection sort is independent of the order of the data: if
the data are already sorted, it takes selection sort just as long to sort as if the
data were unsorted. For this reason, selection sort is only an interesting thought
experiment; many other sorts perform better on all inputs. We now think about
sorting with a slightly different analogy.
def _is(l,low,high):
for i in range(low+1,high+1): # 1..n-1 15
item = l[i]
loc = None
130 Sorting
40 2 1 43 3 4 0 −1 58 3 42 65
40 2 1 43 3 4 0 −1 42 3 58 65
40 2 1 3 3 4 0 −1 42 43 58 65
40 2 1 3 3 4 0 −1 42 43 58 65
−1 2 1 3 3 4 0 40 42 43 58 65
−1 2 1 3 3 0 4 40 42 43 58 65
−1 2 1 0 3 3 4 40 42 43 58 65
−1 2 1 0 3 3 4 40 42 43 58 65
−1 0 1 2 3 3 4 40 42 43 58 65
−1 0 1 2 3 3 4 40 42 43 58 65
−1 0 1 2 3 3 4 40 42 43 58 65
Figure 7.3 Profile of the passes of selection sort: shaded values are sorted. Circled
values are maximum among unsorted values and are moved to the low end of sorted
values on each pass.
7.4 Quicksort 131
A total of n 1 passes are made over the list, with a new unsorted value in-
serted each time. The value inserted is likely to be neither a new minimum or
maximum value. Indeed, if the list was initially unordered, the value will, on
average, end up near the middle of the previously sorted values, causing the in-
ner loop to terminate early. On random data the running time of insertion sort
is expected to be dominated by O(n2 ) compares and data movements (most of
the compares will lead to the movement of a data value).
If the list is initially in order, one compare is needed at every pass to verify
that the value is as big as all the previously sorted value. Thus, the inner loop
is executed exactly once for each of n 1 passes. The best-case running time
performance of the sort is therefore dominated by O(n) comparisons (there are
no movements of data within the list). Because of this characteristic, insertion
sort is often used when data are very nearly ordered (imagine the cost of adding
a small number of new cards to an already sorted poker hand).
In contrast, if the list was previously in reverse order, the value must be com-
pared with every sorted value to find the correct location. As the comparisons
are made, the larger values are moved to the right to make room for the new
value. The result is that each of O(n2 ) compares leads to a data movement, and
the worst-case running time of the algorithm is O(n2 ).
7.4 Quicksort
Since the process of sorting numbers consists of moving each value to its ulti-
mate location in the sorted list, we might make some progress toward a solution
if we could move a single value to its ultimate location. This idea forms the basis
of a fast sorting technique called quicksort.
One way to find the correct location of, say, the leftmost value—called a
pivot—in an unsorted list is to rearrange the values so that all the smaller val-
ues appear to the left of the pivot, and all the larger values appear to the right.
One method of partitioning the data is shown here. It returns the final location
for what was originally the leftmost value:
def _partition(l,left,right):
while left < right:
while left < right and l[left] < l[right]:
right -= 1
if left != right: 5
132 Sorting
Insert
40 2 1 43 3 65 0 −1 58 3 42 4
2 40 1 43 3 65 0 −1 58 3 42 4
1 2 40 43 3 65 0 −1 58 3 42 4
1 2 40 43 3 65 0 −1 58 3 42 4
1 2 3 40 43 65 0 −1 58 3 42 4
1 2 3 40 43 65 0 −1 58 3 42 4
0 1 2 3 40 43 65 −1 58 3 42 4
−1 0 1 2 3 40 43 65 58 3 42 4
−1 0 1 2 3 40 43 58 65 3 42 4
−1 0 1 2 3 3 40 43 58 65 42 4
−1 0 1 2 3 3 40 42 43 58 65 4
−1 0 1 2 3 3 4 40 42 43 58 65
Figure 7.4 Profile of the passes of insertion sort: shaded values form a “hand” of sorted
values. Circled values are successively inserted into the hand.
7.4 Quicksort 133
40 2 1 43 3 65 0 −1 58 3 42 4
left right
4 2 1 43 3 65 0 −1 58 3 42 40
4 2 1 40 3 65 0 −1 58 3 42 43
4 2 1 3 3 65 0 −1 58 40 42 43
4 2 1 3 3 40 0 −1 58 65 42 43
4 2 1 3 3 −1 0 40 58 65 42 43
4 2 1 3 3 −1 0 40 58 65 42 43
left right
Figure 7.5 The partitioning of a list’s values based on the (shaded) pivot value 40.
Snapshots depict the state of the data after the if statements of the partition method.
(l[left],l[right]) = (l[right],l[left])
left += 1
while left < right and l[left] < l[right]:
left += 1
if left != right: 10
(l[left],l[right]) = (l[right],l[left])
right -= 1
return left
The indices left and right start at the two ends of the list (see Figure 7.5) and
move toward each other until they coincide. The pivot value, being leftmost in
the list, is indexed by left. Everything to the left of left is smaller than the
pivot, while everything to the right of right is larger. Each step of the main
loop compares the left and right values and, if they’re out of order, exchanges
them. Every time an exchange occurs the index (left or right) that references
the pivot value is alternated. In any case, the nonpivot variable is moved toward
134 Sorting
40 2 1 43 3 65 0 −1 58 3 42 4
partition
4 2 1 3 3 −1 0 40 58 65 42 43
0 2 1 3 3 −1 4 43 42 58 65
−1 0 1 3 3 2 42 43 65
−1 1 3 3 2 42
2 3 3
2 3
−1 0 1 2 3 3 4 40 42 43 58 65
Figure 7.6 Profile of quicksort: leftmost value (the circled pivot) is used to position
value in final location (indicated by shaded) and partition list into relatively smaller and
larger values. Recursive application of partitioning leads to quicksort.
def _qs(l,low,high):
if low < high: 5
i = _partition(l,low,high)
_qs(l,low,i-1)
_qs(l,i+1,high)
In practice, of course, the splitting of the values is not always optimal (see the
placement of the value 4 in Figure 7.6), but a careful analysis suggests that even
7.4 Quicksort 135
with these “tough breaks” quicksort takes only O(n log n) time.
When either sorted or reverse-sorted data are to be sorted by quicksort, the
results are disappointing. This is because the pivot value selected (here, the
leftmost value) finds its ultimate location at one end of the list or the other.
This reduces the sort of n values to n 1 values (and not n/2), and the sort
requires O(n) passes of an O(n) step partition. The result is an O(n2 ) sort.
Since nearly sorted data are fairly common, this result is to be avoided.
Notice that picking the leftmost value is not special. If, instead, we attempt
to find the correct location for the middle value, then other arrangements of
data will cause the degenerate behavior. In short, for any specific or determin-
istic partitioning technique, a degenerate arrangement exists. The key to more
consistent performance, then, is a nondeterministic partitioning that correctly
places a value selected at random (see Problem ??). There is, of course, a very
unlikely chance that the data are in order and the positions selected induce a
degenerate behavior, but that chance is small and successive runs of the sorting
algorithm on the same data are exceedingly unlikely to exhibit the same behav-
ior. So, although the worst-case behavior is still O(n2 ), its expected behavior is
O(n log n).
Improving Quicksort
While the natural implementation strategy for quicksort is recursive, it’s impor-
tant to realize that Python has a limited number of method calls that can be
outstanding at one time. A typical value for this limit is 1000. That means, for
example, sorting a list of 1024 values that is already sorted not only incurs a per-
formance penalty, it will actually fail when the recursive calls cascade beyond
1000 deep. We consider how this limit can be avoided.
When data is randomly ordered, there can be as many as O(log n) outstand-
ing calls to quicksort. The limit of 1000 levels of recursion is not a problem in
cases where the partitioning of the list is fairly even. However, quicksort can
have degenerate behavior when the correct location for the selected pivot is at
one end of the list or the other. The reason is that the larger half of the par-
titioned list is nearly as large as the list itself. This can lead to O(n) levels of
recursion.
There are two ways that this can be dealt with. First, we can choose, at each
partitioning, a random element as the pivot. Our partitioning strategy is easily
adapted to this approach: we simply begin the partition by swapping the left
element with some other element picked at random. In this way, the selection
of extreme values at each partitioning stage is quite unlikely. It is also possible
to simply shuffle the data at the beginning of the sort, with similar expected
runtime behavior. Randomization leads to expected O(n log n) running time
and O(log n) stack depth.
A second optimization is to note that once the partition has been completed,
the remaining tasks in quicksort are recursive calls. When a procedure ends
with a recursive call it is called tail recursion, which can be converted into a
simple call-free loop. Quicksort ends with two calls: one can be converted into
136 Sorting
a loop, and the other cannot be eliminated in any direct way. Thus the following
procedure is equivalent to our previous formulation:
def _qs(l,low,high):
while low < high:
i = _partition(l,low,high)
_qs(l,low,i-1)
low = i+1
The order of the recursive calls in quicksort is, of course, immaterial. If we
choose to convert one of the recursive calls, we should choose to eliminate the
call that sorts the large half of the list. The following code eliminates half of the
calls and ensures the depth of the stack never grows larger than O(log n) call
frames:
def _qs(l,low,high):
while low < high:
i = _partition(l,low,high)
if i < (low+high)//2:
# sort fewer smaller numbers recursively 5
_qs(l,low,i-1)
low = i+1
else:
# sort fewer larger numbers recursively
_qs(l,i+1,high)
high = i-1
Quicksort is an excellent sort when data are to be sorted with little extra
space. Because the speed of partitioning depends on the random access nature
of lists, quicksort is not suitable when used with structures that don’t support
efficient indexing. In these cases, however, other fast sorts are often possible.
7.5 Mergesort
Suppose that two friends are to sort a list of values. One approach might be
to divide the deck in half. Each person then sorts one of two half-decks. The
sorted deck is then easily constructed by combining the two sorted half-decks.
This careful interleaving of sorted values is called a merge.
It is straightforward to see that a merge takes at least O(n) time, because
every value has to be moved into the destination deck. Still, within n 1 com-
parisons, the merge must be finished. Since each of the n 1 comparisons
(and potential movements of data) takes at most constant time, the merge is no
worse than linear.
There are, of course, some tricky aspects to the merge operation—for exam-
ple, it is possible that all the cards in one half-deck are smaller than all the cards
in the other. Still, the performance of the following merge code is O(n):
def _merge(l,l0,h0,l1,h1,temp):
# save left sub-array into temp
7.5 Mergesort 137
for i in range(l0,l1):
temp[i] = l[i]
target = l0 5
while l0 <= h0 and l1 <= h1:
if l[l1] < temp[l0]:
l[target] = l[l1]
l1 += 1
else: 10
l[target] = temp[l0]
l0 += 1
target += 1
while (l0 <= h0):
l[target] = temp[l0] 15
target += 1
l0 += 1
This code is fairly general, but a little tricky to understand (see Figure 7.7). We
assume that the data from the two lists are located in the two lists—we copy
the lower half of the range into temp and the upper half of the range remains
in l (see Figure 7.7a). The first loop compares the first remaining element of
each list to determine which should be copied over to l first (Figure 7.7b). That
loop continues until one of the two sublists is emptied (Figure 7.7c). If l is the
emptied sublist, the remainder of the temp list is transferred (Figure 7.7d). If
the temp list was emptied, the remainder of the l list is already located in the
correct place—in l!
Returning to our two friends, we note that before the two lists are merged
each of the two friends is faced with sorting half the cards. How should this be
done? If a deck contains fewer than two cards, it’s already sorted. Otherwise,
each person could recursively hand off half of his or her respective deck (now
one-fourth of the entire deck) to a new individual. Once these small sorts are
finished, the quarter decks are merged, finishing the sort of the half decks, and
the two half decks are merged to construct a completely sorted deck. Thus, we
might consider a new sort, called mergesort, that recursively splits, sorts, and
reconstructs, through merging, a deck of cards. The logical “phases” of merge-
sort are depicted in Figure 7.8.
def _ms(l,low,high,temp):
if high <= low:
return
mid = (low+high)//2
if low < mid: 5
_ms(l,low,mid,temp)
if mid+1 < high:
_ms(l,mid+1,high,temp)
_merge(l,low,mid,mid+1,high,temp)
Note that this sort requires a temporary list to perform the merging. This tem-
138 Sorting
data −1 0 3 4 42 58
(a) 0 1 2 3 4 5 6 7 8 9 10 11 12 13
temp 1 2 3 40 43 65
di 6 ti 0 ri 0
data −1 0 1 2 3 3 4 42 58
(b) 0 1 2 3 4 5 6 7 8 9 10 11 12 13
temp 40 43 65
di 8 ti 3 ri 5
data −1 0 1 2 3 3 4 40 42 43 58
(c) 0 1 2 3 4 5 6 7 8 9 10 11 12 13
temp 65
di 12 ti 5 ri 11
data −1 0 1 2 3 3 4 40 42 43 58 65
(d) 0 1 2 3 4 5 6 7 8 9 10 11 12 13
temp
di 12 ti 6 ri 12
Figure 7.7 Four stages of a merge of two six element lists (shaded entries are partic-
ipating values): (a) the initial location of data; (b) the merge of several values; (c) the
point at which a list is emptied; and (d) the final result.
7.5 Mergesort 139
40 2 1 43 3 65 0 −1 58 3 42 4 Split
40 2 1 43 3 65 0 −1 58 3 42 4
40 2 1 43 3 65 0 −1 58 3 42 4
40 2 1 43 3 65 0 −1 58 3 42 4
2 1 3 65 −1 58 42 4
Merge
40 1 2 43 3 65 0 −1 58 3 4 42
1 2 40 3 43 65 −1 0 58 3 4 42
1 2 3 40 43 65 −1 0 3 4 42 58
−1 0 1 2 3 3 4 40 42 43 58 65
Figure 7.8 Profile of mergesort: values are recursively split into unsorted lists that are
then recursively merged into ascending order.
porary list is only used by a single merge at a time, so it is allocated once and
garbage collected after the sort. We hide this detail with a public wrapper pro-
cedure that allocates the list and calls the recursive sort:
def mergesort(l,n):
temp = len(l)*[None]
_ms(l,0,n-1,temp)
Clearly, the depth of the splitting is determined by the number of times that n
can be divided in two and still have a value of 1 or greater: log2 n. At each level
of splitting, every value must be merged into its respective sublist. It follows that
at each logical level, there are O(n) compares over all the merges. Since there
are log2 n levels, we have O(n · log n) units of work in performing a mergesort.
One of the unappealing aspects of mergesort is that it is difficult to merge
two lists without significant extra memory. If we could avoid the use of this
extra space without significant increases in the number of comparisons or data
movements, then we would have an excellent sorting technique.
Recently, there has been renewed interest in designing sorts that are com-
binations of the sorts we have seen so far. Typically, the approach is to select
an appropriate sort based on the characteristics of the input data. In our next
section, we see how this adaptation can be implemented.
140 Sorting
11250
3750
0
1 2 4 8 16 32 64 128
List Size
Figure 7.9 The relative performance of mergesort and insertion sort for lists of length
1 through 256. Note that mergesort is more efficient than insertion sort only for lists
larger than 32 elements.
Since insertion sort (on random data) appears to be more efficient than
merge sort when a list has 64 or fewer elements, we can modify our mergesort
algorithm to take advantage of that fact:
def adaptivemergesort(l,n):
temp = n*[None]
_ams(l,0,n-1,temp)
def _ams(l,low,high,temp): 5
n = high-low+1
if n < 64:
_is(l,low,high)
else:
if high <= low: 10
return
mid = (low+high)//2
if low < mid:
_ams(l,low,mid,temp)
if mid+1 < high: 15
_ams(l,mid+1,high,temp)
_merge(l,low,mid,mid+1,high,temp)
Clearly, mergesort is really insertion sort for smaller list sizes, but for larger lists
the mergesort call tree is trimmed of 6 levels of recursive calls in favor of the
relatively fast insertion sort technique. In Figure ??, the adaptive merge sort
takes on the characteristics of insertion sort for lists of length 64 or less, and the
shape of the curve is determined primarily by the characteristics of mergesort
for larger lists.
Quicksort can benefit from the same type of analysis and we find that, sim-
ilarly, for lists of length 64 or less, insertion sort is preferred. To introduce the
optimization we note that partitioning the values into sublists that are relatively
small and large compared to the pivot, we know the values in the sublist must
occupy that space in the sorted list. We could sort small sublists immediately,
with insertion sort, or we can delay the sorting of small sublists until after the
quicksort has otherwise completed its activity. A single insertion sort on the
entire list will be fairly efficient: each element is close to its final destination
(and thus quickly slid into place with an insertion pass) and elements that were
pivots are in precisely the correct location (so insertion sort will not move them
at all). Our optimized adaptive quicksort appears, then, as follows:
def quicksorta(l,n):
shuffle(l,n)
_qsa(l,0,n-1)
insertion(l,0,n-1)
5
def _qsa(l,low,high):
while low+64 < high:
i = _partition(l,low,high)
if i < (high+low)//2:
142 Sorting
InsertRandom MergeRandom
AdaptiveMeergeRandom
15000
5000
0
1 2 4 8 16 32 64 128
List Size
Figure 7.10 The relative performance of insertion, merge, and adaptive merge sorts.
_qsa(l,low,i-1) 10
low = i+1
else:
_qsa(l,i+1,high)
high = i-1
7.6 Stability
When the objects to be sorted are complex, we frequently determine the order-
ing of the data based on a sort key. Sort keys either appear directly in the object,
or are derived from data found within the object. In either case, the keys for two
objects may be equal even if the objects are not. For example, we may perform
a sort of medical records based on the billing id, a value that may be shared
among many individuals in a family. While the billing id for family members is
the same, their medical histories are not.
Of course, some objects may have an option of several different sort keys.
Voters, for example, may be sorted by surname, or by the street or precinct
where they live. Having the choice of keys allows us to combine or pipeline
two or more sorts on different keys. The result is a full sort that distinguishes
7.6 Stability 143
for i in range(len(l)):
l[i] = (l[i],i)
sort(l)
for j in range(len(l)):
l[j] = l[j][0]
4 Indeed, before the advent of computer-based sorting, names were sorted using punched cards,
with several passes of a stable mechanical sort based on the character punched in a particular
column. A stable sort was made on each column, from right to left, leaving the punch cards in
alphabetical order at the end of all passes.
144 Sorting
for i in range(len(l)):
l[i] = KV(keyfun(l[i]),l[i])
This list can then be sorted independently of the original list since KV values
are compared based just on their keys. Once the sort is finished, the list is
undecorated with:
for i in range(len(l)):
l[i] = kvlist[i].value
Notice that these decorate operations are not list comprehension since the list
is to be sorted in place. As with the process of stabilizing a sort, we could have
used tuples instead of key-value pairs, but because the key function may return
equal key values for list elements that are not equal, it is important to make sure
that comparisons never fall through to the component that carries the original
value. The correct approach, then, is a decoration strategy that used 3-tuples:
for i in range(len(l)):
l[i] = (keyfun(l[i]), i, l[i])
Because equal key values reduce to the comparison of original list index val-
ues, which are necessarily distinguishable, the 3-tuples are clearly compared
without accessing the original value, l[i]. A pleasant side-effect of this time-
saving strategy is the introduction of stability in any otherwise unstable sort.
The undecorate appears as
for i in range(len(l)):
l[i] = l[i][2]
Exercise 7.1 Explain why this sorting technique always takes O(n) time for a
deck of n cards.
146 Sorting
Such an approach is the basis for a general sorting technique called bucket sort.
By quickly inspecting values (perhaps a word) we can approximately sort them
into different buckets (perhaps based on the first letter). In a subsequent pass
we can sort the values in each bucket with, perhaps a different sort. The buck-
ets of sorted values are then accumulated, carefully maintaining the order of
the buckets, and the result is completely sorted. Unfortunately, the worst-case
behavior of this sorting technique is determined by the performance of the al-
gorithm we use to sort each bucket of values.
Exercise 7.2 Suppose we have n values and m buckets and we use insertion sort
to perform the sort of each bucket. What is the worst-case time complexity of this
sort?
Such a technique can be used to sort integers, especially if we can partially sort
the values based on a single digit. For example, we might develop a support
function, digit, that, given a number n and a decimal place d, returns the
value of the digit in the particular decimal place. If d was 0, it would return the
units digit of n. Here is a recursive implementation:
Here is the code for placing a list of integer values among 10 buckets, based on
the value of digit d. For example, if we have numbers between 1 and 52 and we
set d to 2, this code almost sorts the values based on their 10’s digit.
We now have the tools to support a new sort, radix sort. The approach is to
use the bucketPass code to sort all the values based on the units place. Next,
all the values are sorted based on their 10’s digit. The process continues until
enough passes have been made to consider all the digits. If it is known that
values are bounded above, then we can also bound the number of passes as
well. Here is the code to perform a radix sort of values under 1 million (six
passes):
After the first bucketPass, the values are ordered, based on their units digit.
All values that end in 0 are placed near the front of data (see Figure 7.11), all
the values that end in 9 appear near the end. Among those values that end in 0,
the values appear in the order they originally appeared in the list. In this regard,
we can say that bucketPass is a stable sorting technique. All other things being
equal, the values appear in their original order.
During the second pass, the values are sorted, based on their 10’s digit.
Again, if two values have the same 10’s digit, the relative order of the values is
maintained. That means, for example, that 140 will appear before 42, because
after the first pass, the 140 appeared before the 42. The process continues, until
all digits are considered. Here, six passes are performed, but only three are
necessary (see Problem ??).
There are several important things to remember about the construction of
this sort. First, bucketPass is stable. That condition is necessary if the work
of previous passes is not to be undone. Secondly, the sort is unlikely to work if
the passes are performed from the most significant digit toward the units digit.
Finally, since the number of passes is independent of the size of the data list,
the speed of the entire sort is proportional to the speed of a single pass. Careful
7.8 Radix and Bucket Sorts 147
Start
140 2 1 43 3 65 0 11 58 3 42 4
Digit 0
140 0 1 11 2 42 43 3 3 4 65 58
Digit 1
0 1 2 3 3 4 11 140 42 43 58 65
Digit 2
0 1 2 3 3 4 11 42 43 58 65 140
Digit 3
0 1 2 3 3 4 11 42 43 58 65 140
Digit 4
0 1 2 3 3 4 11 42 43 58 65 140
Digit 6
0 1 2 3 3 4 11 42 43 58 65 140
Finish
Figure 7.11 The state of the data list between the six passes of radixSort. The bound-
aries of the buckets are identified by vertical lines; bold lines indicate empty buckets.
Since, on every pass, paths of incoming values to a bucket do not cross, the sort is stable.
Notice that after three passes, the radixSort is finished. The same would be true, no
matter the number of values, as long as they were all under 1000.
148 Sorting
Figure 7.12 A list of phone entries for the 107th Congressional Delegation from Ore-
gon State, before and after sorting by telephone (shaded).
150 Sorting
book, the entries would ideally be ordered based on the name associated with
the entry. Sometimes, however, the compareTo method does not provide the
ordering desired, or worse, the compareTo method has not been defined for an
object. In these cases, the programmer turns to a simple method for specifing an
external comparison method called a comparator. A comparator is an object that
contains a method that is capable of comparing two objects. Sorting methods,
then, can be developed to apply a comparator to two objects when a comparison
is to be performed. The beauty of this mechanism is that different comparators
can be applied to the same data to sort in different orders or on different keys.
In Java a comparator is any class that implements the java.util.Comparator
interface. This interface provides the following method:
Like the compareTo method we have seen earlier, the compare method re-
turns an integer that identifies the relationship between two values. Unlike
the compareTo method, however, the compare method is not associated with
the compared objects. As a result, the comparator is not privy to the implemen-
tation of the objects; it must perform the comparison based on information that
is gained from accessor methods.
As an example of the implementation of a Comparator, we consider the
implementation of a case-insensitive comparison of Strings, called Caseless-
Comparator. This comparison method converts both String objects to upper-
case and then performs the standard String comparison:
The result of the comparison is that strings that are spelled similarly in different
cases appear together. For example, if a list contains the words of the children’s
tongue twister:
we would expect the words to be sorted into the following order:
This should be compared with the standard ordering of String values, which
would generate the following output:
To use a Comparator in a sorting technique, we need only replace the use
of compareTo methods with compare methods from a Comparator. Here, for
example, is an insertion sort that makes use of a Comparator to order the values
in a list of Objects:
Note that in this description we don’t see the particulars of the types involved.
Instead, all data are manipulated as Objects, which are specifically manipulated
by the compare method of the provided Comparator.
through a parenthesized cast. If the type of the fetched value doesn’t match
the type of the cast, the program throws a class cast exception. Here, we cast
the result of get in the compareTo method to indicate that we are comparing
PhoneEntrys.
It is unfortunate that the insertionSort has to be specially coded for use
with the PhoneEntry objects.
7.12 Conclusions
Sorting is an important and common process on computers. In this chapter we
considered several sorting techniques with quadratic running times. Bubble sort
approaches the problem by checking and rechecking the relationships between
elements. Selection and insertion sorts are based on techniques that people
commonly use. Of these, insertion sort is most frequently used; it is easily
coded and provides excellent performance when data are nearly sorted.
Two recursive sorting techniques, mergesort and quicksort, use recursion
to achieve O(n log n) running times, which are optimal for comparison-based
techniques on single processors. Mergesort works well in a variety of situations,
but often requires significant extra memory. Quicksort requires a random access
structure, but runs with little space overhead. Quicksort is not a stable sort
because it has the potential to swap two values with equivalent keys.
We have seen with radix sort, it is possible to have a linear sorting algorithm,
but it cannot be based on compares. Instead, the technique involves carefully
ordering values based on looking at portions of the key. The technique is, practi-
cally, not useful for general-purpose sorting, although for many years, punched
cards were efficiently sorted using precisely the method described here.
Sorting is, arguably, the most frequently executed algorithm on computers
today. When we consider the notion of an ordered structure, we will find that
algorithms and structures work hand in hand to help keep data in the correct
order.
8.1 Decoration
A large portion of this chapter is about the development of programs that help
us instrument and measure characteristics about other programs. One of the
strengths of Python is its ability to support this engineering process. The first
step along this path is an understanding of a process called decoration, or the
annotation of functions and classes.
Suppose we have designed a data structure that provides a feature that is
used in many different placeses in a program. An example might be the append
method of a list implementation. How might we keep track of the number of
times we call this method? It is likely to be error-prone to find the places where
the method is called, and count those; we might not even have access to all of
those call sites. Clearly, it would be necessary to modify the code itself .
We might imagine, however, that the process of gathering these statistics
could be complex—perhaps at least as complex as the code associated with
append. It is likely, also, that success in analyzing the frequency of append
methods will lead, ultimately, to the analysis of other methods of the list class
as well. In all of this, it is important that we not confuse the analysis tool with
the method itself. For this reason, it is frequently useful to “abstract away” the
details of the analysis in the hope that it lowers the barrier to use and that it
does not obscure the object being analyzed.
Python provides a very simple syntactic mechanism, called a method called
154 Instrumentation and Analysis
a decorator, that can be used to augment our definitions of other methods and
classes. Remember, in Python, functions are first-class objects. Once a function
is defined, references to it can be passed around and operated on like any other
object.1 A decorator is a higher order function that is called with another func-
tion as its only parameter. The purpose of the decorator is to return an alterna-
tive function to be used it its place. Typically the alternative function is simply
a new function that wraps the original: it performs some preprocessing of the
function’s arguments, calls the function with its arguments, and then performs
some post processing of the results. Once defined, the higher order function
is applied by mentioning it as a decoration just before the original function is
defined:
@decorator
def f(x):
print(x)
is the same as
def f(x)
print(x)
f = decorator(f)
fun = accounting(fun)
Our job is to write the decorator method, accounting.
Our first version is as follows:
def accounting(f):
1 As we shall see, shortly, the same holds true of classes. When we define a class, it becomes a
callable object, just like a method.
8.1 Decoration 155
f = {0, 1, 1, 2, 3, 5, . . .}
>>> fibo(5)
8
>>> fibo(10) 5
89
>>> fibo(2000)
(significant time passes, humanity fades)
So, what is happening? Since the amount of computation directly attributed to
a single call to fibo is small (a condition test and a possible addition), it must
be the case that many function calls are being performed. To see how many, we
can decorate fibo with our newfound accounting decorator:
@accounting
def fibo(n):
if n <= 1:
return 1
else:
return fibo(n-1) + fibo(n-2)
The effect of this is to quietly wrap fibo with code that increments the counter
each time the function is called. We can now use this to count the actual calls
to the fibo method:
>>> fibo(3)
3
>>> fibo.counter
5
>>> fibo.counter = 0; fibo(5) 5
8
>>> fibo.counter
15
>>> fibo.counter = 0; fibo(20)
10946 10
>>> fibo.counter
21891
Clearly, things get out of hand rather quickly. Our next decorator demonstrates
how we can dramatically reduce the complexity of our recursive implementation
without losing its beauty.
Exercise 8.1 Write a new method, fibocalls(n), that computes the number of
calls that fibo(n) will make.
n fibocalls(n)
0 1
1 1
2 3
3 5
4 9
5 15
6 25
7 41
8 67
9 109
10 177
20 21891
100 1146295688027634168201
200 907947388330615906394593939394821238467651
Figure 8.1 The number of calls made by fibo for various values of n.
return result
wrapped_f.cache = dict()
return wrapped_f
The wrapper function, wrapped_f, uses a tuple of the particular arguments
passed to the function f to look up any pre-computed results. If the function call
was previously computed, the previous result is returned, avoiding the overhead
of redundantly computing it. Otherwise, the result of the new computation is
recorded for possible future use and then returned.
Before the wrapper is returned as the new definition, we attach an empty
dictionary as the cache attribution on the wrapper. Over time, of course, this
dictionary fills with the history of function calls and their respective results.
To make use of the memoize decorator, we tuck it between the accounting
decorator and the function definition so that we may count the number of mem-
oized calls:
@accounting
@memoize
def fibo(n):
if n <= 1:
return 1 5
else:
return fibo(n-1) + fibo(n-2)
This decoration order is a shorthand for
def fibo(n):
...
[0, 1, 2, 4, 8, 16, 5, 10, 3, 6, 12, 13, 17, 11, 7, 14, 15, 9, 18, 19]
>>> l = list(reversed(range(20)))
>>> l 10
[19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
>>> quicksort(l,key=syrlen)
>>> l
[1, 0, 2, 4, 8, 16, 5, 10, 3, 6, 13, 12, 17, 11, 7, 15, 14, 9, 19, 18]
Notice that the relative ordering of values that have the same Syracuse trajec-
tory length is determined by the initial ordering of the list to be sorted. Because
the sortoptions decorator is used for each of our sorting techniques, the results
of the experiment are the same no matter which sort we use.
starttime = time()
quicksort(l)
stoptime = time()
elapsedtime = stoptime-starttime
print("The elapsed time is {} seconds.".format(elapsedtime))
finished call to the function f. Decorating the quicksort method would then
allow us to perform the following:
>>> quicksort(l)
>>> quicksort.elapsedtime
0.004123442259382993
This tells us that it took three fourths of a second to perform the sort. If we run
the experiment ten times, we get the following output:
>>> for i in range(10):
... k = list(l)
... quicksort(k)
... print("ime is {} seconds.".format(quicksort.elapsedtime))
... 5
Time is 0.008211135864257812 seconds.
Time is 0.006258964538574219 seconds.
Time is 0.004125833511352539 seconds.
Time is 0.003949880599975586 seconds.
Time is 0.004060029983520508 seconds. 10
Time is 0.003949165344238281 seconds.
Time is 0.004272937774658203 seconds.
Time is 0.0040531158447265625 seconds.
Time is 0.003945112228393555 seconds.
Time is 0.004277944564819336 seconds.
How are we to interpret this data? Clearly there is some variability in the
elapsed time. Most modern computers run tens or even hundreds of processes
that, at various times, may interrupt your experiment as it progresses. For ex-
ample, the system clock needs to be updated, mail is checked, the mouse image
is moved, garbage is collected, and the display is updated. If any of these dis-
tractions happens during your experiment, the elapsed time associated with
quicksort is increased. When multiple experiments are timed, each may be
thought of as an upper bound on the length of time. It is appropriate, then to
take the minimum time as a reasonable estimate of the time consumed. Further
measurements may be larger, but over many experiments, the estimate becomes
more accurate. Here, for example, we see that quicksort can sort 1000 integers
in approximately 3922 microseconds.
Some experiments, of course, are finished in well under the resolution of the
clock. In this book, all times are measured in units informally called a click, the
length of time it takes Python to increment an integer. On the author’s machine,
a click is approximately 34 nanoseconds. In these cases it can be useful to
repeat the experiment enough times to make the total time significantly greater
than the resolution of the clock. If, for example, we measure 1000 repeated
experiments that take a total of 1000 microseconds, the time of each individual
experiment is about 1 microsecond. One must take care, of course, to consider
the overhead associated with running the experiments 1000 times. Again, on
the author’s machine, that may be as much as 63 nanoseconds per iteration.
On most modern unix-based operating systems, the system allows a wait-
ing process to run largely uninterrupted, until the next scheduling event. That
162 Instrumentation and Analysis
Switch Purpose
-n Directly specify the number of trials to be timed
-s Specify the setup statement
-r Specify the number of times to repeat the entire experiment (default is 3)
-- Separate switches from statements to be timed
We can test out our theory by timing increasingly complex experiments. Our
simplest is to increment an integer variable.
i += 1
The timeit module reports this as taking about 45 nanoseconds:
and then use that to expose the time of an increment that follows:
python3 -m timeit -- ’i=0’ ’i+=1’
10000000 loops, best of 3: 0.0408 usec
per loop
8.2 Timing and Tracing 165
which yields a somewhat smaller time of about 27 nanoseconds. The reason for
this is the fact that the Python interpreter can optimize these two instructions
into a single simpler operation that directly loads 1 into i.
The ability to perform very precise timings allows for careful experimental
design and evaluation of algorithms. Expand on
this.
166 Instrumentation and Analysis
Chapter 9
Sorted Structures
One important use of data structures is to help keep data sorted—the smallest
value in the structure might be stored close to the front, while the largest value
would be stored close to the rear. Once a structure is kept sorted it also becomes
potentially useful as a sorting method: we simply insert our possibly unordered
data into the structure and then extract the values in sorted order. To do this,
however, it is necessary to compare data values to see if they are in the correct
order. In this chapter we will discuss approaches to the various problems as-
sociated with maintaining ordered structures. First we review material we first
encountered when we considered sorting.
Python has a very complex scheme to make sure that ordering can be deter-
mined, even when specific comparison methods have not been defined. In ad-
dition, Python, does not make assumptions about the consistency of the rela-
tionships between the comparison methods; for example, it’s not necessary that
== and != return opposite boolean values. While it may sometimes be useful
to define these operations in inconsistent ways, most classes defined in Python
benefit from the definition of one or more of these methods, in a consistent
manner.
168 Sorted Structures
Let’s see a particular example. Suppose, for example, that we’re interested
in determining the relative ordering of Ratio values. (Recall, we first met the
Ratio class in Chapter ??, on page ??.) The relative ordering of Ratio types
depends not indirectly on the ordering of values held internally, but the order-
ing of the products of opposite numerators and denominators of the Ratios
involved. For this reason the implementation of the special ordering methods is
nontrivial. As examples, we consider the definition of the __lt__ and __eq__
methods from that class:
def __lt__(self,other):
"""Compute self < other."""
return self._top*other._bottom < self._bottom*other._top
def __le__(self,other): 5
"""Compute self < other."""
return self._top*other._bottom <= self._bottom*other._top
def __eq__(self,other):
"""Compute self == other.""" 10
return (self._top == other._top) and \
(self._bottom == other._bottom)
When two Ratio values are compared, the appropriate products are computed
and compared, much as we learned in grade school. Note that the implemen-
tation of the __lt__ special method involves a < comparison of the products—
integer values. This, of course, calls the special method __lt__ for integers.
The implementation of the <= operator follows the same pattern. The imple-
mentation of the __eq__ method ultimately performes a == test of numerators
and denominators, and depends on the ratios always being represented in their
least terms, which, as you may recall, they are. Other operators can be defined,
for the Ratio class, in a similar manner.
Some classes, of course, represent values that cannot be ordered in a partic-
ular manner (Complex values, for example). In these cases, the rich comparison
methods are not defined and an error is thrown if you attempt to compare them.
In the context of our current thinking, of course, it is meaningless, then to place
values of this type in a sorted structure.
In implementations similar to Ratio, of course, it is important that we define
these methods in a consistent manner. Careful attention, of course, will make
this possible, but given there are a half dozen methods that are related, it is
important that maintainers of these classes be vigilant: a change to one method
should be reflected in the implementation of all the methods.
Python provides, as part of the functools package, a class decorator, total_ordering,
that, given the implementation of one or more of the comparison methods, de-
fines each of the others in a consistent manner. For example, the Ratio class
is decorated with the total_ordering decorator to guarantee that each of the
rich comparison methods is correctly defined. Typically, when efficient opera-
tions cannot be inherited, specific definitions are given for the __lt__, __le__,
and __eq__, and the remaining rich comparison functions are provided by the
9.2 Keeping Structures Ordered 169
"""
An abstract base class for SortedList, SkipList, SearchTree, SplayTree,
and RedBlackSearchTree.
...
20
def __eq__(self, other):
"""Data structure has equivalent contents to other."""
if len(self) != len(other):
return False
for x,y in zip(self, other): 25
if x != y:
return False
return True
This interface demands that implementors ultimately provide methods that add
values to the structure and remove them by value. Since the values are always
held within the structure in fully sorted ordering, the equality test for these
containers can perform pairwise comparisons of the values that occur in the
traversals.
value 0 1 2 3 4 5 6 7 8 9 10 11 12
40 -1 0 1 2 3 3 4 40 42 43 58 65
-1 0 1 2 3 3 4 40 42 43 58 65
-1 0 1 2 3 3 4 40 42 43 58 65
-1 0 1 2 3 3 4 40 42 43 58 65
-1 0 1 2 3 3 4 40 42 43 58 65
low high
mid
value 0 1 2 3 4 5 6 7 8 9 10 11 12
93 -1 0 1 2 3 3 4 40 42 43 58 65
-1 0 1 2 3 3 4 40 42 43 58 65
-1 0 1 2 3 3 4 40 42 43 58 65
-1 0 1 2 3 3 4 40 42 43 58 65 93
Figure 9.2 Finding the correct location for a comparable value in a sorted list. The top
search finds a value in the array; the bottom search fails to find the value, but finds the
correct point of insertion. The shaded area is not part of the list during search.
172 Sorted Structures
We present here the code for determining the index of a value in an SortedList.
Be aware that if the value is not in the list, the routine returns the ideal loca-
tion to insert the value. This may be a location that is just beyond the bounds
of the list.
def _locate(self, target):
"""Find the correct location to insert target into list."""
low = 0
high = len(self._data)
mid = (low + high) // 2 5
while high > low:
mid_obj = self._data[mid]
if target < mid_obj:
# if the object we’re trying to place is less than mid_obj, look left
# new highest possible index is mid 10
high = mid
else:
# otherwise, lowest possible index is one to the right of mid
low = mid + 1
# recompute mid with new low or high 15
mid = (low + high) // 2
# if target is in the list, the existing copy is located at self._data[low - 1]
return low
For each iteration through the loop, low and high determine the bounds of the
list currently being searched. mid is computed to be the middle element (if
there are an even number of elements being considered, it is the leftmost of
the two middle elements). This middle element is compared with the parame-
ter, and the bounds are adjusted to further constrain the search. The _locate
imagines that you’re searching for an ideal insertion point for a new value. If
values in the list are equal to the value searched for, the _locate method will
return the index of the first value beyond the equal values. This approach makes
it possible to implement stable sorting relatively easily. Since the portion of the
list participating in the search is roughly halved each time, the total number of
times around the loop is approximately O(log2 n). This is a considerable im-
provement over the implementation of the __contains__ method for lists of
arbitrary elements—that routine is linear in the size of the structure.
Notice that _locate is declared with a leading underscore and is thought
of as a private member of the class. This makes it difficult for a user to call di-
rectly, and makes it hard for a user to write code that depends on the underlying
implementation. To convince yourself of the utility of this, several Sorted con-
tainers of this chapter have exactly the same interface (so these data types can
be interchanged), but they are completely different structures. If the _locate
method were made public, then code could be written that makes use of this
list-specific method, and it would be impossible to switch implementations.
Implementation of the _locate method makes most of the nontrivial Sorted
methods more straightforward. The __contains__ special method is imple-
mented by searching for the appropriate location of the value in the sorted list
9.2 Keeping Structures Ordered 173
and returns true if this location actually contains an equivalent value. Because
_locate would return an index one beyond any equivalent values, we would
expect to find equivalent values at the location just prior to the one returned by
_locate.
def __contains__(self, value):
"""Determine if a SortedList contains value."""
index = self._locate(value) - 1
# if list is empty, value cannot be in it
if index == -1: 5
return False
return self._data[index] == value
The add operator simply adds an element to the SortedList in the position
indicated by the _locate operator:
def add(self, value):
"""Insert value at the correct location in the SortedList."""
index = self._locate(value)
self._data.insert(index, value)
Notice that if two equal values are added to the same SortedList, the last value
added is inserted later in the list.
class SkipListNode():
"""A node for use in SkipList."""
__slots__ = ["data", "next"]
def __init__(self, data=None, next=None):
self.data = data 5
# pointer to next node in SkipList
# next[i] is the next node at level i
self.next = [] if next is None else next
The SkipList is, essentially, a linked list of node structures. It keeps explicit
track of its size, a probability associated with computing the node height, and
a list of next pointers, one for each path through the list. It is initialized in the
following manner:
@checkdoc
class SkipList(Sorted):
def __len__(self):
"""The length of the list at level 0."""
return self._size
Now, let’s assume that there are several elements in the list. What is nec-
essary to locate a particular value? The search begins along the highest level,
shortest path. This path, i, connects only a few nodes, and helps us get to the
approximate location. We walk along the path, finding the largest value on the
path that is less than the value sought. This is the predecessor of our value (if it
is in the list) at level i. Since this node is in every lower level path, we transfer
the search to the next lower level, i-1. As we find the predecessor at each level,
we save a reference to each node. If the list has to be modified—by removing
the value sought, or adding a copy if it is missing—each of these nodes must
have their respective paths updated.
def _find(self, value):
"""Create a list of the nodes immediately to the left of where value should be located
node = self._head
predecessors = []
for l in reversed(range(self.height)): 5
# search for predecessor on level l
nextNode = node.next[l]
while (nextNode is not None) and (nextNode.data < value):
(node,nextNode) = (nextNode,nextNode.next[l])
# we’ve found predecessor at level l 10
predecessors.append(node)
# note that we continue from node’s lower pointers
predecessors.reverse()
return predecessors
Notice that it is important that the search begin with the highest level path;
going from the outer path to the innermost path guarantees that we find the
predecessor nodes quickly. If we consider the paths from lower levels upward,
we must begin the each of each path at the head; this does not guarantee a
sub-linear performance.
How can we evaluate the performance of the _find operation? First, let’s
compute the approximate length of each path. The lowest level path (level 0)
has n nodes. The second lowest path has pn nodes. Each successively higher
path is reduced in length to pi n for the path at level i. Given this, the number
of paths through the list is approximately log p1 n, the number of times that n
can be reduced by p. Another way to think about this is that between each
pair of nodes at level i, there are approximately p1 nodes at level i 1. Now, a
_find operation searches, in the expected case, about half the way through the
top-most path. Once the appropriate spot is found, the search continues at the
next level, considering some portion of the p1 nodes at that level. This continues
log p1 n times. The result is a running time of p1 log p1 n, or O(log n) steps.
The result of _find is a list of references to nodes that are the largest values
9.3 Skip Lists 177
smaller than the value sought. Entry 0 in this list is a reference to the predeces-
sor of the value sought. If the value that follows the node referenced by entry 0
in this list is not the value we seek, the value is not in the list.
def __contains__(self,value):
"""value is in SkipList."""
location = self._find(value)
target = location[0].next[0]
return (target is not None) and (target.data == value)
Another way to think about the result is find is that it is a list of nodes whose
links might be updated if the node sought was to be inserted. In the case of
removals, these nodes are the values that must be updated when the value is
removed.
The add method performs a _find operation which identifies, for each path,
the node that would be the predecessor in a successful search for the value. A
new node is constructed, its height is randomly selected (that is, the number
of paths that will include the node). The insertion of the node involves con-
structing the next points for the new node; these pointers are simply the next
references currently held by the predecessor nodes. For each of these levels it
is also necessary to update the next pointers for the predecessors so they each
point the our new node. Here is the code:
@mutatormethod
def add(self, value):
"""Insert a SkipListNode with data set to value at the appropriate location."""
# create node
node = SkipListNode(value) 5
nodeHeight = self._random_height()
for l in range(self.height,nodeHeight):
self._head.next.append(None)
# list of nodes whose pointers must be patched
predecessors = self._find(value) 10
# construct the pointers leaving node
node.next = [predecessors[i].next[i] for i in range(nodeHeight)]
for i in range(nodeHeight):
predecessors[i].next[i] = node
self._size += 1
It is possible to improve, slightly, the performance of this operation by merging
the code of _find and add methods, but the basic complexity of the operation
is not changed.
The remove operation begins with code that is similar to the __contains__
method. If the value-holding node is found, remove replaces the next fields of
the predecessors with references from the node targeted for removal:
@mutatormethod
def remove(self, value):
"""Remove value from SkipList at all levels where it appears."""
predecessors = self._find(value)
target = predecessors[0].next[0] 5
178 Sorted Structures
9.5 Conclusions
9.5 Conclusions 179
1000
Time (msecs)
800
600
400
200
0
0 10000 20000 30000 40000 50000 60000
Size
Figure 9.3 For small list sizes, SortedLists are constructed more quickly than
SkipLists.
40000
35000
30000
Time (msecs)
25000
20000
15000
10000
5000
0
0 100000 200000 300000 400000 500000 600000
Size
Figure 9.4 Based on insertion sort, SortedList construction is a O(n2 ) process, while
SkipList construction is O(n log n)
180 Sorted Structures
Chapter 10
Binary Trees
R ECURSION IS A BEAUTIFUL APPROACH TO STRUCTURING . We commonly think of
recursion as a form of structuring the control of programs, but self-reference can
be used just as effectively in the structuring of program data. In this chapter,
we investigate the use of recursion in the construction of branching structures
called trees.
Most of the structures we have already investigated are linear—their natural
presentation is in a line. Trees branch. The result is that where there is an
inherent ordering in linear structures, we find choices in the way we order the
elements of a tree. These choices are an indication of the reduced “friction” of
the structure and, as a result, trees provide us with the fastest ways to solve
many problems.
Before we investigate the implementation of trees, we must develop a con-
cise terminology.
10.1 Terminology
A tree is a collection of elements, called nodes, and relations between them,
called edges. Usually, data are stored within the nodes of a tree. Two trees are
disjoint if no node or edge is found common to both. A trivial tree has no nodes
and thus no data. An isolated node is also a tree.
From these primitives we may recursively construct more complex trees. Let
r be a new node and let T1 , T2 , . . . , Tn be a (possibly empty) set—a forest—of
distinct trees. A new tree is constructed by making r the root of the tree, and
establishing an edge between r and the root of each tree, Ti , in the forest. We
refer to the trees, Ti , as subtrees. We draw trees with the root above and the
trees below. Figure ??g is an aid to understanding this construction.
The parent of a node is the adjacent node appearing above it (see Figure ??).
The root of a tree is the unique node with no parent. The ancestors of a node
n are the roots of trees containing n: n, n’s parent, n’s parent’s parent, and so
on. The root is the ancestor shared by every node in the tree. A child of a node
n is any node that has n as its parent. The descendants of a node n are those
nodes that have n as an ancestor. A leaf is a node with no children. Note that
n is its own ancestor and descendant. A node m is the proper ancestor (proper
descendant) of a node n if m is an ancestor (descendant) of n, but not vice versa.
In a tree T , the descendants of n form the subtree of T rooted at n. Any node of
a tree T that is not a leaf is an interior node. Roots can be interior nodes. Nodes
182 Binary Trees
1 At the time of this writing, modern technology has not advanced to the point of allowing nodes
of degree other than 2.
2 This is the Texan born in Massachusetts; the other Texan was born in Connecticut.
10.3 Example: Expression Trees 183
MABeaky = BinTree("Martha");
GHWalker = BinTree("George",DDWalker,MABeaky); 10
JHWear = BinTree("James II");
NEHolliday = BinTree("Nancy");
LWear = BinTree("Lucretia",JHWear,NEHolliday);
DWalker = BinTree("Dorothy",GHWalker,LWear);
GHWBush = BinTree("George",PSBush,DWalker);
For each person we develop a node that either has no links (the parents were
not included in the database) or has references to other pedigrees stored as
BinTrees. Arbitrarily, we choose to maintain the father’s pedigree on the left
side and the mother’s pedigree along the right. We can then answer simple
questions about ancestry by examining the structure of the tree. For example,
who are the direct female relatives of the President?
person = GHWBush
while not person.right.empty:
person = person.right
print(person.value)
The results are
Dorothy
Lucretia
Nancy
Exercise 10.1 These are, of course, only some of the female relatives of President
Bush. Write a program that prints all the female names found in a BinTree
representing a pedigree chart.
One feature that would be useful, would be the ability to add branches to
a tree after the tree was constructed. For example, we might determine that
James Wear had parents named William and Sarah. The database might be up-
dated as follows:
JHWear.left = BinTree("William")
SAYancey = BinTree("Sarah")
JHWear.right = SAYancey
represent expressions using binary trees. Each value in the expression appears
as a leaf, while the operators are internal nodes that represent the reduction of
two values to one (for example, l - 1 is reduced to a single value for use on
the left side of the multiplication sign). The expression tree associated with our
expression is shown in Figure 10.1a. We might imagine that the following code
constructs the tree and prints 1:
# two variables, l and r, both initially zero
l = BinTree(["l",0])
r = BinTree(["r",0])
# compute r = 1+(l-1)*2 5
t = BinTree(operator(’-’),l,BinTree(1))
t = BinTree(operator(’*’),t,BinTree(2))
t = BinTree(operator(’+’),BinTree(1),t)
t = BinTree(operator(’=’),r,t)
print(eval(t)) 10
10.4 Implementation
We now consider the implementation of binary trees. As with our linked list
implementations, we will construct a self-referential binary tree class, BinTree.
The recursive design motivates implementation of many of the BinTree opera-
tions as recursive methods. However, because the base case of recursion often
involves an empty tree we will make use of a dedicated “sentinel” node that
represents the empty tree. This simple implementation will be the basis of a
large number of more advanced structures we see throughout the remainder of
the text.
10.4 Implementation 185
R +
1 *
- 2
L 1
(a)
R +
1 *
- 2
L 1
(b)
parent
value
left right
Figure 10.2 The structure of a BinTree. The parent reference opposes a left or right
child reference in parent node.
def __init__(self,data=None,left=None,right=None,dummy=False,frozen=False):
"""Construct a binary tree node."""
self._hash = None
if dummy:
# this code is typically used to initialize self.Empty
self._data = None 15
10.4 Implementation 187
self._left = self
self._right = self
self._parent = None
self._frozen = True
else: 20
# first, declare the node as a disconnected entity
self._left = None
self._right = None
self._parent = None
self._data = data 25
self._frozen = False
# now, nontrivially link in subtrees
self.left = left if left is not None else self.Empty
self.right = right if right is not None else self.Empty
if frozen:
self.freeze()
We call the initializer with dummy set to True when an empty BinTree is needed.
The result of this initializer is the single empty node that will represent mem-
bers of the fringe of empty trees found along the edge of the binary tree. The
most common initialization occurs when dummy is False. Here, the initializer
makes calls to “setting” routines. These routines allow one to set the references
of the left and right subtrees, but also ensure that the children of this node ref-
erence this node as their parent. This is the direct cost of implementing forward
and backward references along every link. The return, though, is the consid-
erable simplification of other code within the classes that make use of BinTree
methods.
N
NW
NE
Principle 6 Don’t let opposing references show through the interface. W
E
SW
SE
S
Once the node has been constructed, its value can be inspected and modi-
fied using the value-based functions that parallel those we have seen with other
types:
@property
def value(self):
"""The data stored in this node."""
return self._data
5
@value.setter
def value(self,newValue):
self._data = newValue
The BinTree may be used as the basis for our implementation of some fairly
complex programs and structures.
10.5 Example: An Expert System 189
Is it magical? a computer
a unicorn a car
Figure 10.3 The state of the database in the midst of playing infinite_questions.
Exercise 10.2 What questions would the computer ask if you were thinking of a
truck?
When the game is played, the computer is very likely to lose. The database
can still benefit by incorporating information about the losing situation. If the
program guessed a computer and the item was a car, we could incorporate the
car and a question “Does it have wheels?” to distinguish the two objects. As it
turns out, the program is not that difficult.
190 Binary Trees
def play(database):
if not database.left.empty:
# ask a question
answer = input(database.value+" ")
if answer == "yes": 5
play(database.left)
else:
play(database.right)
else:
# make a guess 10
answer = input("Is it "+database.value+"? ")
if answer == "yes":
print("I guessed it!")
else:
answer = input("Darn! What were you thinking of? ")
newObject = BinTree(answer)
oldObject = BinTree(database.value)
database.left = newObject
database.right = oldObject
database.value = \ 20
input("What question would distinguish {} from {}? "\
.format(newObject.value,oldObject.value))
The program can distinguish questions from guesses by checking to see there is
a left child. This situation would suggest this node was a question since the two
children need to be distinguished.
The program is very careful to expand the database by adding new leaves
at the node that represents a losing guess. If we aren’t careful, we can easily
corrupt the database by growing a tree with the wrong topology.
Here is the output of a run of the game that demonstrates the ability of the
database to incorporate new information—that is to learn:
Do you want to play a game?
Think of something...I’ll guess it
Is it a computer?
Darn. What were you thinking of?
What question would distinguish a car from a computer? 5
Do you want to play again?
Think of something...I’ll guess it
Does it have a horn?
Is it a car?
Darn. What were you thinking of? 10
What question would distinguish a unicorn from a car?
Do you want to play again?
Think of something...I’ll guess it
Does it have a horn?
Is it magical? 15
10.6 Recursive Methods 191
Is it a car?
I guessed it!
Do you want to play again?
Have a good day!
Exercise 10.3 Make a case for or against this program as a (simple) model for
human learning through experience.
temp = []
listify(self,temp)
return iter(temp)
The core of this implementation is the protected method enqueueInorder. It
simply traverses the tree rooted at its parameter and enqueues every node en-
countered. Since it recursively enqueues all its left descendants, then itself, and
then its right descendants, it is an in-order traversal. Since the queue is a FIFO,
the order is preserved and the elements may be consumed at the user’s leisure.
Exercise 10.4 Rewrite the other iteration techniques using the recursive approach.
192 Binary Trees
Preorder traversal. Each node is visited before any of its children are visited.
Typically, we visit a node, and then each of the nodes in its left subtree,
followed by each of the nodes in the right subtree. A preorder traversal of
the expression tree in the margin visits the nodes in the order: =, r, +, 1,
⇤, , l, 1, and 2.
In-order traversal. Each node is visited after all the nodes of its left subtree
have been visited and before any of the nodes of the right subtree. The in-
order traversal is usually only useful with binary trees, but similar traversal
mechanisms can be constructed for trees of arbitrary arity. An in-order
traversal of the expression tree visits the nodes in the order: r, =, 1,
+, l, , 1, ⇤, and 2. Notice that, while this representation is similar to
the expression that actually generated the binary tree, the traversal has
removed the parentheses.
Postorder traversal. Each node is visited after its children are visited. We visit
all the nodes of the left subtree, followed by all the nodes of the right
subtree, followed by the node itself. A postorder traversal of the expression
3 Reverse Polish Notation (RPN) was developed by Jan Lukasiewicz, a philosopher and mathemati-
cian of the early twentieth century, and was made popular by Hewlett-Packard in their calculator
wars with Texas Instruments in the early 1970s.
10.7 Traversals of Binary Trees 193
todo = Stack()
addNode(self,todo)
while not todo.empty:
tree = todo.pop() 10
yield tree.value
addNode(tree.right,todo)
addNode(tree.left,todo)
As we can see, todo is the private stack used to keep track of references to
unvisited nodes encountered while visiting other nodes. Another way to think
about it is that it is the frontier of nodes encountered on paths from the root
that have not yet been visited.
194 Binary Trees
A B C’
A’ B’
Figure 10.4 Three cases of determining the next current node for preorder traversals.
Node A has a left child A0 as the next node; node B has no left, but a right child B 0 ; and
node C is a leaf and finds its closest, “right cousin,” C 0 .
todo = Stack()
pushLeftDescendants(self,todo)
while not todo.empty:
tree = todo.pop() 10
yield tree.value
pushLeftDescendants(tree.right,todo)
Since the first element considered in an in-order traversal is the leftmost descen-
dant of the root, pushing each of the nodes from the root down to the leftmost
descendant places that node on the top of the stack.
When the current node is popped from the stack, the next element of the
traversal must be found. We consider two scenarios:
1. If the current node has a right subtree, the nodes of that tree have not
been visited. At this stage we should push the right child, and all the
nodes down to and including its leftmost descendant, on the stack.
2. If the node has no right child, the subtree rooted at the current node has
been fully investigated, and the next node to be considered is the closest
unvisited ancestor of the former current node—the node just exposed on
the top of the stack.
As we shall see later, it is common to order the nodes of a binary tree so that
left-hand descendants of a node are smaller than the node, which is, in turn,
smaller than any of the rightmost descendants. In such a situation, the in-order
traversal plays a natural role in presenting the data of the tree in order. For this
reason, the __iter__ method returns the iterator formed from the he inOrder
generator.
if not tree.left.empty: 5
tree = tree.left
else:
tree = tree.right
todo = Stack() 10
pushToLeftLeaf(self,todo)
while not todo.empty:
node = todo.pop()
yield node.value
if not todo.empty: 15
# parent is on top (todo.peek())
parent = todo.peek()
if node is parent.left:
pushToLeftLeaf(parent.right,todo)
Once the stack has been initialized, the top element on the stack is the leftmost
leaf, the first value to be visited. The node is popped from the stack and yielded.
Now, if a node has just been traversed (as indicated by the yield statement), it
and its descendants must all have been visited thus far in the traversal. Either
the todo stack is empty (we’re currently at the root of the traversal) or the top of
the stack is the parent of the current node. If the current node is its parent’s left
child, we speculatively continue the search through the parent’s right subtree. If
there is no right subtree, or if the current node is a right child, no further nodes
are pushed onto the stack and the next iteration will visit the parent node.
Since the todo stack simply keeps track of the path to the root of the traversal
the stack can be dispensed with by simply using the nodes that are singly linked
through the parent pointers. One must be careful, however, to observe two
constraints on the stackless traversal. First, one must make sure that traversals
of portions of a larger tree are not allowed to “escape” through self, the vir-
tual root of the traversal. Second, since there is one BinTree.Empty, we have
carefully protected that node from being “re-parented”. It is important that the
traversal not accidentally step off of the tree being traversed into the seemingly
disconnected empty node.
5
todo = Queue()
addNode(self,todo)
while not todo.empty:
tree = todo.dequeue()
yield tree.value 10
addNode(tree.left,todo)
addNode(tree.right,todo)
The traversal is initialized by adding the root of the search to the queue (pre-
suming the traversed subtree is not empty). The traversal step involves remov-
ing a node from the queue. Since a queue is a first-in first-out structure, this is
the node that has waited the longest since it was originally exposed and added
to the queue. After we visit a node, we enqueue the children of the node (first
the left the the right). With a little work it is easy to see that these are either
nieces and nephews or right cousins of the next node to be visited; it is impos-
sible for a node and its descendants to be resident in the queue at the same
time.
Unlike the other iterators, this method of traversing the tree is meaningful
regardless of the degree of the tree. On the other hand, the use of the queue,
here, does not seem to be optional; the code would be made considerably more
complex if it were recast without the queue.
In each of these iterators we have declared utility functions (addNode, etc.)
that have helped us describe the logical purpose behind the management of the
linear todo structure. There is, of course, some overhead in declaring these
functions, but from a code-clarity point-of-view, these functions help consid-
erably in understanding the motivations behind the algorithm. The method
addNode is common to two different traversals (preOrder and levelOrder),
but the type of the linear structure was different. The use of the add method
instead of the structure specific equivalents (push and enqueue) might allow us
a small degree of conceptual code reuse.
@property
def root(self):
"""Root of containing tree."""
result = self
while result.parent != None: 5
result = result.parent
return result
NE
E
SW
SE
on the last level. The depth of this node below the root of the traversal is the
height of the tree.
To facilitate this computation we make use of a private level-order traversal,
_nodes, that yields not values, but BinTree nodes:
def _nodes(self):
"""A level-order traversal of the binary tree nodes in this tree."""
def addNode(node,todo):
if not node.empty:
todo.add(node) 5
todo = Queue()
addNode(self,todo)
while not todo.empty:
node = todo.remove() 10
yield node
addNode(node.left,todo)
addNode(node.right,todo)
Traversing a tree with n nodes this traversal takes O(n) time and the size of the
todo structure may grow to include references to as many as d n2 e nodes.
Exercise 10.5 Rewrite the levelOrder generator based on the _nodes traversal.
Now, with the _nodes generator, we can write the height property by re-
turning the depth of the last node encountered.
@property
def height(self):
"""Height of the tree rooted at this node."""
if self.empty:
return -1 5
node = list(self._nodes())[-1]
# how far below self is the last lowest node?
return node.pathLength(self)
200 Binary Trees
Proof: We prove this by induction on the height of the tree. Suppose the tree
has height 0. Then it has exactly one node, which is also a leaf. Since 21 1 = 1,
the observation holds, trivially.
Our inductive hypothesis is that perfect trees of height k < h have 2k+1 1
nodes. Since h > 0, we can decompose the tree into two perfect subtrees of
height h 1, under a common root. Each of the perfect subtrees has 2(h 1)+1
1 = 2h 1 nodes, so there are 2(2h 1) + 1 = 2h+1 1 nodes. This is the result
we sought to prove, so by induction on tree height we see the observation must
hold for all perfect binary trees. ⇧
This observation suggests that if we can compute the height and size of a
tree, we have a hope of detecting a perfect tree by comparing the height of the
tree to the number of nodes:
@property
def perfect(self):
if self.empty:
return True
nodes = list(self._nodes()) 5
last = nodes[-1]
height = last.pathLength(self)
return (1<<(height+1))-1 == len(nodes)
Notice the return statement makes use of shifting 1 to the left by height+1
binary places. This is equivalent to computing 2**(h+1). The result is that the
function can detect a perfect tree in the time it takes to traverse the entire tree,
O(n).
There is one significant disadvantage, though. If you are given a tree with
height greater than 100, the result of the return statement may have to compute
very large values for reasonably small trees.
Exercise 10.6 Rewrite the perfect method so that it avoids the manipulation of
large values for small trees.
We now prove some useful facts about binary trees that help us evaluate per-
formance of methods that manipulate them. First, we consider a pretty result:
if a tree has lots of leaves, it must branch in lots of places.
Observation 10.2 The number of full nodes in a binary tree is one less than the
number of leaves.
With this result, we can now demonstrate that just over half the nodes of a
perfect tree are leaves.
Proof: In a perfect binary tree, all nodes are either full interior nodes or leaves.
The number of nodes is the sum of full nodes F and the number of leaves l.
Since, by Observation 10.2, F = L 1, we know that the count of nodes is
F + L = 2L 1 = 2h+1 1. This leads us to conclude that L = 2h and that
F = 2h 1. ⇧
This result demonstrates that for many simple tree methods half of the time is
spent processing leaves.
Using the _nodes traversal we can identify full trees by rejecting any tree
that contains a node with one child:
@property
def full(self):
"""All nodes are parents of 2 children or are leaves"""
for node in self._nodes():
if node.degree == 1: 5
return False
return True
Since this procedure can be accomplished by simply looking at all of the nodes,
it runs in O(n) time on a tree with n nodes.
We can recognize a complete binary tree by performing a level-order traver-
sal and realizing that any full nodes appear before any leaves. Between these
two types of nodes is potentially one node with just a left subtree.
@property
def complete(self):
"""Leaves of tree at one or two levels, deeper leaves at left."""
allowableDegree = 2
for node in self._nodes(): 5
degree = node.degree
if degree > allowableDegree:
return False
if degree < 2:
if not node.right.empty: 10
return False
allowableDegree = 0
return True
Again, because all of the processing can be accomplished in a single pass of all
of the nodes, a complete tree can be identified in O(n) time.
Figure 10.6 The Mountains huffman tree. Leaves are labeled with the characters they
represent. Paths from root to leaves provide huffman bit strings.
If each letter in the string is represented by 8 bits (as they often are), the entire
string takes 256 bits of storage. Clearly this catchy phrase does not use the full
range of characters, and so perhaps 8 bits are not needed. In fact, there are 13
distinct characters so 4 bits would be sufficient (4 bits can represent any of 16
values). This would halve the amount of storage required, to 128 bits.
If each character were represented by a unique variable-length string of bits,
further improvements are possible. Huffman encoding of characters allows us to
reduce the size of this string to only 111 bits by assigning frequently occurring
letters (like “o”) short representations and infrequent letters (like “a”) relatively
long representations.
Huffman encodings can be represented by binary trees whose leaves are the
characters to be represented. In Figure 10.6 left edges are labeled 0, while right
edges are labeled 1. Since there is a unique path from the root to each leaf, there
is a unique sequence of 1’s and 0’s encountered as well. We will use the string
of bits encountered along the path to a character as its representation in the
compressed output. Note also that no string is a prefix for any other (otherwise
one character would be an ancestor of another in the tree). This means that,
10.9 Example: Huffman Compression 203
Figure 10.7 The Huffman tree of Figure 10.6, but with nodes labeled by total frequen-
cies of descendant characters.
given the Huffman tree, decoding a string of bits involves simply traversing the
tree and writing out the leaves encountered.
The construction of a Huffman tree is an iterative process. Initially, each
character is placed in a Huffman tree of its own. The weight of the tree is the
frequency of its associated character. We then iteratively merge the two most
lightweight Huffman trees into a single new Huffman tree whose weight is the
sum of weights of the subtrees. This continues until one tree remains. One
possible tree for our example is shown in Figure 10.7.
Our approach is to use BinTrees to maintain the structure. This allows the
use of recursion and easy merging of trees. Leaves of the tree carry descriptions
of characters and their frequencies:
@total_ordering
class leaf(object):
__slots__ = ["frequency","ch"]
def __init__(self,c):
self.ch = c 5
204 Binary Trees
self.frequency = 1
def __lt__(self,other):
return self.ch < other.ch
10
def __eq__(self,other):
return self.ch == other.ch
def __str__(self):
return "leaf({} with frequency {})".format(self.ch,self.frequency)
Intermediate nodes carry no data at all. Their relation to their ancestors deter-
mines their portion of the encoding. The entire tree is managed by a wrapper
class, huffmanTree:
@total_ordering
class huffmanTree(object):
__slots__ = ["root","totalWeight"]
def __lt__(self,other):
return (self.totalWeight < other.totalWeight) or ((self.totalWeight == other.totalW
15
def __eq__(self,other):
return (self.totalWeight == other.totalWeight) and (self.root.value == other.root.v
def print(self):
printTree(self.root,"") 20
def __str__(self):
if self.root.height == 0:
return "huffmanTree({})".format(self.root.value)
else:
return "huffmanTree({},{},{})".format(self.totalWeight,self.root.left,self.root
This class is an ordered class because it implements the lt method and is dec-
orated with the @totally_ordered decorator. That method allows the trees to
be ordered by their total weight during the merging process. The utility method
print generates our output recursively, building up a different encoding along
every path.
We now consider the construction of the tree:
def main():
10.9 Example: Huffman Compression 205
s = stdin
freq = LinkedList()
for line in s: 5
for c in line:
if c == ’\n’:
continue
query = leaf(c)
if not query in freq: 10
freq.insert(0,query)
else:
item = freq.remove(query)
item.frequency += 1
freq.insert(0,item) 15
if len(trees) > 0:
ti = iter(trees)
next(ti).print()
There are three phases in this method: the reading of the data, the construction
of the character-holding leaves of the tree, and the merging of trees into a single
encoding. Several things should be noted:
Again, the total number of bits that would be used to represent our com-
pressed phrase is only 111, giving us a compression rate of 56 percent. In these
days of moving bits about, the construction of efficient compression techniques
is an important industry—one industry that depends on the efficient implemen-
tation of data structures.
Figure 10.8 The genealogy of President Clinton, presented as a linear table. Each
individual is assigned an index i. The parents of the individual can be found at locations
2i and 2i + 1. Performing an integer divide by 2 generates the index of a child. Note the
table starts at index 1.
Exercise 10.7 Modify an existing list class so that it keeps track of the index of
its first element. This value may be any integer and should default to zero. Make
sure, then, that the item-based methods respect this new indexing.
One possible approach to storing tree information like this is to store entries
in key-value pairs in the list structure, with the key being the index. In this way,
the tree can be stored compactly and, if the associations are kept in an ordered
structure, they can be referenced with only a logarithmic slowdown.
Exercise 10.8 Describe what would be necessary to allow support for trees with
degrees up to eight (called octtrees). At what cost do we achieve this increased
functionality?
10.11 Conclusions
The tree is a nonlinear structure. Because of branching in the tree, we will
find it is especially useful in situations where decisions can guide the process of
adding and removing nodes.
208 Binary Trees
@abc.abstractmethod
def add(self): 10
"""Insert value into data structure."""
...
@abc.abstractmethod
def first(self): 15
"""The next value that will be removed."""
...
@abc.abstractmethod
def remove(self): 20
210 Priority Queues
def values(self):
"""View of all items contained in data structure.""" 25
return View(self)
def __hash__(self):
"""Hash value for data structure."""
# total starting constant from builtin tuplehash function
total = 1000003 40
for v in self.values():
try:
x = hash(v)
except TypeError as y:
raise y 45
total = total + x
return total
def __repr__(self):
"""Parsable string representation of data structure."""
return self.__class__.__name__ + "(" + repr(list(self.values())) + ")"
Because they must be kept in order, the elements of a PriorityQueue are totally
ordered. In this interface the smallest values are found near the front of the
queue and will be removed soonest.1 The add operation is used to insert a new
value into the queue. At any time a reference to the minimum value can be
obtained with the first method and is removed with remove. A view method
exposes the elements through a traversal. The remaining methods are similar
to those we have seen before.
Notice that the PriorityQueue only extends the Sized interface. First, as
a matter of convenience, PriorityQueue methods consume and return val-
ues that must be orderable. Most structures we have encountered manipulate
unconstrained generic objects. Though similar, the PriorityQueue is not a
1 If explicit priorities are to be associated with values, the user may insert a tuple whose first value
is an ordered value such as an int. In this case, the associated value—the data element—need not
be ordered.
11.2 Example: Improving the Huffman Code 211
Queue. There is, for example, no dequeue method. Though this might be reme-
died, it is clear that the PriorityQueue need not act like a first-in, first-out
structure. At any time, the value about to be removed is the current minimum
value. This value might have been the first value inserted, or it might have just
recently “cut in line” before larger values. Still, the priority queue is just as
general as the stack and queue since, with a little work, one can associate with
inserted values a priority that forces any Linear behavior in a PriorityQueue.
The simplicity of the abstract priority queue makes its implementation rela-
tively straightforward. In this chapter we will consider three implementations:
one based on use of an SortedStructure and two based on a novel structure
called a heap. First, we consider an example that emphasizes the simplicity of
our interface.
@property
def first(self):
"""The smallest value in the priority queue."""
return self._data[-1]
5
@mutatormethod
def remove(self):
"""Remove the smallest value from the priority queue."""
self._dirty = True
return self._data.pop() 10
@mutatormethod
def add(self,value):
"""Add value to priority queue."""
self._data.append(None) 15
j = len(self._data)-1
while (j > 0) and not (value < self._data[j-1]):
self._data[j] = self._data[j-1]
j -=1
self._data[j] = value
self._dirty = True
The first operation takes constant time. The remove operation caches and
removes the first value of the SortedList with a linear-time complexity. This
can be easily avoided by reversing the way that the values are stored in the
11.4 A Heap Implementation 213
NW
NE
Principle 8 Avoid unnaturally extending a natural interface.
E
SW
SE
S
Exercise 11.1 Although the SortedList class does not directly support the PriorityQueue
interface, it nonetheless can be used, protected inside another class. What are the
advantages and disadvantages?
Definition 11.1 A heap is a binary tree whose root references the minimum value
and whose subtrees are, themselves, heaps.
Definition 11.2 A heap is a binary tree whose values are in ascending order on
every path from root to leaf.
214 Priority Queues
1 1 1 1
2 2 2 3 2 2 3
2 3 3 2 2
Figure 11.1 Four heaps containing the same values. Note that there is no ordering
among siblings. Only heap (b) is complete.
We will draw our heaps in the manner shown in Figure ??, with the mini-
mum value on the top and the possibly larger values below. Notice that each of
the four heaps contains the same values but has a different structure. Clearly,
there is a great deal of freedom in the way that the heap can be oriented—for
example, exchanging subtrees does not violate the heap property (heaps (c)
and (d) are mirror images of each other). While not every tree with these four
values is a heap, many are (see Problems ?? and ??). This flexibility reduces the
friction associated with constructing and maintaining a valid heap and, there-
fore, a valid priority queue. When friction is reduced, we have the potential for
N
increasing the speed of some operations.
NW
NE
W
SE
This is We will say that a heap is a complete heap if the binary tree holding the values
completely of the heap is complete. Any set of n values may be stored in a complete heap.
obvious. (To see this we need only sort the values into ascending order and place them
in level order in a complete binary tree. Since the values were inserted in as-
cending order, every child is at least as great as its parent.) The abstract notion
of a complete heap forms the basis for the first of two heap implementations of
a priority queue.
-1
0 1
43 3 3 2
65 58 40 42 4
-1 0 1 43 3 3 2 65 58 40 42 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Figure 11.2 An abstract heap (top) and its list representation. Arrows from parent to
child are not physically part of the list, but are indices computed by the heap’s left and
right methods.
3. The right child of a value stored in location i may be found at the location
following the left child, location 2(i + 1) = (2i + 1) + 1.
def _left(i): 5
"""The index of the left child of i."""
return 2*i+1
def _right(i):
"""The index of the right child of i."""
return 2*(i+1)
The functions _parent, _left, and _right are standalone methods (that is,
216 Priority Queues
they’re not declared as part of the Heap class, to indicate that they do not ac-
tually have to be called on any instance of a heap. Instead, their values are
functions of their parameters only. Their names contain a leading underscore to
emphasize that their use is limited to the Heap module.
Notice, in this mapping, that while the list is not maintained in ascending
order, any path from the root to a leaf does encounter values in ascending order.
Initially, the Heap is represented by an empty list that we expect to grow. If the
list is ever larger than necessary, slots not associated with tree nodes are set to
None. The Heap initialization method is fairly straightforward:
@checkdoc
class Heap(PriorityQueue,Freezable):
−1
0 1
43 3 3 2
65 58 40 42 4 2
−1 0 1 43 3 3 2 65 58 40 42 4 2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
(a) Before
−1
0 1
43 3 2 2
65 58 40 42 4 3
−1 0 1 43 3 2 2 65 58 40 42 4 3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
(b) After
Figure 11.3 The addition of a value (2) to a list-based heap. (a) The value is inserted
into a free location known to be part of the result structure. (b) The value is percolated
up to the correct location on the unique path to the root.
218 Priority Queues
parent = _parent(leaf)
value = self._data[leaf] 5
while (leaf > 0) and (value < self._data[parent]):
self._data[leaf] = self._data[parent]
leaf = parent
parent = _parent(leaf)
self._data[leaf] = value
recursively, with the value sinking into smaller subtrees (Figure ??b), possibly
becoming a leaf. Since any single value is a heap, the recursion must stop by
the time the newly inserted value becomes a leaf.
Here is the private method associated with the pushing down of the root:
def _pushDownRoot(self,root):
"""Take a (possibly large) value at the root and two subtrees that are
heaps, and push the root down to a point where it forms a single heap in O(log n) time."""
heapSize = len(self)
value = self._data[root] 5
while root < heapSize:
childpos = _left(root)
if childpos < heapSize:
if (_right(root) < heapSize) and \
self._data[childpos+1] < self._data[childpos]: 10
childpos += 1
# Assert: childpos indexes smaller of two children
if self._data[childpos] < value:
self._data[root] = self._data[childpos]
root = childpos 15
else:
self._data[root] = value
return
else:
self._data[root] = value
return
The remove method simply involves returning the smallest value of the heap,
but only after the rightmost element of the list has been pushed downward.
@mutatormethod
def remove(self):
"""Remove the minimum element from the Heap."""
assert len(self) > 0
self._dirty = True 5
minVal = self.first
self._data[0] = self._data[len(self)-1];
self._data.pop()
if len(self) > 1 :
self._pushDownRoot(0);
return minVal
Each iteration in pushDownRoot pushes a large value down into a smaller heap
on a path from the root to a leaf. Therefore, the performance of remove is
O(log n), an improvement over the behavior of the Prioritylist implementa-
tion.
Since we have implemented all the required methods of the PriorityQueue,
the Heap implements the PriorityQueue and may be used wherever a priority
queue is required.
220 Priority Queues
-1
0 1
43 3 3 2
65 58 40 42 4
-1 0 1 43 3 3 2 65 58 40 42 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
(a)
0 1
43 3 3 2
65 58 40 42 4
0 1 43 3 3 2 65 58 40 42 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
(b)
Figure 11.4 Removing a value from the heap shown in (a) involves moving the right-
most value of the list to the top of the heap as in (b). Note that this value is likely to
violate the heap property but that the subtrees will remain heaps.
11.4 A Heap Implementation 221
0 1
43 3 3 2
65 58 40 42
4 0 1 43 3 3 2 65 58 40 42
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
(a)
3 1
43 4 3 2
65 58 40 42
0 3 1 43 4 3 2 65 58 40 42
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
(b)
Figure 11.5 Removing a value (continued). In (a) the newly inserted value at the
root is pushed down along a shaded path following the smallest children (lightly shaded
nodes are also considered in determining the path). In (b) the root value finds, over
several iterations, the correct location along the path. Smaller values shift upward to
make room for the new value.
222 Priority Queues
The advantages of the Heap mechanism are that, because of the unique map-
ping of complete trees to the list, it is unnecessary to explicitly store the con-
nections between elements. Even though we are able to get improved perfor-
mance over the Prioritylist, we do not have to pay a space penalty. The
complexity arises, instead, in the code necessary to support the insertion and
removal of values.
Notice that we keep track of the size of the heap locally, rather than asking the
BinaryTree for its size. This is simply a matter of efficiency, but it requires us
to maintain the value within the add and remove procedures. Once we commit
224 Priority Queues
1. If the left heap has no left child, make the right heap the left child of the
left heap (see Figure 11.6b).
2. Otherwise, exchange the left and right children of the left heap. Then
merge (the newly made) left subheap of the left heap with the right heap
(see Figure 11.6d).
Notice that if the left heap has one subheap, the right heap becomes the left
subheap and the merging is finished. Here is the code for the merge method:
def _merge(self, left, right):
"""Merge two BinTree heaps into one."""
if left.empty: return right
if right.empty: return left
leftVal = left.value 5
rightVal = right.value
if self._cmp(rightVal, leftVal) < 0:
result = self._merge(right,left)
else:
result = left 10
if result.left.empty:
result.left = right
else:
temp = result.right
result.right = result.left 15
result.left = self._merge(temp,right)
11.4 A Heap Implementation 225
(a) + 1 1
10 3 10
1 1
+ 3
(c)
3
1 3 1
(d)
+
3
+
Figure 11.6 Different cases of the merge method for SkewHeaps. In (a) one of the
heaps is empty. In (b) and (c) the right heap becomes the left child of the left heap. In
(d) the right heap is merged into what was the right subheap.
226 Priority Queues
return result
To remove the minimum value from the heap we must extract and return
the value at the root. To construct the smaller resulting heap we detach both
subtrees from the root and merge them together. The result is a heap with all
the values of the left and right subtrees, but not the root. This is precisely the
result we require. Here is the code:
@mutatormethod
def remove(self):
"""Remove the minimum element from the SkewHeap."""
self._dirty = True
result = self._root.value 5
self._root = self._merge(self._root.left,self._root.right)
self._count -= 1
return result
The remaining priority queue methods for skew heaps are implemented in a
relatively straightforward manner.
Because a skew heap has unconstrained topology (see Problem ??), it is
possible to construct examples of skew heaps with degenerate behavior. For
example, adding a new maximum value can take O(n) time. For this reason
we cannot put very impressive bounds on the performance of any individual
operation. The skew heap, however, is an example of a self-organizing structure:
inefficient operations spend some of their excess time making later operations
run more quickly. If we are careful, time “charged against” early operations can
be amortized or redistributed to later operations, which we hope will run very
efficiently. This type of analysis can be used, for example, to demonstrate that
m > n skew heap operations applied to a heap of size n take no more than
O(m log n) time. On average, then, each operation takes O(log n) time. For
applications where it is expected that a significant number of requests of a heap
will be made, this performance is appealing.
11.5 Example: Circuit Simulation 227
1 0 1
input 0
output
2
Notice that there is a time associated with the set method. This helps us doc-
ument when different events happen in the component. These events are sim-
228 Priority Queues
ulated by a comparable Event class. This class describes a change in logic level
on an input pin for some component. As the simulation progresses, Events are
created and scheduled for simulation in the future. The ordering of Events is
based on an event time. Here are the details:
@total_ordering
class Event(object):
__slots__ = ["_time", "_level", "_c" ]
def go(self): 10
self._c.component.set(self._time,self._c.pin,self._level)
def __lt__(self,other):
return self._time < other._time
def simulate():
global EvenQueue,time 5
# post: run simulation until event queue is empty
# returns final clock time
low = 0.0 # voltage of low logic
high = 3.0 # voltage of high logic
clock = 0.0 10
while not EventQueue.empty:
e = EventQueue.remove()
# keep track of time
clock = e._time
# simulate the event 15
e.go()
print("-- circuit stable after {} ns --".format(clock))
return clock
As events are processed, the logic level on a component’s pins are updated. If
the inputs to a component change, new Events are scheduled one gate delay
11.5 Example: Circuit Simulation 229
later for each component connected to the output pin. For Sources and Probes,
we write a message to the output indicating the change in logic level. Clearly,
when there are no more events in the priority queue, the simulated circuit is
stable. If the user is interested, he or she can change the logic level of a Source
and resume the simulation by running the simulate method again.
We are now equipped to simulate the circuit of Figure 11.7. The first portion
of the following code sets up the circuit, while the second half simulates the
effect of toggling the input several times:
global time, EventQueue
low = 0 # voltage of low logic
high = 3 # voltage of high logic
EventQueue = Heap()
5
# set up circuit
inv = Inverter(0.2)
and_gate = And(0.8)
output = Probe("output")
input = Source("input",inv.pin(1)) 10
input.connectTo(and_gate.pin(2))
inv.connectTo(and_gate.pin(1))
and_gate.connectTo(output.pin(1))
15
# simulate circuit
time = 0
time = simulate()
input.set(time+1.0,0,high) # first: set input high
time = simulate() 20
input.set(time+1.0,0,low) # later: set input low
time = simulate()
input.set(time+1.0,0,high) # later: set input high
time = simulate()
input.set(time+1.0,0,low) # later: set input low
simulate()
When run, the following output is generated:
1.0 ns: output now 0 volts
-- circuit stable after 1.0 ns --
2.0 ns: input set to 3 volts
2.8 ns: output now 3 volts
3.0 ns: output now 0 volts 5
-- circuit stable after 3.0 ns --
4.0 ns: input set to 0 volts
-- circuit stable after 5.0 ns --
6.0 ns: input set to 3 volts
6.8 ns: output now 3 volts 10
7.0 ns: output now 0 volts
230 Priority Queues
When the input is moved from low to high, a short spike is generated on the
output. Moving the input to low again has no impact. The spike is generated by
the rising edge of a signal, and its width is determined by the gate delay of the
inverter. Because the spike is so short, it would have been difficult to detect it
using real hardware.2 Devices similar to this edge detector are important tools
for detecting changing states in the circuits they monitor.
11.6 Conclusions
We have seen three implementations of priority queues: one based on a list
that keeps its entries in order and two others based on heap implementations.
The list implementation demonstrates how any ordered structure may be
adapted to support the operations of a priority queue.
Heaps form successful implementations of priority queues because they relax
the conditions on “keeping data in priority order.” Instead of maintaining data
in sorted order, heaps maintain a competition between values that becomes
progressively more intense as the values reach the front of the queue. The cost
of inserting and removing values from a heap can be made to be as low as
O(log n).
If the constraint of keeping values in a list is too much (it may be impossi-
ble, for example, to allocate a single large chunk of memory), or if one wants to
avoid the uneven cost of extending a list, a dynamic mechanism is useful. The
SkewHeap is such a mechanism, keeping data in general heap form. Over a num-
ber of operations the skew heap performs as well as the traditional list-based
implementation.
2 This is a very short period of time. During the time the output is high, light travels just over
2 inches!
Chapter 12
Search Trees
S TRUCTURES ARE OFTEN THE SUBJECT OF A SEARCH . We have seen, for example,
that binary search is a natural and efficient algorithm for finding values within
ordered, randomly accessible structures. Recall that at each point the algorithm
compares the value sought with the value in the middle of the structure. If they
are not equal, the algorithm performs a similar, possibly recursive search on one
side or the other. The pivotal feature of the algorithm, of course, was that the
underlying structure was in order. The result was that a value could be effi-
ciently found in approximately logarithmic time. Unfortunately, the modifying
operations—add and remove—had complexities that were determined by the
linear nature of the vector.
Heaps have shown us that by relaxing our notions of order we can improve
on the linear complexities of adding and removing values. These logarithmic
operations, however, do not preserve the order of elements in any obviously
useful manner. Still, if we were somehow able to totally order the elements of
a binary tree, then an algorithm like binary search might naturally be imposed
on this branching structure.
3 3 2 1 1
2 1 3 3 2
1
1 2 2 3
have them on the left. This preference is arbitrary. If we assume that values
equal to the root will be found in the left subtree, but in actuality some are
located in the right, then we might expect inconsistent behavior from methods
that search for these values. In fact, this is not the case.
def __init__(self,data=None,frozen=False,key=None,reverse=False):
"""Construct a SearchTree from an iterable source."""
@property
def empty(self): 10
"""SearchTree contains no items."""
@mutatormethod
def add(self,value):
"""Add value to SearchTree."""
def _insert(self,value): 15
"""Insert value into appropriate location in SearchTree in O(log n) time."""
def __contains__(self,value):
"""value is in SearchTree."""
@mutatormethod
def remove(self,value):
"""Remove value from SearchTree."""
Unlike the BinTree, the SearchTree provides only one iterator method. This
Maybe even method provides for an in-order traversal of the tree, which, with some thought,
with no allows access to each of the elements in order.
thought!
12.2 Example: Tree Sort 233
def __init__(self):
self._table = SearchTree() 5
def __contains__(self,symbol):
return KV(symbol) in self._table
def __setitem__(self,symbol,value): 10
pair = KV(symbol,value)
if pair in self._table:
self._table.remove(pair)
self._table.add(pair)
15
def __getitem__(self,symbol):
a = KV(symbol, None)
if a in self._table:
a = self._table.get(a)
return a.value 20
else:
return None
def __delitem__(self,symbol):
a = KV(symbol,None) 25
if a in self._table:
self._table.remove(a)
def remove(self,symbol):
a = KV(symbol,None) 30
if a in self._table:
a = self._table.remove(a)
return a.value
else:
return None
Based on such a table, we might have a program that reads in a number of alias-
name pairs terminated by the word END. After that point, the program prints out
the fully translated aliases:
table = SymTab()
reading = True
for line in stdin:
words = line.split()
if len(words) == 0: 5
continue
if reading:
if words[0] == ’END’:
reading = False
else: 10
table[words[0]] = words[1]
else:
name = words[0]
while name in table:
name = table[name]
print(name)
Given the input:
three 3
one unity
unity 1
pi three
END 5
one
two
12.4 Implementation 235
three
pi
the program generates the following output:
1
two
3
3
We will consider general associative structures in Chapter ??, when we discuss
dictionaries. We now consider the details of actually supporting the SearchTree
structure.
12.4 Implementation
In considering the implementation of a SearchTree, it is important to remember
that we are implementing an Sorted. The methods of the Sorted accept and
return values that are to be compared with one another. By default, we assume
that the data are ordered and that the natural order is sufficient. If alternative
orders are necessary, or an ordering is to be enforced on elements that do not
directly implement a __lt__ method, alternative key functions may be used.
Essentially the only methods that we depend upon are the compatibility of the
key function and the elements of the tree.
We begin by noticing that a SearchTree is little more than a binary tree
with an imposed order. We maintain a reference to a BinTree and explicitly
keep track of its size. The constructor need only initialize these two fields and
suggest an ordering of the elements to implement a state consistent with an
empty binary search tree:
@checkdoc
class SearchTree(Sorted,Freezable):
Tree = BinTree
__slots__ = [ "_root", "_count", "_dirty", "_frozen", "_hash",
"_cmp" ] 5
def __init__(self,data=None,frozen=False,key=None,reverse=False):
"""Construct a SearchTree from an iterable source."""
self._root = self.Tree.Empty
self._count = 0 10
self._frozen = False
self._cmp = Comparator(key=key,reverse=reverse)
if data:
for x in data:
self.add(x) 15
self._frozen = frozen
self._hash = None
correct location to insert the value and then use that method as the basis for
implementing the public methods—add, contains, and remove. Our approach
to the method _locate is to have it return a reference to the location that iden-
tifies the correct point of insertion for the new value. This method, of course,
makes heavy use of the ordering. Here is the Python code for the method:
def _locate(self,value):
"""A private method to return either
(1) a node that contains the value sought, or
(2) a node that would be the parent of a node created to hold value, or
(3) an empty tree if the tree is empty.""" 5
result = node = self._root
while not node.empty:
nodeValue = node.value
result = node
relation = self._cmp(value,nodeValue) 10
if relation == 0: return result
node = node.left if relation < 0 else node.right
return result
The approach of the _locate method parallels binary search. Comparisons are
made with the _root, which serves as a median value. If the value does not
match, then the search is refocused on either the left side of the tree (among
smaller values) or the right side of the tree (among larger values). In either
case, if the search is about to step off the tree, the current node is returned: if
the value were added, it would be a child of the current node.
Once the _locate method is written, the __contains__ method must check
to see if the node returned by _locate actually equals the desired value:2
def __contains__(self,value):
"""value is in SearchTree."""
if self._root.empty: return False
possibleLocation = self._locate(value)
return 0 == self._cmp(value,possibleLocation.value)
2 We reemphasize at this point the importance of making sure that the __eq__ method for an object
is consistent with the ordering suggested by the _lt method of the particular Comparator.
3 With a little thought, it is clear to see that this is a correct location. If there are two copies of a
value in a tree, the second value added is a descendant and predecessor (in an in-order traversal)
of the located value. It is also easy to see that a predecessor has no right child, and that if one is
added, it becomes the predecessor.
12.4 Implementation 237
def add(self,value):
"""Add value to SearchTree."""
self._insert(value)
5
def _insert(self,value):
"""Insert value into appropriate location in SearchTree in O(log n) time."""
self._dirty = True
newNode = self.Tree(value)
if self._root.empty: 10
self._root = newNode
else:
insertLocation = self._locate(value)
nodeValue = insertLocation.value
if self._cmp(nodeValue,value) < 0: 15
insertLocation.right = newNode
else:
if not insertLocation.left.empty:
self._pred(insertLocation).right = newNode
else: 20
insertLocation.left = newNode
self._count += 1
return newNode
Our add code makes use of the protected “helper” function, _pred, which, after
calling _rightmost, returns a pointer to the node that immediately precedes
the indicated root:
def _pred(self,root):
"""Node immediately preceding root."""
return self._rightmost(root.left)
def _rightmost(self,root): 5
"""Rightmost descendant of root."""
result = root
while not result.right.empty:
result = result.right
return result
A similar routine can be written for successor, and would be used if we preferred
to store duplicate values in the right subtree.
We now approach the problem of removing a value from a binary search
tree. Observe that if it is found, it might be an internal node. The worst case
occurs when the root of a tree is involved, so let us consider that problem.
There are several cases. First (Figure 12.2a), if the root of a tree has no
left child, the right subtree can be used as the resulting tree. Likewise (Fig-
ure 12.2b), if there is no right child, we simply return the left. A third case
(Figure 12.2c) occurs when the left subtree has no right child. Then, the right
238 Search Trees
x x
(a) (b)
B B
A
(c)
Figure 12.2 The three simple cases of removing a root value from a tree.
subtree—a tree with values no smaller than the left root—is made the right sub-
tree of the left. The left root is returned as the result. The opposite circumstance
could also be true.
We are, then, left to consider trees with a left subtree that, in turn, contains
a right subtree (Figure 12.3). Our approach to solving this case is to seek out
the predecessor of the root and make it the new root. Note that even though the
predecessor does not have a right subtree, it may have a left. This subtree can
take the place of the predecessor as the right subtree of a nonroot node. (Note
that this is the result that we would expect if we had recursively performed our
node-removing process on the subtree rooted at the predecessor.)
Finally, here is the Python code that removes the top BinTree of a tree and
returns the root of the resulting tree:
def _remove_node(self,top):
"""Remove the top of the binary tree pointed to by top.
The result is the root of the tree to be used as replacement."""
left,right = top.left,top.right
top.left = top.right = self.Tree.Empty 5
# There are three general cases:
# - the root is empty (avoid): return same
# - the root has fewer than two children: the one (or empty) is new top
12.4 Implementation 239
x 3
1 1
2 2
predecessor( x)
Figure 12.3 Removing the root of a tree with a rightmost left descendant.
With the combined efforts of the _remove_node and _locate methods, we can
now simply locate a value in the search tree and, if found, remove it from the
tree. We must be careful to update the appropriate references to rehook the
modified subtree back into the overall structure.
Notice that inserting and removing elements in this manner ensures that the
in-order traversal of the underlying tree delivers the values stored in the nodes
in a manner that respects the necessary ordering. We use this, then, as our pre-
ferred iteration method.
def __iter__(self):
"""Iterator over the SearchTree."""
240 Search Trees
return self._root.valueOrder()
Exercise 12.1 One possible approach to keeping duplicate values in a binary search
tree is to keep a list of the values in a single node. In such an implementation,
each element of the list must appear externally as a separate node. Modify the
SearchTree implementation to make use of these lists of duplicate values.
y x
Right rotation
x y
C A
A B Left rotation B C
its children. A right rotation takes a left child, x, of a node y and reverses their
relationship. This induces certain obvious changes in connectivity of subtrees,
but in all other ways, the tree remains the same. In particular, there is no
structural effect on the tree above the original location of node y. A left rotation
is precisely the opposite of a right rotation; these operations are inverses of each
other.
The code for rotating a binary tree about a node is a method of the BinTree
class. We show, here, rotateRight; a similar method performs a left rotation. Finally, a right
def rotateRight(self): handed
"""Rotate this node (a left subtree) so it is the root.""" method!
parent = self.parent
newRoot = self.left
wasChild = parent != None 5
wasLeft = self.isLeftChild
# hook in new root (sets newRoot’s parent, as well)
self.left = newRoot.right
newRoot.right = self
if wasChild: 10
if wasLeft:
parent.left = newRoot
else:
parent.right = newRoot
For each rotation accomplished, the nonroot node moves upward by one
level. Making use of this fact, we can now develop an operation to splay a tree
at a particular node. It works as follows:
g p x
(a) p x g p
x g
g g x
p x p g
(b) x p
Figure 12.5 Two of the rotation pairs used in the splaying operation. The other cases
are mirror images of those shown here.
After the splay has been completed, the node x is located at the root of the
tree. If node x were to be immediately accessed again (a strong possibility),
the tree is clearly optimized to handle this situation. It is not the case that the
tree becomes more balanced (see Figure 12.5a). Clearly, if the tree is splayed at
an extremal value, the tree is likely to be extremely unbalanced. An interesting
feature, however, is that the depth of the nodes on the original path from x to
the root of the tree is, on average, halved. Since the average depth of these
nodes is halved, they clearly occupy locations closer to the top of the tree where
they may be more efficiently accessed.
To guarantee that the splay has an effect on all operations, we simply per-
form each of the binary search tree operations as before, but we splay the tree
12.6 Splay Tree Implementation 243
at the node accessed or modified during the operation. In the case of remove,
we splay the tree at the parent of the value removed.
One difficulty with the splay operation is that it potentially modifies the
structure of the tree. For example, the __contains__ method—a method nor-
mally considered nondestructive—potentially changes the underlying topology
of the tree. This makes it difficult to construct iterators that traverse the SplayTree
since the user may use the value found from the iterator in a read-only opera-
tion that inadvertently modifies the structure of the splay tree. This can have
disastrous effects on the state of the iterator. A way around this difficulty is It can also
to have the iterator keep only that state information that is necessary to help wreck your day.
reconstruct—with help from the structure of the tree—the complete state of our
traditional nonsplay iterator. In the case of the __iter__, we keep track of two
references: a reference to an “example” node of the tree and a reference to the
current node inspected by the iterator. The example node helps recompute the
root whenever the iterator is reset. To determine what nodes would have been
stored in the stack in the traditional iterator—the stack of unvisited ancestors of
the current node—we consider each node on the (unique) path from the root to
the current node. Any node whose left child is also on the path is an element of
244 Search Trees
Virtual Stack
root
current
Figure 12.6 A splay tree iterator, the tree it references, and the contents of the virtual
stack driving the iterator.
our “virtual stack.” In addition, the top of the stack maintains the current node
(see Figure ??).
REWRITE The constructor sets the appropriate underlying references and resets the it-
erator into its initial state. Because the SplayTree is dynamically restructuring,
the root value passed to the constructor may not always be the root of the tree.
Still, one can easily find the root of the current tree, given a node: follow parent
pointers until one is None. Since the first value visited in an inorder traversal
is the leftmost descendant, the reset method travels down the leftmost branch
(logically pushing values on the stack) until it finds a node with no left child.
The current node points to, by definition, an unvisited node that is, logically,
on the top of the outstanding node stack. Therefore, the hasNext and get
methods may access the current value immediately.
All that remains is to move the iterator from one state to the next. The next
method first checks to see if the current (just visited) element has a right child.
If so, current is set to the leftmost descendant of the right child, effectively
popping off the current node and pushing on all the nodes physically linking
the current node and its successor. When no right descendant exists, the sub-
tree rooted at the current node has been completely visited. The next node to
be visited is the node under the top element of the virtual stack—the closest
ancestor whose left child is also an ancestor of the current node. Here is how
we accomplish this in Python:
The iterator is now able to maintain its position through splay operations.
Again, the behavior of the splay tree is logarithmic when amortized over a
number of operations. Any particular operation may take more time to execute,
but the time is usefully spent rearranging nodes in a way that tends to make the
tree shorter.
From a practical standpoint, the overhead of splaying the tree on every oper-
12.7 An Alternative: Red-Black Trees 245
ation may be hard to justify if the operations performed on the tree are relatively
random. On the other hand, if the access patterns tend to generate degenerate
binary search trees, the splay tree can improve performance.
Exercise 12.2 Describe a strategy for keeping a binary search tree as short as
possible. One example might be to unload all of the values and to reinsert them in
a particular order. How long does your approach take to add a value?
3. Every path from a node to a descendent leaf contains the same number of
black nodes.
The result of constructing trees with these rules is that the height of the tree
measured along two different paths cannot differ by more than a factor of 2:
two red nodes may not appear contiguously, and every path must have the
same number of black nodes. This would imply that the height of the tree is
O(log2 n).
Exercise 12.3 Prove that the height of the tree with n nodes is no worse than
O(log2 n).
is probed and modified. The methods add and remove are careful to maintain
the red-black structure through at most O(log n) rotations and re-colorings of
nodes. For example, if a node that is colored black is removed from the tree, it
is necessary to perform rotations that either convert a red node on the path to
the root to black, or reduce the black height (the number of black nodes from
root to leaf) of the entire tree. Similar problems can occur when we attempt to
add a new node that must be colored black.
The code for red-black trees can be found online as RedBlackSearchTree.
While the code is too tedious to present here, it is quite elegant and leads to
binary search trees with very good performance characteristics.
No longer The implementation of the RedBlackSearchTree structure in the structure
true. package demonstrates another approach to packaging a binary search tree that
Rewrite. is important to discuss. Like the BinTree structure, the RedBlackSearchTree is
defined as a recursive structure represented by a single node. The RedBlackSearchTree
also contains a dummy-node representation of the empty tree. This is useful in
reducing the complexity of the tests within the code, and it supports the notion
that leaves have children with color, but most importantly, it allows the user
to call methods that are defined even for red-black trees with no nodes. This
approach—coding inherently recursive structures as recursive classes—leads to
side-effect free code. Each method has an effect on the tree at hand but does not
modify any global structures. This means that the user must be very careful to
record any side effects that might occur. In particular, it is important that meth-
ods that cause modifications to the structure return the “new” value of the tree.
If, for example, the root of the tree was the object of a remove, that reference is
no longer useful in maintaining contact with the tree.
To compare the approaches of the SearchTree wrapper and the recursive
RedBlackSearchTree, we present here the implementation of the SymTab struc-
ture we investigated at the beginning of the chapter, but cast in terms of RedBlackSearchTrees.
Comparison of the approaches is instructive (important differences are high-
lighted with uppercase comments).
class RBSymTab(object):
__slots__ = ["_table"]
def __init__(self):
self._table = RedBlackSearchTree() 5
def __contains__(self,symbol):
return KV(symbol,None) in self._table
def __setitem__(self,symbol,value): 10
a = KV(symbol,value)
if a in self._table:
self._table.remove(a)
self._table.add(a)
15
def __getitem__(self,symbol):
12.8 Conclusions 247
a = KV(symbol,None)
if a in self._table:
a = self._table.get(a)
return a.value 20
else:
return None
def remove(self,symbol):
a = KV(symbol, None) 25
if a in self._table:
a = self._table.get(a)
self._table.remove(a)
return a.value
else:
return None
12.8 Conclusions
A binary search tree is the product of imposing an order on the nodes of a binary
tree. Each node encountered in the search for a value represents a point where
a decision can be accurately made to go left or right. If the tree is short and
fairly balanced, these decisions have the effect of eliminating a large portion of
the remaining candidate values.
The binary search tree is, however, a product of the history of the insertion
of values. Since every new value is placed at a leaf, the internal nodes are left
untouched and make the structure of the tree fairly static. The result is that
poor distributions of data can cause degenerate tree structures that adversely
impact the performance of the various search tree methods.
To combat the problem of unbalanced trees, various rotation-based opti-
mizations are possible. In splay trees, rotations are used to force a recently
accessed value and its ancestors closer to the root of the tree. The effect is often
to shorten degenerate trees, resulting in an amortized logarithmic behavior. A
remarkable feature of this implementation is that there is no space penalty: no
accounting information needs to be maintained in the nodes.
248 Search Trees
Chapter 13
Sets
It is often useful to keep track of an unordered collection of unique values. For
this purpose, a set is the best choice. Python provides several built-in types that
have set-like semantics, and we will investiate those in this chapter and the
next. First, however, we think about the identity of objects.
class ListSet(Set):
__slots__ = ["_data"]
5
def __init__(self,data=None):
"""Create ListSet, possibly populated by an iterable."""
self._data = []
if data:
for x in data: 10
self.add(x)
def add(self,value):
"""Add a value, if not already present."""
if value not in self: 15
self._data.append(value)
250 Sets
def discard(self,value):
"""Remove a value, if present."""
if value in self: 20
self._data.remove(value)
def clear(self):
"""Empty the set."""
if self._data: 25
self._data = []
def __contains__(self,value):
"""Check for value in set."""
return value in self._data 30
def __iter__(self):
"""Return a traversal of the set elements."""
return iter(self._data)
35
def __len__(self):
"""Return number of elements in set."""
return len(self._data)
def __repr__(self):
return "ListSet("+repr(self._data)+")"
To populate the set we incorporate values using add or, in the initializer, the
element-by-element addition through iteration. Removal of values can be ac-
complished by discard, which removes the matching value if present and is
silent otherwise,1 and clear, which empties the ListSet. While clear is a
generic method of the MutableSet class, it is often more efficient when overrid-
den for each class.
Here are some interactions of sets that become possible, even with this sim-
ple implementation:
>>> a = ListSet(range(10))
>>> b = ListSet(range(5,7))
>>> a < b
False
>>> b < a 5
True
>>> a == b
False
>>> a == set(range(9,-1,-1))
True 10
>>> a == range(9,-1,-1)
1 This is compared with the operation remove, which typically raises a ValueError if the value is
not found.
13.2 Example: Sets of Integers 251
False
>>> a - b
ListSet([0, 1, 2, 3, 4, 7, 8, 9])
>>> a | b 15
ListSet([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> a.remove(5)
>>> a & b
ListSet([6])
>>> a.isdisjoint(b) 20
False
>>> b.clear()
>>> len(b)
0
Notice, by the way, that the set-combining operations generate new objects of
the type found on the left side of the operation. Thus, the result of taking a
ListSet and removing values found in a builtin set generate a new ListSet.
Because the __contains__ method takes O(n) time, many operations are
linear. For example, add must first check for the presence of a value before
performing a constant-time append of the value to the list. This means, for ex-
ample, that initialization from an iterable source of n values takes O(n2 ) time.
The reader is encouraged to consider ways that the performance of basic op-
erations can be improved, though it will be necessary to override the default
methods from the abstract base class to get the best performance.
The reason that __contains__ method is slow is that the value to be located
may be found in any position within the list. If, however, we could predict
where the value would be found if it was there, the contains method might be
a constant time operation. This observation is the basis for our next example of
a set, a BitSet.
The bitwise-or operation simply computes a value whose individual bits are
set if the bit is set in either (or both) of the composed values.
def _offset(n): 5
"""Compute bit index of bit n in array element."""
return n%BitSet.WIDTH
def _bit(n):
"""Generate a 1 in bit n, 0 otherwise."""
return 1<<n
The value of BitSet.WIDTH would be 8 for this implementation, since we’re
using arrays of bytes to store the membership information. The conversion of
index and bit offset back to an integer value is relatively simple: we simply
multiply compute index*BitSet.WIDTH+offset.
The BitSet keeps track of two items: the underlying bytearray, _bytes,
and the number of elements in the set, _count, which is otherwise expensive to
determine (you might think about how this might be accomplished). The ini-
tialization of this structure makes use of the clear method, which is responsible
for resetting the BitSet into an empty state.
class BitSet(MutableSet,Freezable):
13.2 Example: Sets of Integers 253
WIDTH = 8
__slots__ = ["_bytes", "_count"]
def clear(self):
"""Discard all elements of set."""
self._count = 0
self._bytes = bytearray()
At any time the extent of the BitSet contains only the bytes that have been
necessary to encode the presence of values from 0 to the maximum value en-
countered during the life of the BitSet. For this reason, it is always necessary
to check to see if an operation is valid within the current extent and, on occa-
sion, to expand the extend of the set to allow relatively large, new values to be
added. The hidden method _probe checks the extent while _ensure expands
the range of values that may be stored.
def _probe(self,n):
"""Verify bit n is in the extent of the set."""
index = _index(n)
return index < len(self._bytes)
5
def _ensure(self,n):
"""Ensure that bit n is within the capacity of the set."""
index = _index(n)
increment = index-(len(self._bytes)-1)
if increment > 0:
self._bytes.extend(increment*[0])
The action of the _ensure method is much the same as the extension of a list:
new bytes are only added as they are needed. Because the bytearray extend
operation might involve copying the entire array of bytes, an operation that is
linear in the size of the set, it might be preferred to have the extent expanded
through exponential growth. The reader is encouraged to revisit that discussion
on page ??.
The __contains__ operation is arguably the most important operation. It
probes, possibly being able to eliminate the possibility if the _probe fails. If the
_probe is successful the value may or may not be in the set, so it is important
that we check that the particular bit is set to 1.
def __contains__(self,n):
"""n is in set. Equivalent to __get_item__(n)."""
if not self._probe(n): return False
254 Sets
index = _index(n)
bitnum = _offset(n)
return 0 != (self._bytes[index] & _bit(bitnum))
While Python typically considers zero values and empty strings to be treated as
False, that interpretation only occurs when a boolean value is needed. In this
case, however, the result of the and (&) is an integer formed from the bits that
result from the operation. The value returned, then, is an integer, so to convert
it to a boolean value, the comparison with zero is necessary.
Two methods add and discard must set and clear bits. To add a value we
first ensuring that the appropriate is available (if not already present) to be set.
def add(self,n):
"""Add n to set."""
self._ensure(n)
index = _index(n)
bitnum = _offset(n) 5
if not (self._bytes[index] & _bit(bitnum)):
self._bytes[index] |= _bit(bitnum)
self._count += 1
Because bits are being turn on, the important operation, here, is or. Of course,
we only add one to the _count when we actually expand the size of the BitSet.
The discard method removes values if present. It does not shrink the
bytearray, though it could. Here, we use _probe to exit quickly if n could
not possibly be in the BitSet. If the correct bit is present, it is set to 0 only if it
is not currently 1.
def discard(self,n):
"""Remove n from self, if present; remain silent if not there."""
if not self._probe(n): return
index = _index(n)
bitnum = _offset(n) 5
if self._bytes[index] & _bit(bitnum):
self._bytes[index] &= ~(_bit(bitnum))
self._count -= 1
Here, because we’re trying to turn off the bit, we use an and operation with a
mask that is missing the bit to be cleared.
As we have seen, the bitwise operations | and & are used to force bits to be
1 or 0. If, we know the current setting, it is possible to toggle or complement
an individual bit with the bitwise-exlusive or operator, ^. Since, in both the add
and discard operations, we know the current state and are simply interested
in changing the state, both operations could use the exclusive-or operation in a
symmetric manner.
The __iter__ function for BitSets is particularly interesting. Unlike other
structures to be traversed we must examine, possibly, several locations before
we yield a single value. We use the enumerate iteration to produce individ-
ual bytes and their respective index values. Only if a byte is not zero do we
check the individual bits. If one of these bits is set, we reconstruct the cor-
13.3 Hash tables 255
responding number, based on its index (i) and bit offset (b). The expression
i*BitSet.WIDTH+b, we note, is the combining operation that is the inverse of
the _index and _offset hidden methods.
def __iter__(self):
"""(lazy) iterator of all elements in set."""
for i,v in enumerate(self._bytes):
if v != 0:
for b in range(BitSet.WIDTH): 5
if v & _bit(b):
yield (i*BitSet.WIDTH)+b
For very sparsely populated sets that cover a wide range of values, the __iter__
method may loop many times between production of values.
The BitSet structure is an efficient implementation of a Set because mem-
bers are stored in locations that are found at easily computed and dedicated
locations. These characteristics allow us to add and discard values from sets
in constant time. There are downsides, of course: sometimes we want to store
values that are not integers, and sparse sets may take up considerable space.
We now consider techniques that allow us to construct sets of objects other than
integers with some bounds on the space utilization.
The Hashable abstract base class, found in the container.abc package, promises
the definition of the special method, __hash__. This method is called when the
builtin function hash is called with an object as its sole parameter. Many builtin
primitive types, including numeric types, booleans, and strings all define this
method. Immutable container types, including tuple types cannot be directly
modified and, thus, are hashable. Because the list is a dynamic container
type, a hash code is not available. In those situations where one wishes to
have a hashable list-like container, a tuple is usually first constructed from the
desired list.
Because we think of the comparison of hash values as a first approximation
to a full equality test of immutable objects, it is vital that the hash codes of equal
values be equal.
13.3 Hash tables 257
Principle 11 Equivalent objects (the same under __eq__) should return equiva-
lent hash codes (as computed by __hash__).
2 While it may be tricky to build good hash methods, bad hashing techniques do not generally cause
failure, just bad performance.
3 Actually, in CPython (a common version of Python on most platforms), the int -1 hashes to
-2. This is due to a technical constraint of the implementation of Python in C that is relatively
unimportant to this discussion. Still, do not be surprised if, when you expect a hash value of -1,
Python computes a value of -2.
4 Python also transparently supports Unicode strings. The discussion, here, applied equally well
600
500
400
Collisions
300
200
100
0
0 100 200 300 400 500 600 700 800 900
Hash Bucket (hashtable size=997)
Figure 13.1 Numbers of words from the UNIX spelling dictionary hashing to each of
the 997 buckets of a default hash table, if sum of characters is used to generate hash
code.
slot of a 997 element list.5 The periodic peaks demonstrate the fact that some
slots of the table are heavily preferred over others. The performance of looking
up and modifying values in the hash table will vary considerably, depending
on the slot that is targeted by the hash function. Clearly, it would be useful to
continue our search for a good mechanism.
Another approach might be to weight each character of the string by its
position. To ensure that even very short strings have the potential to generate
large hash values, we can provide exponential weights: the hash code for an l
character string, s, is
l 1
X
s[i]ci
i=0
5 It is desirable to have the size of the list of target locations be a prime number because it spreads
out the keys that might collide because of poor hash definition. For example, if every hash code
were a multiple of 8, every cell in a 997 element array is a potential target. If the table had size
1024, any location that is not a multiple of 8 would not be targeted.
13.3 Hash tables 259
300
250
200
Collisions
150
100
50
0
0 100 200 300 400 500 600 700 800 900
Hash Bucket (hashtable size=997)
Figure 13.2 Frequency of dictionary words hashing to each of 997 buckets if characters
are weighted by powers of 2 to generate hash code. Some periodicity is observable.
2000
1500
Collisions
1000
500
0
0 100 200 300 400 500 600 700 800 900
Hash Bucket (hashtable size=997)
Figure 13.3 Frequency of words from dictionary hashing to each of 997 buckets if
hash code is generated by weighting characters by powers of 256. Notice that bucket
160 has 2344 words, well above the expected average of 237.
260 Sets
250
200
Collisions
150
100
50
0
0 100 200 300 400 500 600 700 800 900
Hash Bucket (hashtable size=997)
Figure 13.4 Frequency of words from dictionary hashing to each of 997 buckets, using
the Python str hash code generation. The deviation is small and the distribution is
seeded with a random constant determined each Python session.
where p is a prime6 and r0 and r1 are random values selected when Python
starts up (see Figure 13.4). Because p is prime, the effect of multiplying by
powers of p is similar to shifting by a fractional number of bits; not only do
the bits move to the left, but they change value, as well. The values r0 and r1
introduce an element of randomness that works against collisions of particular
strings. It also means that the hash codes for individual runs of a program will
differ.7
Python 3.3, the hash computation for tuples computes a modified sum of hash
codes whose weights are related to the square of the distance the element is
from the end of the string.
Little is important to know about the technique other than to realize that
the hash codes of each element contributes to the hash code for the tuple. Be-
cause of this, tuples that contain mutable values (like lists) do not have hash
codes, because lists, themselves, are not hashable. In addition, in CPython (the
standard for most platforms), hash codes can never be -1; if a hash of -1 is pro-
duced, it is mapped, automatically, to -2. The reason for this is an unimportant
implementation decision, but it means that all tuples whose elements are each
either -1 or -2 hash to exactly the same value!
13.4.1 HashSets
We now implement a HashSet whose values are stored sparsely in a list. All ele-
ments in the table are stored in a fixed-length list whose length is, ideally, prime.
Initialization ensures that each slot within the array is set to None. Eventually,
slots will contain hashable values. We use list for speed, but any container with
constant time access to elements would be a logical alternative.
@checkdoc
class HashSet(MutableSet,Freezable):
RESERVED = "Rent This Space"
SIZES = [3, 7, 17, 37, 67, 127, 257, 509, 997, 1999, 4001, 8011,
16001, 32003, 64007, 125003, 250007, 500009, 1000003]
__slots__ = ["_data", "_capacity", "_maxload", "_dirty", "_counter", "_frozen", "_hash"]
0 alexandrite alexandrite 0
1 1
2 crystal cobalt? crystal 2
3 dawn dawn 3
4 emerald emerald 4
5 flamingo flamingo 5
6 6
7 hawthorne hawthorne 7
8 8
9 9
10 10
11 11
12 moongleam marigold marigold? moongleam 12
13 marigold 13
14 14
15 15
16 16
17 17
18 18
19 tangerine tangerine 19
20 20
21 vaseline vaseline? vaseline 21
22 22
(a) (b)
Figure 13.5 Hashing color names of antique glass. (a) Values are hashed into the first
available slot, possibly after rehashing. (b) The lookup process uses a similar approach
to possibly find values.
self._frozen = frozen
self._hash = None
0 alexandrite alexandrite 0
1 1
2 crystal custard crystal 2
3 dawn dawn 3
4 delete emerald 4
5 flamingo flamingo 5
6 6
7 hawthorne hawthorne 7
8 8
9 9
10 10
11 11
12 delete moongleam marigold? 12
13 marigold marigold 13
14 14
(a) (b)
Figure 13.6 (a) Deletion of a value leaves a shaded reserved cell as a place holder. (b)
A reserved cell is considered empty during insertion and full during lookup.
alexandrite 0 alexandrite
1
cobalt crystal canary 2 cobalt crystal
dawn 3 dawn duncan
custard 4 custard
flamingo 5 flamingo
6
hawthorne 7 hawthorne
8
(a) (b)
Figure 13.7 (a) Primary clustering occurs when two values that hash to the same slot
continue to compete during rehashing. (b) Rehashing causes keys that initially hash to
different slots to compete.
def _rehash(self,value,index): 15
"""Find another slot based on value and old hash index."""
return (index+1)%self.capacity
13.4 Sets of Hashable Values 265
alexandrite 0 alexandrite
1
cobalt crystal canary 2 cobalt crystal
dawn 3 dawn duncan
custard 4 custard
flamingo 5 flamingo
6
hawthorne 7 hawthorne
8
9
10
11
12
marigold 13 marigold
14
15
16
17
18
tangerine 19 tangerine
20
vaseline 21 vaseline
22
23
24
25
Figure 13.8 The keys of Figure 13.7 are rehashed by an offset determined by the
alphabet code (a = 1, b = 2, etc.) of the second letter. No clustering occurs, but strings
must have two letters!
self._checkLoad(self._counter+1)
index = self._locate(value) 5
if _free(self._data[index]):
self._counter += 1
self._data[index] = value
self._dirty = True
The get function works similarly—we simply return the value that matches
in the table, or None, if no equivalent value could be found.
def get(self, value):
"""Return object stored in HashSet that is equal to value."""
if value not in self:
raise KeyError("Value {} not found in HashSet.".format(value))
index = self._locate(value)
return self._data[index]
To discard a value from the HashSet, we locate the correct slot for the value
and, if found, we leave a reserved mark (HashSet.RESERVED) to maintain con-
sistency in _locate.
@mutatormethod
def discard(self, value):
"""Remove value from HashSet, if present, or silently do nothing"""
index = self._locate(value)
if _free(self._data[index]): 5
return
self._data[index] = HashSet.RESERVED
self._dirty = True
self._counter = self._counter - 1
Hash tables are efficiently traversed. Our approach is based on the list iter-
ator, yielding all values that are not “free”; values of None and HashSet.RESERVED
are logically empty slots in the table. When the iterator is queried for a value,
the underlying list is searched from the current point forward to find the next
meaningful reference. The iterator must eventually inspect every element of the
structure, even if very few of the elements are currently used.8
8The performance of this method could be improved by linking the contained values together. This
would, however, incur an overhead on the add and discard methods that may not be desirable.
13.4 Sets of Hashable Values 267
13.4.2 ChainedSets
Open addressing is a satisfactory method for handling hashing of data, if one
can be assured that the hash table will not get too full. When open addressing
is used on nearly full tables, it becomes increasingly difficult to find an empty
slot to store a new value.
One approach to avoiding the complexities of open addressing—reserved
values and table extension—is to handle collisions in a fundamentally different
manner. External chaining solves the collision problem by inserting all elements
that hash to the same bucket into a single collection of values. Typically, this
collection is an unordered list. The success of the hash table depends heavily on
the fact that the average length of the linked lists (the load factor of the table) is Check this
small and the inserted objects are uniformly distributed. When the objects are term
uniformly distributed, the deviation in list size is kept small and no list is much
longer than any other.
The process of locating the correct slot in an externally chained table in-
volves simply computing the initial hash for the key and “modding” by the table
size. Once the appropriate bucket is located, we verify that the collection is
constructed and the value in the collection is updated. Because Python’s list
classes do not allow element retrieval by value, we may have to remove and
reinsert the appropriate value.
@mutatormethod
def add(self, value):
"""Add value to ChainSet, regardless of if it is already present."""
index = self._locate(value)
bucket = self._data[index] 5
if bucket is None:
self._data[index] = []
bucket = self._data[index]
if value in bucket:
bucket.remove(value) 10
else:
self._counter += 1
bucket.append(value)
self._dirty = True
Most of the other methods are implemented in a similar manner: they locate
the appropriate bucket to get a list, then they search for the equivalent value
within the list.
One method, __iter__, essentially requires the iteration over two dimen-
sions of the hash table. One loop searches for buckets in the hash table—buckets
that contain values—and an internal loop that explicitly iterates across the list.
This is part of the price we must pay for being able to store arbitrarily large num-
bers of values in each bucket of the hash table.
def __iter__(self):
for l in self._data:
268 Sets
if l is not None:
for v in l:
yield v 5
def __len__(self):
"""The number of objects stored in ChainSet."""
return self._counter
10
def __str__(self):
"""String representation of ChainSet."""
return str(list(self))
def __repr__(self): 15
"""Evaluable representation of this ChainSet."""
return "ChainSet({!r})".format(list(self))
2.5e+07 SplayTree
OrderedList
OrderedVector
2e+07
1.5e+07
1e+07
5e+06
0
0 1000 2000 3000 4000 5000 6000 7000 8000
Size of Structure
Figure 13.9 The time required to construct large ordered structures from random val-
ues.
500 SplayTree
OrderedList
OrderedVector
400
300
200
100
0
0 2 4 6 8 10 12 14 16
Size of Structure
Figure 13.10 The time required to construct small ordered structures from random
values.
270 Sets
the structure is immutable, but the value is not. Since KV types are considered
equal if their keys are equal, the simplest effective hashing mechanism (and the
one actually used) for a KV pair is to simply use the hash code of the key. For
this reason, we can construct HashSets and ChainSets of KV types.
For other types that are mutable, it is sometimes useful to construct frozen
versions that are immutable. For example, Python provides both sets and
frozensets. The reason for this (and the implementors of sets were undoubt-
edly sensitive to this) is that we often want to construct power sets or sets of sets.
For these purposes, (mutable) set objects are made immutable by constructing
immutable frozenset copies as they are added to the main set structure. Unfor-
tunately, the construction of frozen copies of structures is cumbersome enough
that few of Python’s classes have frozen alternates.
Another approach, and one we use in the structure package, is to provide
each structure with the ability to freeze or become immutable. This process
cannot be directly reversed other than by constructing a fresh, mutable copy.
As an example of the technique, we will now look more closely at the imple-
mentation of the LinkedList class. This class, recall, has all the semantics of
Python’s builtin list class: values can be added, removed, found and modified.
To freeze a such a structure is to disable all methods that have the potential of
modifying the state of the hashable part of the structure.
First, let’s look at the __hash__ method, to see the scope of the problem:
@hashmethod
def __hash__(self):
"""Hash value for LinkedList."""
if self._hash:
return self._hash 5
index = 73
total = 91
for v in self:
try:
x = hash(v) 10
except TypeError as y:
raise y
total = total + (x * index)
index = index + 1
self._hash = total
return total
Notice that the hash code for the structure is affected by the hash code for each
of the elements contained within the list. If we are to freeze the list, we will
have to freeze each of the elements that are contained. Freezing LinkedList
structures is deep or recursive. If we fail to freeze an element of the structure,
the element can be changed, thus affecting the hash code of the entire structure.
To indicate that a structure may be frozen, we have it extend the Freezable
abstract base class. This requires two methods, freeze and frozen:
class Freezable(Hashable):
"""
13.5 Freezing Structures 271
@abc.abstractproperty
def frozen(self): 10
"""Data structure is frozen."""
...
One approach to providing this interface is to supply with a hidden _frozen
attribute. This boolean is True when the structure is immutable. When the
structure is mutable, _frozen is False. During initialization, we carefully ini-
tialize these variables to allow the ability to populate the structure with values
from an iterable source. Typically, we provide the ability to specify the value for
a boolean keyword, frozen, to indicate that the structure is initially frozen.
def __init__(self, data=None, frozen=False):
"""Initialize a singly linked list."""
self._hash = None
self._head = None
self._len = 0 5
self._frozen = False
if data is not None:
self.extend(data)
if frozen:
self.freeze()
The two required methods for any container structure are typically written
as they are for LinkedLists:
def freeze(self):
for v in self:
if isinstance(v,Freezable):
v.freeze()
self._frozen = True 5
self._hash = hash(self)
@property
def frozen(self):
return self._frozen
Notice how the structure of freeze parallels that of __hash__. Also, because
freezing the structure will indirectly freeze the hash code, we compute and
cache this value, though it’s not logically necessary. Calling the __hash__ func-
tion before the structure is frozen causes an exception to be raised. This behav-
ior is indicated by the @hashmethod decorator.
@hashmethod
def __hash__(self):
272 Sets
During the life of a structure, its state is changed by methods that change or
mutate the state. These methods, of course, should not be callable if the object
is frozen. The structure package provides a method decorator, @mutatormethod,
that indicates which methods change the structure’s state.
@mutatormethod
def append(self, value):
"""Append a value at the end of the list (iterative version).
Args: 5
value: the value to be added to the end of the list.
"""
...
The effect is to wrap the function with a test to ensure that self.frozen is
False. If, on the other hand, the structure is frozen, an exception is raised.
Once frozen, a structure cannot be unfrozen. It may, however, be used as
a iterable source for initializing mutable structures. More efficient techniques
are possible with the copy package and Python’s serialization methods, called
pickling.
Chapter 14
Maps
One of the most important primitive data structures of Python is the dictionary,
dict. The dictionary is, essentially, a set of key-value pairs where the keys
are unique. In abstract terms, a dictionary is an implementation of a function,
where the domain of the function is the collection of keys, and the range is the
collection of values.
In this chapter we look at a simple reimplementation of the builtin dict
class, as well as a number of extensions of the notion of a map. First, we revisit
the symbol table application to make use of a dict.
14.1 Mappings
A mapping is a data structure that captures the correspondence between values
found in a domain of keys to members of a range values. For most implemen-
tations, the size of the range is finite, but it is not required. In Python, the ab-
stract base class for immutable mappings is collection.abc.Mapping, which
requires the implementation of a number of methods, the most important of
which is __getitem__. Recall that this method provides support for “indexing”
the structure. Mutable mappings require the implementation of methods that
can modify the mapping including __setitem__ and __delitem__.
The builtin class, dict, implements the methods required by MutableMapping.
As an example of the use of these mappings, we revisit the symbol table example
from the previous chapter.
reading = False
print("Table contains aliases: {}".format(table.keys()))
else:
table[words[0]] = words[1]
else:
name = words[0]
while name in table: 15
name = table[name]
print(name)
Here, we see that Python’s native dict object can be used directly to maintain
the symbol map structure. The semantics of indexing are extended to us to
access value by an arbitrary hashable key. The Mapping interface demands, as
well, the construction of three different types of views, or collections of data:
keys returns a view of the valid keys in the dictionary, items returns a view of
the valid (key,value) 2-tuples, and values returns a view of a collection of the
values that correspond to keys. The values in this last view are not necessarily
unique or hashable. The keys view is used in our symbol table application to
identify those words that would be rewritten.
is unimportant (and cannot be predicted; that’s what we’re looking for!). The
result is the actual mapping we need. The get method performs a similar op-
eration, but has an optional second parameter which is the value to return is
there was no matching key-value pair in the set.
Deletion of the an association is determined by the appropriate key. The
__delitem__ method removes the pair from the underlying set.
@mutatormethod
def __delitem__(self,key):
"""Remove key-value pair associated with key from Dictionary."""
self._kvset.remove(KV(key,None))
Notice that we use remove that, unlike set’s discard method, will raise an ex-
ception if the item is not found. This is the behavior we desire in the MutableMapping
method. The pop method works similarly, but returns the value associated with
the key removed.
A number of other straightforward methods are required to meet the spec-
ification of a MutableMapping. The clear method removes all the key-value
associations. The easiest approach is to clear the underlying set.
def clear(self):
"""Remove all contents of Dictionary."""
self._kvset.clear()
As with the set and list implementations, a popitem method is provided that
allows us to pick an arbitrary mapping element to remove and return. In most
implementations, the pair returned by popitem is simply the first one that would
be encountered in a traversal of the mapping.
@mutatormethod
def popitem(self):
"""Remove and return some key-value pair as a tuple.
If Dictionary is empty, raise a KeyError."""
if len(self) == 0: 5
raise KeyError("popitem(): dictionary is empty")
item = next(iter(self.items()))
(key,value) = item
self._kvset.remove(KV(key,None))
return item
Because MutableMappings store key-value pairs, the value returned is an arbi-
trary pair formed into a (key,value) 2-tuple.
Views are objects that allow one to iterate (often multiple times) across a
subordinate target structure. When changes are made to the target, they are
immediately reflected in the view. Every view supports __len__ (it’s Sized),
__contains__ (it’s a Container), and __iter__ (it’s Iterable). All Mapping
types provide three methods that construct views of the underlying structure:
items, which returns a view of key-value tuples, keys, which returns a view of
the keys in the structure, and values, which returns a view of all the values
that appear in the structure. In the symbol map example, we saw how the keys
method could be used to generate a list of valid keys to the map. The Mapping
14.2 Tables: Sorted Mappings 277
class provides generic implementations of each of the methods items, keys, and
values similar to the following:
def items(self):
"""View over all tuples of key-value pairs in Dictionary."""
return ItemsView(self)
def keys(self): 5
"""View over all keys in Dictionary."""
return KeysView(self)
def values(self):
"""View over all values in Dictionary."""
return ValuesView(self)
Each type of view has a corresponding abstract base class that provides generic
implementation of the appropriate functionality. These classes—ItemsView,
KeysView, and ValuesView—are found in the collections package. Views
don’t typically have internal state (other than a reference to the target struc-
ture), but the right circumstance (for example, when the target structure is
frozen) they provide an opportunity for efficiently caching read-only views of
structures when on-the-fly traversals with iterators are computationally costly.
An important benefit of the use of the HashSet class is the ability of this
set implementation to gracefully grow as the number of entries in the mapping
increases. Since most uses of mapping will actually use a surprisingly small
number of key-value pairs, we initially set the capacity (the number of buckets
in the HashSet) to be relatively small. As the number of mapping entries in-
creases, this capacity increases as well. It is, then, relatively unimportant that
we estimate the capacity of the Mapping very accurately.
However, if we were to depend on a ChainSet, whose capacity determines
the fixed number of buckets to be used during the life of the structure, the
performance will degrade quickly if the capacity was underestimated. This is
because the performance of the structures is largely determined by the structure
that manages the multiple entries found in each bucket. Add figure
with
degrading
14.2 Tables: Sorted Mappings chained set
perfor-
One of the costs of having near-constant time access to key-value pairs using mance.
hashing is the loss of order. If the keys can be ordered, it is natural to want to
have the key-value pairs stored in sorted order. In the structure package we call
sorted mappings, tables. The implementation of table structures is not logically
much different than the implementation of dictionaries, we simply keep our KV
pairs in a Sorted structure. Because of its (expected) speed, the SplayTree is a
natural target for such an implementation, thought any Sorted structure would
work.
Internally, the Table collect the KV pairs in the _sorted attribute. Since the
table is a container, it is necessary to implement the __container__ method,
278 Maps
15.1 Terminology
A graph G consists of a collection of vertices v 2 VG and relations or edges
(u, v) 2 EG between them (see Figure 15.1). An edge is incident to (or mentions)
each of its two component vertices. A graph is undirected if each of its edges
is considered a set of two unordered vertices, and directed if the mentioned
vertices are ordered (e.g., referred to as the source and destination). A graph S
is a subgraph of G if and only if VS ✓ VG and ES ✓ EG . Simple examples of
graphs include the list and the tree.
In an undirected graph, the number of edges (u, v) incident to a vertex u
is its degree. In a directed graph, the outgoing edges determine its out-degree
(or just degree) and incoming edges its in-degree. A source is a vertex with no
incoming edges, while a sink is a vertex with no outgoing edges.
Two edges (u, v) and (v, w) are said to be adjacent. A path is a sequence
of n distinct, adjacent edges (v0 , v1 ), (v1 , v2 ), . . . , (vn 1 , vn ). In a simple path
the vertices are distinct, except for, perhaps, the end points v0 and vn . When
v0 = vn , the simple path is a cycle.
Two vertices u and v are connected (written u ; v) if and only if a simple
path of the graph mentions u and v as its end points. A subgraph S is a connected
component (or, often, just a component) if and only if S is a largest subgraph of Components are
G such that for every pair of vertices u, v 2 VS either u ; v or v ; u. A always
connected.
280 Graphs
b c
G
d c
a b a d
S
c d e
a b
b c e
Figure 15.1 Some graphs. Each node a is adjacent to node b, but never to d. Graph
G has two components, one of which is S. The directed tree-shaped graph is a directed,
acyclic graph. Only the top left graph is complete.
@property
15.2 The Graph Interface 281
def frozen(self):
"""Graph is frozen.""" 10
def freeze(self):
"""Freeze graph to prevent adding and removing vertices and edges and
to allow hashing."""
15
@abc.abstractmethod
def add(self, label):
"""Add vertex with label label to graph."""
@abc.abstractmethod 20
def add_edge(self, here, there, edge_label):
"""Add edge from here to there with its own label edge_label to graph."""
@abc.abstractmethod
def remove(self, label):
"""Remove vertex with label label from graph.""" 25
@abc.abstractmethod
def remove_edge(self, here, there):
"""Remove edge from here to there."""
@abc.abstractmethod 30
def get(self, label):
"""Return actual label of indicated vertex.
Return None if no vertex with label label is in graph."""
@abc.abstractmethod 35
def get_edge(self, here, there):
"""Return actual edge from here to there."""
@abc.abstractmethod
def __contains__(self, label): 40
"""Graph contains vertex with label label."""
@abc.abstractmethod
def contains_edge(self, here, there):
"""Graph contains edge from vertex with label here to vertex with
label there."""
@abc.abstractmethod
def visit(self, label):
"""Set visited flag on vertex with label label and return previous
value."""
@abc.abstractmethod
def visit_edge(self, edge):
282 Graphs
@abc.abstractmethod
def visited(self, label):
"""Vertex with label label has been visited."""
60
@abc.abstractmethod
def visited_edge(self, edge):
"""edge has been visited."""
@abc.abstractmethod 65
def reset(self):
"""Set all visited flags to false."""
@abc.abstractmethod
def __len__(self): 70
"""The number of vertices in graph."""
@abc.abstractmethod
def degree(self, label):
"""The number of vertices adjacent to vertex with label label."""
@abc.abstractproperty
def edge_count(self):
"""The number of edges in graph."""
80
@abc.abstractmethod
def __iter__(self):
"""Iterator across all vertices of graph."""
@abc.abstractmethod 85
def vertices(self):
"""View of all vertices in graph."""
@abc.abstractmethod
def neighbors(self, label): 90
"""View of all vertices adjacent to vertex with label label.
Must be edge from label to other vertex."""
@abc.abstractmethod
def edges(self): 95
"""View across all edges of graph."""
@abc.abstractmethod
def clear(self):
"""Remove all vertices from graph."""
100
15.2 The Graph Interface 283
@property
def empty(self):
"""Graph contains no vertices."""
@property
def directed(self): 105
"""Graph is directed."""
@abc.abstractmethod
def __hash__(self):
"""Hash value for graph.""" 110
def __eq__(self, other):
"""self has same number of vertices and edges as other and all
vertices and edges of self are equivalent to those of other."""
Because edges can be fully identified by their constituent vertices, edge opera-
tions sometimes require pairs of vertex labels. Since it is useful to implement
both directed and undirected graphs, we can determine the type of a specific
graph using the directed method. In undirected graphs, the addition of an
edge effectively adds a directed edge in both directions. Many algorithms keep
track of their progress by visiting vertices and edges. This is so common that it
seems useful to provide direct support for adding (visit), checking (visited),
and removing (reset) marks on vertices and edges.
Two iterators—generated by __iter__ and edges—traverse the vertices and
edges of a graph, respectively. A special iterator—generated by neighbors—
traverses the vertices adjacent to a given vertex. From this information, out-
bound edges can be determined.
Before we discuss particular implementations of graphs, we consider the
abstraction of vertices and edges. From the user’s point of view a vertex is a
label. Abstractly, an edge is an association of two vertices and an edge label. In
addition, we must keep track of objects that have been visited. These features
of vertices and edges are independent of the implementation of graphs; thus we
commit to an interface for these objects early. Let’s consider the _Vertex class.
@checkdoc
class _Vertex(object):
@property
def label(self): 15
284 Graphs
def visit(self):
"""Set visited flag true and return its previous value."""
old_visited = self.visited
self._visited = True
return old_visited
def reset(self): 25
"""Set visited flag false."""
self._visited = False
@property
def visited(self): 30
"""This _Vertex has been visited."""
return self._visited
def __str__(self): 40
"""String representation of the _Vertex."""
return repr(self)
def __hash__(self):
"""Hash value for _Vertex.""" 45
if self._hash is None:
try:
self._hash = hash(self.label)
except TypeError as y:
raise y
return self._hash
This class is similar to a KV pair: the label portion of the _Vertex cannot be
modified, but the visited flag can be freely set and reset. Two _Vertex objects
are considered equal if their labels are equal. It is a bare-bones interface. It
should also be noted that the _Vertex is a nonpublic class (thus the leading
underscore). Since a _Vertex is not visible through the Graph interface, there
is no reason for the user to have access to the _Vertex class.
Because the Edge class is a visible feature of a Graph interface (you might
ask why—see Problem ??), the Edge class is a visible declaration:
@checkdoc
class Edge(object):
15.2 The Graph Interface 285
@property
def frozen(self):
"""Edge is frozen.""" 20
return self._frozen
def freeze(self):
"""Freeze Edge."""
self._frozen = True 25
@property
def here(self):
"""Starting vertex."""
return self._here 30
@property
def there(self):
"""Ending vertex."""
return self._there 35
@property
def label(self):
"""The label of this Edge."""
return self._label 40
@label.setter
@mutatormethod
def label(self, value):
"""Set label to value.""" 45
self._label = value
def visit(self):
286 Graphs
@property
def visited(self): 55
"""This Edge has been visited."""
return self._visited
@property
def directed(self): 60
"""This Edge is directed."""
return self._directed
def reset(self):
"""Set visited flag false.""" 65
self._visited = False
@hashmethod
def __hash__(self):
"""Hash value for Edge.""" 85
if self._hash is None:
try:
self._hash = hash(self.here) + hash(self.there) + hash(self.label)
except TypeError as y:
return y 90
return self._hash
def __repr__(self):
"""Parsable string representation for Edge."""
15.3 Implementations 287
15.3 Implementations
Now that we have a good feeling for the graph interface, we consider traditional As “traditional”
implementations. Nearly every implementation of a graph has characteristics as this science
of one of these two approaches. Our approach to specifying these implemen- gets, anyway!
tations, however, will be dramatically impacted by the availability of object-
oriented features. We first remind ourselves of the importance of abstract base
classes in Python and as a feature of our design philosophy.
self._frozen = False
self._directed = directed
@property
def directed(self):
"""Graph is directed."""
return self._directed
When we must write code that is dependent on particular design decisions, we
delay it by decorating an abstract header for the particular method. For exam-
ple, we will need to add edges to our graph, but the implementation depends
on whether or not the graph is directed. Looking ahead, here is what the decla-
ration for add_edge looks like in the abstract Graph class:
@abc.abstractmethod
def add_edge(self, here, there, edge_label):
"""Add edge from here to there with its own label edge_label to graph."""
...
That’s it! It is simply a promise that code will eventually be written.
Once the abstract class is described as fully as possible, we implement (and
possibly extend) it, in various ways, committing in each case, to a particular ap-
proach of storing vertices and edges. Because we declare the class GraphMatrix
to be an extension of the Graph class, all the code written for the abstract Graph
class is inherited; it is as though it had been written for the GraphMatrix class.
By providing the missing pieces of code (tailored for our matrix implementa-
tion graphs), the extension class becomes concrete. We can actually construct
instances of the GraphMatrix class.
A related concept, subtyping, allows us to use any extension of a class wher-
ever the extended class could be used. We call the class that was extended the
base type or superclass, and the extension the subtype or subclass. Use of subtyp-
ing allows us to write code thinking of an object as a Graph and not necessarily
a GraphList. Because GraphMatrix is an extension of Graph, it is a Graph.
Even though we cannot construct a Graph, we can correctly manipulate con-
crete subtypes using the methods described in the abstract class. In particular,
a call to the method add_edge calls the method of GraphMatrix because each
object carries information that will allow it to call the correct implementation.
In parallel, we develop an abstract base class for private _Vertex implemen-
tations which, in each graph implementation, is made concrete by a dedicated
extension to a hidden, specific vertex class.
We now return to our normally scheduled implementations!
3 0 1 2 3 4
0 F F T F T
1 F F F F F
2 4
2 T F F T T
3 F F T F T
4 T F T T F
1 0
(a) (b)
Figure 15.2 (a) An undirected graph and (b) its adjacency matrix representation. Each
nontrivial edge is represented twice across the diagonal—once in the gray and once in
the white—making the matrix symmetric.
Destination
0 0 1 2 3 4
4 0 T T F F T
1 F F T F F
Source
3 2 F F F T F
3 F F F F F
2 1 4 T F F T F
(a) (b)
Figure 15.3 (a) A directed graph and (b) its adjacency matrix representation. Each
edge appears exactly once in the matrix.
1 If the vertices are totally ordered—say we can determine that u < v—then we can reduce the
cost of representing edges by a factor of two by storing information about equivalent edges (u, v)
and (v, u) with a boolean at [u][v].
290 Graphs
manipulate an _index field. Each index is a small integer that identifies the
dedicated matrix row and column that maintains adjacency information about
each vertex. To help allocate the indices, we keep a free list (a set) of available
indices.
One feature of our implementation has the potential to catch the unwary
programmer by surprise. Because we use a mapping to organize vertex labels, it
is important that the vertex label class implement the __hash__ function in such
a way as to guarantee that if two labels are equal (using the __eq__ method),
they have the same hash code.
We can now consider the protected data and initializer for the GraphMatrix
class:
@checkdoc
class GraphMatrix(Graph):
in an inconsistent state.
By default, these Graph instances are undirected. By specifying the directed
keyword, one can change this behavior for the life of the graph. The graph can
also be seeded with an initial set of vertices and/or edges.
The add method adds a vertex. If the vertex already exists, the operation
does nothing. If it is new to the graph, an index is allocated from the free list,
a new _MatrixVertex object is constructed, and the label-vertex association is
recorded in the _matrix map. The newly added vertex mentions no edges,
initially.
@mutatormethod
def add(self, label):
"""Add vertex with label label to graph."""
if label not in self:
# index is last available space in _free 5 GraphMatrix
# if _free is empty, index is len(self)
if len(self._free) != 0:
index = self._free.pop()
else:
index = len(self) 10
# grow matrix: add row, column
self._capacity = self._capacity + 1
for there in self._matrix:
there.append(None)
self._matrix.append([None] * self._capacity) 15
# increasing current size
self._size = self._size + 1
# create _Vertex object
vertex = _MatrixVertex(label, index)
# add vertex to _vertex_dict
self._vertex_dict[label] = vertex
An important part of making the matrix space efficient is to keep a list of matrix
indices (_free) that are not currently in use. When all indices are in use and a
new vertex is added, the matrix is extended by adding a new row and column.
The new index immediately used to support the vertex entry.
Removing a vertex reverses the add process. We must, however, be sure to
set each element of the vertex’s matrix row and column to None, removing any
mentioned edges (we may wish to add a new, isolated vertex with this index in
the future). When we remove the vertex from the _matrix map and we “recy-
cle” its index by adding it to the list of free indices. As with all of our remove
methods, we return the previous value of the label. (Even though the labels
match using __eq__, it is likely that they are not precisely the same. Once re-
turned, the user can extract any unknown information from the previous label
before the value is collected as garbage.)
@mutatormethod
def remove(self, label):
"""Remove vertex with label label from graph."""
292 Graphs
if label in self:
# get the vertex object and ask for its index 5
index = self._vertex_dict[label].index
# set all entries in row index of _matrix to None
# can’t travel from here to any other vertex
# count number of edges being removed and adjust current_edges accordingly
count = 0 10
for i in range(len(self._matrix[index])):
if self._matrix[index][i]:
count = count + 1
self._matrix[index][i] = None
self._edge_count = self._edge_count - count 15
# set all entries in column index of each _matrix row to None
# can’t travel from any other vertex to here
count = 0
for there in self._matrix:
if there[index]: 20
count = count + 1
there[index] = None
# in directed graph, these edges were counted separately
# so need to subtract them from current_edges
if self.directed: 25
self._edge_count = self._edge_count - count
# current size is decreasing
self._size = self._size - 1
# add index to _free
self._free.add(index) 30
# remove vertex from _vertex_dict
del self._vertex_dict[label]
Within the graph we store references to Edge objects. Each Edge records all
of the information necessary to position it within the graph, including whether it
is directed or not. This allows the equals method to work on undirected edges,
even if the vertices were provided in the opposite order (see Problem ??). To
add an edge to the graph, we require two vertex labels and an edge label. The
vertex labels uniquely identify the vertices within the graph, and the edge label
(which may be None). This Edge reference (which is, effectively, True) is the
value inserted within the matrix at the appropriate row and column. To add the
edge, we construct a new Edge with the appropriate information. This object
is written to appropriate matrix entries: undirected graphs update one or two
locations; directed graphs update just one. Here is the add_edge method:
@mutatormethod
def add_edge(self, here, there, edge_label=None):
"""Add edge from here to there with its own label edge_label to graph."""
if not here in self:
self.add(here) 5
15.3 Implementations 293
return None
matrix row yielding all vertices that appear opposite the queried vertex:
class NeighborsViewMatrix(Iterable):
def __iter__(self):
g = self._target
if self._label in g:
# the _Vertex object with label label 15
vertex = g._vertex_dict[self._label]
row = g._matrix[vertex.index]
for edge in [e for e in row if e is not None]:
if edge.here != self._label:
yield edge.here 20
else:
yield edge.there
def __len__(self):
return self._target.degree(self._label)
All that remains is to construct an iterator over the edges of the graph. In
this code we see a good example of the use of views: a single view is con-
structed of all the values found in the vertex dictionary. This view is traversed
by two iterators in a nested loop, so it is important that these iterators act in-
dependently. This approach ensures that we only consider those parts of the
matrix that are actively used to support the connectivity of the graph. From
these entries, we return those that are not None; these are edges of the graph.
For directed graphs, we return every Edge encountered. For undirected graphs,
we return only those edges that are found in the upper triangle of the matrix
(where the column index is at least as large as the row index).
@checkdoc
class EdgesViewMatrix(Iterable):
"""A view of all the edges of a GraphMatrix.
Returned by the edges method of a GraphMatrix."""
5
__slots__ = ["_target"]
10
def __iter__(self):
verts = self._target._vertex_dict.values()
for here in verts:
row = self._target._matrix[here.index]
for there in verts: 15
if self._target.directed or (there.index >= here.index):
this_edge = row[there.index]
if this_edge:
yield this_edge
20
def __len__(self):
return self._target.edge_count
GraphListVertex
15.3 Implementations 297
__slots__ = ["_adjacencies"]
5
def __init__(self, label):
"""Create a vertex with no adjacencies."""
super().__init__(label)
# store edges adjacent to this vertex
# in directed graphs, store edges with this vertex as here (edges from this vertex)
self._adjacencies = []
3
0 2 4
1
2 4
2 3 0 4
3 2 4
4 3 2 0
1 0
(a) (b)
Figure 15.4 (a) An undirected graph and (b) its adjacency list representation. Each
edge is represented twice in the structure. (Compare with Figure 15.2.)
0
4 0 0 4 1
1 2
Source
3 2 3
3
2 1 4 0 3
(a) (b)
Figure 15.5 (a) A directed graph and (b) its adjacency list representation. Each edge
appears once in the source list. (Compare with Figure 15.3.)
__slots__ = ["_vertex_dict"]
5
def __init__(self, vertices=None, edges=None, directed=False):
"""Construct a GraphList from an iterable source. Specify if the graph is directed."""
super().__init__(directed)
self._vertex_dict = {}
if vertices: 10
for vertex in vertices:
self.add(vertex)
if edges:
for edge in edges:
if len(edge) == 2: 15
self.add_edge(edge[0], edge[1])
elif len(edge) == 3:
self.add_edge(edge[0], edge[1], edge[2])
else:
raise KeyError("Incorrect parameters for initializing edge.")
Our approach to extending the abstract GraphList type to support directed and
undirected graphs is similar to that described in the adjacency matrix imple-
mentation.
Adding a vertex (if it is not already there) involves simply adding it to the
underlying mapping.
@mutatormethod
def add(self, label):
"""Add vertex with label label to graph."""
if label not in self:
vertex = _ListVertex(label)
self._vertex_dict[label] = vertex
appropriate adjacency lists. For a directed graph, we insert the edge in the list
associated with the source vertex. For an undirected graph, a reference to the
edge must be inserted into both lists. It is important, of course, that a reference
to a single edge be inserted in both lists so that changes to the edge are main-
tained consistently.
@mutatormethod
def add_edge(self, here, there, edge_label=None):
"""Add edge from here to there with its own label edge_label to graph."""
self.add(here)
self.add(there) 5
# get vertices from _vertex_dict
here_vertex = self._vertex_dict[here]
there_vertex = self._vertex_dict[there]
# create edge
new = Edge(here_vertex.label, there_vertex.label, edge_label, directed=self.directed)
# tell here vertex that this edge is adjacent
here_vertex.add_edge(new)
# in undirected graph, this edge is also adjacent to there
if not self.directed:
there_vertex.add_edge(new)
Removing an edge is simply a matter of removing it from the source and (if
the graph is undirected) the destination vertex adjacency lists.
@mutatormethod
def remove_edge(self, here, there):
"""Remove edge from here to there."""
if self.contains_edge(here, there):
here_vertex = self._vertex_dict[here] 5
there_vertex = self._vertex_dict[there]
# create edge object that will be equal to the edge we want to remove from each ver
edge = Edge(here_vertex.label, there_vertex.label, directed=self.directed)
old_edge = here_vertex.remove_edge(edge)
# in directed graph, this step will do nothing: 10
if not self.directed:
there_vertex.remove_edge(edge)
return old_edge.label
else:
return None
Notice that to remove an edge a “pattern” edge must be constructed to identify
(through __eq__) the target of the remove.
Many of the remaining edge and vertex methods have been greatly simpli-
fied by our having extended the Vertex class. Here, for example, is the degree
method:
def degree(self, label):
"""The number of vertices adjacent to vertex with label label."""
if label in self:
15.3 Implementations 301
return self._vertex_dict[label].degree
This code calls the _ListVertex degree method. That, in turn, returns the
length of the underlying adjacency list. Most of the remaining methods are
simply implemented.
The view constructed by the edges is made complex by the fact that a single
edge will be seen twice in an undirected graph. To account for this we only
report edges in the list of adjacencies for vertex where vertex is the source of
the edge.
@checkdoc
class EdgesViewList(Iterable):
"""A view of all the edges of a GraphList.
Returned by the edges method of a GraphList."""
__slots__ = ["_target"] 5
def __iter__(self): 10
for vertex in self._target._vertex_dict.values():
for edge in vertex._adjacencies:
if self._target.directed or (edge.here is vertex.label):
yield edge
15
def __len__(self):
return self._target.edge_count
15.4.1 Reachability
Once data are stored within a graph, it is often desirable to identify vertices
that are reachable from a common source (see Figure 15.6). One approach is
to treat the graph as you would a maze and, using search techniques, find the
reachable vertices. For example, we may use depth-first search: each time we
visit an unvisited vertex we seek to further deepen the traversal. The following
code demonstrates how we might use recursion to search for unvisited vertices:
Compiler design
Vision
Organization
A.I. Data structures
Modeling
Surfing
Algorithms
Linear algebra Theory
Discrete math
Parallel systems
(a)
Compiler design
Vision
Organization
A.I. Data structures
Modeling
Surfing
Algorithms
Linear algebra Theory
Discrete math
Parallel systems
(b)
Figure 15.6 Courses you might be expected to have taken if you’re in a compiler de-
sign class. (a) A typical prerequisite graph (classes point to prerequisites). Note the
central nature of data structures! (b) Bold courses can be reached as requisite courses
for compiler design.
304 Graphs
Actually, the time stamps are useful only for purposes of illustration. In fact,
we can simply append vertices to the end of a list at the time that they would
15.4 Examples: Common Graph Algorithms 305
Vision: 19−20
Organization: 14−15
A.I.: 21−22 Data structures: 3−6
Modeling: 23−24
Figure 15.7 The progress of a topological sort of the course graph. The time inter-
val following a node label indicates the time interval spent processing that node or its
descendants. Dark nodes reachable from compiler design are all processed during the
interval [1–16]—the interval associated with compiler design.
306 Graphs
This algorithm is clearly O(|V |3 ): each iterator visits |V | vertices and (for adja-
cency matrices) the check for existence of an edge can be performed in constant
15.4 Examples: Common Graph Algorithms 307
time.
To see how the algorithm works, we number the vertices in the order they
are encountered by the vertex iterator. After k iterations of the outer loop,
all “reachability edges” of the subgraph containing just the first k vertices are
completely determined. The next iteration extends this result to a subgraph of
k + 1 vertices. An inductive approach to proving this algorithm correct (which
we avoid) certainly has merit.
Clearly, edge labels could contain more information than just the path length.
For example, the path itself could be constructed, stored, and produced on re-
quest, if necessary. Again, the complexity of the algorithm is O(|V |3 ). This is
satisfactory for dense graphs, especially if they’re stored in adjacency matrices,
but for sparse graphs the checking of all possible edges seems excessive. Indeed,
other approaches can improve these bounds. We leave some of these for your
next course in algorithms!
308 Graphs
Montpelier
Connecting the world
All distances are approximate nautical miles.
130
Boston
Albany
120 250
Sacramento Harrisburg
Salt Lake City 150 Dover
450 100
1500
550
4200
Trenton
Phoenix Athens
4700
Figure 15.8 The progress of a minimum spanning tree computation. Bold vertices
and edges are part of the tree. Harrisburg and Trenton, made adjacent by the graph’s
shortest edge, are visited first. At each stage, a shortest external edge adjacent to the
tree is incorporated.
if not graph.visited(v):
searching = False
graph.visit_edge(e)
2 We use a key function that targets the label of the edge here, which would be typical for this
310 Graphs
sidered. Short edges are removed from the queue until an unvisited vertex is
mentioned by a new tree edge (e), or the queue is emptied.
Over the course of the algorithm, consideration of each edge and vertex
results in a priority queue operation. The running time, then is O((|V | +
|E|) log (|V |)).
Notice that the first edge added may not be the graph’s shortest.
while v:
if v not in result:
# a new city is considered; record result 15
result[v] = possible
vdist = possible[0]
# add all outgoing roads to priority queue
for w in g.neighbors(v):
(distance, highway) = g.get_edge(v,w).label 20
q.add((distance+vdist,v,w,highway))
# find next closest city to start
if len(q) == 0: break
possible = q.remove()
(distance, somewhere, v, highwayname) = possible 25
# break out of this loop
return result
algorithm. Another possibility would be to extend the Edge class to totally order edges based on
their label.
15.4 Examples: Common Graph Algorithms 311
c c
10 10
b b:8
a:0 8 a:0 8
3 5 11 3 5 11
f 4 f:3 4
e 5 d e:7 5 d
3 5 11 3 5 11
f:3 4 f:3 4
e:7 5 d:12 e:7 5 d:12
(3) (4)
15.5 Conclusions
In this chapter we have investigated two traditional implementations of graphs.
The adjacency matrix stores information about each edge in a square matrix
while the adjacency list implementation keeps track of edges that leave each
vertex. The matrix implementation is ideal for dense graphs, where the number
of actual edges is high, while the list implementation is best for representing
sparse graphs.
Our approach to implementing graph structures is to use partial implemen-
tations, called abstract classes, and extend them until they are concrete, or com-
plete. Other methods are commonly used, but this has the merit that common
code can be shared among similar classes. Indeed, this inheritance is one of the
features commonly found in object-oriented languages.
This last section is, in effect, a stepping stone to an investigation of algo-
rithms. There are many approaches to answering graph-related questions, and
because of the dramatic differences in complexities in different implementa-
tions, the solutions are often affected by the underlying graph structure.
Finally, we note that many of the seemingly simple graph-related problems
cannot be efficiently solved with any reasonable representation of graphs. Those
problems are, themselves, a suitable topic for many future courses of study.
Appendix A
Answers
This section contains answers to many problems presented to the reader in the
text. In the first section, we provide answers to self check problems. In the
second section, we provide answers to many of the problems from the text.