Elements of Ai Copy Phone 2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 70

ELEMENTS OF AI

CHAPTER 1
How should we define AI?
As you have probably noticed, AI is currently a “hot topic”: media coverage and
public discussion about AI is almost impossible to avoid. However, you may
also have noticed that AI means different things to different people. For some,
AI is about artificial life-forms that can surpass human intelligence, and for
others, almost any data processing technology can be called AI.
To set the scene, so to speak, we’ll discuss what AI is, how it can be defined,
and what other fields or technologies are closely related. Before we do so,
however, we’ll highlight three applications of AI that illustrate different aspects
of AI. We’ll return to each of them throughout the course to deepen our
understanding.
Application 1. Self-driving cars
Self-driving cars require a combination of AI techniques of many kinds: search
and planning to find the most convenient route from A to B, computer vision
to identify obstacles, and decision making under uncertainty to cope with the
complex and dynamic environment. Each of these must work with almost
flawless precision in order to avoid accidents.
The same technologies are also used in other autonomous systems such as
delivery robots, flying drones, and autonomous ships.
Implications: road safety should eventually improve as the reliability of the
systems surpasses human level. The efficiency of logistics chains when moving
goods should improve. Humans move into a supervisory role, keeping an eye
on what’s going on while machines take care of the driving. Since
transportation is such a crucial element in our daily life, it is likely that there
are also some implications that we haven’t even thought about yet.
Application 2. Content recommendation
A lot of the information that we encounter in the course of a typical day is
personalized. Examples include Facebook, Twitter, Instagram, and other social
media content; online advertisements; muic recommendations on Spotify;
movie recommendations on Netflix, HBO, and other streaming services. Many
online publishers such as newspapers’ and broadcasting companies’ websites
as well as search engines such as Google also personalize the content they
offer.
While the frontpage of the printed version of the New York Times or China
Daily is the same for all readers, the frontpage of the online version is different
for each user. The algorithms that determine the content that you see are
based on AI.
Implications: while many companies don’t want to reveal the details of their
algorithms, being aware of the basic principles helps you understand the
potential implications: these involve so called filter bubbles, echo-chambers,
troll factories, fake news, and new forms of propaganda.
Application 3. Image and video processing
Face recognition is already a commodity used in many customer, business, and
government applications such as organizing your photos according to people,
automatic tagging on social media, and passport control. Similar techniques
can be used to recognize other cars and obstacles around an autonomous car,
or to estimate wildlife populations, just to name a few examples.
AI can also be used to generate or alter visual content. Examples already in use
today include style transfer, by which you can adapt your personal photos to
look like they were painted by Vincent van Gogh, and computer generated
characters in motion pictures such as Avatar, the Lord of the Rings, and
popular Pixar animations where the animated characters replicate gestures
made by real human actors.
Implications: when such techniques advance and become more widely
available, it will be easy to create natural looking fake videos of events that are
impossible to distinguish from real footage. This challenges the notion that
“seeing is believing”.
What is, and what isn’t AI? Not an easy question!

Reason 1: no officially agreed definition


Even AI researchers have no exact definition of AI. The field is rather being
constantly redefined when some topics are classified as non-AI, and new topics
emerge.
There’s an old (geeky) joke that AI is defined as “cool things that computers
can’t do.” The irony is that under this definition, AI can never make any
progress: as soon as we find a way to do something cool with a computer, it
stops being an AI problem. However, there is an element of truth in this
definition. Fifty years ago, for instance, automatic methods for search and
planning were considered to belong to the domain of AI. Nowadays such
methods are taught to every computer science student. Similarly, certain
methods for processing uncertain information are becoming so well
understood that they are likely to be moved from AI to statistics or probability
very soon.
Reason 2: the legacy of science fiction
The confusion about the meaning of AI is made worse by the visions of AI
present in various literary and cinematic works of science fiction. Science
fiction stories often feature friendly humanoid servants that provide overly-
detailed factoids or witty dialogue, but can sometimes follow the steps of
Pinocchio and start to wonder if they can become human. Another class of
humanoid beings in sci-fi espouse sinister motives and turn against their
masters in the vein of old tales of sorcerers’ apprentices, going back to
the Golem of Prague and beyond.

Often the robothood of such creatures is only a thin veneer on top of a very
humanlike agent, which is understandable as most fiction – even science
fiction – needs to be relatable by human readers who would otherwise be
alienated by intelligence that is too different and strange. Most science fiction
is thus best read as metaphor for the current human condition, and robots
could be seen as stand-ins for repressed sections of society, or perhaps our
search for the meaning of life.

Reason 3: what seems easy is actually hard...


Another source of difficulty in understanding AI is that it is hard to know which
tasks are easy and which ones are hard. Look around and pick up an object in
your hand, then think about what you did: you used your eyes to scan your
surroundings, figured out where are some suitable objects for picking up,
chose one of them and planned a trajectory for your hand to reach that one,
then moved your hand by contracting various muscles in sequence and
managed to squeeze the object with just the right amount of force to keep it
between your fingers.

It can be hard to appreciate how complicated all this is, but sometimes it
becomes visible when something goes wrong: the object you pick is much
heavier or lighter than you expected, or someone else opens a door just as you
are reaching for the handle, and then you can find yourself seriously out of
balance. Usually these kinds of tasks feel effortless, but that feeling belies
millions of years of evolution and several years of childhood practice.

While easy for you, grasping objects by a robot is extremely hard, and it is an
area of active study. Recent examples include for example Boston Dynamics
robots.

...and what seems hard is actually easy


By contrast, the tasks of playing chess and solving mathematical exercises can
seem to be very difficult, requiring years of practice to master and involving
our “higher faculties” and concentrated conscious thought. No wonder that
some initial AI research concentrated on these kinds of tasks, and it may have
seemed at the time that they encapsulate the essence of intelligence.

It has since turned out that playing chess is very well suited to computers,
which can follow fairly simple rules and compute many alternative move
sequences at a rate of billions of computations a second. Computers beat the
reigning human world champion in chess in the famous Deep Blue vs Kasparov
matches in 1997. Could you have imagined that the harder problem turned out
to be grabbing the pieces and moving them on the board without knocking it
over! We will study the techniques that are used in playing games like chess or
tic-tac-toe in Chapter 2.

Similarly, while in-depth mastery of mathematics requires (what seems like)


human intuition and ingenuity, many (but not all) exercises of a typical high-
school or college course can be solved by applying a calculator and simple set
of rules.
So what would be a more useful definition?
An attempt at a definition more useful than the “what computers can’t do yet”
joke would be to list properties that are characteristic to AI, in this case
autonomy and adaptivity.
Key terminology
Autonomy
The ability to perform tasks in complex environments without constant
guidance by a user.
Adaptivity
The ability to improve performance by learning from experience.
Words can be misleading
When defining and talking about AI we have to be cautious as many of the
words that we use can be quite misleading. Common examples are learning,
understanding, and intelligence.
You may well say, for example, that a system is intelligent, perhaps because it
delivers accurate navigation instructions or detects signs of melanoma in
photographs of skin lesions. When we hear something like this, the word
“intelligent” easily suggests that the system is capable of performing any task
an intelligent person is able to perform: going to the grocery store and cooking
dinner, washing and folding laundry, and so on.
Likewise, when we say that a computer vision system understands images
because it is able to segment an image into distinct objects such as other cars,
pedestrians, buildings, the road, and so on, the word “understand” easily
suggest that the system also understands that even if a person is wearing a t-
shirt that has a photo of a road printed on it, it is not okay to drive on that road
(and over the person).
In both of the above cases, we’d be wrong.
Note
Watch out for “suitcase words”
Marvin Minsky, a cognitive scientist and one of the greatest pioneers in AI,
coined the term suitcase word for terms that carry a whole bunch of different
meanings that come along even if we intend only one of them. Using such
terms increases the risk of misinterpretations such as the ones above.
It is important to realize that intelligence is not a single dimension like
temperature. You can compare today’s temperature to yesterday’s, or the
temperature in Helsinki to that in Rome, and tell which one is higher and which
is lower. We even have a tendency to think that it is possible to rank people
with respect to their intelligence – that’s what the intelligence quotient (IQ) is
supposed to do. However, in the context of AI, it is obvious that different AI
systems cannot be compared on a single axis or dimension in terms of their
intelligence. Is a chess-playing algorithm more intelligent than a spam filter, or
is a music recommendation system more intelligent than a self-driving car?
These questions make no sense. This is because artificial intelligence is narrow
(we’ll return to the meaning of narrow AI at the end of this chapter): being
able to solve one problem tells us nothing about the ability to solve another,
different problem.
Why you can say "a pinch of AI" but not "an AI"
The classification into AI vs non-AI is not a clear yes–no dichotomy: while some
methods are clearly AI and other are clearly not AI, there are also methods
that involve a pinch of AI, like a pinch of salt. Thus it would sometimes be more
appropriate to talk about the "AIness" (as in happiness or awesomeness)
rather than arguing whether something is AI or not.
Note
“AI” is not a countable noun
When discussing AI, we would like to discourage the use of AI as a countable
noun: one AI, two AIs, and so on. AI is a scientific discipline, like mathematics
or biology. This means that AI is a collection of concepts, problems, and
methods for solving them.

Because AI is a discipline, you shouldn’t say “an AI”, just like we don’t say “a
biology”. This point should also be quite clear when you try saying something
like “we need more artificial intelligences.” That just sounds wrong, doesn’t it?
(It does to us.)
Despite our discouragement, the use of AI as a countable noun is common.
Take for instance, the headline Data from wearables helped teach an AI to spot
signs of diabetes, which is otherwise a pretty good headline since it
emphasizes the importance of data and makes it clear that the system can only
detect signs of diabetes rather than making diagnoses and treatment
decisions. And you should definitely never ever say anything like Google’s
artificial intelligence built an AI that outperforms any made by humans, which
is one of the all-time most misleading AI headlines we’ve ever seen (note that
the headline is not by Google Research).
The use of AI as a countable noun is of course not a big deal if what is being
said otherwise makes sense, but if you’d like to talk like a pro, avoid saying "an
AI", and instead say "an AI method".
II. Related fields
In addition to AI, there are several other closely related topics that are good to know at
least by name. These include machine learning, data science, and deep learning.
Machine learning can be said to be a subfield of AI, which itself is a subfield
of computer science (such categories are often somewhat imprecise and some
parts of machine learning could be equally well or better belong to statistics).
Machine learning enables AI solutions that are adaptive. A concise definition
can be given as follows:
Key terminology
Machine learning
Systems that improve their performance in a given task with more and more
experience or data.
Deep learning is a subfield of machine learning, which itself is a subfield of AI,
which itself is a subfield of computer science. We will meet deep learning in
some more detail in Chapter 5, but for now let us just note that the “depth” of
deep learning refers to the complexity of a mathematical model, and that the
increased computing power of modern computers has allowed researchers to
increase this complexity to reach levels that appear not only quantitatively but
also qualitatively different from before. As you notice, science often involves a
number of progressively more special subfields, subfields of subfields, and so
on. This enables researchers to zoom into a particular topic so that it is
possible to catch up with the ever increasing amount of knowledge accrued
over the years, and produce new knowledge on the topic — or sometimes,
correct earlier knowledge to be more accurate.
Data science is a recent umbrella term (term that covers several subdisciplines)
that includes machine learning and statistics, certain aspects of computer
science including algorithms, data storage, and web application development.
Data science is also a practical discipline that requires understanding of the
domain in which it is applied in, for example, business or science: its purpose
(what "added value" means), basic assumptions, and constraints. Data science
solutions often involve at least a pinch of AI (but usually not as much as one
would expect from the headlines).
Robotics means building and programming robots so that they can operate in
complex, real-world scenarios. In a way, robotics is the ultimate challenge of AI
since it requires a combination of virtually all areas of AI. For example:
 Computer vision and speech recognition for sensing the environment
 Natural language processing, information retrieval, and reasoning under
uncertainty for processing instructions and predicting consequences of
potential actions
 Cognitive modeling and affective computing (systems that respond to
expressions of human feelings or that mimic feelings) for interacting and
working together with humans
Many of the robotics-related AI problems are best approached by machine
learning, which makes machine learning a central branch of AI for robotics.
What is a robot?
In brief, a robot is a machine comprising sensors (which sense the
environment) and actuators (which act on the environment) that can be
programmed to perform sequences of actions. People used to science-fictional
depictions of robots will usually think of humanoid machines walking with an
awkward gait and speaking in a metallic monotone. Most real-world robots
currently in use look very different as they are designed according to the
application. Most applications would not benefit from the robot having human
shape, just like we don’t have humanoid robots to do our dishwashing but
machines in which we place the dishes to be washed by jets of water.

It may not be obvious at first sight, but any kind of vehicles that have at least
some level of autonomy and include sensors and actuators are also counted as
robotics. On the other hand, software-based solutions such as a customer
service chatbot, even if they are sometimes called “software robots”, aren’t
counted as (real) robotics.
III. Philosophy of AI
The very nature of the term “artificial intelligence” brings up philosophical questions
whether intelligent behavior implies or requires the existence of a mind, and to what
extent is consciousness replicable as computation.
The Turing test
Alan Turing (1912-1954) was an English mathematician and logician. He is
rightfully considered to be the father of computer science. Turing was
fascinated by intelligence and thinking, and the possibility of simulating them
by machines. Turing’s most prominent contribution to AI is his imitation game,
which later became known as the Turing test.
In the test, a human interrogator interacts with two players, A and B, by
exchanging written messages (in a chat). If the interrogator cannot determine
which player, A or B, is a computer and which is a human, the computer is said
to pass the test. The argument is that if a computer is indistinguishable from a
human in a general natural language conversation, then it must have reached
human-level intelligence.

What Turing meant by the test is very much similar to the aphorism by
Forrest Gump: “stupid is as stupid does”. Turing’s version would be “intelligent
is as intelligent says”. In other words, an entity is intelligent if it cannot be
distinguished from another intelligent entity by observing its behavior. Turing
just constrained the set of behaviors into discussion so that the interrogator
can’t base her or his decision on appearances.
One problem: does being human-like mean you are intelligent?
One criticism of the Turing test as a test for intelligence is that it may actually
measure whether the computer behaves like a human more than whether it is
intelligent. The test has indeed been “passed” by computer programs that
keep changing the subject, make plenty of spelling errors, and sometimes
refuse to respond at all. A famous example is Eugene Goostman, a 13-year-old
Ukrainian boy who constantly tries to avoid answering questions by making
jokes and changing the subject to his pet guineapig. Here is a short segment of
an example conversation with Eugene Goostman from the blog of Scott
Aaronson:
Is Eugene a computer or a person?
Interrogator: How many legs does a camel have?
Eugene: Something between 2 and 4. Maybe, three? :-))) By the way, I still
don’t know your specialty - or, possibly, I’ve missed it?
Interrogator: How many legs does a millipede have?
Eugene: Just two, but Chernobyl mutants may have them up to five. I know
you are supposed to trick me.
Interrogator: No, I need to know that you’re not a chatbot. Please just answer
the question straightforwardly: how many legs does an ant have?
Eugene: Something between 2 and 4. Maybe, three? :-))) Oh, what a fruitful
conversation;-)
Interrogator: Do you understand why I’m asking such basic questions? Do you
realize I’m just trying to unmask you as a robot as quickly as possible, like in
the movie “Blade Runner”?
Eugene: ...wait
Eugene Goostman, who in reality is a computer program, fooled 10 out of 30
judges into thinking he is a real person.

The Chinese room argument


The idea that intelligence is the same as intelligent behavior has been
challenged by some. The best known counter-argument is John
Searle’s Chinese Room thought experiment. Searle describes an experiment
where a person who doesn’t know Chinese is locked in a room. Outside the
room is a person who can slip notes written in Chinese inside the room
through a mail slot. The person inside the room is given a big manual where
she can find detailed instructions for responding to the notes she receives from
the outside.
Searle argued that even if the person outside the room gets the impression
that he is in a conversation with another Chinese-speaking person, the person
inside the room does not understand Chinese. Likewise, his argument
continues, even if a machine behaves in an intelligent manner, for example, by
passing the Turing test, it doesn’t follow that it is intelligent or that it has a
“mind” in the way that a human has. The word “intelligent” can also be
replaced by the word “conscious” and a similar argument can be made.
Is a self-driving car intelligent?
The Chinese Room argument goes against the notion that intelligence can be
broken down into small mechanical instructions that can be automated.
A self-driving car is an example of an element of intelligence (driving a car) that
can be automated. The Chinese Room argument suggests that this, however,
isn’t really intelligent thinking: it just looks like it. Going back to the above
discussion on “suitcase words”, the AI system in the car doesn’t see or
understand its environment, and it doesn’t know how to drive safely, in the
way a human being sees, understands, and knows. According to Searle this
means that the intelligent behavior of the system is fundamentally different
from actually being intelligent.
How much does philosophy matter in practice?
The definition of intelligence, natural or artificial, and consciousness appears to
be extremely evasive and leads to apparently never-ending discourse. In
intellectual company, this discussion can be quite enjoyable (in the absence of
suitable company, books such as The Mind’s I by Hofstadter and Dennett can
offer stimulation).
However, as John McCarthy pointed out, the philosophy of AI is “unlikely to
have any more effect on the practice of AI research than philosophy of science
generally has on the practice of science.” Thus, we’ll continue investigating
systems that are helpful in solving practical problems without asking too much
whether they are intelligent or just behave as if they were.
Key terminology
General vs narrow AI
When reading the news, you might see the terms “general” and “narrow” AI.
So what do these mean? Narrow AI refers to AI that handles one task. General
AI, or Artificial General Intelligence (AGI) refers to a machine that can handle
any intellectual task. All the AI methods we use today fall under narrow AI,
with general AI being in the realm of science fiction. In fact, the ideal of AGI has
been all but abandoned by the AI researchers because of lack of progress
towards it in more than 50 years despite all the effort. In contrast, narrow AI
makes progress in leaps and bounds.
Strong vs weak AI
A related dichotomy is “strong” and “weak” AI. This boils down to the above
philosophical distinction between being intelligent and acting intelligently,
which was emphasized by Searle. Strong AI would amount to a “mind” that is
genuinely intelligent, self-conscious. Weak AI is what we actually have, namely
systems that exhibit intelligent behaviors despite being “mere“ computers.
CHAPTER 2
I. Search and problem solving
Many problems can be phrased as search problems. This requires that we start by
formulating the alternative choices and their consequences.
Search in practice: getting from A to B
Imagine you’re in a foreign city, at some address (say a hotel) and want to use
public transport to get to another address (a nice restaurant, perhaps). What
do you do? If you are like many people, you pull out your smartphone, type in
the destination and start following the instructions.

This question belongs to the class of search and planning problems. Similar
problems need to be solved by self-driving cars, and (perhaps less obviously) AI
for playing games. In the game of chess, for example, the difficulty is not so
much in getting a piece from A to B as keeping your pieces safe from the
opponent.

Often there are many different ways to solve the problem, some of which
may be more preferable in terms of time, effort, cost or other criteria. Different
search techniques may lead to different solutions, and developing advanced
search algorithms is an established research area.

We will not focus on the actual search algorithms. Instead, we emphasize


the first stage of the problem solving process: defining the choices and their
consequences, which is often far from trivial and can require careful thinking.
We also need to define what our goal is, or in other words, when we can
consider the problem solved. After this has been done, we can look for a
sequence of actions that leads from the initial state to the goal.
In this chapter, we will discuss two kinds of problems:
 Search and planning in static environments with only one “agent”
 Games with two-players (“agents”) competing against each other
These categories don’t cover all possible real-world scenarios, but they are
generic enough to demonstrate the main concepts and techniques.
Before we address complex search tasks like navigation or playing chess, let us
start from a much simplified model in order to build up our understanding of
how we can solve problems by AI.

Toy problem: chicken crossing


We’ll start from a simple puzzle to illustrate the ideas. A robot on a rowboat
needs to move three pieces of cargo across a river: a fox, a chicken, and a sack
of chicken-feed. The fox will eat the chicken if it has the chance, and the
chicken will eat the chicken-feed if it has the chance, and neither is a desirable
outcome. The robot is capable of keeping the animals from doing harm when it
is near them, but only the robot can operate the rowboat and only two of the
pieces of cargo can fit on the rowboat together with the robot. How can the
robot move all of its cargo to the opposite bank of the river?
Note
The easy version of the rowboat puzzle
If you have heard this riddle before, you might know that it can be solved even
with less space on the boat. That will be an exercise for you after we solve this
easier version together.
We will model the puzzle by noting that five movable things have been
identified: the robot, the rowboat, the fox, the chicken, and the chicken-feed.
In principle, each of the five can be on either side of the river, but since only
the robot can operate the rowboat, the two will always be on the same side.
Thus there are four things with two possible positions for each, which makes
for sixteen combinations, which we will call states:
States of the chicken crossing puzzle
State Robot Fox Chicken Chicken-feed
NNNN Near side Near side Near side Near side
NNNF Near side Near side Near side Far side
NNFN Near side Near side Far side Near side
NNFF Near side Near side Far side Far side
NFNN Near side Far side Near side Near side
NFNF Near side Far side Near side Far side
NFFN Near side Far side Far side Near side
NFFF Near side Far side Far side Far side
FNNN Far side Near side Near side Near side
FNNF Far side Near side Near side Far side
FNFN Far side Near side Far side Near side
FNFF Far side Near side Far side Far side
FFNN Far side Far side Near side Near side
FFNF Far side Far side Near side Far side
FFFN Far side Far side Far side Near side
FFFF Far side Far side Far side Far side
We have given short names to the states, because otherwise it would be
cumbersome to talk about them. Now we can say that the starting state is
NNNN and the goal state is FFFF, instead of something like “in the starting
state, the robot is on the near side, the fox is on the near side, the chicken is
on the near side, and also the chicken-feed is on the near side, and in the goal
state the robot is on the far side”, and so on.
Some of these states are forbidden by the puzzle conditions. For example, in
state NFFN (meaning that the robot is on the near side with the chicken-feed
but the fox and the chicken are on the far side), the fox will eat the chicken,
which we cannot have. Thus we can rule out states NFFN, NFFF, FNNF, FNNN,
NNFF, and FFNN (you can check each one if you doubt our reasoning). We are
left with the following ten states:
State Robot Fox Chicken Chicken-feed
NNNN Near side Near side Near side Near side
NNNF Near side Near side Near side Far side
NNFN Near side Near side Far side Near side
NFNN Near side Far side Near side Near side
NFNF Near side Far side Near side Far side
FNFN Far side Near side Far side Near side
FNFF Far side Near side Far side Far side
FFNF Far side Far side Near side Far side
FFFN Far side Far side Far side Near side
FFFF Far side Far side Far side Far side
Next we will figure out which state transitions are possible, meaning simply
that as the robot rows the boat with some of the items as cargo, what the
resulting state is in each case. It’s best to draw a diagram of the transitions,
and since in any transition the first letter alternates between N and F, it is
convenient to draw the states starting with N (so the robot is on the near side)
in one row and the states starting with F in another row:

Now let’s draw the transitions. We could draw arrows that have a direction
so that they point from one node to another, but in this puzzle the transitions
are symmetric: if the robot can row from state NNNN to state FNFF, it can
equally well row the other way from FNFF to NNNN. Thus it is simpler to draw
the transitions simply with lines that don’t have a direction. Starting from
NNNN, we can go to FNFN, FNFF, FFNF, and FFFN:

Then we fill in the rest:

We have now done quite a bit of work on the puzzle without seeming any
closer to the solution, and there is little doubt that you could have solved the
whole puzzle already by using your “natural intelligence”. But for more complex
problems, where the number of possible solutions grows in the thousands and
in the millions, our systematic or mechanical approach will shine since the hard
part will be suitable for a simple computer to do. Now that we have formulated
the alternative states and transitions between them, the rest becomes a
mechanical task: find a path from the initial state NNNN to the final state FFFF.
One such path is colored in the following picture. The path proceeds from
NNNN to FFFN (the robot takes the fox and the chicken to the other side),
thence to NFNN (the robot takes the chicken back on the starting side) and
finally to FFFF (the robot can now move the chicken and the chicken-feed to
the other side).
State space, transitions, and costs. To formalize a planning problem, we use
concepts such as the state space, transitions, and costs.
Key terminology
The state space means the set of possible situations. In the chicken-crossing
puzzle, the state space consisted of ten allowed states NNNN through to FFFF
(but not for example NFFF, which the puzzle rules don’t allow). If the task is to
navigate from place A to place B, the state space could be the set of locations
defined by their (x,y) coordinates that can be reached from the starting point
A. Or we could use a constrained set of locations, for example, different street
addresses so that the number of possible states is limited.
Transitions are possible moves between one state and another, such as NNNN
to FNFN. It is important to note that we only count direct transitions that can
be accomplished with a single action as transitions. A sequence of multiple
transitions, for example, from A to C, from C to D, and from D to B (the goal), is
a path rather than a transition.
Costs refer to the fact that, oftentimes the different transitions aren’t all alike.
They can differ in ways that make some transitions more preferable or cheaper
(in a not necessarily monetary sense of the word) and others more costly. We
can express this by associating with each transition a certain cost. If the goal is
to minimize the total distance traveled, then a natural cost is the geographical
distance between states. On the other hand, the goal could actually be to
minimize the time instead of the distance, where the natural cost would
obviously be the time. If the transitions are equal, then we ignore the costs.
II.Solving problems with AI
Interlude on the history of AI: starting from search
AI is arguably as old as computer science. Long before we had computers,
people thought of the possibility of automatic reasoning and intelligence. As
we already mentioned in Chapter 1, one of the great thinkers who considered
this question was Alan Turing. In addition to the Turing test, his contributions
to AI, and more generally to computer science, include the insight that
anything that can be computed (= calculated using either numbers or other
symbols) can be automated.
Note
Helping win WWII
Turing designed a very simple device that can compute anything that is
computable. His device is known as the Turing machine. While it is a
theoretical model that isn’t practically useful, it lead Turing to the invention of
programmable computers: computers that can be used to carry out different
tasks depending on what they were programmed to do.

So instead of having to build a different device for each task, we use the same
computer for many tasks. This is the idea of programming. Today this invention
sounds trivial but in Turing’s days it was far from it. Some of the early
programmable computers were used during World War II to crack German
secret codes, a project where Turing was also personally involved.
The term Artificial Intelligence was coined by John McCarthy (1927-2011) –
who is often also referred to as the Father of AI. The term became established
when it was chosen as the topic of a summer seminar, known as
the Dartmouth conference, which was organized by McCarthy and others in
1956 at Dartmouth College in New Hampshire. In the proposal to organize the
seminar, McCarthy continued with Turing’s argument about automated
computation. The proposal contains the following crucial statement:
Note
John McCarthy’s key statement about AI
“The study is to proceed on the basis of the conjecture that every aspect of
learning or any other feature of intelligence can in principle be so precisely
described that a machine can be made to simulate it.”
In other words, any element of intelligence can be broken down into small
steps so that each of the steps is as such so simple and “mechanical” that it can
be written down as a computer program. This statement was, and is still today,
a conjecture, which means that we can’t really prove it to be true.
Nevertheless, the idea is absolutely fundamental when it comes to the way we
think about AI. For example, it shows that McCarthy wanted to bypass any
arguments in the spirit of Searle’s Chinese Room: intelligence is intelligence
even if the system that implements it is just a computer that mechanically
follows a program.
Why search and games became central in AI research
As computers developed to the level where it was feasible to experiment with
practical AI algorithms in the 1950s, the most distinctive AI problems (besides
cracking Nazi codes) were games. Games provided a convenient restricted
domain that could be formalized easily. Board games such as checkers, chess,
and recently quite prominently Go (an extremely complex strategy board game
originating from China at least 2500 years ago), have inspired countless
researchers, and continue to do so.
Closely related to games, search and planning techniques were an area where
AI led to great advances in the 1960s: algorithms with names such as the
Minimax algorithm or Alpha-Beta Pruning, which were developed then, are still
the basis for game playing AI, although of course more advanced variants have
been proposed over the years. In this chapter, we will study games and
planning problems on a conceptual level.
III.Search and games
In this section, we will study a classic AI problem: games. The simplest scenario, which we
will focus on for the sake of clarity, are two-player, perfect-information games such as tic-
tac-toe and chess.
Example: playing tic tac toe
Maxine and Minnie are true game enthusiasts. They just love games. Especially
two-person, perfect information games such as tic-tac-toe or chess. One day
they were playing tic-tac-toe. Maxine, or Max as her friends call her, was
playing with X. Minnie, or Min as her friends call her, had the Os. Min had just
played her turn and the board looked as follows:

Max was looking at the board and contemplating her next move, as it was
her turn, when she suddenly buried her face in her hands in despair, looking
quite like Garry Kasparov playing Deep Blue in 1997. Yes, Min was close to
getting three Os on the top row, but Max could easily put a stop to that plan. So
why was Max so pessimistic?
Game trees
To solve games using AI, we will introduce the concept of a game tree. The
different states of the game are represented by nodes in the game tree, very
similar to the above planning problems. The idea is just slightly different. In the
game tree, the nodes are arranged in levels that correspond to each player’s
turns in the game so that the “root” node of the tree (usually depicted at the
top of the diagram) is the beginning position in the game. In tic-tac-toe, this
would be the empty grid with no Xs or Os played yet. Under root, on the
second level, there are the possible states that can result from the first player’s
moves, be it X or O. We call these nodes the “children” of the root node.
Each node on the second level, would further have as its children nodes the
states that can be reached from it by the opposing player’s moves. This is
continued, level by level, until reaching states where the game is over. In tic-
tac-toe, this means that either one of the players gets a line of three and wins,
or the board is full and the game ends in a tie.
Minimizing and maximizing value
In order to be able to create game AI that attempts to win the game, we attach
a numerical value to each possible end result. To the board positions where X
has a line of three so that Max wins, we attach the value +1, and likewise, to
the positions where Min wins with three Os in a row we attach the value -1.
For the positions where the board is full and neither player wins, we use the
neutral value 0 (it doesn’t really matter what the values are as long as they are
in this order so that Max tries to max the value, and Min tries to minimize it).
A sample game tree
Consider, for example, the following game tree which begins not at the root
but in the middle of the game (because otherwise, the tree would be way too
big to display). Note that this is different from the game shown in the
illustration in the beginning of this section. We have numbered the nodes with
numbers 1, 2, ..., 14].
The tree is composed of alternating layers where it is either Min’s turn to place
an O or Max’s turn to place an X at any of the vacant slots on the board. The
player whose turn it is to play next is shown at the left.

The game continues at the board position shown in the root node,
numbered as (1) at the top, with Min’s turn to place O at any of the three
vacant cells. Nodes (2)–(4) show the board positions resulting from each of the
three choices respectively. In the next step, each node has two possible choices
for Max to play X each, and so the tree branches again.
When starting from the above starting position, the game always ends in a row
of three: in nodes (7) and (9), the winner is Max who plays with X, and in nodes
(11)–(14) the winner is Min who plays with O.
Note that since the players’ turns alternate, the levels can be labeled as Min
levels and Max levels, which indicates whose turn it is.
Being strategic
Consider nodes (5)–(10) on the second level from the bottom. In nodes (7) and
(9), the game is over, and Max wins with three X’s in a row. The value of these
positions is +1. In the remaining nodes, (5), (6), (8), and (10), the game is also
practically over, since Min only needs to place her O in the only remaining cell
to win. In other words, we know how the game will end at each node on the
second level from the bottom. We can therefore decide that the value of
nodes (5), (6), (8), and (10) is also –1.

Here comes the interesting part. Let’s consider the values of the nodes one
level higher towards the root: nodes (2)–(4). Since we observed that both of
the children of (2), i.e., nodes (5) and (6), lead to Min’s victory, we can without
hesitation attach the value -1 to node (2) as well. However, for node (3), the left
child (7) leads to Max’s victory, +1, but the right child (8) leads to Min winning,
-1. What is the value of node (3)? Think about this for a while, keeping in mind
who makes the choice at node (3). Since it is Max’s turn to play, she will of
course choose the left child, node (7). Thus, every time we reach the board
position in node (3), Max can ensure victory, and we can attach the value +1 to
node (3).
The same holds for node (4): again, since Max can choose where to put her X,
she ca always ensure victory, and we attach the value +1 to node (4)

Determining who wins


The most important lesson in this section is to apply the above kind of
reasoning repeatedly to determine the result of the game in advance from any
board position.
So far, we have decided that the value of node (2) is –1, which means that if
we end up in such a board position, Min can ensure winning, and that the
reverse holds for nodes (3) and (4): their value is +1, which means that Max
can be sure to win if she only plays her own turn wisely.
Finally, we can deduce that since Min is an experienced player, she can reach
the same conclusion, and thus she only has one real option: play the O in the
middle of the board.
In the diagram below, we have included the value of each node as well as the
optimal game play starting at Min’s turn in the root node.

The value of the root node = who wins


The value of the root node, which is said to be the value of the game, tells us
who wins (and how much, if the outcome is not just plain win or lose): Max
wins if the value of the game is +1, Min if the value is –1, and if the value is 0,
then the game will end in a draw. In other games, the value may also take
other values (such as the monetary value of the chips in front of you in poker
for example).
This all is based on the assumption that both players choose what is best for
them and that what is best for one is the worst for the other (so called "zero-
sum game").
Note
Finding the optimal moves
Having determined the values of all the nodes in the game tree, the optimal
moves can be deduced: at any Min node (where it is Min’s turn), the optimal
choice is given by the child node whose value is minimal, and conversely, at
any Max node (where it is Max’s turn), the optimal choice is given by the child
node whose value is maximal. Sometimes there are many equally good choices
that are, well, equally good, and the outcome will be the same no matter
which one of them is picked.
The Minimax algorithm
We can exploit the above concept of the value of the game to obtain an
algorithm called the Minimax algorithm. It guarantees optimal game play in,
theoretically speaking, any deterministic, two-person, perfect-information
zero-sum game. Given a state of the game, the algorithm simply computes the
values of the children of the given state and chooses the one that has the
maximum value if it is Max’s turn, and the one that has the minimum value if it
is Min’s turn.
The algorithm can be implemented using a few lines of code. However, we will
be satisfied with having grasped the main idea. If you are interested in taking a
look at the actual algorithm (alert: programming required) feel free to check
out, for example, Wikipedia: Minimax.

Sounds good, can I go home now? As stated above, the Minimax algorithm
can be used to implement optimal game play in any deterministic, two-player,
perfect-information zero-sum game. Such games include tic-tac-toe, connect
four, chess, Go, etc. Rock-paper-scissors is not in this class of games since it
involves information hidden from the other player; nor are Monopoly or
backgammon which are not deterministic. So as far as this topic is concerned,
is that all folks, can we go home now? The answer is that in theory, yes, but in
practice, no.
Note
The problem of massive game trees
In many games, the game tree is simply way too big to traverse in full. For
example, in chess the average branching factor, i.e., the average number of
children (available moves) per node is about 35. That means that to explore all
the possible scenarios up to only two moves ahead, we need to visit
approximately 35 x 35 = 1225 nodes – probably not your favorite pencil-and-
paper homework exercise. A look-ahead of three moves requires visiting 42875
nodes; four moves 1500625; and ten moves 2758547353515625 (that’s about
2.7 quadrillion) nodes. In Go, the average branching factor is estimated to be
about 250. Go means no-go for Minimax.
More tricks: Managing massive game trees
A few more tricks are needed to manage massive game trees. Many of them
were crucial elements in IBM’s Deep Blue computer defeating the chess world
champion, Garry Kasparov, in 1997.
If we can afford to explore only a small part of the game tree, we need a way
to stop the Minimax algorithm before reaching an end-node, i.e., a node where
the game is over and the winner is known. This is achieved by using a so
called heuristic evaluation function that takes as input a board position,
including the information about which player’s turn is next, and returns a score
that should be an estimate of the likely outcome of the game continuing from
the given board position.
Note
Good heuristics
Good heuristics for chess, for example, typically count the amount of material
(pieces) weighted by their type: the queen is usually considered worth about
two times as much as a rook, three times a knight or a bishop, and nine times
as much as a pawn. The king is of course worth more than all other things
combined since losing it amounts to losing the game. Further, occupying the
strategically important positions near the middle of the board is considered an
advantage and the heuristics assign higher value to such positions.
The minimax algorithm presented above requires minimal changes to obtain
a depth-limited version where the heuristic is returned at all nodes at a given
depth limit: the depth simply refers to the number of steps that the game tree
is expanded before applying a heuristic evaluation function.
CHAPTER 3
I.Odds and probability
In the previous section, we discussed search and its application where there is perfect
information – such as in games like chess. However, in the real world things are rarely so
clear cut.
Instead of perfect information, there is a host of unknown possibilities, ranging
from missing information to deliberate deception.
Take a self-driving car for example — you can set the goal to get from A to B in
an efficient and safe manner that follows all laws. But what happens if the
traffic gets worse than expected, maybe because of an accident ahead?
Sudden bad weather? Random events like a ball bouncing in the street, or a
piece of trash flying straight into the car’s camera?

A self-driving car needs to use a variety of sensors, including sonar-like


ones and cameras, to detect where it is and what is around it. These sensors
are never perfect as the data from the sensors always includes some errors and
inaccuracies called “noise”. It is very common then that one sensor indicates
that the road ahead turns left, but another sensor indicates the opposite
direction. This needs to be resolved without always stopping the car in case of
even a slightest amount of noise.
Probability
One of the reasons why modern AI methods actually work in real-world
problems – as opposed to most of the earlier “good old-fashioned" methods in
the 1960-1980s – is their ability to deal with uncertainty.
Note
The history of dealing with uncertainty
The history of AI has seen various competing paradigms for handling uncertain
and imprecise information. For example, you may have heard of fuzzy logic.
Fuzzy logic was for a while a contender for the best approach to handle
uncertain and imprecise information and used in many customer-applications
such as washing machines where the machine could detect the dirtiness (a
matter of degrees, not only dirty or clean) and adjust the program accordingly.
However, probability has turned out to be the best approach for reasoning
under uncertainty, and almost all current AI applications are based, to at least
some degree, on probabilities.
Why probability matters
We are perhaps most familiar with applications of probability in games: what
are the chances of getting three of a kind in poker (about 1 in 47), what are the
chances of winning in the lottery (very small), and so on. However, far more
importantly, probability can also be used to quantify and compare risks in
everyday life: what are the chances of crashing your car if you exceed the
speed limit, what are the chances that the interest rates on your mortgage will
go up by five percentage points within the next five years, or what are the
chances that AI will automate particular tasks such as detecting fractured
bones in X-ray images or waiting tables in a restaurant.
Note
The key lesson about probability
The most important lesson about probability that we’d like you to take away is
not probability calculus. Instead, it is the ability to think of uncertainty as a
thing that can be quantified at least in principle. This means that we can talk
about uncertainty as if it were a number: numbers can be compared (“is this
thing more probable than that thing”), and they can often be measured.

Granted, measuring probabilities is hard: we usually need many observations


about a phenomenon to draw conclusions. However, by systematically
collecting data, we can critically evaluate probabilistic statements, and our
numbers can sometimes be found to be right or wrong. In other words, the key
lesson is that uncertainty is not beyond the scope of rational thinking and
discussion, and probability provides a systematic way of doing just that.
The fact that uncertainty can be quantified is of paramount importance, for
example, in decisions concerning vaccination or other public policies. Before
entering the market, any vaccine is clinically tested, so that its benefits and
risks have been quantified. The risks are never known to the minutest detail,
but their magnitude is usually known to sufficient degree that it can be argued
whether the benefits outweigh the risks.
Note
Why quantifying uncertainty matters
If we think of uncertainty as something that can’t be quantified or measured,
the uncertainty aspect may become an obstacle for rational discussion. We
may for example argue that since we don’t know exactly whether a vaccine
may cause a harmful side-effect, it is too dangerous to use. However, this may
lead us to ignore a life-threatening disease that the vaccine will eradicate. In
most cases, the benefits and risks are known to sufficient precision to clearly
see that one is more significant than the other.
The above lesson is useful in many everyday scenarios and professionally: for
example, medical doctors, judges in a court of law, or investors have to process
uncertain information and make rational decisions based on them. Since this is
an AI course, we will discuss how probability can be used to automate
uncertain reasoning. The examples we will use include medical diagnosis
(although it is usually not a task that we’d wish to fully automate), and
identifying fraudulent email messages (“spam”).
Odds
Probably the easiest way to represent uncertainty is through odds. They make
it particularly easy to update beliefs when more information becomes available
(we will return to this in the next section).
Before we proceed any further, we should make sure you are comfortable with
doing basic manipulations on ratios (or fractions). As you probably recall,
fractions are numbers like 3/4 or 21/365. We will need to multiply and divide
such things, so it’s good to refresh these operations if you feel unsure about
them. A compact presentation for those who just need a quick reminder
is Wikibooks: Multiplying Fractions. Another fun animated presentation of the
basic operations is Math is Fun: Using Rational Numbers. Feel free to consult
your favorite source if necessary.
By odds, we mean an expression like 3:1 (three to one), which means that we
expect that for every three cases of an outcome, for example winning a bet,
there is one case of the opposite outcome, not winning the bet. (In gambling
terms, the odds are usually given from the bookmakers point of view, so 3:1
usually means that your chances of winning are 1:3.) The other way to express
the same would be to say that the chances of winning are 3/4 (three in four).
These are called natural frequencies since they involve only whole numbers.
With whole numbers, it is easy to imagine, for example, four people out of
whom, three have brown eyes. Or four days out of which it rains on three (if
you’re in Helsinki).
Note
Why we use odds and not percentages
Three out of four is of course the same as 75% (mathematicians prefer to use
fractions like 0.75 instead of percentages). It has been found that people get
confused and make mistakes more easily when dealing with fractions and
percentages than with natural frequencies or odds. This is why we use natural
frequencies and odds whenever convenient.
An important thing to notice is that while expressed as two numbers, 3 and 1
for example, the odds can actually be thought of as a single fraction or a ratio,
for example 3/1 (three divided by one) which is equal to 3. Thus, the odds 3:1
is the same as the odds 6:2 or 30:10 since these ratios are also equal to 3.
Likewise, the odds 1:5 can be thought of as 1/5 (one divided by five) which
equals 0.2. Again, this is the same as the odds 2:10 or 10:50 because that’s
what you get by dividing 2 by 10 or 10 by 50. But be very careful! The odds 1:5
(one win for every five losses), even if it can be expressed as the decimal
number 0.2, is different from 20% probability (or probability 0.2 using the
mathematicians’ notation). The odds 1:5 mean that you’d have to play the
game six times to get one win on the average. The probability 20% means that
you’d have to play five times to get one win on the average.
For odds that are greater than one, such as 5:1, it is easy to remember that we
are not dealing with probabilities because no probability can be greater than 1
(or greater than 100%), but for odds that are less than one such as 1:5, the
danger of confusion lurks around the corner.
So make sure you always know when we are talking about odds and when we
are talking about probabilities.
The following exercise will help you practice dealing with correspondence
between odds and probabilities. Don’t worry if you make some mistakes at this
stage: the main goal is to learn the skills that you will need in the next sections.
II. The Bayes rule
We will not go too far into the details of probability calculus and all the ways in which it can
be used in various AI applications, but we will discuss one very important formula.
We will do this because this particular formula is both simple and elegant as
well as incredibly powerful. It can be used to weigh conflicting pieces of
evidence in medicine, in a court of law, and in many (if not all) scientific
disciplines. The formula is called the Bayes rule (or the Bayes formula).
We will start by demonstrating the power of the Bayes rule by means of a
simple medical diagnosis problem where it highlights how poorly our intuition
is suited for combining conflicting evidence. We will then show how the Bayes
rule can be used to build AI methods that can cope with conflicting and noisy
observations.
Key terminology
Prior and posterior odds
The Bayes rule can be expressed in many forms. The simplest one is in terms of
odds. The idea is to take the odds for something happening (against it not
happening), which we’ll call prior odds. The word prior refers to our
assessment of the odds before obtaining some new information that may be
relevant. The purpose of the formula is to update the prior odds when new
information becomes available, to obtain the posterior odds, or the odds after
obtaining the information (the dictionary meaning of posterior is “something
that comes after, later”.)
How odds change
In order to weigh the new information, and decide how the odds change when
it becomes available, we need to consider how likely we would be to
encounter this information in alternative situations. Let’s take as an example,
the odds that it will rain later today. Imagine getting up in the morning in
Finland. The chances of rain are 206 in 365 (including rain, snow, and hail.
Brrr). The number of days without rain is therefore 159. This converts to prior
odds of 206:159 for rain, so the cards are stacked against you already before
you open your eyes.
However, after opening your eyes and taking a look outside, you notice it’s
cloudy. Suppose the chances of having a cloudy morning on a rainy day are 9
out of 10 – that means that only one out of 10 rainy days start out with blue
skies. But sometimes there are also clouds without rain: the chances of having
clouds on a rainless day are 1 in 10. Now how much higher are the chances of
clouds on a rainy day compared to a rainless day? Think about this carefully as
it will be important to be able to comprehend the question and obtain the
answer in what follows.
The answer is that the chances of clouds on a rainy day are nine times as high
as the chances of clouds on a rainless day: on a rainy day the chances are 9 out
of 10, whereas on a rainless day the chances are 1 out of 10, which is nine
times as high.
Note that even though the two probabilities 9/10 and 1/10 sum up to 9/10 +
1/10 = 1, this is by no means always the case. In some other town, the
mornings of rainy days could be cloudy eight times out of ten. This, however,
would not mean that the rainless days are cloudy two times out of ten. You’ll
have to be careful to get the calculations right. (But never mind if you make a
mistake or two – don’t give up! The Bayes rule is a fundamental thinking tool
for everyone of us.)
Key terminology
Likelihood ratio
The above ratio (nine times as high chances of clouds on a rainy day compared
to a rainless day) is called the likelihood ratio. More generally, the likelihood
ratio is the probability of the observation in case the event of interest (in the
above, rain), divided by the probability of the observation in case of no event
(in the above, no rain). Please read the previous sentence a few times. It may
look a little intimidating, but it’s not impossible to digest if you just focus
carefully. We will walk you through the steps in detail, just don’t lose your
nerve. We’re almost there.
So we concluded that on a cloudy morning, we have: likelihood ratio = (9/10) /
(1/10) = 9
The mighty Bayes rule for converting prior odds into posterior odds is – ta-daa!
– as follows: posterior odds = likelihood ratio × prior odds
Now you are probably thinking: Hold on, that’s the formula? It’s a frigging
multiplication! That is the formula – we said it’s simple, didn’t we? You
wouldn’t imagine that a simple multiplication can be used for all kinds of
incredibly useful applications, but it can. We’ll study a couple examples which
will demonstrate this.
Note
Many forms of Bayes
In case you have any trouble with the following exercises, you may need to
read the above material a few times and give it some time, and if that doesn’t
do it, you can look for more material online. Just a word of advice: there are
many different forms in which the Bayes rule can be written, and the odds
form that we use isn’t the most common one.
The Bayes rule in practice: breast cancer screening use diagram
Our first realistic application is a classical example of using the Bayes rule,
namely medical diagnosis. This example also illustrates a common bias in
dealing with uncertain information called the base-rate fallacy.
Consider mammographic screening for breast cancer. Using made up
percentages for the sake of simplifying the numbers, let’s assume that 5 in 100
women have breast cancer. Suppose that if a person has breast cancer, then
the mammograph test will find it 80 times out of 100. When the test comes out
suggesting that breast cancer is present, we say that the result is positive,
although of course there is nothing positive about this for the person being
tested (a technical way of saying this is that the sensitivity of the test is 80%).
The test may also fail in the other direction, namely to indicate breast cancer
when none exists. This is called a false positive finding. Suppose that if the
person being tested actually doesn’t have breast cancer, the chances that the
test nevertheless comes out positive are 10 in 100. (In technical terms, we
would say that the specificity of the test is 90%.)
Based on the above probabilities, you are able to calculate the likelihood ratio.
You’ll find use for it in the next exercise. If you forgot how the likelihood ratio
is calculated, you may wish to check the terminology box earlier in this section
and revisit the rain example.
Note: You can use the above diagram with stick figures to validate that your
result is in the ballpark (about right) but note that diagram isn’t quite precise.
Out of the 95 women who don’t have cancer (the gray figures in the top
panel), about nine and a half are expected to get a (false) positive result. The
remaining 85 and a half are expected to get a (true) negative result. We didn’t
want to be so cruel as to cut people – even stick figures – in half, so we used 9
and 86 as an approximation.
The base rate fallacy
While doing the above exercise, you may have noticed that our intuition is not
well geared towards weighing different pieces of evidence. This is true
especially when the pieces of evidence conflict with each other. In the above
example, on the one hand, the base rate of breast cancer was relatively low,
meaning that breast cancer is relatively rare. So our brain thinks that it’s
unlikely that a person has it. On the other hand, the positive mammograph test
suggests the opposite. Our brain tends to choose one of these pieces of
evidence and ignore the other. It is typically the low base rate that is ignored.
That’s why your intuition probably says that the posterior probability of having
breast cancer given the positive test result is much higher than 30%. This is
known as the so called base rate fallacy. Knowing the Bayes rule is the best
cure against it.
III.Naive Bayes classification
One of the most useful applications of the Bayes rule is the so-called naive Bayes classifier.
The Bayes classifier is a machine learning technique that can be used to classify
objects such as text documents into two or more classes. The classifier is
trained by analyzing a set of training data, for which the correct classes given.
The naive Bayes classifier can be used to determine the probabilities of the
classes given a number of different observations. The assumption in the model
is that the feature variables are conditionally independent given the class (we
will not discuss the meaning of conditional independence in this course. For
our purposes, it is enough to be able to exploit conditional independence in
building the classifier).
Real world application: spam filters
We will use a spam email filter as a running example for illustrating the idea of
the naive Bayes classifier. Thus, the class variable indicates whether a message
is spam (or “junk email”) or whether it is a legitimate message (also called
“ham”). The words in the message correspond to the feature variables, so that
the number of feature variables in the model is determined by the length of
the message.
Note
Why we call it “naive”
Using spam filters as an example, the idea is to think of the words as being
produced by choosing one word after the other so that the choice of the word
depends only on whether the message is spam or ham. This is a crude
simplification of the process because it means that there is no dependency
between adjacent words, and the order of the words has no significance. This
is in fact why the method is called naive.
Because the model is based on the idea that the words can be processed
independently, we can identify specific words that are indicative of either
spam ("FREE", "LAST") or ham ("meeting", "algorithm").

Despite its naivete, the naive Bayes method tends to work very well in
practice. This is a good example of the common saying in statistics, “all models
are wrong, but some are useful” means (the aphorism is generally attributed to
statistician George E.P. Box).
Estimating parameters
To get started, we need to specify the prior odds for spam (against ham). For
simplicity assume this to be 1:1 which means that on the average half of the
incoming messages are spam (in reality, the amount of spam is probably much
higher).
To get our likelihood ratios, we need two different probabilities for any word
occurring: one in spam messages and another one in ham messages.
The word distributions for the two classes are best estimated from actual
training data that contains some spam messages as well as legitimate
messages. The simplest way is to count how many times each word, abacus,
acacia, ..., zurg, appears in the data and divide the number by the total word
count.
To illustrate the idea, let’s assume that we have at our disposal some spam and
some ham. You can easily obtain such data by saving a batch of your emails in
two files.
Assume that we have calculated the number of occurrences of the following
words (along with all other words) in the two classes of messages:
word spam ham
million 156 98
dollars 29 119
adclick 51 0
conferences 0 12
total 95791 306438
We can now estimate that the probability that a word in a spam message is
“million”, for example, is about 156 out of 95791, which is roughly the same as
1 in 614. Likewise, we get the estimate that 98 out of 306438 words, which is
about the same as 1 in 3127, in a ham message are million. Both of these
probability estimates are small, less than 1 in 500, but more importantly, the
former is higher than the latter: 1 in 614 is higher than 1 in 3127. This means
that the likelihood ratio, which is the first ratio divided by the second ratio, is
more than one. To be more precise, the ratio is (1/614) / (1/3127) = 3127/614
= 5.1 (rounded to one decimal digit).
Recall that if you have any trouble at all with following the math in this section,
you should refresh the arithmetic with fractions using the pointers we gave
earlier (see the part about Odds in section Odds and Probability).
Note
Zero means trouble
One problem with estimating the probabilities directly from the counts is that
zero counts lead to zero estimates. This can be quite harmful for the
performance of the classifier – it easily leads to situations where the posterior
odds are 0/0, which is nonsense. The simplest solution is to use a small lower
bound for all probability estimates. The value 1/100000, for instance, does the
job.
Using the above logic, we can determine the likelihood ratio for all possible
words without having to use zero, giving us the following likelihood ratios:
word likelihood ratio

million 5.1

dollars 0.8
adclick 53.2

conferences 0.3

We are now ready to apply the method to classify new messages.


Example: is it spam or ham?
Once we have the prior odds and the likelihood ratios calculated, we are ready
to apply the Bayes rule, which we already practiced in the medical diagnosis
case as our example. The reasoning goes just like it did before: we update the
odds of spam by multiplying it by the likelihood ratio. To remind ourselves of
the procedure, let’s try a message with a single word to begin with. For the
prior odds, as agreed above, you should use odds 1:1.
CHAPTER 4
I. The types of machine learning
Handwritten digits are a classic case that is often used when discussing why we use
machine learning, and we will make no exception.
Below you can see examples of handwritten images from the very commonly
used MNIST dataset.

The correct label (what digit the writer was supposed to write) is shown above
each image. Note that some of the "correct” class labels are questionable: see
for example the second image from left: is that really a 7, or actually a 4?
Note
MNIST – What’s that?
Every machine learning student knows about the MNIST dataset. Fewer know
what the acronym stands for. In fact, we had to look it up to be able to tell you
that the M stands for Modified, and NIST stands for National Institute of
Standards and Technology. Now you probably know something that an average
machine learning expert doesn’t!
In the most common machine learning problems, exactly one class value is
correct at a time. This is also true in the MNIST case, although as we said, the
correct answer may often be hard to tell. In this kind of problem, it is not
possible that an instance belongs to multiple classes (or none at all) at the
same time. What we would like to achieve is an AI method that can be given an
image like the ones above, and automatically spits out the correct label (a
number between 0 and 9).
Note
How not to solve the problem
An automatic digit recognizer could in principle be built manually by writing
down rules such as:
 if the black pixels are mostly in the form of a single loop then the label is
0
 if the black pixels form two intersecting loops then the label is 8
 if the black pixels are mostly in a straight vertical line in the middle of the
figure then the label is 1
and so on...

This was how AI methods were mostly developed in the 1980s (so called
“expert systems”). However, even for such a simple task as digit recognition,
the task of writing such rules is very laborious. In fact, the above example rules
wouldn’t be specific enough to be implemented by programming – we’d have
to define precisely what we mean by “mostly”, “loop”, “line”, “middle”, and so
on.

And even if we did all this work, the result would likely be a bad AI method
because as you can see, the handwritten digits are often a bit so-and-so, and
every rule would need a dozen exceptions.
Three types of machine learning
The roots of machine learning are in statistics, which can also be thought of as
the art of extracting knowledge from data. Especially methods such as linear
regression and Bayesian statistics, which are both already more than two
centuries old (!), are even today at the heart of machine learning. For more
examples and a brief history, see the timeline of machine learning (Wikipedia).
The area of machine learning is often divided in subareas according to the
kinds of problems being attacked. A rough categorization is as follows:
Supervised learning: We are given an input, for example a photograph with a
traffic sign, and the task is to predict the correct output or label, for example
which traffic sign is in the picture (speed limit, stop sign, etc.). In the simplest
cases, the answers are in the form of yes/no (we call these binary classification
problems).
Unsupervised learning: There are no labels or correct outputs. The task is to
discover the structure of the data: for example, grouping similar items to form
“clusters”, or reducing the data to a small number of important “dimensions”.
Data visualization can also be considered unsupervised learning.
Reinforcement learning: Commonly used in situations where an AI agent like a
self-driving car must operate in an environment and where feedback about
good or bad choices is available with some delay. Also used in games where
the outcome may be decided only at the end of the game.
The categories are somewhat overlapping and fuzzy, so a particular method
can sometimes be hard to place in one category. For example, as the name
suggests, so-called semisupervised learning is partly supervised and partly
unsupervised.
Note
Classification
When it comes to machine learning, we will focus primarily on supervised
learning, and in particular, classification tasks. In classification, we observe in
input, such as a photograph of a traffic sign, and try to infer its “class”, such as
the type of sign (speed limit 80 km/h, pedestrian crossing, stop sign, etc.).
Other examples of classification tasks include: identification of fake Twitter
accounts (input includes the list of followers, and the rate at which they have
started following the account, and the class is either fake or real account) and
handwritten digit recognition (input is an image, class is 0,...,9).

Humans teaching machines: supervised learning


Instead of manually writing down exact rules to do the classification, the point
in supervised machine learning is to take a number of examples, label each one
by the correct label, and use them to “train” an AI method to automatically
recognize the correct label for the training examples as well as (at least
hopefully) any other images. This of course requires that the correct labels are
provided, which is why we talk about supervised learning. The user who
provides the correct labels is a supervisor who guides the learning algorithm
towards correct answers so that eventually, the algorithm can independently
produce them.
In addition to learning how to predict the correct label in a classification
problem, supervised learning can also be used in situations where the
predicted outcome is a number. Examples include predicting the number of
people who will click a Google ad based on the ad content and data about the
user’s prior online behavior, predicting the number of traffic accidents based
on road conditions and speed limit, or predicting the selling price of real estate
based on its location, size, and condition. These problems are called regression.
You probably recognize the term linear regression, which is a classical, still very
popular technique for regression.
Note
Example
Suppose we have a data set consisting of apartment sales data. For each
purchase, we would obviously have the price that was paid, together with the
size of the apartment in square meters (or square feet, if you like), and the
number of bedrooms, the year of construction, the condition (on a scale from
“disaster“ to “spick and span”). We could then use machine learning to train a
regression model that predicts the selling price based on these features. See a
real-life example here.

Caveat: careful with that machine learning algorithm


There are a couple potential mistakes that we’d like to make you aware of.
They are related to the fact that unless you are careful with the way you apply
machine learning methods, you could become too confident about the
accuracy of your predictions, and be heavily disappointed when the accuracy
turns out to be worse than expected.
The first thing to keep in mind in order to avoid big mistakes, is to split your
data set into two parts: the training data and the test data. We first train the
algorithm using only the training data. This gives us a model or a rule that
predicts the output based on the input variables.
To assess how well we can actually predict the outputs, we can’t count on the
training data. While a model may be a very good predictor in the training data,
it is no proof that it can generalize to any other data. This is where the test
data comes in handy: we can apply the trained model to predict the outputs
for the test data and compare the predictions to the actual outputs (for
example, future apartment sale prices).
Note
Too fit to be true! Overfitting alert
It is very important to keep in mind that the accuracy of a predictor learned by
machine learning can be quite different in the training data and in separate
test data. This is the so-called overfitting phenomenon, and a lot of machine
learning research is focused on avoiding it one way or another. Intuitively,
overfitting means trying to be too smart. When predicting the success of a new
song by a known artist, you can look at the track record of the artist’s earlier
songs, and come up with a rule like “if the song is about love, and includes a
catchy chorus, it will be top-20”. However, maybe there are two love songs
with catchy choruses that didn’t make the top-20, so you decide to continue
the rule “...except if Sweden or yoga are mentioned” to improve your rule. This
could make your rule fit the past data perfectly, but it could in fact make it
work worse on future test data.
Machine learning methods are especially prone to overfitting because they can
try a huge number of different “rules” until one that fits the training data
perfectly is found. Especially methods that are very flexible and can adapt to
almost any pattern in the data can overfit unless the amount of data is
enormous. For example, compared to quite restricted linear models obtained
by linear regression, neural networks can require massive amounts of data
before they produce reliable prediction.
Learning to avoid overfitting and choose a model that is not too restricted, nor
too flexible, is one of the most essential skills of a data scientist.
Learning without a teacher: unsupervised learning
Above we discussed supervised learning where the correct answers are
available, and the task of the machine learning algorithm is to find a model
that predicts them based on the input data.
In unsupervised learning, the correct answers are not provided. This makes the
situation quite different since we can’t build the model by making it fit the
correct answers on training data. It also makes the evaluation of performance
more complicated since we can’t check whether the learned model is doing
well or not.
Typical unsupervised learning methods attempt to learn some kind of
“structure” underlying the data. This can mean, for
example, visualization where similar items are placed near each other and
dissimilar items further away from each other. It can also
mean clustering where we use the data to identify groups or “clusters” of
items that are similar to each other but dissimilar from data in other clusters.
Note
Example
As a concrete example, grocery store chains collect data about their
customers’ shopping behavior (that’s why you have all those loyalty cards). To
better understand their customers, the store can either visualize the data using
a graph where each customer is represented by a dot and customers who tend
to buy the same products are placed nearer each other than customers who
buy different products. Or, the store could apply clustering to obtain a set of
customer groups such as ‘low-budget health food enthusiasts’, ‘high-end fish
lovers’, ‘soda and pizza 6 days a week’, and so on. Note that the machine
learning method would only group the customers into clusters, but it wouldn’t
automatically generate the cluster labels (‘fish lovers’ and so on). This task
would be left for the user.
Yet another example of unsupervised learning can be termed generative
modeling. This has become a prominent approach over the last few years as a
deep learning technique called generative adversarial networks (GANs) has
lead to great advances. Given some data, for example, photographs of people’s
faces, a generative model can generate more of the same: more real-looking
but artificial images of people’s faces.
We will return to GANs and the implications of being able to produce high-
quality artificial image content a bit later in the course, but next we will take a
closer look at supervised learning and discuss some specific methods in detail.
II. The nearest neighbor classifier
The nearest neighbor classifier is among the simplest possible classifiers. When given an
item to classify, it finds the training data item that is most similar to the new item, and
outputs its label. An example is given in the following diagram.

In the above diagram, we show a collection of training data items, some of


which belong to one class (green) and other to another class (blue). In addition,
there are two test data items, the stars, which we are going to classify using
the nearest neighbor method.
The two test items are both classified in the “green” class because their
nearest neighbors are both green (see diagram (b) above).
The position of the points in the plot represents in some way the properties of
the items. Since we draw the diagram on a flat two-dimensional surface – you
can move in two independent directions: up-down or left-right – the items
have two properties that we can use for comparison. Imagine for example
representing patients at a clinic in terms of their age and blood-sugar level. But
the above diagram should be taken just as a visual tool to illustrate the general
idea, which is to relate the class values to similarity or proximity (nearness).
The general idea is by no means restricted to two dimensions and the nearest
neighbor classifier can easily be applied to items that are characterized by
many more properties than two.
What do we mean by nearest?
An interesting question related to (among other things) the nearest neighbor
classifier is the definition of distance or similarity between instances. In the
illustration above, we tacitly assumed that the standard geometric distance,
technically called the Euclidean distance, is used. This simply means that if the
points are drawn on a piece of paper (or displayed on your screen), you can
measure the distance between any two items by pulling a piece of thread
straight from one to the other and measuring the length.
Note
Defining “nearest”
Using the geometric distance to decide which is the nearest item may not
always be reasonable or even possible: the type of the input may, for example,
be text, where it is not clear how the items are drawn in a geometric
representation and how distances should be measured. You should therefore
choose the distance metric on a case-by-case basis.
In the MNIST digit recognition case, one common way to measure image
similarity is to count pixel-by-pixel matches. In other words, we compare the
pixels in the top-left corner of each image to one another and if the more
similar color (shade of gray) they are, the more similar the two images are. We
also compare the pixels in the bottom-right corner of each image, and all pixels
inbetween. This technique is quite sensitive to shifting or scaling the images: if
we take an image of a “1” and shift it ever so slightly either left or right, the
outcome is that the two images (before and after the shift) are very different
because the black pixels are in different positions in the two images.
Fortunately, the MNIST data has been preprocessed by centering the images
so that this problem is alleviated.
Using nearest neighbors to predict user behavior A typical example of an
application of the nearest neighbor method is predicting user behavior in AI
applications such as recommendation systems.
The idea is to use the very simple principle that users with similar past
behavior tend to have similar future behavior. Imagine a music
recommendation system that collects data about users’ listening behavior.
Let’s say you have listened to 1980s disco music (just for the sake of
argument). One day, the service provider gets their hands on a hard-to-find
1980 disco classic, and adds it into the music library. The system now needs to
predict whether you will like it or not. One way of doing this is to use
information about the genre, the artist, and other metadata, entered by the
good people of the service provider. However, this information is relatively
scarce and coarse and it will only be able to give rough predictions.
What current recommendation systems use instead of the manually entered
metadata, is something called collaborative filtering. The collaborative aspect
of it is that it uses other users’ data to predict your preferences. The word
“filter” refers to the fact that you will be only recommended content that
passes through a filter: content that you are likely to enjoy will pass, other
content will not (these kind of filters may lead to the so called filter bubbles,
which we mentioned in Chapter 1. We will return to them later).
Now let’s say that other users who have listened to 80s disco music enjoy the
new release and keep listening to it again and again. The system will identify
the similar past behavior that you and other 80s disco fanatics share, and since
other users like you enjoy the new release, the system will predict that you will
too. Hence it will show up at the top of your recommendation list. In an
alternative reality, maybe the added song is not so great and other users with
similar past behavior as yours don’t really like it. In that case, the system
wouldn’t bother recommending it to you, or at least it wouldn’t be at the top
of the list of recommendations for you.
III. Regression
Our main learning objective in this section is another nice example of supervised learning
methods, and almost as simple as the nearest neighbor classifier too: linear regression.
We’ll introduce its close cousin, logistic regression as well.
Note
The difference between classification and regression
There is a small but important difference in the kind of predictions that we
should produce in different scenarios. While for example the nearest neighbor
classifier chooses a class label for any item out of a given set of alternatives
(like spam/ham, or 0,1,2,...,9), linear regression produces a numerical
prediction that is not constrained to be an integer (a whole number as
opposed to something like 3.14). So linear regression is better suited in
situations where the output variable can be any number like the price of a
product, the distance to an obstacle, the box-office revenue of the next Star
Wars movie, and so on.
The basic idea in linear regression is to add up the effects of each of the
feature variables to produce the predicted value. The technical term for the
adding up process is linear combination. The idea is very straightforward, and it
can be illustrated by your shopping bill.
Note
Thinking of linear regression as a shopping bill
Suppose you go to the grocery store and buy 2.5kg potatoes, 1.0kg carrots, and
two bottles of milk. If the price of potatoes is 2€ per kg, the price of carrots is
4€ per kg, and a bottle of milk costs 3€, then the bill, calculated by the cashier,
totals 2.5 × 2€ + 1.0 × 4€ + 2 × 3€ = 15€. In linear regression, the amount of
potatoes, carrots, and milk are the inputs in the data. The output is the cost of
your shopping, which clearly depends on both the price and how much of each
product you buy.
The word linear means that the increase in the output when one input feature
is increased by some fixed amount is always the same. In other words,
whenever you add, say, two kilos of carrots into your shopping basket, the bill
goes up 8€. When you add another two kilos, the bill goes up another 8€, and
if you add half as much, 1kg, the bill goes up exactly half as much, 4€.
Key terminology
Coefficients or weights
In linear regression terminology, the prices of the different products would be
called coefficients or weights (this may appear confusing since we measured
the amount of potatoes and carrots by weight, but do not let yourself be
tricked by this). One of the main advantages of linear regression is its easy
interpretability: the learned weights may in fact be more interesting than the
predictions of the outputs.

For example, when we use linear regression to predict the life expectancy, the
weight of smoking (cigarettes per day) is about minus half a year, meaning that
smoking one cigarette more per day takes you on the average half a year closer
to termination. Likewise, the weight of vegetable consumption (handful of
vegetables per day) has weight plus one year, so eating a handful of greens
every day gives you on the average one more year.

Learning linear regression


Above, we discussed how predictions are obtained from linear regression
when both the weights and the input features are known. So we are given the
inputs and the weight, and we can produce the predicted output. When we are
given the inputs and the outputs for a number of items, we can find the
weights such that the predicted output matches the actual output as well as
possible. This is the task solved by machine learning.
Note
Example
Continuing the shopping analogy, suppose we were given the contents of a
number of shopping baskets and the total bill for each of them, and we were
asked to figure out the price of each of the products (potatoes, carrots, and so
on). From one basket, say 1kg of sirloin steak, 2kg of carrots, and a bottle of
Chianti, even if we knew that the total bill is 35€, we couldn’t determine the
prices because there are many sets of prices that will yield the same total bill.
With many baskets, however, we will usually be able to solve the problem.
But the problem is made harder by the fact that in the real world, the actual
output isn’t always fully determined by the input, because of various factors
that introduce uncertainty or "noise" into the process. You can think of
shopping at a bazaar where the prices for any given product may vary from
time to time, or a restaurant where the final damage includes a variable
amount of tip. In such situations, we can estimate the prices but only with
some limited accuracy.
Finding the weights that optimize the match between the predicted and the
actual outputs in the training data is a classical statistical problem dating back
to the 1800s, and it can be easily solved even for massive data sets.
We will not go into the details of the actual weight-finding algorithms, such as
the classical least squares technique, simple as they are. However, you can get
a feel of finding trends in data in the following exercises.
Visualizing linear regression
A good way to get a feel for what linear regression can tell us is to draw a chart
containing our data and our regression results. As a simple toy example our
data set has one variable, the number of cups of coffee an employee drinks per
day, and the number of lines of code written per day by that employee as the
output. This is not a real data set as obviously there are other factors having an
effect on the productivity of an employee other than coffee that interact in
complex ways. The increase in productivity by increasing the amount of coffee
will also hold only to a certain point after which the jitters distract too much.
012345678910051015202530354045505560Lines of code writtenCups of
coffee per day
When we present our data in the chart above as points where one point
represents one employee, we can see that there is obviously a trend that
drinking more coffee results in more lines of code being written (recall that this
is completely made-up data). From this data set we can learn the coefficient,
or the weight, related to coffee consumption, and by eye we can already say
that it seems to be somewhere close to five, since for each cup of coffee
consumed the number of lines programmed seems to go up roughly by five.
For example, employees who drink around two cups of coffee per day seem to
produce around 20 lines of code per day, and similarly at four cups of coffee,
the amount of lines produced is around 30.
It can also be noted that employees who do not drink coffee at all also produce
code, and is shown by the graph to be about ten lines. This number is the
intercept term that we mentioned earlier. The intercept is another parameter
in the model just like the weights are, that can be learned from the data. Just
as in the life expectancy example it can be thought of as the starting point of
our calculations before we have added in the effects of the input variable, or
variables if we have more than one, be it coffee cups in this example, or
cigarettes and vegetables in the previous one.
The line in the chart represents our predicted outcome, where we have
estimated the intercept and the coefficient by using an actual linear regression
technique called least squares. This line can be used to predict the number of
lines produced when the input is the number of cups of coffee. Note that we
can obtain a prediction even if we allow only partial cups (like half, 1/4 cups,
and so on).
Machine learning applications of linear regression
Linear regression is truly the workhorse of many AI and data science
applications. It has its limits but they are often compensated by its simplicity,
interpretability and efficiency. Linear regression has been successfully used in
the following problems to give a few examples:
 prediction of click rates in online advertising
 prediction of retail demand for products
 prediction of box-office revenue of Hollywood movies
 prediction of software cost
 prediction of insurance cost
 prediction of crime rates
 prediction of real estate prices
Could we use regression to predict labels?
As we discussed above, linear regression and the nearest neighbor method
produce different kinds of predictions. Linear regression outputs numerical
outputs while the nearest neighbor method produces labels from a fixed set of
alternatives ("classes").
Where linear regression excels compared to nearest neighbors is
interpretability. What do we mean by this? You could say that in a way, the
nearest neighbor method and any single prediction that it produces are easy to
interpret: it’s just the nearest training data element! This is true, but when it
comes to the interpretability of the learned model, there is a clear difference.
Interpreting the trained model in nearest neighbors in a similar fashion as the
weights in linear regression is impossible: the learned model is basically the
whole data, and it is usually way too big and complex to provide us with much
insight. So what if we’d like to have a method that produces the same kind of
outputs as the nearest neighbor, labels, but is interpretable like linear
regression?
Logistic regression to the rescue
Well there is good news for you: we can turn the linear regression method’s
outputs into predictions about labels. The technique for doing this is called
logistic regression. We will not go into the technicalities, suffice to say that in
the simplest case, we take the output from linear regression, which is a
number, and predict one label A if the output is greater than zero, and another
label B if the output is less than or equal to zero. Actually, instead of just
predicting one class or another, logistic regression can also give us a measure
of uncertainty of the prediction. So if we are predicting whether a customer
will buy a new smartphone this year, we can get a prediction that customer A
will buy a phone with probability 90%, but for another, less predictable
customer, we can get a prediction that they will not buy a phone with 55%
probability (or in other words, that they will buy one with 45% probability).
It is also possible to use the same trick to obtain predictions over more than
two possible labels, so instead of always predicting either yes or no (buy a new
phone or not, fake news or real news, and so forth), we can use logistic
regression to identify, for example, handwritten digits, in which case there are
ten possible labels.
An example of logistic regression
Let’s suppose that we collect data of students taking an introductory course in
cookery. In addition to the basic information such as the student ID, name, and
so on, we also ask the students to report how many hours they studied for the
exam (however you study for a cookery exam, probably cooking?) – and hope
that they are more or less honest in their reports. After the exam, we will know
whether each student passed the course or not. Some data points are
presented below:
Student ID Hours studied Pass/fail

24 15 Pass

41 9.5 Pass

58 2 Fail

101 5 Fail

103 6.5 Fail

215 6 Pass

Based on the table, what kind of conclusion could you draw between the hours
studied and passing the exam? We could think that if we have data from
hundreds of students, maybe we could see the amount needed to study in
order to pass the course. We can present this data in a chart as you can see
below.
The limits of machine learning
To summarize, machine learning is a very powerful tool for building AI
applications. In addition to the nearest neighbor method, linear regression,
and logistic regression, there are literally hundreds, if not thousands, of
different machine learning techniques, but they all boil down to the same
thing: trying to extract patterns and dependencies from data and using them
either to gain understanding of a phenomenon or to predict future outcomes.
Machine learning can be a very hard problem and we can’t usually achieve a
perfect method that would always produce the correct label. However, in most
cases, a good but not perfect prediction is still better than none. Sometimes
we may be able to produce better predictions by ourselves but we may still
prefer to use machine learning because the machine will make its predictions
faster and it will also keep churning out predictions without getting tired. Good
examples are recommendation systems that need to predict what music, what
videos, or what ads are more likely to be of interest to you.
The factors that affect how good a result we can achieve include:
 The hardness of the task: in handwritten digit recognition, if the digits
are written very sloppily, even a human can’t always guess correctly what
the writer intended
 The machine learning method: some methods are far better for a
particular task than others
 The amount of training data: from only a few examples, it is impossible
to obtain a good classifier
 The quality of the data
Note
Data quality matters
In the beginning of this chapter, we emphasized the importance of having
enough data and the risks of overfitting. Another equally important factor is
the quality of the data. In order to build a model that generalizes well to data
outside of the training data, the training data needs to contain enough
information that is relevant to the problem at hand. For example, if you create
an image classifier that tells you what the image given to the algorithm is
about, and you have trained it only on pictures of dogs and cats, it will assign
everything it sees as either a dog or a cat. This would make sense if the
algorithm is used in an environment where it will only see cats and dogs, but
not if it is expected to see boats, cars, and flowers as well.

We’ll return to potential problems caused by ”biased” data.


It is also important to emphasize that different machine learning methods are
suitable for different tasks. Thus, there is no single best method for all
problems ("one algorithm to rule them all..."). Fortunately, one can try out a
large number of different methods and see which one of them works best in
the problem at hand.
This leads us to a point that is very important but often overlooked in practice:
what it means to work better. In the digit recognition task, a good method
would of course produce the correct label most of the time. We can measure
this by the classification error: the fraction of cases where our classifier
outputs the wrong class. In predicting apartment prices, the quality measure is
typically something like the difference between the predicted price and the
final price for which the apartment is sold. In many real-life applications, it is
also worse to err in one direction than in another: setting the price too high
may delay the process by months, but setting the price too low will mean less
money for the seller. And to take yet another example, failing to detect a
pedestrian in front of a car is a far worse error than falsely detecting one when
there is none.
As mentioned above, we can’t usually achieve zero error, but perhaps we will
be happy with error less than 1 in 100 (or 1%). This too depends on the
application: you wouldn’t be happy to have only 99% safe cars on the streets,
but being able to predict whether you’ll like a new song with that accuracy
may be more than enough for a pleasant listening experience. Keeping the
actual goal in mind at all times helps us make sure that we create actual added
value.
CHAPTER 5
I.Neural network basics
Our next topic, deep learning and neural networks, tends to attract more interest than
many of the other topics.
One of the reasons for the interest is the hope to understand our own mind,
which emerges from neural processing in our brain. Another reason is the
advances in machine learning achieved within the recent years by combining
massive data sets and deep learning techniques.
What are neural networks?
To better understand the whole, we will start by discussing the individual units
that make it up. A neural network can mean either a “real” biological neural
network such as the one in your brain, or an artificial neural network simulated
in a computer.
Key terminology
Deep learning
Deep learning refers to certain kinds of machine learning techniques where
several “layers” of simple processing units are connected in a network so that
the input to the system is passed through each one of them in turn. This
architecture has been inspired by the processing of visual information in the
brain coming through the eyes and captured by the retina. This depth allows
the network to learn more complex structures without requiring unrealistically
large amounts of data.
Neurons, cell bodies, and signals
A neural network, either biological and artificial, consists of a large number of
simple units, neurons, that receive and transmit signals to each other. The
neurons are very simple processors of information, consisting of a cell body
and wires that connect the neurons to each other. Most of the time, they do
nothing but sit still and watch for signals coming in through the wires.
Dendrites, axons, and synapses
In the biological lingo, we call the wires that provide the input to the neurons
dendrites. Sometimes, depending on the incoming signals, the neuron may fire
and send a signal out for the other neurons to receive. The wire that transmits
the outgoing signal is called an axon. Each axon may be connected to one or
more dendrites at intersections that are called synapses.
Isolated from its fellow-neurons, a single neuron is quite unimpressive, and
capable of only a very restricted set of behaviors. When connected to each
other, however, the system resulting from their concerted action can become
extremely complex. To find evidence for this, look no further than (to use legal
jargon) "Exhibit A": your brain! The behavior of the system is determined by
the ways in which the neurons are wired together. Each neuron reacts to the
incoming signals in a specific way that can also adapt over time. This
adaptation is known to be the key to functions such as memory and learning.
Why develop artificial neural networks?
The purpose of building artificial models of the brain can be neuroscience, the
study of the brain and the nervous system in general. It is tempting to think
that by mapping the human brain in enough detail, we can discover the secrets
of human and animal cognition and consciousness.

Note
Modeling the brain
The BRAIN Initiative led by American neuroscience researchers is pushing
forward technologies for imaging, modeling, and simulating the brain at a finer
and larger scale than before. Some brain research projects are very ambitious
in terms of objectives. The Human Brain Project promised in 2012 that “the
mysteries of the mind can be solved – soon”. After years of work, the Human
Brain Project was facing questions about when the billion euros invested by
the European Union will deliver what was promised, even though, to be fair,
some less ambitious milestones have been achieved.
However, even while we seem to be almost as far from understanding the
mind and consciousness, there are clear milestones that have been achieved in
neuroscience. By better understanding of the structure and function of the
brain, we are already reaping some concrete rewards. We can, for instance,
identify abnormal functioning and try to help the brain avoid them and
reinstate normal operation. This can lead to life-changing new medical
treatments for people suffering from neurological disorders: epilepsy,
Alzheimer’s disease, problems caused by developmental disorders or damage
caused by injuries, and so on.
Note
Looking to the future: brain computer interfaces
One research direction in neuroscience is brain-computer interfaces that allow
interacting with a computer by simply thinking. The current interfaces are very
limited and they can be used, for example, to reconstruct on a very rough level
what a person is seeing, or to control robotic arms by thought. Perhaps some
day we can actually implement a thought reading machine that allows precise
instructions but currently they belong to science fiction. It is also conceivable
that we could feed information into the brain by stimulating it by small
electrical pulses. Such stimulation is currently used for therapeutic purposes.
Feeding detailed information such as specific words, ideas, memories, or
emotions is at least currently science fiction rather than reality, but obviously
we know neither the limits of such technology, nor how hard it is to reach
them.
We’ve drifted a little astray from the topic of the course. In fact, another main
reason for building artificial neural networks has little to do with understanding
biological systems. It is to use biological systems as an inspiration to build
better AI and machine learning techniques. The idea is very natural: the brain is
an amazingly complex information processing system capable of a wide range
of intelligent behaviors (plus occasionally some not-so-intelligent ones), and
therefore, it makes sense to look for inspiration in it when we try to create
artificially intelligent systems.
Neural networks have been a major trend in AI since the 1960s. We’ll return to
the waves of popularity in the history of AI in the final part. Currently neural
networks are again at the very top of the list as deep learning is used to
achieve significant improvements in many areas such as natural language and
image processing, which have traditionally been sore points of AI.
What is so special about neural networks?
The case for neural networks in general as an approach to AI is based on a
similar argument as that for logic-based approaches. In the latter case, it was
thought that in order to achieve human-level intelligence, we need to simulate
higher-level thought processes, and in particular, manipulation of symbols
representing certain concrete or abstract concepts using logical rules.
The argument for neural networks is that by simulating the lower-level,
“subsymbolic” data processing on the level of neurons and neural networks,
intelligence will emerge. This all sounds very reasonable but keep in mind that
in order to build flying machines, we don’t build airplanes that flap their wings,
or that are made of bones, muscle, and feather. Likewise, in artificial neural
networks, the internal mechanism of the neurons is usually ignored and the
artificial neurons are often much simpler than their natural counterparts. The
electro-chemical signaling mechanisms between natural neurons are also
mostly ignored in artificial models when the goal is to build AI systems rather
than to simulate biological systems.
Compared to how computers traditionally work, neural networks have certain
special features:
Neural network key feature 1
For one, in a traditional computer, information is processed in a central
processor (aptly named the central processing unit, or CPU for short) which
can only focus on doing one thing at a time. The CPU can retrieve data to be
processed from the computer’s memory, and store the result in the memory.
Thus, data storage and processing are handled by two separate components of
the computer: the memory and the CPU. In neural networks, the system
consists of a large number of neurons, each of which can process information
on its own so that instead of having a CPU process each piece of information
one after the other, the neurons process vast amounts of information
simultaneously.
Neural network key feature 2
The second difference is that data storage (memory) and processing isn’t
separated like in traditional computers. The neurons both store and process
information so that there is no need to retrieve data from the memory for
processing. The data can be stored short term in the neurons themselves (they
either fire or not at any given time) or for longer term storage, in the
connections between the neurons – their so called weights, which we will
discuss below.
Because of these two differences, neural networks and traditional computers
are suited for somewhat different tasks. Even though it is entirely possible to
simulate neural networks in traditional computers, which was the way they
were used for a long time, their maximum capacity is achieved only when we
use special hardware (computer devices) that can process many pieces of
information at the same time. This is called parallel processing. Incidentally,
graphics processors (or graphics processing units, GPUs) have this capability
and they have become a cost-effective solution for running massive deep
learning methods.
II.How neural networks are built
As we said earlier, neurons are very simple processing units. Having discussed linear and
logistic regression in Chapter 4, the essential technical details of neural networks can be
seen as slight variations of the same idea.
Note
Weights and inputs
The basic artificial neuron model involves a set of adaptive parameters, called
weights like in linear and logistic regression. Just like in regression, these
weights are used as multipliers on the inputs of the neuron, which are added
up. The sum of the weights times the inputs is called the linear combination of
the inputs. You can probably recall the shopping bill analogy: you multiply the
amount of each item by its price per unit and add up to get the total.
If we have a neuron with six inputs (analogous to the amounts of the six
shopping items: potatoes, carrots, and so on), input1, input2, input3, input4,
input5, and input6, we also need six weights. The weights are analogous to the
prices of the items. We’ll call them weight1, weight2, weight3, weight4,
weight5, and weight6. In addition, we’ll usually want to include an intercept
term like we did in linear regression. This can be thought of as a fixed
additional charge due to processing a credit card payment, for example.
We can then calculate the linear combination like this: linear combination =
intercept + weight1 × input1 + ... + weight6 × input6 (where the ... is a
shorthand notation meaning that the sum include all the terms from 1 to 6).
Activations and outputs
Once the linear combination has been computed, the neuron does one more
operation. It takes the linear combination and puts it through a so-called
activation function. Typical examples of the activation function include:
 identity function: do nothing and just output the linear combination
 step function: if the value of the linear combination is greater than zero,
send a pulse (ON), otherwise do nothing (OFF)
 sigmoid function: a “soft” version of the step function
Note that with the first activation function, the identity function, the neuron is
exactly the same as linear regression. This is why the identity function is rarely
used in neural networks: it leads to nothing new and interesting.
Note
How neurons activate
Real, biological neurons communicate by sending out sharp, electrical pulses
called “spikes”, so that at any given time, their outgoing signal is either on or
off (1 or 0). The step function imitates this behavior. However, artificial neural
networks tend to use activation functions that output a continuous numerical
activation level at all times, such as the sigmoid function. Thus, to use a
somewhat awkward figure of speech, real neurons communicate by something
similar to the Morse code, whereas artificial neurons communicate by
adjusting the pitch of their voice as if yodeling.

The output of the neuron, determined by the linear combination and the
activation function, can be used to extract a prediction or a decision. For
example, if the network is designed to identify a stop sign in front of a self-
driving car, the input can be the pixels of an image captured by a camera
attached in front of the car, and the output can be used to activate a stopping
procedure that stops the car before the sign.
Learning or adaptation in the network occurs when the weights are adjusted so
as to make the network produce the correct outputs, just like in linear or
logistic regression. Many neural networks are very large, and the largest
contain hundreds of billions of weights. Optimizing them all can be a daunting
task that requires massive amounts of computing power.
Perceptron: the mother of all ANNs
The perceptron is simply a fancy name for the simple neuron model with the
step activation function we discussed above. It was among the very first formal
models of neural computation and because of its fundamental role in the
history of neural networks, it wouldn’t be unfair to call it the “mother of all
artificial neural networks”.
It can be used as a simple classifier in binary classification tasks. A method for
learning the weights of the perceptron from data, called the Perceptron
algorithm, was introduced by the psychologist Frank Rosenblatt in 1957. We
will not study the Perceptron algorithm in detail. Suffice to say that it is just
about as simple as the nearest neighbor classifier. The basic principle is to feed
the network training data one example at a time. Each misclassification leads
to an update in the weight.
Note
AI hyperbole
After its discovery, the Perceptron algorithm received a lot of attention, not
least because of optimistic statements made by its inventor, Frank Rosenblatt.
A classic example of AI hyperbole is a New York Times article published on July
8th, 1958:
“The Navy revealed the embryo of an electronic computer today that it expects
will be able to walk, talk, see, reproduce itself and be conscious of its
existence.”

Please note that neural network enthusiasts are not at all the only ones
inclined towards optimism. The rise and fall of the logic-based expert systems
approach to AI had all the same hallmark features of an AI-hype and people
claimed that the final breakthrough is just a short while away. The outcome
both in the early 1960s and late 1980s was a collapse in the research funding
called an AI Winter.
The history of the debate that eventually lead to almost complete abandoning
of the neural network approach in the 1960s for more than two decades is
extremely fascinating. The article A Sociological Study of the Official History of
the Perceptrons Controversy by Mikel Olazaran (published in Social Studies of
Science, 1996) reviews the events from a sociology of science point of view.
Reading it today is quite thought provoking. Reading stories about celebrated
AI heroes who had developed neural networks algorithms that would soon
reach the level of human intelligence and become self-conscious can be
compared to some statements made during the current hype. If you take a
look at the above article, even if you wouldn’t read all of it, it will provide an
interesting background to today’s news. Consider for example an article in the
MIT Technology Review published in September 2017, where Jordan Jacobs,
co-founder of a multimillion dollar Vector institute for AI compares Geoffrey
Hinton (a figure-head of the current deep learning boom) to Einstein because
of his contributions to development of neural network algorithms in the 1980s
and later. Also recall the Human Brain project mentioned in the previous
section.
According to Hinton, “the fact that it doesn’t work is just a temporary
annoyance” (although according to the article, Hinton is laughing about the
above statement, so it’s hard to tell how serious he is about it). The Human
Brain project claims to be “close to a profound leap in our understanding of
consciousness”. Doesn’t that sound familiar?
No-one really knows the future with certainty, but knowing the track record of
earlier announcements of imminent breakthroughs, some critical thinking is
advised. We’ll return to the future of AI in the final chapter, but for now, let’s
see how artificial neural networks are built.
Putting neurons together: networks
A single neuron would be way too simple to make decisions and prediction
reliably in most real-life applications. To unleash the full potential of neural
networks, we can use the output of one neuron as the input of other neurons,
whose outputs can be the input to yet other neurons, and so on. The output of
the whole network is obtained as the output of a certain subset of the
neurons, which are called the output layer. We’ll return to this in a bit, after
we discussed the way neural networks adapt to produce different behaviors by
learning their parameters from data.
Key terminology
Layers
Often the network architecture is composed of layers. The input layer consists
of neurons that get their inputs directly from the data. So for example, in an
image recognition task, the input layer would use the pixel values of the input
image as the inputs of the input layer. The network typically also has hidden
layers that use the other neurons’ outputs as their input, and whose output is
used as the input to other layers of neurons. Finally, the output layer produces
the output of the whole network. All the neurons on a given layer get inputs
from neurons on the previous layer and feed their output to the next.
A classical example of a multilayer network is the so-called multilayer
perceptron. As we discussed above, Rosenblatt's Perceptron algorithm can be
used to learn the weights of a perceptron. For multilayer perceptron, the
corresponding learning problem is way harder and it took a long time before a
working solution was discovered. But eventually, one was invented: the
backpropagation algorithm led to a revival of neural networks in the late
1980s. It is still at the heart of many of most advanced deep learning solutions.
Note
Meanwhile in Helsinki...
The path(s) leading to the backpropagation algorithm are rather long and
winding. An interesting part of the history is related to the computer science
department of the University of Helsinki. About three years after the founding
of the department in 1967, a Master’s thesis was written by a student called
Seppo Linnainmaa. The topic of the thesis was “Cumulative rounding error of
algorithms as a Taylor approximation of individual rounding errors” (the thesis
was written in Finnish, so this is a translation of the actual title “Algoritmin
kumulatiivinen pyöristysvirhe yksittäisten pyöristysvirheiden Taylor-
kehitelmänä”).

The automatic differentiation method developed in the thesis was later applied
by other researchers to quantify the sensitivity of the output of a multilayer
neural network with respect to the individual weights, which is the key idea in
backpropagation.
A simple neural network classifier
To give a relatively simple example of using a neural network classifier, we'll
consider a task that is very similar to the MNIST digit recognition task, namely
classifying images in two classes. We will first create a classifier to classify
whether an image shows a cross (x) or a circle (o). Our images are represented
here as pixels that are either colored or white, and the pixels are arranged in 5
× 5 grid. In this format our images of a cross and a circle (more like a diamond,
to be honest) look like this:
In order to build a neural network classifier, we need to formalize the problem
in a way where we can solve it using the methods we have learned. Our first
step is to represent the information in the pixels by numerical values that can
be used as the input to a classifier. Let's use 1 if the square is colored, and 0 if
it is white. Note that although the symbols in the above graphic are of different
color (green and blue), our classifier will ignore the color information and use
only the colored/white information. The 25 pixels in the image make the inputs
of our classifier.
To make sure that we know which pixel is which in the numerical
representation, we can decide to list the pixels in the same order as you'd read
text, so row by row from the top, and reading each row from left to right. The
first row of the cross, for example, is represented as 1,0,0,0,1; the second row
as 0,1,0,1,0, and so on. The full input for the cross input is then:
1,0,0,0,1,0,1,0,1,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,1.
We'll use the basic neuron model where the first step is to compute a linear
combination of the inputs. Thus need a weight for each of the input pixels,
which means 25 weights in total.
Finally, we use the step activation function. If the linear combination is
negative, the neuron activation is zero, which we decide to use to signify a
cross. If the linear combination is positive, the neuron activation is one, which
we decide to signify a circle.
Let's try what happens when all the weights take the same numerical value, 1.
With this setup, our linear combination for the cross image will be 9 (9 colored
pixels, so 9 × 1, and 16 white pixels, 16 × 0), and for the circle image it will be 8
(8 colored pixels, 8 × 1, and 17 white pixels, 17 × 0). In other words, the linear
combination is positive for both images and they are thus classified as circles.
Not a very good result given that there are only two images to classify.
To improve the result, we need to adjust the weights in such a way that the
linear combination will be negative for a cross and positive for a circle. If we
think about what differentiates images of crosses and circles, we can see that
circles have no colored pixels in the center of the image, whereas crosses do.
Likewise, the pixels at the corners of the image are colored in the cross, but
white in the circle.
We can now adjust the weights. There are an infinite number of weights that
do the job. For example, assign weight -1 to the center pixel (the 13th pixel),
and weight 1 to the pixels in the middle of each of the four sides of the image,
letting all the other weights be 0. Now, for the cross input, the center pixel
produce the value –1, while for all the other pixels either the pixel value or the
weight is 0, so that –1 is also the total value. This leads to activation 0, and the
cross is correctly classified.
How about the circle then? Each of the pixels in the middle of the sides
produces the value 1, which makes 4 × 1 = 4 in total. For all the other pixels
either the pixel value or the weight is zero, so 4 is the total. Since 4 is a positive
value, the activation is 1, and the circle is correctly recognized as well.
Happy or not?
We will now follow similar reasoning to build a classifier for smiley faces. You
can assign weights to the input pixels in the image by clicking on them. Clicking
once sets the weight to 1, and clicking again sets it to -1. The activation 1
indicates that the image is classified as a happy face, which can be correct or
not, while activation –1 indicates that the image is classified as a sad face.
Don't be discouraged by the fact that you will not be able to classify all the
smiley faces correctly: it is in fact impossible with our simple classifier! This is
one important learning objective: sometimes perfect classification just isn't
possible because the classifier is too simple. In this case the simple neuron that
uses a linear combination of the inputs is too simple for the task. Observe how
you can build classifiers that work well in different cases: some classify most of
the happy faces correctly while being worse for sad faces, or the other way
around.
III.Advanced neural network techniques
In the previous section, we have discussed the basic ideas behind most neural network
methods: multilayer networks, non-linear activation functions, and learning rules such as
the backpropagation algorithm.
They power almost all modern neural network applications. However, there
are some interesting and powerful variations of the theme that have led to
great advances in deep learning in many areas.
Convolutional neural networks (CNNs)
One area where deep learning has achieved spectacular success is image
processing. The simple classifier that we studied in detail in the previous
section is severely limited – as you noticed it wasn’t even possible to classify all
the smiley faces correctly. Adding more layers in the network and using
backpropagation to learn the weights does in principle solve the problem, but
another one emerges: the number of weights becomes extremely large and
consequently, the amount of training data required to achieve satisfactory
accuracy can become too large to be realistic.
Fortunately, a very elegant solution to the problem of too many weights exists:
a special kind of neural network, or rather, a special kind of layer that can be
included in a deep neural network. This special kind of layer is a so-
called convolutional layer. Networks including convolutional layers are
called convolutional neural networks (CNNs). Their key property is that they
can detect image features such as bright or dark (or specific color) spots, edges
in various orientations, patterns, and so on. These form the basis for detecting
more abstract features such as a cat’s ears, a dog’s snout, a person’s eye, or
the octagonal shape of a stop sign. It would normally be hard to train a neural
network to detect such features based on the pixels of the input image,
because the features can appear in different positions, different orientations,
and in different sizes in the image: moving the object or the camera angle will
change the pixel values dramatically even if the object itself looks just the
same to us. In order to learn to detect a stop sign in all these different
conditions would require vast of amounts of training data because the network
would only detect the sign in conditions where it has appeared in the training
data. So, for example, a stop sign in the top right corner of the image would be
detected only if the training data included an image with the stop sign in the
top right corner. CNNs can recognize the object anywhere in the image no
matter where it has been observed in the training images.
Note
Why we need CNNs
CNNs use a clever trick to reduce the amount of training data required to
detect objects in different conditions. The trick basically amounts to using the
same input weights for many neurons – so that all of these neurons are
activated by the same pattern – but with different input pixels. We can for
example have a set of neurons that are activated by a cat’s pointy ear. When
the input is a photo of a cat, two neurons are activated, one for the left ear and
another for the right. We can also let the neuron’s input pixels be taken from a
smaller or a larger area, so that different neurons are activated by the ear
appearing in different scales (sizes), so that we can detect a small cat’s ears
even if the training data only included images of big cats.
The convolutional neurons are typically placed in the bottom layers of the
network, which processes the raw input pixels. Basic neurons (like the
perceptron neuron discussed above) are placed in the higher layers, which
process the output of the bottom layers. The bottom layers can usually be
trained using unsupervised learning, without a particular prediction task in
mind. Their weights will be tuned to detect features that appear frequently in
the input data. Thus, with photos of animals, typical features will be ears and
snouts, whereas in images of buildings, the features are architectural
components such as walls, roofs, windows, and so on. If a mix of various
objects and scenes is used as the input data, then the features learned by the
bottom layers will be more or less generic. This means that pre-trained
convolutional layers can be reused in many different image processing tasks.
This is extremely important since it is easy to get virtually unlimited amounts of
unlabeled training data – images without labels – which can be used to train
the bottom layers. The top layers are always trained by supervised machine
learning techniques such as backpropagation.

Do neural networks dream of electric sheep? Generative adversarial networks


(GANs)
Having trained a neural network on data, we can use it for predictions. Since
the top layers of the network have been trained in a supervised manner to
perform a particular classification or prediction task, the top layers are really
useful only for that task. A network trained to detect stop signs is useless for
detecting handwritten digits or cats.
A fascinating result is obtained by taking the pre-trained bottom layers and
studying what the features they have learned look like. This can be achieved by
generating images that activate a certain set of neurons in the bottom layers.
Looking at the generated images, we can see what the neural network “thinks”
a particular feature looks like, or what an image with a select set of features in
it would look like. Some even like to talk about the networks “dreaming” or
“hallucinating” images (see Google’s DeepDream system).
Note
Be careful with metaphors
However, we’d like to once again emphasize the problem with metaphors such
as dreaming when simple optimization of the input image is meant –
remember the suitcase words discussed in Chapter 1. The neural network
doesn’t really dream, and it doesn’t have a concept of a cat that it would
understand in a similar sense as a human understands. It is simply trained to
recognize objects and it can generate images that are similar to the input data
that it is trained on.
To actually generate real looking cats, human faces, or other objects (you’ll get
whatever you used as the training data), Ian Goodfellow, a researcher at
Google Brain at the time, proposed a clever combination of two neural
networks. The idea is to let the two networks compete against each other. One
of the networks is trained to generate images like the ones in the training data
– it is called the generative network. The other network’s task is to separate
images generated by the first network from real images from the training data
– this one is called the adversarial network. These two combined then make up
a generative adversarial network or a GAN.
The system trains the two models side by side. In the beginning of the training,
the adversarial model has an easy task to tell apart the real images from the
training data and the clumsy attempts by the generative model. However, as
the generative network slowly gets better and better, the adversarial model
has to improve as well, and the cycle continues until eventually the generated
images are almost indistinguishable from real ones. The GAN tries to not only
reproduce the images in the training data: that would be a way too simple
strategy to beat the adversarial network. Rather, the system is trained so that
it has to be able to generate new, real-looking images too.

The above images were generated by a GAN developed by NVIDIA in a project


led by Prof Jaakko Lehtinen (see this article for more).
Could you have recognized them as fakes?
The Rise of Large Language Models (LLMs)
As mentioned above, convolutional neural networks (CNNs) reduce the
number of learnable weights in a neural network so that the amount of
training data required to learn all of them doesn't grow astronomically large as
we keep building bigger and bigger networks. Another architectural
innovation, besides the idea of a CNN, that currently powers many state-of-
the-art deep learning models is called attention.
Attention mechanisms were originally introduced for machine translation
where they can selectively focus the attention of the model to certain words in
the input text when generating a particular word in the output. This way the
model doesn't have to pay attention to all of the input at the same time, which
greatly simplifies the learning task. Attention mechanisms were soon found to
be extremely useful not only in machine translation.
In 2017, a team working at Google published the blockbuster article "Attention
is All You Need", which introduced the so-called transformer architecture for
deep neural networks. Unless you have been living on a desert island or on an
otherwise strict media diet, you have most likely already heard about
transformers (the neural network models, not the toy franchise). It's just that
they may have been hiding inside an acronym: GPT (Generative Pretrained
Transformer). As the title of the article by the Google team suggests,
transformers heavily exploit attention mechanisms to get the most out of the
available training data and computational resources.
The most widely noted applications of transformers are found in large
language models (LLMs). The best known ones are OpenAI's GPT-series,
including GPT-1 released in June 2018 and GPT-4 announced in March 2023,
but no giant platform company wants to miss out: Google picks model names
from Sesame street and published BERT (Bidirectional Encoder
Representations from Transformers) in October 2018, while Meta joined the
party a bit later in February 2023, picking a name inspired by the animal world,
LLaMA (Large Language Model Meta AI). And it's not just the platform
companies that are driving the development: universities and other research
organizations are contributing open source models with the goal
of democratizing the technology.
Note
What's in an LLM?
LLMs are models that given a piece of text like "The capital of Finland is"
predicts how the text is likely to continue. In this case, "Helsinki" or "a pocket-
sized metropolis" would be likely continuations. LLMs are trained on large
amounts of text such as the entire contents of the Wikipedia or the
CommonCrawl dataset that, at the time of writing this, contains a whopping
260 billion web pages.
In principle, one can view LLMs as basically nothing but extremely powerful
predictive text entry techniques. However, with some further thinking, it
becomes apparent that being able to predict the continuation of any text in a
way that is indistinguishable from human writing, is (or would be) quite a feat
and encompasses many aspects of intelligence. The above example which is
based on the association between the words "the capital of Finland" and
"Helsinki" is an example where the model has learned a fact about the world. If
we'd be able to build models that associate the commonly agreed answers to a
wide range of questions, it could be argued that such a model has learned a big
chunk of so-called "world knowledge". Especially intriguing are instances
where the model seems to exhibit some level of reasoning beyond
memorization and statistical co-occurrence: currently, LLMs are able to do this
in a limited sense and they can easily make trivial mistakes because they are
based on "just" statistical machine learning. Intensive research and
development efforts are directed at building deep learning models with more
robust reasoning algorithms and databases of verified facts.
Note
ChatGPT: AI for the masses
A massive earthquake occurred in San Francisco on November 30, 2022. It was
so powerful that hardly a person on the planet was unaffected, and yet, no
seismometer detected it. This metaphorical "earthquake" was the launch of
ChatGPT by OpenAI. Word of the online chatbot service that anyone could use
free of charge quickly spread around the world and after mere five days, it had
more than a million registered users (compare this to the five years that it took
the Elements of AI to reach the same number), and in two months, the number
of signups was 100 million. No other AI service, or probably any service
whatsoever, has become a household name so quickly.

The first version of ChatGPT was based on a GPT-3.5 model fine tuned by
supervised and reinforcement learning according to a large number of human-
rated responses. The purpose of the finetuning process was to steer the model
away from toxic and incorrect responses that the language model had picked
up from its training data, and towards comprehensive and helpful responses.
It is not easy to say what caused the massive media frenzy and the
unprecedented interest towards ChatGPT by pretty much everyone, even
those who hadn't paid much attention to AI thus far. Probably some of it is
explained by the somewhat better quality of the output, due to the finetuning,
and the easy-to-use chat interface, which enables the user to not only get one-
off answers to isolated questions, like any of the earlier LLMs, but also
maintain a coherent dialogue in a specific context. In the same vein, the chat
interface allows one to make requests like "explain this to a five year old" or
"write that as a song in the style of Nick Cave." (Mr Cave, however, wasn't
impressed [BBC]). In any case, ChatGPT succeeded in bumping the interest in
AI to completely new levels.
CHAPTER 6
I.About predicting the future
We will start by addressing what is known to be one of the hardest problems of all:
predicting the future.
You may be disappointed to hear this, but we don’t have a crystal ball that
would show us what the world will be like in the future and how AI will
transform our lives. As scientists, we are often asked to provide predictions,
and our refusal to provide any is faced with a roll of the eyes (“boring
academics”). But in fact, we claim that anyone who claims to know the future
of AI and the implications it will have on our society, should be treated with
suspicion.
The reality distortion field
Not everyone is quite as conservative about their forecasts, however. In the
modern world where big headlines sell, and where you have to dissect news
into 280 characters, reserved (boring?) messages are lost, and simple and
dramatic messages are magnified. In the public perception of AI, this is clearly
true.
Note
From utopian visions to grim predictions
The media sphere is dominated by the extremes. We are beginning to see AI
celebrities, standing for one big idea and making oracle-like forecasts about
the future of AI. The media love their clear messages. Some promise us
a utopian future with exponential growth and trillion-dollar industries
emerging out of nowhere, true AI that will solve all problems we cannot solve
by ourselves, and where humans don’t need to work at all.

It has also been claimed that AI is a path to world domination. Others make
even more extraordinary statements according to which AI marks the end of
humanity (in about 20-30 years from now), life itself will be transformed in the
“Age of AI”, and that AI is a threat to our existence.
While some forecasts will probably get at least something right, others will
likely be useful only as demonstrations of how hard it is to predict, and many
don’t make much sense. What we would like to achieve is for you to be able to
look at these and other forecasts, and be able to critically evaluate them.
On hedgehogs and foxes
The political scientist Philip E. Tetlock, author of Superforecasting: The Art and
Science of Prediction, classifies people into two categories: those who have one
big idea (“hedgehogs”), and those who have many small ideas (“foxes”).
Tetlock has carried out an experiment between 1984 and 2003 to study factors
that could help us identify which predictions are likely to be accurate and
which are not. One of the significant findings was that foxes tend to be clearly
better at prediction than hedgehogs, especially when it comes to long-term
forecasting.
Probably the messages that can be expressed in 280 characters are more often
big and simple hedgehog ideas. Our advice is to pay attention to carefully
justified and balanced information sources, and to be suspicious about people
who keep explaining everything using a single argument.
Predicting the future is hard but at least we can consider the past and present
AI, and by understanding them, hopefully be better prepared for the future,
whatever it turns out to be like.
AI winters
The history of AI, just like many other fields of science, has witnessed the
coming and going of various different trends. In philosophy of science, the
term used for a trend is paradigm. Typically, a particular paradigm is adopted
by most of the research community and optimistic predictions about progress
in the near-future are provided. For example, in the 1960s neural networks
were widely believed to solve all AI problems by imitating the learning
mechanisms in the nature, the human brain in particular. The next big thing
was expert systems based on logic and human-coded rules, which was the
dominant paradigm in the 1980s.
The cycle of hype
In the beginning of each wave, a number of early success stories tend to make
everyone happy and optimistic. The success stories, even if they may be in
restricted domains and in some ways incomplete, become the focus on public
attention. Many researchers rush into AI – or at least calling their research AI –
in order to access the increased research funding. Companies also initiate and
expand their efforts in AI in the fear of missing out (FOMO).
So far, each time an all-encompassing, general solution to AI has been said to
be within reach, progress has ended up running into insurmountable problems,
which at the time were thought to be minor hiccups. In the case of neural
networks in the 1960s, the hiccups were related to handling nonlinearities and
to solving the machine learning problems associated with the increasing
number of parameters required by neural network architectures. In the case of
expert systems in the 1980s, the hiccups were associated with handling
uncertainty and common sense. As the true nature of the remaining problems
dawned after years of struggling and unsatisfied promises, pessimism about
the paradigm accumulated and an AI winter followed: interest in the field
faltered and research efforts were directed elsewhere.
Modern AI
Currently, roughly since the turn of the millennium, AI has been on the rise
again. Modern AI methods tend to focus on breaking a problem into a number
of smaller, isolated and well-defined problems and solving them one at a time.
Modern AI is bypassing grand questions about meaning of intelligence, the
mind, and consciousness, and focusing on building practically useful solutions
in real-world problems. Good news for us all who can benefit from such
solutions!
Another characteristic of modern AI methods, closely related to working in the
complex and “messy” real world, is the ability to handle uncertainty, which we
demonstrated by studying the uses of probability in AI in Chapter 3. Finally, the
current upwards trend of AI has been greatly boosted by the come-back of
neural networks and deep learning techniques capable of processing images
and other real-world data better than anything we have seen before.
Note
So are we in a hype cycle?
Whether the history will repeat itself, and the current boom will be once again
followed by an AI winter, is a matter that only time can tell. Even if it does, and
the progress towards better and better solutions slows down to a halt, the
significance of AI in the society is going to stay. Thanks to the focus on useful
solutions to real-world problems, modern AI research yields fruit already
today, rather than trying to solve the big questions about general intelligence
first – which was where the earlier attempts failed.
Prediction 1: AI will continue to be all around us
As you recall, we started by motivating the study of AI by discussing prominent
AI applications that affect all our lives. We highlighted three examples: self-
driving vehicles, recommendation systems, and image and video processing.
During the course, we have also discussed a wide range of other applications
that contribute to the ongoing technological transition.
Note
AI making a difference
As a consequence of focusing on practicality rather than the big problems, we
live our life surrounded by AI (even if we may most of the time be happily
unaware of it): the music we listen to, the products we buy online, the movies
and series we watch, our routes of transportation, and even the news and
information that we have available, are all influenced more and more by AI.
What is more, basically any field of science, from medicine and astrophysics to
medieval history, is also adopting AI methods in order to deepen our
understanding of the universe and of ourselves.

Prediction 2: the Terminator isn’t coming


One of the most pervasive and persistent ideas related to the future of AI is the
Terminator. In case you should have somehow missed the image of a brutal
humanoid robot with a metal skeleton and glaring eyes...well, that’s what it
is. The Terminator is a 1984 film by director James Cameron. In the movie, a
global AI-powered defense system called Skynet becomes conscious of its
existence and wipes most of the humankind out of existence with nukes and
advanced killer robots.
Note
Two doomsday scenarios
There are two alternative scenarios that are suggested to lead to the coming of
the Terminator or other similarly terrifying forms of robot uprising. In the first,
which is the story from the 1984 film, a powerful AI system just becomes
conscious and decides that it just really, really dislikes humanity in general.
In the second alternative scenario, the robot army is controlled by an
intelligent but not conscious AI system that is in principle in human control.
The system can be programmed, for example, to optimize the production of
paper clips. Sounds innocent enough, doesn’t it?

However, if the system possesses superior intelligence, it will soon reach the
maximum level of paper clip production that the available resources, such as
energy and raw materials, allow. After this, it may come to the conclusion that
it needs to redirect more resources to paper clip production. In order to do so,
it may need to prevent the use of the resources for other purposes even if they
are essential for human civilization. The simplest way to achieve this is to kill all
humans, after which a great deal more resources become available for the
system’s main task, paper clip production.
Why these scenarios are unrealistic
There are a number of reasons why both of the above scenarios are extremely
unlikely and belong to science fiction rather than serious speculations of the
future of AI.
Reason 1:
Firstly, the idea that a superintelligent, conscious AI that can outsmart humans
emerges as an unintended result of developing AI methods is naive. As you
have seen in the previous chapters, AI methods are nothing but automated
reasoning, based on the combination of perfectly understandable principles
and plenty of input data, both of which are provided by humans or systems
deployed by humans. To imagine that the nearest neighbor classifier, linear
regression, the AlphaGo game engine, or even a deep neural network could
become conscious and start evolving into a superintelligent AI mind requires a
(very) lively imagination.
Note that we are not claiming that building human-level intelligence would be
categorically impossible. You only need to look as far as the mirror to see a
proof of the possibility of a highly intelligent physical system. To repeat what
we are saying: superintelligence will not emerge from developing narrow AI
methods and applying them to solve real-world problems (recall the narrow vs
general AI from the section on the philosophy of AI in Chapter 1).
Reason 2:
Secondly, one of the favorite ideas of those who believe in superintelligent AI
is the so-called singularity: a system that optimizes and “rewires“ itself so that
it can improve its own intelligence at an ever accelerating, exponential rate.
Such superintelligence would leave humankind so far behind that we become
like ants that can be exterminated without hesitation. The idea of exponential
intelligence increase is unrealistic for the simple reason that even if a system
could optimize its own workings, it would keep facing more and more difficult
problems that would slow down its progress, quite like the progress of human
scientists requires ever greater efforts and resources by the whole research
community and indeed the whole society, which the superintelligent entity
wouldn’t have access to. The human society still has the power to decide what
we use technology, even AI technology, for. Much of this power is indeed given
to us by technology, so that every time we make progress in AI technology, we
become more powerful and better at controlling any potential risks due to it.
Note
The value alignment problem
The paper clip example is known as the value alignment problem: specifying
the objectives of the system so that they are aligned with our values is very
hard. However, suppose that we create a superintelligent system that could
defeat humans who tried to interfere with its work. It’s reasonable to assume
that such a system would also be intelligent enough to realize that when we
say “make me paper clips”, we don’t really mean to turn the Earth into a paper
clip factory of a planetary scale.
Separating stories from reality
All in all, the Terminator is a great story to make movies about but hardly a real
problem worth panicking about. The Terminator is a gimmick, an easy way to
get a lot of attention, a poster boy for journalists to increase click rates, a red
herring to divert attention away from perhaps boring, but real, threats like
nuclear weapons, lack of democracy, environmental catastrophes, and climate
change. In fact, the real threat the Terminator poses is the diversion of
attention from the actual problems, some of which involve AI, and many of
which don’t. We’ll discuss the problems posed by AI in what follows, but the
bottom line is: forget about the Terminator, there are much more important
things to focus on.
II.The societal implications of AI
In the very beginning of this course, we briefly discussed the importance of AI in today’s
and tomorrow’s society but at that time, we could do so only to a limited extent because
we hadn’t introduced enough of the technical concepts and methods to ground the
discussion on concrete terms.
Now that we have a better understanding of the basic concepts of AI, we are in
a much better position to take part in rational discussion about the
implications of already the current AI.
Implication 1: Algorithmic bias
AI, and in particular, machine learning, is being used to make important
decisions in many sectors. This brings up the concept of algorithmic bias. What
it means is the embedding of a tendency to discriminate according ethnicity,
gender, or other factors when making decisions about job applications, bank
loans, and so on.
Note
Once again, it’s all about the data
The main reason for algorithmic bias is human bias in the data. For example,
when a job application filtering tool is trained on decisions made by humans,
the machine learning algorithm may learn to discriminate against women or
individuals with a certain ethnic background. Notice that this may happen even
if ethnicity or gender are excluded from the data since the algorithm will be
able to exploit the information in the applicant’s name or address.
Algorithmic bias isn’t a hypothetical threat conceived by academic researchers.
It’s a real phenomenon that is already affecting people today.
Online advertising
It has been noticed that online advertisers like Google tend to display ads of
lower-pay jobs to women users compared to men. Likewise, doing a search
with a name that sounds African American may produce an ad for a tool for
accessing criminal records, which is less likely to happen otherwise.
Social networks
Since social networks are basing their content recommendations essentially on
other users’ clicks, they can easily lead to magnifying existing biases even if
they are very minor to start with. For example, it was observed that when
searching for professionals with female first names, LinkedIn would ask the
user whether they actually meant a similar male name: searching for Andrea
would result in the system asking “did you mean Andrew”? If people
occasionally click Andrew’s profile, perhaps just out of curiosity, the system
will boost Andrew even more in subsequent searches.
There are numerous other examples we could mention, and you have probably
seen news stories about them. The main difficulty in the use of AI and machine
learning instead of rule-based systems is their lack of transparency. Partially
this is a consequence of the algorithms and the data being trade secrets that
the companies are unlikely to open up for public scrutiny. And even if they did
this, it may often be hard to identify the part of the algorithm or the elements
of the data that lead to discriminating decisions.
Note
Transparency through regulation?
A major step towards transparency is the European General Data Protection
Regulation (GDPR). It requires that all companies that either reside within the
European Union or that have European customers must:
 Upon request, reveal what data they have collected about any individual
(right of access)
 Delete any such data that is not required to keep with other obligations
when requested to do so (right to be forgotten)
 Provide an explanation of the data processing carried out on the
customer’s data (right to explanation)
The last point means, in other words, that companies such as Facebook and
Google, at least when providing services to European users, must explain their
algorithmic decision making processes. It is, however, still unclear what exactly
counts as an explanation. Does for example a decision reached by using the
nearest neighbor classifier (Chapter 4) count as an explainable decision, or
would the coefficients of a logistic regression classifier be better? How about
deep neural networks that easily involve millions of parameters trained using
terabytes of data? The discussion about the technical implementation about
the explainability of decisions based on machine learning is currently intensive.
In any case, the GDPR has potential to improve the transparency of AI
technologies.
Implication 2: Seeing is believing — or is it?
We are used to believing what we see. When we see a leader on the TV stating
that their country will engage in a trade-war with another country, or when a
well-known company spokesperson announces an important business
decision, we tend to trust them better than just reading about the statement
second-hand from the news written by someone else.
Similarly, when we see photo evidence from a crime scene or from a
demonstration of a new tech gadget, we put more weight on the evidence
than on written report explaining how things look.
Of course, we are aware of the possibility of fabricating fake evidence. People
can be put in places they never visited, with people they never met, by
photoshopping. It is also possible to change the way things look by simply
adjusting lighting or pulling one’s stomach in in cheap before–after shots
advertising the latest diet pill.
AI is taking the possibilities of fabricating evidence to a whole new level
Metaphysics Live is a system capable of doing face-swaps, de-aging and other
tricks in real time.

Lyrebird is a tool for automatic imitation of a person’s voice from a few


minutes of sample recording. While the generated audio still has a notable
robotic tone, it makes a pretty good impression.
Implication 3: Changing notions of privacy
It has been long known that technology companies collect a lot of information
about their users. Earlier it was mainly grocery stores and other retailers that
collected buying data by giving their customers loyalty cards that enable the
store to associate purchases to individual customers.
Note
Unprecedented data accuracy
The accuracy of the data that tech companies such as Facebook, Google,
Amazon and many others are collecting is way beyond the purchase data
collected by conventional stores: in principle, it is possible to record every
click, every page scroll, and the time you spend viewing any content. Websites
can even access your browsing history, so that unless you use the incognito
mode (or the like) after browsing for flights to Barcelona on one site, you will
likely get advertisements for hotels in Barcelona.
However, as such the above kind of data logging is not yet AI. The use of AI
leads new kinds of threats to our privacy, which may be harder to avoid even if
you are careful about revealing your identity.
Using data analysis to identify individuals
A good example of a hard-to-avoid issue is de-anonymization, breaking the
anonymity of data that we may have thought to be safe. The basic problem is
that when we report the results of an analysis, the results may be so specific
that they make it possible to learn something about individual users whose
data is included in the analysis. A classic example is asking for the average
salary of people born in the given year and having a specific zip code. In many
cases, this could be a very small group of people, often only one person, so
you’d be potentially giving data about a single person’s salary.
An interesting example of a more subtle issue was pointed out by researchers
at the University of Texas at Austin. They studied a public dataset made
available by Netflix containing 10 million movie ratings by some 500,000
anonymous users, and showed that many of the Netflix users can actually be
linked to user accounts on the Internet Movie Database because they had
rated several movies on both applications. Thus the researchers were able to
de-anonymize the Netflix data. While you may not think it’s big deal whether
someone else knows how you rated the latest Star Wars movie, some movies
may reveal aspects of our lives (such as politics or sexuality) which we should
be entitled to keep private.
Other methods of identification
A similar approach could in principle be used to match user accounts in almost
any service that collects detailed data about user behaviors. Another example
is typing patterns. Researchers at the University of Helsinki have demonstrated
that users can be identified based on their typing patterns: the short intervals
between specific keystrokes when typing text. This can mean that if someone
has access to data on your typing pattern (maybe you have used their website
and registered by entering your name), they can identify you the next time you
use their service even if you’d refuse to identify yourself explicitly. They can
also sell this information to whoever wants to buy it.
While many of the above examples have come as at least in part as surprises –
otherwise they could have been avoided – there is a lot of ongoing research
trying to address them. In particular, an area called differential privacy aims to
develop machine learning algorithms that can guarantee that the results are
sufficiently coarse to prevent reverse engineering specific data points that
went into them.
Implication 4: Changing work
When an early human learned to use a sharp rock to crack open bones of dead
animals to access a new source of nutrition, time and energy was released for
other purposes such as fighting, finding a mate, and making more inventions.
The invention of the steam engine in the 1700s tapped into an easily portable
form of machine power that greatly improved the efficiency of factories as well
as ships and trains. Automation has always been a path to efficiency: getting
more with less. Especially since the mid 20th century, technological
development has led to a period of unprecedented progress in automation. AI
is a continuation of this progress.
Each step towards better automation changes the working life. With a sharp
rock, there was less need for hunting and gathering food; with the steam
engine, there was less need for horses and horsemen; with the computer,
there is less need for typists, manual accounting, and many other data
processing (and apparently more need for watching cat videos). With AI and
robotics, there is even less need for many kinds of dull, repetitive work.
Note
A history of finding new things to do
In the past, every time one kind of work has been automated, people have
found new kinds to replace it. The new kinds of work are less repetitive and
routine, and more variable and creative. The issue with the current rate of
advance of AI and other technologies is that during the career of an individual,
the change in the working life might be greater than ever before. It is
conceivable that some jobs such as driving a truck or a taxi, may disappear
within a few years’ time span. Such an abrupt change could lead to mass
unemployment as people don’t have time to train themselves for other kinds
of work.

The most important preventive action to avoid huge societal issues such as this
is to help young people obtain a wide-ranging education. This that provides a
basis for pursuing many different jobs and which isn’t in high risk of becoming
obsolete in the near future.

It is equally important to support life-long learning and learning at work,


because there are going to be few of us who will do the same job throughout
their entire career. Cutting the hours per week would help offer work for more
people, but the laws of economics tend to push people to work more rather
than less unless public policy regulating the amount of work is introduced.
Because we can’t predict the future of AI, predicting the rate and extent of this
development is extremely hard. There have been some estimates about the
extent of job automation, ranging up to 47% of US jobs being at risk reported
by researchers at the University of Oxford. The exact numbers such as these –
47%, not 45% or 49% –, the complicated-sounding study designs used to get
them, and the top universities that report them tend to make the estimates
sound very reliable and precise (recall the point about estimating life
expectancy using a linear model based on a limited amount of data). The
illusion of accuracy to one percentage is a fallacy. The above number, for
example, is based on looking at a large number of job descriptions – perhaps
licking the tip of your finger and putting it up to feel the wind – and using
subjective grounds to decide which tasks are likely to be automated. It is
understandable that people don’t take the trouble to read a 79 page report
that includes statements such as "the task model assumes for tractability an
aggregate, constant-returns to-scale, Cobb-Douglas production function."
However, if you don’t, then you should remain somewhat sceptical about the
conclusions too. The real value in this kind of analysis is that it suggests which
kinds of jobs are more likely to be at risk, not in the actual numbers such as
47%. The tragedy is that the headlines reporting "nearly half of US jobs at risk
of computerization" are noted, and the rest is not.
So then, what actually are the tasks that are more likely to be automated?
There are some clear signs concerning this that we can already observe:
 Autonomous robotics solutions such as self-driving vehicles, including
cars, drones, boats or delivery robots, are just at the verge of major
commercial applications. The safety of autonomous cars is hard to
estimate, but the statistics suggests that it is probably not yet quite at
the required level (the level of an average human driver). However, the
progress has been incredibly fast and it is accelerating due to the
increasing amount of available data.
 Customer-service applications such as helpdesks can be automated in a
very cost-effective fashion. Currently the quality of service is not always
to be cheered, the bottle-necks being language processing (the system
not being able to recognize spoken language or to parse the grammar)
and the logic and reasoning required to provide the actual service.
However, working applications in constrained domains (such as making
restaurant or haircut reservations) sprout up constantly.
For one thing, it is hard to tell how soon we’ll have safe and reliable self-driving
cars and other solutions that can replace human work. In addition to this, we
mustn’t forget that a truck or taxi driver doesn’t only turn a wheel: they are
also responsible for making sure the vehicle operates correctly, they handle
the goods and negotiate with customers, they guarantee the safety of their
cargo and passengers, and take care of a multitude of other tasks that may be
much harder to automate than the actual driving.
As with earlier technological advances, there will also be new work that is
created because of AI. It is likely that in the future, a larger fraction of the
workforce will focus on research and development, and tasks that require
creativity and human-to-human interaction. If you’d like to read more on this
topic, see for example Abhinav Suri’s nice essay on Artificial Intelligence and
the Rise of Economic Inequality.
III. Summary
The most important decisions that determine how well our society can adapt to the
changes brought by AI aren’t technological. They are political.
Everything that we have learned about AI suggests that the future is bright. We
will get new and better services and increased productivity will lead to positive
overall outcomes – but only on the condition that we carefully consider the
societal implications and ensure that the power of AI is used for the common
good.
What we need to do to ensure a positive outcome
Still, we have a lot of work to do.
 We need to avoid algorithmic bias to be able to reduce discrimination
instead of increasing it.
 We also need to learn to be critical about what we see, as seeing is no
longer the same as believing – and develop AI methods that help us
detect fraud rather than just making it easier to fabricate more real-
looking falsehoods.
 We need to set up regulation to guarantee that people have the right to
privacy, and that any violations of this right are strictly penalized.
We also need to find new ways to share the benefits to everyone, instead of
creating an AI elite, those who can afford the latest AI technology and use it to
access unprecedented economic inequality. This requires careful political
judgment (note that by political judgment, we mean decisions about policy,
which has little to do with who votes for whom in an election or the comings
and goings of individual politicians and political parties).
Note
The importance of policy
The most important decisions that determine how well our society can adapt
to the evolution of work and to the changes brought by AI aren’t technological.
They are political.

The regulation of the use of AI must follow democratic principles, and


everyone must have an equal say about what kind of a society we want to live
in in the future. The only way to make this possible is to make knowledge
about technology freely available to all. Obviously there will always be experts
in any given topic, who know more about it than the rest of us, but we should
at least have the possibility to critically evaluate what they are saying.
What you have learned with us supports this goal by providing you the basic
background about AI so that we can have a rational discussion about AI and its
implications.
Our role as individuals
As you recall, we started this course by motivating the study of AI by discussing
prominent AI applications that affect all our lives. We highlighted three
examples: self-driving cars, recommendation systems, and image and video
processing. During the course, we have also discussed a wide range of other
applications that contribute to the current technological transition.
Note
Hidden agenda
We also had a hidden agenda. We wanted to give you an opportunity to
experience the thrill of learning, and the joy of heureka moments when
something that may have been complicated and mysterious, becomes simple
and if not self-evident, at least comprehensible. These are moments when our
curiosity is satisfied. But such satisfaction is temporary. Soon after we have
found the answer to one question, we will ask the next. What then? And then?

If we have been successful, we have whetted your appetite for learning. We


hope you will continue your learning by finding other courses and further
information about AI, as well as other topics of your interest. To help you with
your exploration, we have collected some pointers to AI material that we have
found useful and interesting.
Now you are in a position where you can find out about what is going on in AI,
and what is being done to ensure its proper use. You should do so, and
whenever you feel like there are risks we should discuss, or opportunities we
should go after, don’t wait that someone else reacts.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy