Elements of AI

Download as pdf or txt
Download as pdf or txt
You are on page 1of 119

I.How should we define AI?

In our very first section, we’ll become familiar with the concept of AI by looking into its
definition and some examples.

As you have probably noticed, AI is currently a “hot topic”: media coverage


and public discussion about AI is almost impossible to avoid. However, you
may also have noticed that AI means different things to different people. For
some, AI is about artificial life-forms that can surpass human intelligence, and
for others, almost any data processing technology can be called AI.

To set the scene, so to speak, we’ll discuss what AI is, how it can be defined,
and what other fields or technologies are closely related. Before we do so,
however, we’ll highlight three applications of AI that illustrate different aspects
of AI. We’ll return to each of them throughout the course to deepen our
understanding.

Application 1. Self-driving cars

Self-driving cars require a combination of AI techniques of many kinds: search


and planning to find the most convenient route from A to B, computer vision to
identify obstacles, and decision making under uncertainty to cope with the
complex and dynamic environment. Each of these must work with almost
flawless precision in order to avoid accidents.

The same technologies are also used in other autonomous systems such as
delivery robots, flying drones, and autonomous ships.

Implications: road safety should eventually improve as the reliability of the


systems surpasses human level. The efficiency of logistics chains when
moving goods should improve. Humans move into a supervisory role, keeping
an eye on what’s going on while machines take care of the driving. Since
transportation is such a crucial element in our daily life, it is likely that there
are also some implications that we haven’t even thought about yet.

Application 2. Content recommendation


A lot of the information that we encounter in the course of a typical day is
personalized. Examples include Facebook, X, Instagram, TikTok and other
social media content; online advertisements; music recommendations on
Spotify; movie recommendations on Netflix, Max, and other streaming
services. Many online publishers such as newspapers’ and broadcasting
companies’ websites as well as search engines such as Google also
personalize the content they offer.

While the frontpage of the printed version of the New York Times or China
Daily is the same for all readers, the frontpage of the online version is different
for each user. The algorithms that determine the content that you see are
based on AI.

Implications: while many companies don’t want to reveal the details of their
algorithms, being aware of the basic principles helps you understand the
potential implications: these involve so called filter bubbles, echo-chambers,
troll factories, fake news, and new forms of propaganda.

Application 3. Image and video processing

Face recognition is already a commodity used in many customer, business,


and government applications such as organizing your photos according to
people, automatic tagging on social media, and passport control. Similar
techniques can be used to recognize other cars and obstacles around an
autonomous car, or to estimate wildlife populations, just to name a few
examples.

AI can also be used to generate or alter visual content. Examples already in


use today include style transfer, by which you can adapt your personal photos
to look like they were painted by Vincent van Gogh, and computer generated
characters in motion pictures such as Avatar, the Lord of the Rings, and
popular Pixar animations where the animated characters replicate gestures
made by real human actors.

Implications: when such techniques advance and become more widely


available, it will be easy to create natural looking fake videos of events that
are impossible to distinguish from real footage. This challenges the notion that
“seeing is believing”.
What is, and what isn’t AI? Not an easy question!

The popularity of AI in the media is in part due to the fact that people have
started using the term when they refer to things that used to be called by other
names. You can see almost anything from statistics and business analytics to
manually encoded if-then rules called AI. Why is this so? Why is the public
perception of AI so nebulous? Let’s look at a few reasons.

Reason 1: no officially agreed definition

Even AI researchers have no exact definition of AI. The field is rather being
constantly redefined when some topics are classified as non-AI, and new
topics emerge.

There’s an old (geeky) joke that AI is defined as “cool things that computers
can’t do.” The irony is that under this definition, AI can never make any
progress: as soon as we find a way to do something cool with a computer, it
stops being an AI problem. However, there is an element of truth in this
definition. Fifty years ago, for instance, automatic methods for search and
planning were considered to belong to the domain of AI. Nowadays such
methods are taught to every computer science student. Similarly, certain
methods for processing uncertain information are becoming so well
understood that they are likely to be moved from AI to statistics or probability
very soon.

Reason 2: the legacy of science fiction

The confusion about the meaning of AI is made worse by the visions of AI


present in various literary and cinematic works of science fiction. Science
fiction stories often feature friendly humanoid servants that provide overly-
detailed factoids or witty dialogue, but can sometimes follow the steps of
Pinocchio and start to wonder if they can become human. Another class of
humanoid beings in sci-fi espouse sinister motives and turn against their
masters in the vein of old tales of sorcerers’ apprentices, going back to
the Golem of Prague and beyond.

Often the robothood of such creatures is only a thin veneer on top of a very
humanlike agent, which is understandable as most fiction – even science
fiction – needs to be relatable by human readers who would otherwise be
alienated by intelligence that is too different and strange. Most science fiction
is thus best read as metaphor for the current human condition, and robots
could be seen as stand-ins for repressed sections of society, or perhaps our
search for the meaning of life.

Reason 3: what seems easy is actually hard...

Another source of difficulty in understanding AI is that it is hard to know which


tasks are easy and which ones are hard. Look around and pick up an object in
your hand, then think about what you did: you used your eyes to scan your
surroundings, figured out where are some suitable objects for picking up,
chose one of them and planned a trajectory for your hand to reach that one,
then moved your hand by contracting various muscles in sequence and
managed to squeeze the object with just the right amount of force to keep it
between your fingers.

It can be hard to appreciate how complicated all this is, but sometimes it
becomes visible when something goes wrong: the object you pick is much
heavier or lighter than you expected, or someone else opens a door just as
you are reaching for the handle, and then you can find yourself seriously out
of balance. Usually these kinds of tasks feel effortless, but that feeling belies
millions of years of evolution and several years of childhood practice.

While easy for you, grasping objects by a robot is extremely hard, and it is an
area of active study. Recent examples include for example Boston Dynamics
robots.

...and what seems hard is actually easy

By contrast, the tasks of playing chess and solving mathematical exercises


can seem to be very difficult, requiring years of practice to master and
involving our “higher faculties” and concentrated conscious thought. No
wonder that some initial AI research concentrated on these kinds of tasks, and
it may have seemed at the time that they encapsulate the essence of
intelligence.

It has since turned out that playing chess is very well suited to computers,
which can follow fairly simple rules and compute many alternative move
sequences at a rate of billions of computations a second. Computers beat the
reigning human world champion in chess in the famous Deep Blue vs
Kasparov matches in 1997. Could you have imagined that the harder problem
turned out to be grabbing the pieces and moving them on the board without
knocking it over! We will study the techniques that are used in playing games
like chess or tic-tac-toe in Chapter 2.

Similarly, while in-depth mastery of mathematics requires (what seems like)


human intuition and ingenuity, many (but not all) exercises of a typical high-
school or college course can be solved by applying a calculator and simple
set of rules.

So what would be a more useful definition?

An attempt at a definition more useful than the “what computers can’t do yet”
joke would be to list properties that are characteristic to AI, in this case
autonomy and adaptivity.

Key terminology

Autonomy
The ability to perform tasks in complex environments without constant
guidance by a user.

Adaptivity
The ability to improve performance by learning from experience.

Words can be misleading

When defining and talking about AI we have to be cautious as many of the


words that we use can be quite misleading. Common examples are learning,
understanding, and intelligence.

You may well say, for example, that a system is intelligent, perhaps because it
delivers accurate navigation instructions or detects signs of melanoma in
photographs of skin lesions. When we hear something like this, the word
“intelligent” easily suggests that the system is capable of performing any task
an intelligent person is able to perform: going to the grocery store and cooking
dinner, washing and folding laundry, and so on.
Likewise, when we say that a computer vision system understands images
because it is able to segment an image into distinct objects such as other
cars, pedestrians, buildings, the road, and so on, the word “understand” easily
suggest that the system also understands that even if a person is wearing a t-
shirt that has a photo of a road printed on it, it is not okay to drive on that road
(and over the person).

In both of the above cases, we’d be wrong.

Note

Watch out for “suitcase words”


Marvin Minsky, a cognitive scientist and one of the greatest pioneers in AI,
coined the term suitcase word for terms that carry a whole bunch of different
meanings that come along even if we intend only one of them. Using such
terms increases the risk of misinterpretations such as the ones above.
It is important to realize that intelligence is not a single dimension like
temperature. You can compare today’s temperature to yesterday’s, or the
temperature in Helsinki to that in Rome, and tell which one is higher and
which is lower. We even have a tendency to think that it is possible to rank
people with respect to their intelligence – that’s what the intelligence quotient
(IQ) is supposed to do. However, in the context of AI, it is obvious that
different AI systems cannot be compared on a single axis or dimension in
terms of their intelligence. Is a chess-playing algorithm more intelligent than a
spam filter, or is a music recommendation system more intelligent than a self-
driving car? These questions make no sense. This is because artificial
intelligence is narrow (we’ll return to the meaning of narrow AI at the end of
this chapter): being able to solve one problem tells us nothing about the ability
to solve another, different problem.

Why you can say "a pinch of AI" but not "an AI"

The classification into AI vs non-AI is not a clear yes–no dichotomy: while


some methods are clearly AI and other are clearly not AI, there are also
methods that involve a pinch of AI, like a pinch of salt. Thus it would
sometimes be more appropriate to talk about the "AIness" (as in happiness or
awesomeness) rather than arguing whether something is AI or not.

Note
“AI” is not a countable noun
When discussing AI, we would like to discourage the use of AI as a countable
noun: one AI, two AIs, and so on. AI is a scientific discipline, like mathematics
or biology. This means that AI is a collection of concepts, problems, and
methods for solving them.

Because AI is a discipline, you shouldn’t say “an AI”, just like we don’t say “a
biology”. This point should also be quite clear when you try saying something
like “we need more artificial intelligences.” That just sounds wrong, doesn’t it?
(It does to us.)
Despite our discouragement, the use of AI as a countable noun is common.
Take for instance, the headline Data from wearables helped teach an AI to
spot signs of diabetes, which is otherwise a pretty good headline since it
emphasizes the importance of data and makes it clear that the system can
only detect signs of diabetes rather than making diagnoses and treatment
decisions. And you should definitely never ever say anything like Google’s
artificial intelligence built an AI that outperforms any made by humans, which
is one of the all-time most misleading AI headlines we’ve ever seen (note that
the headline is not by Google Research).

The use of AI as a countable noun is of course not a big deal if what is being
said otherwise makes sense, but if you’d like to talk like a pro, avoid saying
"an AI", and instead say "an AI method".
Answered

Exercise 1: Is this AI or not?


Which of the following are AI and which are not. Choose yes, no, or “kind of”
where kind of means that it both can be or can't be, depending on the viewpoint.

Note: You will only be able to submit the answer once, after which the correct
answers will be revealed, so take your time and re-read the material above if you
feel like it. That said, don't worry if you get some of them wrong – some of them
are debatable in any case because these kinds of things are rarely perfectly clear
cut. We are quite sure that if you just focus and do your best, you will have no
problems achieving a successful overall result in the end. Making mistakes is one
of the best opportunities to learn.
Spreadsheet that calculates sums and other pre-defined functions on given data
Yes
No
Kind of

The answer is correct


The outcome is determined by the user-specified formula, no AI needed.
Predicting the stock market by fitting a curve to past data about stock prices
Yes
No
Kind of

The answer is correct


Fitting a simple curve is not really AI, but there are so many different curves to
choose from, even if there's a lot of data to constrain them, that one needs
machine learning/AI to get useful results.
A GPS navigation system for finding the fastest route
Yes
No
Kind of

The answer is correct


The signal processing and geometry used to determine the coordinates isn't AI,
but providing good suggestions for navigation (shortest/fastest routes) is AI,
especially if variables such as traffic conditions are taken into account.
A music recommendation system such as Spotify that suggests music based on the users' listening behavior
Yes
No
Kind of

The answer is correct


The system learns from the users' (not only your) listening behavior.
Big data storage solutions that can store huge amounts of data (such as images or video) and stream them to many users at the same
time
Yes
No
Kind of

The answer is correct


Storing and retrieving specific items from a data collection is neither adaptive
or autonomous.
Photo editing features such as brightness and contrast in applications such as Photoshop
Yes
No
Kind of

The answer is correct


Adjustments such as color balance, contrast, and so on, are neither adaptive nor
autonomous, but the developers of the application may use some AI to
automatically tune the filters.
Style transfer filters in applications such as Prisma that take a photo and transform it into different art styles (impressio nist, cubist, ...)
Yes
No
Kind of

The answer is not correct


Such methods typically learn image statistics (read: what small patches of the
image in a certain style look like up close) and transform the input photo so
that its statistics match the style, so the system is adaptive.
6/7 answers correct

II.Related fields
In addition to AI, there are several other closely related topics that are good to know at least
by name. These include machine learning, data science, and deep learning.

Machine learning can be said to be a subfield of AI, which itself is a subfield


of computer science (such categories are often somewhat imprecise and
some parts of machine learning could be equally well or better belong to
statistics). Machine learning enables AI solutions that are adaptive. A concise
definition can be given as follows:

Key terminology

Machine learning
Systems that improve their performance in a given task with more and more
experience or data.
Deep learning is a subfield of machine learning, which itself is a subfield of AI,
which itself is a subfield of computer science. We will meet deep learning in
some more detail in Chapter 5, but for now let us just note that the “depth” of
deep learning refers to the complexity of a mathematical model, and that the
increased computing power of modern computers has allowed researchers to
increase this complexity to reach levels that appear not only quantitatively but
also qualitatively different from before. As you notice, science often involves a
number of progressively more special subfields, subfields of subfields, and so
on. This enables researchers to zoom into a particular topic so that it is
possible to catch up with the ever increasing amount of knowledge accrued
over the years, and produce new knowledge on the topic — or sometimes,
correct earlier knowledge to be more accurate.

Data science is a recent umbrella term (term that covers several


subdisciplines) that includes machine learning and statistics, certain aspects
of computer science including algorithms, data storage, and web application
development. Data science is also a practical discipline that requires
understanding of the domain in which it is applied in, for example, business or
science: its purpose (what "added value" means), basic assumptions, and
constraints. Data science solutions often involve at least a pinch of AI (but
usually not as much as one would expect from the headlines).

Robotics means building and programming robots so that they can operate in
complex, real-world scenarios. In a way, robotics is the ultimate challenge of
AI since it requires a combination of virtually all areas of AI. For example:

• Computer vision and speech recognition for sensing the environment


• Natural language processing, information retrieval, and reasoning under
uncertainty for processing instructions and predicting consequences of
potential actions
• Cognitive modeling and affective computing (systems that respond to
expressions of human feelings or that mimic feelings) for interacting and
working together with humans

Many of the robotics-related AI problems are best approached by machine


learning, which makes machine learning a central branch of AI for robotics.

Note

What is a robot?
In brief, a robot is a machine comprising sensors (which sense the environment)
and actuators (which act on the environment) that can be programmed to
perform sequences of actions. People used to science-fictional depictions of
robots will usually think of humanoid machines walking with an awkward gait
and speaking in a metallic monotone. Most real-world robots currently in use
look very different as they are designed according to the application. Most
applications would not benefit from the robot having human shape, just like we
don’t have humanoid robots to do our dishwashing but machines in which we
place the dishes to be washed by jets of water.

It may not be obvious at first sight, but any kind of vehicles that have at least
some level of autonomy and include sensors and actuators are also counted as
robotics. On the other hand, software-based solutions such as a customer service
chatbot, even if they are sometimes called “software robots”, aren’t counted as
(real) robotics.
Answered

Exercise 2: Taxonomy of AI
A taxonomy is a scheme for classifying many things that may be special cases of
one another. We have explained the relationships between a number of
disciplines or fields and pointed out, for example, that machine learning is
usually considered to be a subfield of AI.

A convenient way to visualize a taxonomy is an Euler diagram. An Euler diagram


(closely related to the more familiar Venn diagrams) consists of shapes that
corresponds to concepts, which are organized so that overlap between the
shapes corresponds to overlap between the concepts (see for
example Wikipedia: Euler Diagram).

Notice that a taxonomy does not need to be strictly hierarchical. A discipline can
be a subfield of more than one more general topic: for example, machine learning
can also be thought to be a subfield of statistics. In this case, the subfield concept
would be placed in the overlap between the more general topics.

Your task: Construct a taxonomy in the Euler diagram example given below
showing the relationships between the following things: AI, machine learning,
computer science, data science, and deep learning.

Where would you put AI?


Section A
Section B
Section C
Section D
Section E

The answer is correct


AI is a part of computer science.
Where would you put machine learning?
Section A
Section B
Section C
Section D
Section E

The answer is correct


Machine learning is usually considered to be a part of AI.
Where would you put computer science?
Section A
Section B
Section C
Section D
Section E

The answer is correct


Computer science is a relatively broad field that includes AI but also other
subfields such as distributed computing, human-computer interaction, and
software engineering.
Where would you put data science?
Section A
Section B
Section C
Section D
Section E

The answer is correct


Data science needs computer science and AI. However, it also involves a lot of
statistics, business, law, and other application domains, so it is usually not
considered to be a part of computer science.
Where would you put deep learning?
Section A
Section B
Section C
Section D
Section E

The answer is correct


Deep learning is a part of machine learning.
5/5 answers correct
Answered

Exercise 3: Examples of tasks


Consider the following example tasks. Try to determine which AI-related fields
are involved in them. Select all that apply. (Hint: machine learning involves
almost always some kind of statistics).

Note: This exercise is meant to inspire you to think about the different aspects of
AI and their role in various applications. As there are no clear-cut answers to
many of these questions, this exercise will not be included in the grading.
Nevertheless, we suggest that you do your best and try to answer as well as you
can, but don't worry if our answers will differ from yours.
Autonomous car
Statistics
Robotics
Machine learning

The answer is correct


Autonomous cars apply a wide range of techniques to function. These include
statistics, robotics, and machine learning.
Steering a rocket into orbit
Statistics
Robotics
Machine learning

The answer is not correct


In order to steer a rocket into orbit robotics are needed to fire the engines at
the right times and with the right power.
Online ad optimization
Statistics
Robotics
Machine learning

The answer is correct


In order to optimize ads online, machine learning and statistics are needed to
deliver the correct type of ads to the right audience, and to measure the
effectiveness of the optimization.
Customer service chatbot
Statistics
Robotics
Machine learning

The answer is not correct


A customer service chatbot will need machine learning to process human
produced language in such a way that it can act on it.
Summarizing gallup results
Statistics
Robotics
Machine learning

The answer is not correct


Summarizing gallup results is a classical case of using statistics to produce
insights.
2/5 answers correct

III.Philosophy of AI
The very nature of the term “artificial intelligence” brings up philosophical questions whether
intelligent behavior implies or requires the existence of a mind, and to what extent is
consciousness replicable as computation.

The Turing test

Alan Turing (1912-1954) was an English mathematician and logician. He is


rightfully considered to be the father of computer science. Turing was
fascinated by intelligence and thinking, and the possibility of simulating them
by machines. Turing’s most prominent contribution to AI is his imitation game,
which later became known as the Turing test.

In the test, a human interrogator interacts with two players, A and B, by


exchanging written messages (in a chat). If the interrogator cannot determine
which player, A or B, is a computer and which is a human, the computer is
said to pass the test. The argument is that if a computer is indistinguishable
from a human in a general natural language conversation, then it must have
reached human-level intelligence.

What Turing meant by the test is very much similar to the aphorism by Forrest
Gump: “stupid is as stupid does”. Turing’s version would be “intelligent is as
intelligent says”. In other words, an entity is intelligent if it cannot be
distinguished from another intelligent entity by observing its behavior. Turing
just constrained the set of behaviors into discussion so that the interrogator
can’t base her or his decision on appearances.

One problem: does being human-like mean you are intelligent?

One criticism of the Turing test as a test for intelligence is that it may actually
measure whether the computer behaves like a human more than whether it is
intelligent. The test has indeed been “passed” by computer programs that
keep changing the subject, make plenty of spelling errors, and sometimes
refuse to respond at all. A famous example is Eugene Goostman, a 13-year-
old Ukrainian boy who constantly tries to avoid answering questions by
making jokes and changing the subject to his pet guineapig. Here is a short
segment of an example conversation with Eugene Goostman from the blog of
Scott Aaronson:

Is Eugene a computer or a person?

Interrogator: How many legs does a camel have?

Eugene: Something between 2 and 4. Maybe, three? :-))) By the way, I still
don’t know your specialty - or, possibly, I’ve missed it?

Interrogator: How many legs does a millipede have?

Eugene: Just two, but Chernobyl mutants may have them up to five. I know
you are supposed to trick me.
Interrogator: No, I need to know that you’re not a chatbot. Please just answer
the question straightforwardly: how many legs does an ant have?

Eugene: Something between 2 and 4. Maybe, three? :-))) Oh, what a fruitful
conversation;-)

Interrogator: Do you understand why I’m asking such basic questions? Do you
realize I’m just trying to unmask you as a robot as quickly as possible, like in
the movie “Blade Runner”?

Eugene: ...wait

Eugene Goostman, who in reality is a computer program, fooled 10 out of 30


judges into thinking he is a real person.

The Chinese room argument

The idea that intelligence is the same as intelligent behavior has been
challenged by some. The best known counter-argument is John
Searle’s Chinese Room thought experiment. Searle describes an experiment
where a person who doesn’t know Chinese is locked in a room. Outside the
room is a person who can slip notes written in Chinese inside the room
through a mail slot. The person inside the room is given a big manual where
she can find detailed instructions for responding to the notes she receives
from the outside.

Searle argued that even if the person outside the room gets the impression
that he is in a conversation with another Chinese-speaking person, the person
inside the room does not understand Chinese. Likewise, his argument
continues, even if a machine behaves in an intelligent manner, for example,
by passing the Turing test, it doesn’t follow that it is intelligent or that it has a
“mind” in the way that a human has. The word “intelligent” can also be
replaced by the word “conscious” and a similar argument can be made.

Is a self-driving car intelligent?

The Chinese Room argument goes against the notion that intelligence can be
broken down into small mechanical instructions that can be automated.
A self-driving car is an example of an element of intelligence (driving a car)
that can be automated. The Chinese Room argument suggests that this,
however, isn’t really intelligent thinking: it just looks like it. Going back to the
above discussion on “suitcase words”, the AI system in the car doesn’t see or
understand its environment, and it doesn’t know how to drive safely, in the
way a human being sees, understands, and knows. According to Searle this
means that the intelligent behavior of the system is fundamentally different
from actually being intelligent.

How much does philosophy matter in practice?

The definition of intelligence, natural or artificial, and consciousness appears


to be extremely evasive and leads to apparently never-ending discourse. In
intellectual company, this discussion can be quite enjoyable (in the absence of
suitable company, books such as The Mind’s I by Hofstadter and Dennett can
offer stimulation).

However, as John McCarthy pointed out, the philosophy of AI is “unlikely to


have any more effect on the practice of AI research than philosophy of
science generally has on the practice of science.” Thus, we’ll continue
investigating systems that are helpful in solving practical problems without
asking too much whether they are intelligent or just behave as if they were.

Key terminology

General vs narrow AI
When reading the news, you might see the terms “general” and “narrow” AI. So
what do these mean? Narrow AI refers to AI that handles one task. General AI, or
Artificial General Intelligence (AGI) refers to a machine that can handle any
intellectual task. All the AI methods we use today fall under narrow AI, with
general AI being in the realm of science fiction. In fact, the ideal of AGI has been
all but abandoned by the AI researchers because of lack of progress towards it in
more than 50 years despite all the effort. In contrast, narrow AI makes progress
in leaps and bounds.

Strong vs weak AI
A related dichotomy is “strong” and “weak” AI. This boils down to the above
philosophical distinction between being intelligent and acting intelligently, which
was emphasized by Searle. Strong AI would amount to a “mind” that is genuinely
intelligent and self-conscious. Weak AI is what we actually have, namely systems
that exhibit intelligent behaviors despite being “mere“ computers.
Answered

Exercise 4: Definitions, definitions


Which definition of AI do you like best? How would you define AI?

Let's first scrutinize the following definitions that have been proposed earlier:

1. "cool things that computers can't do"


2. machines imitating intelligent human behavior
3. autonomous and adaptive systems

Your task:

• Do you think these are good definitions? Consider each of them in turn and
try to come up with things that they get wrong - either things that you
think should be counted as AI but aren't according to the definition, or vice
versa. Explain your answers by a few sentences per item (so just saying
that all the definitions look good or bad isn't enough).
• Also come up with your own, improved definition that solves some of the
problems that you have identified with the above candidates. Explain with
a few sentences how your definition may be better than the above ones.

Please read the above instructions carefully and answer both of the items
above in the text box below. Your answer will be reviewed by other users
and by the instructors. Please answer in English, and check your answer
before clicking 'Submit' because once submitted, you can no longer edit
your answer.

Note: On using ChatGPT and similar models: We appreciate you putting an AI


tool into use. Feel free to consult it but note that since we are interested in your
own opinion, we expect you to write your final answers yourself.

Your answer:
1. "cool things that computers can't do" This is a fun and useful way of
defining AI to explain it to young people like 6th graders, but this description
lacks the primary goal of creating AI - to serve humans. One could argue that
serving humans, such as doing their routine work, analyzing thousands of
pieces of information, etc., is cool. What is cool is subjective and constantly
changing. For example, Amazon Alexa or Siri, which were once considered
great achievements in AI, were cool, but now such voice assistants exist on
many devices, so they are like routine tasks. Moreover, the definition cannot
be applied to future computers. According to Moore's Law, which observes
that the number of transistors on a microchip doubles approximately every
12-18 months, computing power will double exponentially. This means one
cannot predict what future computers will be capable of with such power,
especially with the inclusion of quantum computing. 2. machines imitating
intelligent human behavior This definition does not explicitly point out that
AI can also imitate non-human creatures such as animals e.g. Spot - a
robotic dog by Boston Dynamics. It does not include non-intelligent human
behavior either. As noted in the current chapter, human behavior is not
necessarily intelligent. I mean, who doesn't walk into a room and forget why
they went there? If AI does something unintelligent then there will be many
critiques and it will be seen as more than just a malfunction. This raises
another question: what are the standards for being intelligent? It is arguable
that everyone would consider people doing daily chores so intelligent, but if
the machines were to do it, such as a humanoid robot that can do the
laundry (not a washing machine) or wash the dishes (not a dishwasher), they
would be considered intelligent. As can be seen, there's a bit of ambiguity
here. On the other hand, a human is more intelligent than a cockroach, but
scientists cannot even create AI with the intelligence level of such a small
insect considering the neurons inside their tiny brains. This means there is a
broad gap between AI today and AI that will reach or surpass human
intelligence. Thus the above description of “machines behaving like
intelligent humans” is yet to come. In my opinion, a more comprehensive
definition would be "machines trying to imitate humans and other entities." 3.
autonomous and adaptive systems I personally liked this definition the
most among the three. It is rather self-explanatory and descriptive. However,
it still has some flaws. Firstly, there are AI tools that are non-autonomous or
semi-autonomous. For example, Sophia, a realistic AI humanoid robot that
can interact with its surroundings, is semi-autonomous as she generates
speech either autonomously or through scripts written by her developers. As
for being adaptive, the definition doesn’t specify what is being adapted. AI
can be self-adaptive (adapting its own processes) or adaptive to its
surroundings. Finally, combining the above definitions into a single
thorough definition that avoids ambiguity related to personal preferences for
what is “cool,” and considers non-human entities, intelligence levels, and
degrees of autonomy and adaptability, AI can be defined as: Computer
systems that possess certain autonomy and adaptability, capable of
attempting to imitate both intelligent and non-intelligent behaviors of
humans and other entities, with the main goal of serving humans.

Example answer

There is no right or wrong answer, but here’s what we think:

“Cool things that computers can't do"

The good: this adapts to include new problems in the future, captures a wide
range of AI such computer vision, natural language processing.

The bad: it rules out any "solved" problems, very hard to say what counts as
"cool".

“Machines imitating intelligent human behavior”

The good: the same as in the previous. Also, imitate is a good word since it
doesn't require that the AI solutions should "be" intelligent (whatever it means)
and it's instead enough to act intelligently.

The bad: the definition is almost self-referential in that it immediately leads to


the question what is 'intelligent', also this one is too narrow in the sense that it
only includes human-like intelligent behavior and excludes other forms of
intelligence such as so-called swarm intelligence (intelligence exhibited by for
example ant colonies).

“Autonomous and adaptive systems”

The good: it highlights two main characteristics of AI, captures things like robots,
self-driving cars, and so on, also nicely fits machine learning-based AI methods
that adapt to the training data.

The bad: once again, these lead to further questions and the definition of
'autonomous' in particular isn't very clear (is a vacuum cleaner bot autonomous?
How about a spam filter?). Furthermore, not all AI systems need to be
autonomous and we can in fact often achieve much more by combining human
and machine intelligence.

Your answer has been accepted!


You may still give peer reviews to help others
Give peer review

Received peer reviews:

Your answer has received 4 peer reviews. The average grade of received
reviews is 5.00.

Show all received peer reviews


After completing Chapter 1 you should be able to:

• Explain autonomy and adaptivity as key concepts for explaining AI


• Distinguish between realistic and unrealistic AI (science fiction vs. real
life)
• Express the basic philosophical problems related to AI including the
implications of the Turing test and Chinese room thought experiment

I.Search and problem solving


Many problems can be phrased as search problems. This requires that we start by
formulating the alternative choices and their consequences.

Search in practice: getting from A to B

Imagine you’re in a foreign city, at some address (say a hotel) and want to use
public transport to get to another address (a nice restaurant, perhaps). What
do you do? If you are like many people, you pull out your smartphone, type in
the destination and start following the instructions.

This question belongs to the class of search and planning problems. Similar
problems need to be solved by self-driving cars, and (perhaps less obviously)
AI for playing games. In the game of chess, for example, the difficulty is not so
much in getting a piece from A to B as keeping your pieces safe from the
opponent.

Often there are many different ways to solve the problem, some of which may
be more preferable in terms of time, effort, cost or other criteria. Different
search techniques may lead to different solutions, and developing advanced
search algorithms is an established research area.
We will not focus on the actual search algorithms. Instead, we emphasize the
first stage of the problem solving process: defining the choices and their
consequences, which is often far from trivial and can require careful thinking.
We also need to define what our goal is, or in other words, when we can
consider the problem solved. After this has been done, we can look for a
sequence of actions that leads from the initial state to the goal.

In this chapter, we will discuss two kinds of problems:

• Search and planning in static environments with only one “agent”


• Games with two-players (“agents”) competing against each other

These categories don’t cover all possible real-world scenarios, but they are
generic enough to demonstrate the main concepts and techniques.

Before we address complex search tasks like navigation or playing chess, let
us start from a much simplified model in order to build up our understanding of
how we can solve problems by AI.

Toy problem: chicken crossing

We’ll start from a simple puzzle to illustrate the ideas. A robot on a rowboat
needs to move three pieces of cargo across a river: a fox, a chicken, and a
sack of chicken-feed. The fox will eat the chicken if it has the chance, and the
chicken will eat the chicken-feed if it has the chance, and neither is a
desirable outcome. The robot is capable of keeping the animals from doing
harm when it is near them, but only the robot can operate the rowboat and
only two of the pieces of cargo can fit on the rowboat together with the robot.
How can the robot move all of its cargo to the opposite bank of the river?

Note

The easy version of the rowboat puzzle


If you have heard this riddle before, you might know that it can be solved even
with less space on the boat. That will be an exercise for you after we solve this
easier version together.
We will model the puzzle by noting that five movable things have been
identified: the robot, the rowboat, the fox, the chicken, and the chicken-feed.
In principle, each of the five can be on either side of the river, but since only
the robot can operate the rowboat, the two will always be on the same side.
Thus there are four things with two possible positions for each, which makes
for sixteen combinations, which we will call states:

States of the chicken crossing puzzle

State Robot Fox Chicken


NNNN Near side Near side Near side
NNNF Near side Near side Near side
NNFN Near side Near side Far side
NNFF Near side Near side Far side
NFNN Near side Far side Near side
NFNF Near side Far side Near side
NFFN Near side Far side Far side
NFFF Near side Far side Far side
FNNN Far side Near side Near side
FNNF Far side Near side Near side
FNFN Far side Near side Far side
FNFF Far side Near side Far side
FFNN Far side Far side Near side
FFNF Far side Far side Near side
FFFN Far side Far side Far side
FFFF Far side Far side Far side
We have given short names to the states, because otherwise it would be
cumbersome to talk about them. Now we can say that the starting state is
NNNN and the goal state is FFFF, instead of something like “in the starting
state, the robot is on the near side, the fox is on the near side, the chicken is
on the near side, and also the chicken-feed is on the near side, and in the
goal state the robot is on the far side”, and so on.

Some of these states are forbidden by the puzzle conditions. For example, in
state NFFN (meaning that the robot is on the near side with the chicken-feed
but the fox and the chicken are on the far side), the fox will eat the chicken,
which we cannot have. Thus we can rule out states NFFN, NFFF, FNNF,
FNNN, NNFF, and FFNN (you can check each one if you doubt our
reasoning). We are left with the following ten states:

State Robot Fox Chicken


NNNN Near side Near side Near side
NNNF Near side Near side Near side
NNFN Near side Near side Far side
NFNN Near side Far side Near side
NFNF Near side Far side Near side
FNFN Far side Near side Far side
FNFF Far side Near side Far side
FFNF Far side Far side Near side
FFFN Far side Far side Far side
FFFF Far side Far side Far side
Next we will figure out which state transitions are possible, meaning simply
that as the robot rows the boat with some of the items as cargo, what the
resulting state is in each case. It’s best to draw a diagram of the transitions,
and since in any transition the first letter alternates between N and F, it is
convenient to draw the states starting with N (so the robot is on the near side)
in one row and the states starting with F in another row:

Now let’s draw the transitions. We could draw arrows that have a direction so
that they point from one node to another, but in this puzzle the transitions are
symmetric: if the robot can row from state NNNN to state FNFF, it can equally
well row the other way from FNFF to NNNN. Thus it is simpler to draw the
transitions simply with lines that don’t have a direction. Starting from NNNN,
we can go to FNFN, FNFF, FFNF, and FFFN:

Then we fill in the rest:

We have now done quite a bit of work on the puzzle without seeming any
closer to the solution, and there is little doubt that you could have solved the
whole puzzle already by using your “natural intelligence”. But for more
complex problems, where the number of possible solutions grows in the
thousands and in the millions, our systematic or mechanical approach will
shine since the hard part will be suitable for a simple computer to do. Now that
we have formulated the alternative states and transitions between them, the
rest becomes a mechanical task: find a path from the initial state NNNN to the
final state FFFF.

One such path is colored in the following picture. The path proceeds from
NNNN to FFFN (the robot takes the fox and the chicken to the other side),
thence to NFNN (the robot takes the chicken back on the starting side) and
finally to FFFF (the robot can now move the chicken and the chicken-feed to
the other side).

State space, transitions, and costs

To formalize a planning problem, we use concepts such as the state space,


transitions, and costs.

Key terminology

The state space


means the set of possible situations. In the chicken-crossing puzzle, the state
space consisted of ten allowed states NNNN through to FFFF (but not for
example NFFF, which the puzzle rules don’t allow). If the task is to navigate from
place A to place B, the state space could be the set of locations defined by their
(x,y) coordinates that can be reached from the starting point A. Or we could use a
constrained set of locations, for example, different street addresses so that the
number of possible states is limited.

Transitions
are possible moves between one state and another, such as NNNN to FNFN. It is
important to note that we only count direct transitions that can be accomplished
with a single action as transitions. A sequence of multiple transitions, for
example, from A to C, from C to D, and from D to B (the goal), is a path rather
than a transition.

Costs
refer to the fact that, oftentimes the different transitions aren’t all alike. They can
differ in ways that make some transitions more preferable or cheaper (in a not
necessarily monetary sense of the word) and others more costly. We can express
this by associating with each transition a certain cost. If the goal is to minimize
the total distance traveled, then a natural cost is the geographical distance
between states. On the other hand, the goal could actually be to minimize the
time instead of the distance, in which case the natural cost would obviously be
the time. If all the transitions are equal, then we can ignore the costs.
Answered

Exercise 5: A smaller rowboat


In the traditional version of this puzzle the robot can only fit one thing on the
boat with it. The state space is still the same, but fewer transitions are possible.

Using the diagram with the possible states below as a starting point, draw the
possible transitions in it (it is MUCH easier to do this with a pencil and paper
than without).

Having drawn the state transition diagram, find the shortest path from NNNN to
FFFF, and calculate the number of transitions on it.

Please type your answer as the number of transitions in the shortest path (just a
single number like "12"). Do NOT include any further description of your
solution. Hint: Do not count the number of states, but the number of transitions.
For example, the number of transitions in the path NNNN→FFNF→NFNF→FFFF
is 3 instead of 4.

Your answer:
7

Your answer is correct


Correct. There are two shortest paths that lead from the start NNNN to the goal
FFFF. One of them is NNNN -> FNFN -> NNFN -> FFFN -> NFNN -> FFNF -> NFNF
-> FFFF, and the other NNNN -> FNFN -> NNFN -> FNFF -> NNNF -> FFNF ->
NFNF -> FFFF. Intuitively, the strategy is to move the chicken on the other side
first, and then go back get either the fox or the feed, and take it to the far side too.
The robot then takes the chicken back to the near side to save it from being eaten
or from eating the feed, and takes the other remaining object (fox or feed) from
the near side to the far side. Finally, the robot goes to fetch the chicken and takes
it to the far side to reach the goal.
Answered
Exercise 6: The Towers of Hanoi
Let's do another puzzle: the well-known Towers of Hanoi. In our version, the
puzzle involves three pegs, and two discs: one large, and one small (actually,
there can be any number of discs but for the exercise, two is enough to
demonstrate the principle).

In the initial state, both discs are stacked in the first (leftmost) peg. The goal is to
move the discs to the third peg. You can move one disc at a time, from any peg to
another, as long as there is no other disc on top of it. It is not allowed to put a
larger disc on top of a smaller disc.

This picture shows the initial state and the goal state. There are also seven other
states so that the total number of possible states is nine: three ways to place the
large disc and for each of them, three ways to place the small disc.

Your task: Draw the state diagram. The diagram should include all the nine
possible states in the game, connected by lines that show the possible transitions.
The picture below shows the overall structure of the state diagram and the
positions of the first three states. It shows that from the starting state (at the top
corner), you can move to two other states by moving the small disc. Complete the
state diagram by placing the remaining states in the correct places. Note that the
transitions are again symmetric and you can also move sideways (left or right) or
up in the diagram.

After solving the task using pen and paper, enter your solution by choosing
which state belongs to which node in the diagram. (Hint: Each state belongs to
exactly one node).

Choose for each node (1–6) in the above diagram the correct state A—F from
below.

What state should be in box 1?


State A
State B
State C
State D
State E
State F

The answer is correct


Correct
What state should be in box 2?
State A
State B
State C
State D
State E
State F

The answer is correct


Correct
What state should be in box 3?
State A
State B
State C
State D
State E
State F

The answer is correct


Correct
What state should be in box 4?
State A
State B
State C
State D
State E
State F

The answer is correct


Correct
What state should be in box 5?
State A
State B
State C
State D
State E
State F

The answer is correct


Correct
What state should be in box 6?
State A
State B
State C
State D
State E
State F

The answer is correct


Correct
6/6 answers correct

II.Solving problems with AI


Interlude on the history of AI: starting from search

AI is arguably as old as computer science. Long before we had computers,


people thought of the possibility of automatic reasoning and intelligence. As
we already mentioned in Chapter 1, one of the great thinkers who considered
this question was Alan Turing. In addition to the Turing test, his contributions
to AI, and more generally to computer science, include the insight that
anything that can be computed (= calculated using either numbers or other
symbols) can be automated.

Note

Helping win WWII


Turing designed a very simple device that can compute anything that is
computable. His device is known as the Turing machine. While it is a theoretical
model that isn’t practically useful, it led Turing to the invention of programmable
computers: computers that can be used to carry out different tasks depending on
what they were programmed to do.

So instead of having to build a different device for each task, we use the same
computer for many tasks. This is the idea of programming. Today this invention
sounds trivial but in Turing’s days it was far from it. Some of the early
programmable computers were used during World War II to crack German
secret codes, a project where Turing was also personally involved.
The term Artificial Intelligence was coined by John McCarthy (1927-2011) –
who is often also referred to as the Father of AI. The term became established
when it was chosen as the topic of a summer seminar, known as
the Dartmouth conference, which was organized by McCarthy and others in
1956 at Dartmouth College in New Hampshire. In the proposal to organize the
seminar, McCarthy continued with Turing’s argument about automated
computation. The proposal contains the following crucial statement:

Note

John McCarthy’s key statement about AI


“The study is to proceed on the basis of the conjecture that every aspect of
learning or any other feature of intelligence can in principle be so precisely
described that a machine can be made to simulate it.”
In other words, any element of intelligence can be broken down into small
steps so that each of the steps is as such so simple and “mechanical” that it
can be written down as a computer program. This statement was, and is still
today, a conjecture, which means that we can’t really prove it to be true.
Nevertheless, the idea is absolutely fundamental when it comes to the way we
think about AI. For example, it shows that McCarthy wanted to bypass any
arguments in the spirit of Searle’s Chinese Room: intelligence is intelligence
even if the system that implements it is just a computer that mechanically
follows a program.

Why search and games became central in AI research

As computers developed to the level where it was feasible to experiment with


practical AI algorithms in the 1950s, the most distinctive AI problems (besides
cracking Nazi codes) were games. Games provided a convenient restricted
domain that could be formalized easily. Board games such as checkers,
chess, and recently quite prominently Go (an extremely complex strategy
board game originating from China at least 2500 years ago), have inspired
countless researchers, and continue to do so.

Closely related to games, search and planning techniques were an area


where AI led to great advances in the 1960s: algorithms with names such as
the Minimax algorithm or Alpha-Beta Pruning, which were developed then, are
still the basis for game playing AI, although of course more advanced variants
have been proposed over the years. In this chapter, we will study games and
planning problems on a conceptual level.

III.Search and games


In this section, we will study a classic AI problem: games. The simplest scenario, which we
will focus on for the sake of clarity, are two-player, perfect-information games such as tic-tac-
toe and chess.

Example: playing tic tac toe

Maxine and Minnie are true game enthusiasts. They just love games.
Especially two-person, perfect information games such as tic-tac-toe or chess.
One day they were playing tic-tac-toe. Maxine, or Max as her friends call her,
was playing with X. Minnie, or Min as her friends call her, had the Os. Min had
just played her turn and the board looked as follows:

Max was looking at the board and contemplating her next move, as it was her
turn, when she suddenly buried her face in her hands in despair, looking quite
like Garry Kasparov playing Deep Blue in 1997.

Yes, Min was close to getting three Os on the top row, but Max could easily
put a stop to that plan. So why was Max so pessimistic?

Game trees

To solve games using AI, we will introduce the concept of a game tree. The
different states of the game are represented by nodes in the game tree, very
similar to the above planning problems. The idea is just slightly different. In
the game tree, the nodes are arranged in levels that correspond to each
player’s turns in the game so that the “root” node of the tree (usually depicted
at the top of the diagram) is the beginning position in the game. In tic-tac-toe,
this would be the empty grid with no Xs or Os played yet. Under root, on the
second level, there are the possible states that can result from the first
player’s moves, be it X or O. We call these nodes the “children” of the root
node.

Each node on the second level, would further have as its children nodes the
states that can be reached from it by the opposing player’s moves. This is
continued, level by level, until reaching states where the game is over. In tic-
tac-toe, this means that either one of the players gets a line of three and wins,
or the board is full and the game ends in a tie.

Minimizing and maximizing value

In order to be able to create game AI that attempts to win the game, we attach
a numerical value to each possible end result. To the board positions where X
has a line of three so that Max wins, we attach the value +1, and likewise, to
the positions where Min wins with three Os in a row we attach the value -1.
For the positions where the board is full and neither player wins, we use the
neutral value 0 (it doesn’t really matter what the values are as long as they are
in this order so that Max tries to maximize the value, and Min tries to minimize
it).

A sample game tree

Consider, for example, the following game tree which begins not at the root
but in the middle of the game (because otherwise, the tree would be way too
big to display). Note that this is different from the game shown in the
illustration in the beginning of this section. We have numbered the nodes with
numbers 1, 2, ..., 14.

The tree is composed of alternating layers where it is either Min’s turn to place
an O or Max’s turn to place an X at any of the vacant slots on the board. The
player whose turn it is to play next is shown at the left.

The game continues at the board position shown in the root node, numbered
as (1) at the top, with Min’s turn to place O at any of the three vacant cells.
Nodes (2)–(4) show the board positions resulting from each of the three
choices respectively. In the next step, each node has two possible choices for
Max to play X each, and so the tree branches again.
When starting from the above starting position, the game always ends in a
row of three: in nodes (7) and (9), the winner is Max who plays with X, and in
nodes (11)–(14) the winner is Min who plays with O.

Note that since the players’ turns alternate, the levels can be labeled as Min
levels and Max levels, which indicates whose turn it is.

Being strategic

Consider nodes (5)–(10) on the second level from the bottom. In nodes (7)
and (9), the game is over, and Max wins with three X’s in a row. The value of
these positions is +1. In the remaining nodes, (5), (6), (8), and (10), the game
is also practically over, since Min only needs to place her O in the only
remaining cell to win. In other words, we know how the game will end at each
node on the second level from the bottom. We can therefore decide that the
value of nodes (5), (6), (8), and (10) is also –1.

Here comes the interesting part. Let’s consider the values of the nodes one
level higher towards the root: nodes (2)–(4). Since we observed that both of
the children of (2), i.e., nodes (5) and (6), lead to Min’s victory, we can without
hesitation attach the value -1 to node (2) as well. However, for node (3), the
left child (7) leads to Max’s victory, +1, but the right child (8) leads to Min
winning, -1. What is the value of node (3)? Think about this for a while,
keeping in mind who makes the choice at node (3).

Since it is Max’s turn to play, she will of course choose the left child, node (7).
Thus, every time we reach the board position in node (3), Max can ensure
victory, and we can attach the value +1 to node (3).

The same holds for node (4): again, since Max can choose where to put her
X, she can always ensure victory, and we attach the value +1 to node (4).

Determining who wins

The most important lesson in this section is to apply the above kind of
reasoning repeatedly to determine the result of the game in advance from any
board position.
So far, we have decided that the value of node (2) is –1, which means that if
we end up in such a board position, Min can ensure winning, and that the
reverse holds for nodes (3) and (4): their value is +1, which means that Max
can be sure to win if she only plays her own turn wisely.

Finally, we can deduce that since Min is an experienced player, she can reach
the same conclusion, and thus she only has one real option: play the O in the
middle of the board.

In the diagram below, we have included the value of each node as well as the
optimal game play starting at Min’s turn in the root node.

The value of the root node = who wins

The value of the root node, which is said to be the value of the game, tells us
who wins (and how much, if the outcome is not just plain win or lose): Max
wins if the value of the game is +1, Min if the value is –1, and if the value is 0,
then the game will end in a draw. In other games, the value may also take
other values (such as the monetary value of the chips in front of you in poker
for example).

This all is based on the assumption that both players choose what is best for
them and that what is best for one is the worst for the other (so called "zero-
sum game").

Note

Finding the optimal moves


Having determined the values of all the nodes in the game tree, the optimal
moves can be deduced: at any Min node (where it is Min’s turn), the optimal
choice is given by the child node whose value is minimal, and conversely, at any
Max node (where it is Max’s turn), the optimal choice is given by the child node
whose value is maximal. Sometimes there are many equally good choices that
are, well, equally good, and the outcome will be the same no matter which one of
them is picked.

The Minimax algorithm


We can exploit the above concept of the value of the game to obtain an
algorithm called the Minimax algorithm. It guarantees optimal game play in,
theoretically speaking, any deterministic, two-person, perfect-information zero-
sum game. Given a state of the game, the algorithm simply computes the
values of the children of the given state and chooses the one that has the
maximum value if it is Max’s turn, and the one that has the minimum value if it
is Min’s turn.

The algorithm can be implemented using a few lines of code. However, we will
be satisfied with having grasped the main idea. If you are interested in taking
a look at the actual algorithm (alert: programming required) feel free to check
out, for example, Wikipedia: Minimax.

Sounds good, can I go home now?

As stated above, the Minimax algorithm can be used to implement optimal


game play in any deterministic, two-player, perfect-information zero-sum
game. Such games include tic-tac-toe, connect four, chess, Go, etc. Rock-
paper-scissors is not in this class of games since it involves information
hidden from the other player; nor are Monopoly or backgammon which are not
deterministic. So as far as this topic is concerned, is that all folks, can we go
home now? The answer is that in theory, yes, but in practice, no.

Note

The problem of massive game trees


In many games, the game tree is simply way too big to traverse in full. For
example, in chess the average branching factor, i.e., the average number of
children (available moves) per node is about 35. That means that to explore all
the possible scenarios up to only two moves ahead, we need to visit
approximately 35 x 35 = 1225 nodes – probably not your favorite pencil-and-
paper homework exercise. A look-ahead of three moves requires visiting 42875
nodes; four moves 1500625; and ten moves 2758547353515625 (that’s about
2.7 quadrillion) nodes. In Go, the average branching factor is estimated to be
about 250. Go means no-go for Minimax.

More tricks: Managing massive game trees


A few more tricks are needed to manage massive game trees. Many of them
were crucial elements in IBM’s Deep Blue computer defeating the chess world
champion, Garry Kasparov, in 1997.

If we can afford to explore only a small part of the game tree, we need a way
to stop the Minimax algorithm before reaching an end-node, i.e., a node
where the game is over and the winner is known. This is achieved by using a
so called heuristic evaluation function that takes as input a board position,
including the information about which player’s turn is next, and returns a score
that should be an estimate of the likely outcome of the game continuing from
the given board position.

Note

Good heuristics
Good heuristics for chess, for example, typically count the amount of material
(pieces) weighted by their type: the queen is usually considered worth about two
times as much as a rook, three times a knight or a bishop, and nine times as much
as a pawn. The king is of course worth more than all other things combined since
losing it amounts to losing the game. Further, occupying the strategically
important positions near the middle of the board is considered an advantage and
the heuristics assign higher value to such positions.
The minimax algorithm presented above requires minimal changes to obtain
a depth-limited version where the heuristic is returned at all nodes at a given
depth limit: the depth simply refers to the number of steps that the game tree
is expanded before applying a heuristic evaluation function.
Answered

Exercise 7: Why so pessimistic, Max?


Let's return to the tic-tac-toe game described in the beginning of this section. To
narrow down the space of possible end-games to consider, we can observe that
Max must clearly place an X on the top row to avoid imminent defeat:

Now it's Min's turn to play an O. Evaluate the value of this state of the game as
well as the other states in the game tree where the above position is the root,
using the Minimax algorithm.
Your task:
Look at the game tree starting from the below board position. Using a pencil and
paper, fill in the values of the bottom-level nodes where the game is over. Note
that this time some of the games end in a draw, which means that the values of
the node is 0 (instead of -1 or 1).

Next continue filling the values of the nodes in the next level up. Since there is no
branching at that level, the values on the second-lowest level are the same as at
the bottom level.

On the second-highest level, fill in the values by choosing for each node the
maximum of the values of the child nodes – as you notice, this is a MAX level.
Finally, fill in the root node's value by choosing the minimum of the root node's
child nodes' values. This is the value of the game.

Enter the value of the game as your answer.

1
-1
0

The answer is correct


The value is -1. The values on the second level are 0, 0, and -1. The values on the
third level are -1, 0, -1, 0, -1, -1, which are the same as the values on the bottom
level. As you can see, Max has all the reason to be serious since by playing in the
bottom-right corner, Min can guarantee a win. The inevitable victory of Min can
also be seen from the value of the game -1.
Note

The limitations of plain search


It may look like we have a method to solve any problem by specifying the states
and transitions between them, and finding a path from the current state to our
goal. Alas, things get more complicated when we want to apply AI in real world
problems. Basically, the number of states in even a moderately complex real-
world scenario grows out of hand, and we can’t find a solution by exhaustive
search (“brute force”) or even by using clever heuristics.
Moreover, the transitions which take us from one state to the next when we
choose an action are not deterministic. This means that whatever we choose to
do will not always completely determine the outcome because there are factors
that are beyond our control, and that are often unknown to us.

The algorithms we have discussed above can be adapted to handle some


randomness, for example randomness in choosing cards from a shuffled deck or
throwing dice. This means that we will need to introduce the concept of
uncertainty and probability. Only thus we can begin to approach real-world AI
instead of simple puzzles and games. This is the topic of Chapter 3.
After completing Chapter 2 you should be able to:

• Formulate a real-world problem as a search problem


• Formulate a simple game (such as tic-tac-toe) as a game tree
• Use the minimax principle to find optimal moves in a limited-size game
tree
• I.Odds and probability
• In the previous section, we discussed search and its application where there is perfect
information – such as in games like chess. However, in the real world things are rarely
so clear cut.
• Instead of perfect information, there is a host of unknown possibilities,
ranging from missing information to deliberate deception.
• Take a self-driving car for example — you can set the goal to get from A
to B in an efficient and safe manner that follows all laws. But what
happens if the traffic gets worse than expected, maybe because of an
accident ahead? Sudden bad weather? Random events like a ball
bouncing in the street, or a piece of trash flying straight into the car’s
camera?

• A self-driving car needs to use a variety of sensors, including sonar-like
ones and cameras, to detect where it is and what is around it. These
sensors are never perfect as the data from the sensors always includes
some errors and inaccuracies called “noise”. It is very common then that
one sensor indicates that the road ahead turns left, but another sensor
indicates the opposite direction. This needs to be resolved without
always stopping the car in case of even a slightest amount of noise.
• Probability
• One of the reasons why modern AI methods actually work in real-world
problems – as opposed to most of the earlier “good old-fashioned"
methods in the 1960-1980s – is their ability to deal with uncertainty.
• Note
• The history of dealing with uncertainty
• The history of AI has seen various competing paradigms for handling
uncertain and imprecise information. For example, you may have heard of
fuzzy logic. Fuzzy logic was for a while a contender for the best approach
to handle uncertain and imprecise information and used in many
customer-applications such as washing machines where the machine could
detect the dirtiness (a matter of degrees, not only dirty or clean) and
adjust the program accordingly.

However, probability has turned out to be the best approach for reasoning
under uncertainty, and almost all current AI applications are based, to at
least some degree, on probabilities.

• Why probability matters
• We are perhaps most familiar with applications of probability in games:
what are the chances of getting three of a kind in poker (about 1 in 47),
what are the chances of winning in the lottery (very small), and so on.
However, far more importantly, probability can also be used to quantify
and compare risks in everyday life: what are the chances of crashing
your car if you exceed the speed limit, what are the chances that the
interest rates on your mortgage will go up by five percentage points
within the next five years, or what are the chances that AI will automate
particular tasks such as detecting fractured bones in X-ray images or
waiting tables in a restaurant.
• Note
• The key lesson about probability
• The most important lesson about probability that we’d like you to take
away is not probability calculus. Instead, it is the ability to think of
uncertainty as a thing that can be quantified at least in principle. This
means that we can talk about uncertainty as if it were a number: numbers
can be compared (“is this thing more probable than that thing”), and they
can often be measured.

Granted, measuring probabilities is hard: we usually need many


observations about a phenomenon to draw conclusions. However, by
systematically collecting data, we can critically evaluate probabilistic
statements, and our numbers can sometimes be found to be right or
wrong. In other words, the key lesson is that uncertainty is not beyond the
scope of rational thinking and discussion, and probability provides a
systematic way of doing just that.
• The fact that uncertainty can be quantified is of paramount importance,
for example, in decisions concerning vaccination or other public policies.
Before entering the market, any vaccine is clinically tested, so that its
benefits and risks have been quantified. The risks are never known to
the minutest detail, but their magnitude is usually known to sufficient
degree that it can be argued whether the benefits outweigh the risks.
• Note
• Why quantifying uncertainty matters
• If we think of uncertainty as something that can’t be quantified or
measured, the uncertainty aspect may become an obstacle for rational
discussion. We may for example argue that since we don’t know exactly
whether a vaccine may cause a harmful side-effect, it is too dangerous to
use. However, this may lead us to ignore a life-threatening disease that the
vaccine will eradicate. In most cases, the benefits and risks are known to
sufficient precision to clearly see that one is more significant than the
other.
• The above lesson is useful in many everyday scenarios and
professionally: for example, medical doctors, judges in a court of law, or
investors have to process uncertain information and make rational
decisions based on them. Since this is an AI course, we will discuss
how probability can be used to automate uncertain reasoning. The
examples we will use include medical diagnosis (although it is usually
not a task that we’d wish to fully automate), and identifying fraudulent
email messages (“spam”).
• Answered

• Exercise 8: Probabilistic forecasts


• For this exercise, remember the key points from the above discussion:
probability can be quantified (expressed as a number) and it can be right
or wrong. But also keep in mind, that it is usually not possible to draw
conclusions about whether a particular number was right or wrong based
on a single observation.
• Consider the following four probabilistic forecasts and outcomes. What can
we conclude based on the outcome about the correctness of the forecasts?
Can we conclude that the probability given by the forecast was
indeed the correct probability (choose "right"), that the forecast was
wrong (choose "wrong"), or can we conclude neither way (choose "cannot
be concluded")?
• The weather forecast says it's going to rain with 90% probability tomorrow but the day turns out to be all sun and no rain.
• Right
• Wrong
• Cannot be concluded

• The answer is correct
• We can't conclude that the weather forecast was wrong based on only the
single event. The forecast said it's going to rain with 90% probability,
which means it would not rain with 10% probability or in one out of 10
days. It is perfectly plausible that the day in question was the 1 in 10
event. Concluding that the probability 90% was correct would also be
wrong because by the same argument, we could then conclude that 80%
chance of rain was also correct, and both cannot be correct at the same
time.
• The weather forecast says it's going to rain with 0% probability tomorrow but the day turns out to be rainy.
• Right
• Wrong
• Cannot be concluded

• The answer is correct
• The weather forecast was wrong because a 0% probability means that it
should definitely not rain. But it did.
• Suppose you monitor a weather forecaster for a long time. You only consider the days for which the forecast gives 80%
chance of rain. You find that in the long run, on the average it rains on three out of every five days.
• Right
• Wrong
• Cannot be concluded

• The answer is correct
• The weather forecasts are wrong if they predict 80% chance of rain and it
rains only 60% (three out of five) of the time in the long run. (Note that
we'd really need to keep track of the accuracy for a long time to reach this
conclusion but that's what "in the long run" means.) In practice, weather
forecasters actually tend to provide this kind of 'wrong' predictions just
to be safe: people are often quite disappointed when the weather turns
out to be worse than predicted but pleasantly surprised when it turns out
better than predicted.
• In the United States presidential election 2016, a well-known political forecast blog, Five-Thirty-Eight, gave Clinton a 71.4%
chance of winning (vs Trump's 28.6%). However, contrary to the prediction, Donald Trump was elected the 45th president of
the United States.
• Right
• Wrong
• Cannot be concluded

• The answer is correct
• Cannot be concluded to be wrong (or right). Sometimes unlikely things
happen. Considering the previous item, it would actually have been
wrong to predict, say, 90% or 100% chance for Trump if there simply
isn't enough information available to anticipate the outcome. In other
words, perhaps Trump's victory was a rare (or rareish) event with 28.6%
probability. Such events are expected to happen in more than one out of
four cases, after all.
• 4/4 answers correct
• Odds
• Probably the easiest way to represent uncertainty is through odds. They
make it particularly easy to update beliefs when more information
becomes available (we will return to this in the next section).
• Before we proceed any further, we should make sure you are
comfortable with doing basic manipulations on ratios (or fractions). As
you probably recall, fractions are numbers like 3/4 or 21/365. We will
need to multiply and divide such things, so it’s good to refresh these
operations if you feel unsure about them. A compact presentation for
those who just need a quick reminder is Wikibooks: Multiplying
Fractions. Another fun animated presentation of the basic operations
is Math is Fun: Using Rational Numbers. Feel free to consult your
favorite source if necessary.
• By odds, we mean an expression like 3:1 (three to one), which means
that we expect that for every three cases of an outcome, for example
winning a bet, there is one case of the opposite outcome, not winning
the bet. (In gambling terms, the odds are usually given from the
bookmakers point of view, so 3:1 usually means that your chances of
winning are 1:3.) The other way to express the same would be to say
that the chances of winning are 3/4 (three in four). These are called
natural frequencies since they involve only whole numbers. With whole
numbers, it is easy to imagine, for example, four people out of whom,
three have brown eyes. Or four days out of which it rains on three (if
you’re in Helsinki).

• Note
• Why we use odds and not percentages
• Three out of four is of course the same as 75% (mathematicians prefer to
use fractions like 0.75 instead of percentages). It has been found that
people get confused and make mistakes more easily when dealing with
fractions and percentages than with natural frequencies or odds. This is
why we use natural frequencies and odds whenever convenient.
• An important thing to notice is that while expressed as two numbers, 3
and 1 for example, the odds can actually be thought of as a single
fraction or a ratio, for example 3/1 (three divided by one) which is equal
to 3. Thus, the odds 3:1 is the same as the odds 6:2 or 30:10 since
these ratios are also equal to 3. Likewise, the odds 1:5 can be thought
of as 1/5 (one divided by five) which equals 0.2. Again, this is the same
as the odds 2:10 or 10:50 because that’s what you get by dividing 2 by
10 or 10 by 50. But be very careful! The odds 1:5 (one win for every five
losses), even if it can be expressed as the decimal number 0.2, is
different from 20% probability (or probability 0.2 using the
mathematicians’ notation). The odds 1:5 mean that you’d have to play
the game six times to get one win on the average. The probability 20%
means that you’d have to play five times to get one win on the average.
• For odds that are greater than one, such as 5:1, it is easy to remember
that we are not dealing with probabilities because no probability can be
greater than 1 (or greater than 100%), but for odds that are less than
one such as 1:5, the danger of confusion lurks around the corner.
• So make sure you always know when we are talking about odds and
when we are talking about probabilities.
• The following exercise will help you practice dealing with
correspondence between odds and probabilities. Don’t worry if you
make some mistakes at this stage: the main goal is to learn the skills
that you will need in the next sections.
• Answered

• Exercise 9: Odds
• As we already mentioned above, the odds 3:1 – for example three rainy
days for each rainless day – corresponds to probability 0.75 (or in
percentages 75%).
• In general, if the odds in favor of an event are x:y, the probability of the
event is given by x / (x+y). Try that with the odds 3:1 if you like. You
should get the answer 0.75.
• As we also pointed out, the odds 6:2 corresponds to exactly the same
probability as the odds 3:1 because when we let x=6 and y=2, and write
them in the formula x / (x+y), we get 6/(6+2), which comes out as 6/8 =
3/4 = 0.75.
• Your task:
• For the first three items 1–3, convert from odds to probabilities expressed
as natural frequencies; for example from 1:1 to 1/2. Give your answer as a
fraction, for example 2/3.
• For the last three items 4–6, convert the odds into probabilities expressed
as percentages (e.g. 4.2%). Give your answer in percentages using a single
decimal, for example 12.2%.
• Hint: the calculations are to be calculated with a simple calculator and the
formulas can be found above.
• The odds for getting three of a kind in poker are about 1:46.
• Your answer:
• 1/47

• Your answer is correct
• Correct. There are 46 situations where you do not get three of a kind for
one where you get it, so the probability is 1/(1+46) = 1/47.
• The odds for rain in Helsinki are 206:159.
• Your answer:
• 206/365

• Your answer is correct
• Correct. There are 206 rainy days for 159 dry days, so the probability is
206/(206+159) = 206/365.
• The odds for rain in San Diego are 23:342.
• Your answer:
• 23/365

• Your answer is correct
• Correct. There are 23 rainy days for 342 dry days, so the probability is
23/(23+342) = 23/365.
• The odds for getting three of a kind in poker are about 1:46.
• Your answer:
• 2.1%

• Your answer is correct
• Correct. Previously we had the probability as 1/(1+ 46) = 1/47, which
gives us roughly 0.0213, which rounds to 2.1%.
• The odds for rain in Helsinki are 206:159.
• Your answer:
• 56.4%

• Your answer is correct
• Correct. Previously we had the probability as 206/(206 + 159) = 206/365,
which gives us roughly 0.5644, which rounds to 56.4%.
• The odds for rain in San Diego are 23:342.
• Your answer:
• 6.3%

• Your answer is correct
• Correct. Previously we had the probability as 23/(23 + 342) = 23/365,
which gives us roughly 0.0630, which rounds to 6.3%.
• 6/6 answers correct

II.The Bayes rule


We will not go too far into the details of probability calculus and all the ways in which it can
be used in various AI applications, but we will discuss one very important formula.

We will do this because this particular formula is both simple and elegant as
well as incredibly powerful. It can be used to weigh conflicting pieces of
evidence in medicine, in a court of law, and in many (if not all) scientific
disciplines. The formula is called the Bayes rule (or the Bayes formula).

We will start by demonstrating the power of the Bayes rule by means of a


simple medical diagnosis problem where it highlights how poorly our intuition
is suited for combining conflicting evidence. We will then show how the Bayes
rule can be used to build AI methods that can cope with conflicting and noisy
observations.

Key terminology

Prior and posterior odds


The Bayes rule can be expressed in many forms. The simplest one is in terms of
odds. The idea is to take the odds for something happening (against it not
happening), which we’ll call prior odds. The word prior refers to our assessment
of the odds before obtaining some new information that may be relevant. The
purpose of the formula is to update the prior odds when new information
becomes available, to obtain the posterior odds, or the odds after obtaining the
information (the dictionary meaning of posterior is “something that comes after,
later”.)
How odds change

In order to weigh the new information, and decide how the odds change when
it becomes available, we need to consider how likely we would be to
encounter this information in alternative situations. Let’s take as an example,
the odds that it will rain later today. Imagine getting up in the morning in
Finland. The chances of rain are 206 in 365 (including rain, snow, and hail.
Brrr). The number of days without rain is therefore 159. This converts to prior
odds of 206:159 for rain, so the cards are stacked against you already before
you open your eyes.

However, after opening your eyes and taking a look outside, you notice it’s
cloudy. Suppose the chances of having a cloudy morning on a rainy day are 9
out of 10 – that means that only one out of 10 rainy days start out with blue
skies. But sometimes there are also clouds without rain: the chances of
having clouds on a rainless day are 1 in 10. Now how much higher are the
chances of clouds on a rainy day compared to a rainless day? Think about
this carefully as it will be important to be able to comprehend the question and
obtain the answer in what follows.

The answer is that the chances of clouds on a rainy day are nine times as
high as the chances of clouds on a rainless day: on a rainy day the chances
are 9 out of 10, whereas on a rainless day the chances are 1 out of 10, which
is nine times as high.

Note that even though the two probabilities 9/10 and 1/10 sum up to 9/10 +
1/10 = 1, this is by no means always the case. In some other town, the
mornings of rainy days could be cloudy eight times out of ten. This, however,
would not mean that the rainless days are cloudy two times out of ten. You’ll
have to be careful to get the calculations right. (But never mind if you make a
mistake or two – don’t give up! The Bayes rule is a fundamental thinking tool
for everyone of us.)

Key terminology

Likelihood ratio
The above ratio (nine times as high chances of clouds on a rainy day compared to
a rainless day) is called the likelihood ratio. More generally, the likelihood ratio is
the probability of the observation in case the event of interest (in the above,
rain), divided by the probability of the observation in case of no event (in the
above, no rain). Please read the previous sentence a few times. It may look a little
intimidating, but it’s not impossible to digest if you just focus carefully. We will
walk you through the steps in detail, just don’t lose your nerve. We’re almost
there.
So we concluded that on a cloudy morning, we have: likelihood ratio = (9/10) /
(1/10) = 9

The mighty Bayes rule for converting prior odds into posterior odds is – ta-
daa! – as follows: posterior odds = likelihood ratio × prior odds

Now you are probably thinking: Hold on, that’s the formula? It’s a frigging
multiplication! That is the formula – we said it’s simple, didn’t we? You
wouldn’t imagine that a simple multiplication can be used for all kinds of
incredibly useful applications, but it can. We’ll study a couple examples which
will demonstrate this.

Note

Many forms of Bayes


In case you have any trouble with the following exercises, you may need to read
the above material a few times and give it some time, and if that doesn’t do it, you
can look for more material online. Just a word of advice: there are many different
forms in which the Bayes rule can be written, and the odds form that we use isn’t
the most common one. Here are a couple links that you may find useful.

• Maths Doctor: Bayes’ Theorem and medical testing


• Better Explained: Understanding Bayes Theorem With Ratios
Answered

Exercise 10: Bayes rule (part 1 of 2)


Apply the Bayes rule to calculate the posterior odds for rain having observed
clouds in the morning in Helsinki.

As we calculated above, the prior odds for rain is 206:159 and the likelihood
ratio for observing clouds is 9.

Give your result in the form of odds, xx:yy, where xx and yy are numbers. (Note
that xx and yy does not mean that the numbers should have two digits each.)
Remember that when multiplying odds, you should only multiply the numerator
(the xx part). For example, if you multiply the odds 5:3 by 5, the result is 25:3.
Give the answer without simplifying the expression even if both sides have a
common factor.

Your answer:
1854:159

Your answer is correct


Correct. The prior odds are 206:159. The likelihood ratio is 9, so we get the
posterior odds for rain given clouds to be 9 × 206:159 = 1854:159. So in the long
run, on the days when we observe clouds in the morning, we can expect 1854
rainy days for every 159 rainless days, which is about the same as 12 rainy days
for one rainless day. If we wanted to express this as a probability (even though
this was not the question), we could use the formula x / (x+y) to get the value
1854 / (1854+159) which is about 0.92, or about 92% chance of rain when there
are clouds in the morning. Better take an umbrella.

The Bayes rule in practice: breast cancer screening

Our first realistic application is a classical example of using the Bayes rule,
namely medical diagnosis. This example also illustrates a common bias in
dealing with uncertain information called the base-rate fallacy.

Consider mammographic screening for breast cancer. Using made up


percentages for the sake of simplifying the numbers, let’s assume that 5 in
100 women have breast cancer. Suppose that if a person has breast cancer,
then the mammograph test will find it 80 times out of 100. When the test
comes out suggesting that breast cancer is present, we say that the result is
positive, although of course there is nothing positive about this for the person
being tested (a technical way of saying this is that the sensitivity of the test is
80%).

The test may also fail in the other direction, namely to indicate breast cancer
when none exists. This is called a false positive finding. Suppose that if the
person being tested actually doesn’t have breast cancer, the chances that the
test nevertheless comes out positive are 10 in 100. (In technical terms, we
would say that the specificity of the test is 90%.)
Based on the above probabilities, you are able to calculate the likelihood ratio.
You’ll find use for it in the next exercise. If you forgot how the likelihood ratio is
calculated, you may wish to check the terminology box earlier in this section
and revisit the rain example.

Note: You can use the above diagram with stick figures to validate that your
result is in the ballpark (about right) but note that diagram isn’t quite precise.
Out of the 95 women who don’t have cancer (the gray figures in the top
panel), about nine and a half are expected to get a (false) positive result. The
remaining 85 and a half are expected to get a (true) negative result. We didn’t
want to be so cruel as to cut people – even stick figures – in half, so we used
9 and 86 as an approximation.
Answered

Exercise 11: Bayes rule (part 2 of 2)


Consider the above breast cancer scenario. An average woman takes the
mammograph test and gets a positive test result suggesting breast cancer. What
do you think are the odds that she has breast cancer given the observation that
the test is positive?

First, use your intuition without applying the Bayes rule, and write down on a
piece of paper (not in the answer box below) what you think the chances of
having breast cancer are after a positive test result. The intuitive answer will not
be a part of your answer. It will be just for your own information.

Next, calculate the posterior odds for her having breast cancer using the Bayes
rule. This will be your answer.

Hints:

1. Start by calculating the prior odds.


2. Determine the probability of the observation in case of the event (cancer).
3. Determine the probability of the observation in case of no event (no
cancer).
4. Obtain the likelihood ratio as the ratio of the above two probabilities.
5. Finally, multiply the prior odds by the likelihood ratio.

Enter the posterior odds as your solution below. Give the answer in the form
xx:yy where xx and yy are numbers, without simplifying the expression even if
both sides have a common factor.
Your answer:
40:95

Your answer is correct


Correct. The prior odds describe the situation before getting the test result. Since
five out of 100 women have breast cancer, there is on the average five women
with breast cancer for every 95 women without breast cancer, and therefore, the
prior odds are 5:95. The likelihood ratio is the probability of a positive result in
case of cancer divided by the probability of a positive result in case of no cancer.
With the above numbers, this is given by 80/100 divided by 10/100, which is 8.
The Bayes rule now gives the posterior odds of breast cancer given the positive
test result: posterior odds = 8 × 5:95 = 40:95, which is the correct answer. So
despite the positive test result, the odds are actually against the person having
breast cancer: among the women who are tested positive, there are on the
average 40 women with breast cancer for every 95 women without breast
cancer. Note: If we would like to express the chances of breast cancer given the
positive test result as a probability (even though this is not what the exercise
asked for), we would consider the 40 cases with cancer and the 95 cases without
cancer together, and calculate what portion of the total 40 + 95 = 135 individuals
have cancer. This gives the result 40 out of 135, or about 30%. This is much
higher than the prevalence of breast cancer, 5 in 100, or 5%, but still the chances
are that the person has no cancer. If you compare the solution to your intuitive
answer, they tend to be quite different for most people. This demonstrates how
poorly suited out intuition is for handling uncertain and conflicting information.

The base rate fallacy

While doing the above exercise, you may have noticed that our intuition is not
well geared towards weighing different pieces of evidence. This is true
especially when the pieces of evidence conflict with each other. In the above
example, on the one hand, the base rate of breast cancer was relatively low,
meaning that breast cancer is relatively rare. So our brain thinks that it’s
unlikely that a person has it. On the other hand, the positive mammograph
test suggests the opposite. Our brain tends to choose one of these pieces of
evidence and ignore the other. It is typically the low base rate that is ignored.
That’s why your intuition probably says that the posterior probability of having
breast cancer given the positive test result is much higher than 30%. This is
known as the so called base rate fallacy. Knowing the Bayes rule is the best
cure against it.
III.Naive Bayes classification
One of the most useful applications of the Bayes rule is the so-called naive Bayes classifier.

The Bayes classifier is a machine learning technique that can be used to


classify objects such as text documents into two or more classes. The
classifier is trained by analyzing a set of training data, for which the correct
classes are given.

The naive Bayes classifier can be used to determine the probabilities of the
classes given a number of different observations. The assumption in the
model is that the feature variables are conditionally independent given the
class (we will not discuss the meaning of conditional independence in this
course. For our purposes, it is enough to be able to exploit conditional
independence in building the classifier).

Real world application: spam filters

We will use a spam email filter as a running example for illustrating the idea of
the naive Bayes classifier. Thus, the class variable indicates whether a
message is spam (or “junk email”) or whether it is a legitimate message (also
called “ham”). The words in the message correspond to the feature variables,
so that the number of feature variables in the model is determined by the
length of the message.

Note

Why we call it “naive”


Using spam filters as an example, the idea is to think of the words as being
produced by choosing one word after the other so that the choice of the word
depends only on whether the message is spam or ham. This is a crude
simplification of the process because it means that there is no dependency
between adjacent words, and the order of the words has no significance. This is
in fact why the method is called naive.
Because the model is based on the idea that the words can be processed
independently, we can identify specific words that are indicative of either
spam ("FREE", "LAST") or ham ("meeting", "algorithm").

Despite its naivete, the naive Bayes method tends to work very well in
practice. This is a good example of the common saying in statistics, “all
models are wrong, but some are useful” means (the aphorism is generally
attributed to statistician George E.P. Box).

Estimating parameters

To get started, we need to specify the prior odds for spam (against ham). For
simplicity assume this to be 1:1 which means that on the average half of the
incoming messages are spam (in reality, the amount of spam is probably
much higher).

To get our likelihood ratios, we need two different probabilities for any word
occurring: one in spam messages and another one in ham messages.

The word distributions for the two classes are best estimated from actual
training data that contains some spam messages as well as legitimate
messages. The simplest way is to count how many times each word, abacus,
acacia, ..., zurg, appears in the data and divide the number by the total word
count.

To illustrate the idea, let’s assume that we have at our disposal some spam
and some ham. You can easily obtain such data by saving a batch of your
emails in two files.

Assume that we have calculated the number of occurrences of the following


words (along with all other words) in the two classes of messages:

word spam h
million 156 9
dollars 29 1
adclick 51 0
conferences 0 1
total 95791 3
We can now estimate that the probability that a word in a spam message is
“million”, for example, is about 156 out of 95791, which is roughly the same as
1 in 614. Likewise, we get the estimate that 98 out of 306438 words, which is
about the same as 1 in 3127, in a ham message are million. Both of these
probability estimates are small, less than 1 in 500, but more importantly, the
former is higher than the latter: 1 in 614 is higher than 1 in 3127. This means
that the likelihood ratio, which is the first ratio divided by the second ratio, is
more than one. To be more precise, the ratio is (1/614) / (1/3127) = 3127/614
= 5.1 (rounded to one decimal digit).
Recall that if you have any trouble at all with following the math in this section,
you should refresh the arithmetic with fractions using the pointers we gave
earlier (see the part about Odds in section Odds and Probability).

Note

Zero means trouble


One problem with estimating the probabilities directly from the counts is that
zero counts lead to zero estimates. This can be quite harmful for the performance
of the classifier – it easily leads to situations where the posterior odds are 0/0,
which is nonsense. The simplest solution is to use a small lower bound for all
probability estimates. The value 1/100000, for instance, does the job.
Using the above logic, we can determine the likelihood ratio for all possible
words without having to use zero, giving us the following likelihood ratios:

word likelihood ratio


million 5.1
dollars 0.8
adclick 53.2
conferences 0.3
We are now ready to apply the method to classify new messages.

Example: is it spam or ham?

Once we have the prior odds and the likelihood ratios calculated, we are
ready to apply the Bayes rule, which we already practiced in the medical
diagnosis case as our example. The reasoning goes just like it did before: we
update the odds of spam by multiplying it by the likelihood ratio. To remind
ourselves of the procedure, let’s try a message with a single word to begin
with. For the prior odds, as agreed above, you should use odds 1:1.
Answered

Exercise 12: One word spam filter


Let's start with a message that only has one word in it: “million”.

Your task: Calculate the posterior odds for spam given this word using the table
above, starting from prior odds 1:1. Keep in mind that the odds is not the same as
the probability, which we would usually express as a percentage.
Give your answer in the form of a single decimal number x.x using the dot '.' as
the decimal separator.

(Remember that odds can be represented as xx:yy or simply as a single decimal


number, say z.z (where z.z = xx/yy). You may wish to revisit the discussion on
this just before Exercise 9 in Section 3.1 (Odds and Probability).)

Your answer:
5.1

Your answer is correct


Correct. As you may have noticed, the structure of this exercise is identical to that
of the previous exercise about medical diagnosis. We have the class label spam or
ham, and one piece of evidence that we can use to update our prior odds to
obtain the posterior odds. We decided above that the prior odds are 1:1. The
likelihood ratio is obtained by dividing the probability of the word 'million' in
spam divided by the probability of the word 'million' in ham. This we already
calculated above, and it can be found in the table of likelihood ratios: the value is
5.1. Now multiply the prior odds by the likelihood ratio to get 1:1 × 5.1 = 5.1. This
is the posterior odds. Again, the posterior odds means that for messages that
include the word 'million', there are on the average 5.1 spam messages for every
ham message. Or to use whole numbers, there are 51 spam messages for every
10 ham messages. The probability value is therefore 51 / (51+10) = 51/61, or
approximately 83.6 %.

To handle the rest of the words in a message, we can use exactly the same
procedure. The posterior odds after one word, which you calculated in the
previous exercise, will become the prior odds for the next word, and so on.
Answered

Exercise 13: Full spam filter


Now use the naive Bayes method to calculate the posterior odds for spam given
the message “million dollars adclick conferences”.

You should again start with the prior odds 1:1, and then multiply the odds
repeatedly by the likelihood ratios for each of the four words. Notice that the
likelihood ratios are tabulated above for your reference (these are the numbers
5.1, 0.8, and so on).
Your task: Express the result as posterior odds without any rounding of the
result. You may take a look at the solution of the previous exercise for help.

Your answer:
65.1168

Your answer is correct


Correct. We start in the same way as the previous exercise. Multiplying the prior
odds by the likelihood ratio 5.1 (for the word 'million') gives posterior odds 5.1.
Next we'll simply keep multiplying the odds by the likelihood ratios for the rest
of the message. The likelihood ratios can be found in the table above: 0.8
('dollars'), 53.2 ('adclick'), and 0.3 ('conference'). The product of all these
numbers equals 1:1 × 5.1 × 0.8 × 53.2 × 0.3 = 65.1168. This means that for
messages that include all these four words, there are on the average about 65
spam messages for each ham message, or about 651 spam messages for every 10
ham messages. If we wanted to get the probability value (which was not asked),
it is about 651 / (651+10) = 651 / 661 or approximately 98.5 %. This message
would probably end up in your junk mail folder.

Hooray! You have now mastered a powerful technique used every day in a
wide range of real-world AI applications, the naive Bayes classifier. Even if
you had to skip some of the technicalities, you should try to make sure you
understood the basic principles of applying probabilities to update beliefs. As
we discussed in the beginning of this Chapter, the main advantage of
probabilistic reasoning is the ability to handle uncertain and conflicting
evidence. Using examples in medical diagnosis and spam filtering, we
demonstrated how this works in practice.

After completing Chapter 3 you should be able to:

• Express probabilities in terms of natural frequencies


• Apply the Bayes rule to infer risks in simple scenarios
• Explain the base-rate fallacy and avoid it by applying Bayesian
reasoning

I.The types of machine learning


Handwritten digits are a classic case that is often used when discussing why we use machine
learning, and we will make no exception.
Below you can see examples of handwritten images from the very commonly
used MNIST dataset.

The correct label (what digit the writer was supposed to write) is shown above
each image. Note that some of the "correct” class labels are questionable: see
for example the second image from left: is that really a 7, or actually a 4?

Note

MNIST – What’s that?


Every machine learning student knows about the MNIST dataset. Fewer know
what the acronym stands for. In fact, we had to look it up to be able to tell you
that the M stands for Modified, and NIST stands for National Institute of
Standards and Technology. Now you probably know something that an average
machine learning expert doesn’t!
In the most common machine learning problems, exactly one class value is
correct at a time. This is also true in the MNIST case, although as we said, the
correct answer may often be hard to tell. In this kind of problem, it is not
possible that an instance belongs to multiple classes (or none at all) at the
same time. What we would like to achieve is an AI method that can be given
an image like the ones above, and automatically spits out the correct label (a
number between 0 and 9).

Note

How not to solve the problem


An automatic digit recognizer could in principle be built manually by writing
down rules such as:

• if the black pixels are mostly in the form of a single loop then the label is
0
• if the black pixels form two intersecting loops then the label is 8
• if the black pixels are mostly in a straight vertical line in the middle of
the figure then the label is 1

and so on...
This was how AI methods were mostly developed in the 1980s (so called
“expert systems”). However, even for such a simple task as digit recognition,
the task of writing such rules is very laborious. In fact, the above example rules
wouldn’t be specific enough to be implemented by programming – we’d have
to define precisely what we mean by “mostly”, “loop”, “line”, “middle”, and so
on.

And even if we did all this work, the result would likely be a bad AI method
because as you can see, the handwritten digits are often a bit so-and-so, and
every rule would need a dozen exceptions.

Three types of machine learning

The roots of machine learning are in statistics, which can also be thought of
as the art of extracting knowledge from data. Especially methods such as
linear regression and Bayesian statistics, which are both already more than
two centuries old (!), are even today at the heart of machine learning. For
more examples and a brief history, see the timeline of machine
learning (Wikipedia).

The area of machine learning is often divided in subareas according to the


kinds of problems being attacked. A rough categorization is as follows:

Supervised learning: We are given an input, for example a photograph with a


traffic sign, and the task is to predict the correct output or label, for example
which traffic sign is in the picture (speed limit, stop sign, etc.). In the simplest
cases, the answers are in the form of yes/no (we call these binary
classification problems).

Unsupervised learning: There are no labels or correct outputs. The task is to


discover the structure of the data: for example, grouping similar items to form
“clusters”, or reducing the data to a small number of important “dimensions”.
Data visualization can also be considered unsupervised learning.

Reinforcement learning: Commonly used in situations where an AI agent like


a self-driving car must operate in an environment and where feedback about
good or bad choices is available with some delay. Also used in games where
the outcome may be decided only at the end of the game.
The categories are somewhat overlapping and fuzzy, so a particular method
can sometimes be hard to place in one category. For example, as the name
suggests, so-called semisupervised learning is partly supervised and partly
unsupervised.

Note

Classification
When it comes to machine learning, we will focus primarily on supervised
learning, and in particular, classification tasks. In classification, we observe in
input, such as a photograph of a traffic sign, and try to infer its “class”, such as the
type of sign (speed limit 80 km/h, pedestrian crossing, stop sign, etc.). Other
examples of classification tasks include: identification of fake Twitter accounts
(input includes the list of followers, and the rate at which they have started
following the account, and the class is either fake or real account) and
handwritten digit recognition (input is an image, class is 0,...,9).

Humans teaching machines: supervised learning

Instead of manually writing down exact rules to do the classification, the point
in supervised machine learning is to take a number of examples, label each
one by the correct label, and use them to “train” an AI method to automatically
recognize the correct label for the training examples as well as (at least
hopefully) any other images. This of course requires that the correct labels are
provided, which is why we talk about supervised learning. The user who
provides the correct labels is a supervisor who guides the learning algorithm
towards correct answers so that eventually, the algorithm can independently
produce them.

In addition to learning how to predict the correct label in a classification


problem, supervised learning can also be used in situations where the
predicted outcome is a number. Examples include predicting the number of
people who will click a Google ad based on the ad content and data about the
user’s prior online behavior, predicting the number of traffic accidents based
on road conditions and speed limit, or predicting the selling price of real estate
based on its location, size, and condition. These problems are
called regression. You probably recognize the term linear regression, which is
a classical, still very popular technique for regression.
Note

Example
Suppose we have a data set consisting of apartment sales data. For each
purchase, we would obviously have the price that was paid, together with the
size of the apartment in square meters (or square feet, if you like), and the
number of bedrooms, the year of construction, the condition (on a scale from
“disaster“ to “spick and span”). We could then use machine learning to train a
regression model that predicts the selling price based on these features. See a
real-life example here.

Caveat: careful with that machine learning algorithm

There are a couple potential mistakes that we’d like to make you aware of.
They are related to the fact that unless you are careful with the way you apply
machine learning methods, you could become too confident about the
accuracy of your predictions, and be heavily disappointed when the accuracy
turns out to be worse than expected.

The first thing to keep in mind in order to avoid big mistakes, is to split your
data set into two parts: the training data and the test data. We first train the
algorithm using only the training data. This gives us a model or a rule that
predicts the output based on the input variables.

To assess how well we can actually predict the outputs, we can’t count on the
training data. While a model may be a very good predictor in the training data,
it is no proof that it can generalize to any other data. This is where the test
data comes in handy: we can apply the trained model to predict the outputs
for the test data and compare the predictions to the actual outputs (for
example, future apartment sale prices).

Note

Too fit to be true! Overfitting alert


It is very important to keep in mind that the accuracy of a predictor learned by
machine learning can be quite different in the training data and in separate test
data. This is the so-called overfitting phenomenon, and a lot of machine learning
research is focused on avoiding it one way or another. Intuitively, overfitting
means trying to be too smart. When predicting the success of a new song by a
known artist, you can look at the track record of the artist’s earlier songs, and
come up with a rule like “if the song is about love, and includes a catchy chorus, it
will be top-20”. However, maybe there are two love songs with catchy choruses
that didn’t make the top-20, so you decide to continue the rule “...except if
Sweden or yoga are mentioned” to improve your rule. This could make your rule
fit the past data perfectly, but it could in fact make it work worse on future test
data.

Machine learning methods are especially prone to overfitting because they can
try a huge number of different “rules” until one that fits the training data
perfectly is found. Especially methods that are very flexible and can adapt to
almost any pattern in the data can overfit unless the amount of data is enormous.
For example, compared to quite restricted linear models obtained by linear
regression, neural networks can require massive amounts of data before they
produce reliable prediction.
Learning to avoid overfitting and choose a model that is not too restricted, nor
too flexible, is one of the most essential skills of a data scientist.

Learning without a teacher: unsupervised learning

Above we discussed supervised learning where the correct answers are


available, and the task of the machine learning algorithm is to find a model
that predicts them based on the input data.

In unsupervised learning, the correct answers are not provided. This makes
the situation quite different since we can’t build the model by making it fit the
correct answers on training data. It also makes the evaluation of performance
more complicated since we can’t check whether the learned model is doing
well or not.

Typical unsupervised learning methods attempt to learn some kind of


“structure” underlying the data. This can mean, for
example, visualization where similar items are placed near each other and
dissimilar items further away from each other. It can also
mean clustering where we use the data to identify groups or “clusters” of items
that are similar to each other but dissimilar from data in other clusters.

Note

Example
As a concrete example, grocery store chains collect data about their customers’
shopping behavior (that’s why you have all those loyalty cards). To better
understand their customers, the store can either visualize the data using a graph
where each customer is represented by a dot and customers who tend to buy the
same products are placed nearer each other than customers who buy different
products. Or, the store could apply clustering to obtain a set of customer groups
such as ‘low-budget health food enthusiasts’, ‘high-end fish lovers’, ‘soda and
pizza 6 days a week’, and so on. Note that the machine learning method would
only group the customers into clusters, but it wouldn’t automatically generate
the cluster labels (‘fish lovers’ and so on). This task would be left for the user.
Yet another example of unsupervised learning can be termed generative
modeling. This has become a prominent approach over the last few years as
a deep learning technique called generative adversarial networks (GANs) has
lead to great advances. Given some data, for example, photographs of
people’s faces, a generative model can generate more of the same: more
real-looking but artificial images of people’s faces.

We will return to GANs and the implications of being able to produce high-
quality artificial image content a bit later in the course, but next we will take a
closer look at supervised learning and discuss some specific methods in more
detail.

II.The nearest neighbor classifier


The nearest neighbor classifier is among the simplest possible classifiers. When given an item
to classify, it finds the training data item that is most similar to the new item, and outputs its
label. An example is given in the following diagram.

In the above diagram, we show a collection of training data items, some of


which belong to one class (green) and other to another class (blue). In
addition, there are two test data items, the stars, which we are going to
classify using the nearest neighbor method.

The two test items are both classified in the “green” class because their
nearest neighbors are both green (see diagram (b) above).

The position of the points in the plot represents in some way the properties of
the items. Since we draw the diagram on a flat two-dimensional surface – you
can move in two independent directions: up-down or left-right – the items have
two properties that we can use for comparison. Imagine for example
representing patients at a clinic in terms of their age and blood-sugar level.
But the above diagram should be taken just as a visual tool to illustrate the
general idea, which is to relate the class values to similarity or proximity
(nearness). The general idea is by no means restricted to two dimensions and
the nearest neighbor classifier can easily be applied to items that are
characterized by many more properties than two.

What do we mean by nearest?

An interesting question related to (among other things) the nearest neighbor


classifier is the definition of distance or similarity between instances. In the
illustration above, we tacitly assumed that the standard geometric distance,
technically called the Euclidean distance, is used. This simply means that if
the points are drawn on a piece of paper (or displayed on your screen), you
can measure the distance between any two items by pulling a piece of thread
straight from one to the other and measuring the length.

Note

Defining “nearest”
Using the geometric distance to decide which is the nearest item may not always
be reasonable or even possible: the type of the input may, for example, be text,
where it is not clear how the items are drawn in a geometric representation and
how distances should be measured. You should therefore choose the distance
metric on a case-by-case basis.
In the MNIST digit recognition case, one common way to measure image
similarity is to count pixel-by-pixel matches. In other words, we compare the
pixels in the top-left corner of each image to one another and if the more
similar color (shade of gray) they are, the more similar the two images are.
We also compare the pixels in the bottom-right corner of each image, and all
pixels inbetween. This technique is quite sensitive to shifting or scaling the
images: if we take an image of a “1” and shift it ever so slightly either left or
right, the outcome is that the two images (before and after the shift) are very
different because the black pixels are in different positions in the two images.
Fortunately, the MNIST data has been preprocessed by centering the images
so that this problem is alleviated.

Using nearest neighbors to predict user behavior


A typical example of an application of the nearest neighbor method is
predicting user behavior in AI applications such as recommendation systems.

The idea is to use the very simple principle that users with similar past
behavior tend to have similar future behavior. Imagine a music
recommendation system that collects data about users’ listening behavior.
Let’s say you have listened to 1980s disco music (just for the sake of
argument). One day, the service provider gets their hands on a hard-to-find
1980 disco classic, and adds it into the music library. The system now needs
to predict whether you will like it or not. One way of doing this is to use
information about the genre, the artist, and other metadata, entered by the
good people of the service provider. However, this information is relatively
scarce and coarse and it will only be able to give rough predictions.

What current recommendation systems use instead of the manually entered


metadata, is something called collaborative filtering. The collaborative aspect
of it is that it uses other users’ data to predict your preferences. The word
“filter” refers to the fact that you will be only recommended content that
passes through a filter: content that you are likely to enjoy will pass, other
content will not (these kind of filters may lead to the so called filter bubbles,
which we mentioned in Chapter 1. We will return to them later).

Now let’s say that other users who have listened to 80s disco music enjoy the
new release and keep listening to it again and again. The system will identify
the similar past behavior that you and other 80s disco fanatics share, and
since other users like you enjoy the new release, the system will predict that
you will too. Hence it will show up at the top of your recommendation list. In an
alternative reality, maybe the added song is not so great and other users with
similar past behavior as yours don’t really like it. In that case, the system
wouldn’t bother recommending it to you, or at least it wouldn’t be at the top of
the list of recommendations for you.

The following exercise will illustrate this idea.


Answered

Exercise 14: Customers who bought similar products


In this exercise, we will build a simple recommendation system for an online
shopping application where the users' purchase history is recorded and used to
predict which products the user is likely to buy next.
We have data from six users. For each user, we have recorded their recent
shopping history of four items and the item they bought after buying these four
items:

User Shopping History Purchase


Moby Dick
Sanni boxing gloves headphones sunglasses coffee beans
(novel)
Jouni t-shirt coffee beans coffee maker coffee beans coffee beans
ragg wool
Janina sunglasses sneakers t-shirt sneakers
socks
2001: A Space
Henrik headphones t-shirt boxing gloves flip flops
Odyssey (dvd)
Moby Dick
Ville t-shirt flip flops sunglasses sunscreen
(novel)
2001: A Space
Teemu Moby Dick (novel) coffee beans headphones coffee beans
Odyssey (dvd)

The most recent purchase is the one in the rightmost column, so for example,
after buying a t-shirt, flip flops, sunglasses, and Moby Dick (novel), Ville bought
sunscreen. Our hypothesis is that after buying similar items, other users are also
likely to buy sunscreen.

To apply the nearest neighbor method, we need to define what we mean by


nearest. This can be done in many different ways, some of which work better
than others. Let’s use the shopping history to define the similarity (“nearness”)
by counting how many of the items have been purchased by both users.

For example, users Ville and Henrik have both bought a t-shirt, so their similarity
is 1. Note that flip flops doesn't count because we don't include the most recent
purchase when calculating the similarity — it is reserved for another purpose.

Our task is to predict the next purchase of customer Travis who has bought the
following products:

User Shopping History Purchase


Travis green tea t-shirt sunglasses flip flops ?

You can think of Travis being our test data, and the above six users make our
training data.

Proceed as follows:
1. Calculate the similarity of Travis relative to the six users in the training
data (done by adding together the number of similar purchases by the
users).
2. Having calculated the similarities, identify the user who is most similar to
Travis by selecting the largest of the calculated similarities.
3. Predict what Travis is likely to purchase next by looking at the most recent
purchase (the rightmost column in the table) of the most similar user from
the previous step.

Who is the user most similar to Travis?

Your answer:
Ville

Your answer is correct


Correct. When you calculate the similarities between Travis and all the other
users, Ville and Travis will have the largest similarity with a similarity of 3.

What is the predicted purchase for Travis?

Your answer:
sunscreen

Your answer is correct


Correct. Since Ville's latest purchase was sunscreen, we will recommend it also to
Travis.

2/2 answers correct

In the above example, we only had six users’ data and our prediction was
probably very unreliable. However, online shopping sites often have millions
of users, and the amount of data they produce is massive. In many cases,
there are a horde of users whose past behavior is very similar to yours, and
whose purchase history gives a pretty good indication of your interests.

These predictions can also be self-fulfilling prophecies in the sense that you
are more likely to buy a product if it is recommended to you by the system,
which makes it tricky to evaluate how well they actually work. The same kind
of recommendation systems are also used to recommend music, movies,
news, and social media content to users. In the context of news and social
media, filters created by such systems can lead to filter bubbles.
Answered

Exercise 15: Filter bubbles


As discussed above, recommending news or social media content that a user is
likely to click or like, may lead to filter bubbles where the users only see content
that is in line with their own values and views.

1. Do you think that filter bubbles are harmful? After all, they are created by
recommending content that the user likes. What negative consequences, if
any, may be associated with filter bubbles? Feel free to look for more
information from other sources.
2. Think of ways to avoid filter bubbles while still being able to recommend
content to suit personal preferences. Come up with at least one suggestion.
You can look for ideas from other sources, but we'd like to hear your own
ideas too!

Note: your answer should be at least a few sentences for each part.

Note: On using ChatGPT and similar models: We appreciate you putting an AI


tool into use. Feel free to consult it but note that since we are interested in your
own opinion, we expect you to write your final answers yourself.

Your answer:
1. Do you think that filter bubbles are harmful? After all, they are created by
recommending content that the user likes. What negative consequences, if
any, may be associated with filter bubbles? Feel free to look for more
information from other sources. Indeed, they might be pretty harmful in
certain scenarios. Although filter bubbles might be useful for advertisers and
recommendation companies, as users are more likely to look for similar
products they have tried, they can also be used with bad intentions. One risk
of using filter bubbles is related to a neuromarketing strategy. In this
strategy, products that are shown more frequently become more familiar to
users, thus making the users trust and believe the ads, perceiving them as
needs rather than wants. This is related to a physiological concept called
cognitive ease, where repeated exposure to information makes it seem true.
Users affected by this trick tend to prefer the product they see more
frequently, hence creating an artificial sense of cognitive ease. (More about
cognitive ease can be learned from the following YouTube video link on one
of my favorite YouTube channels - Veritasium:
https://www.youtube.com/watch?v=cebFWOlx848). From a user’s
perspective, being brainwashed is far from harmless. Using this
physiological trick, there might be another problem - the spreading of
misinformation. As mentioned earlier, users are more likely to believe in the
information that is repeated. If a similar product is seen plenty of times,
despite being ineffective, users will perceive similar products as good due to
filter bubbles. This can also limit the users from seeing different points of
view and in some sense, preventing them from leaving their comfort zones.
Additionally, constantly seeing similar content might reduce critical thinking
skills because users are not encouraged to question and analyze different
viewpoints. 2. Think of ways to avoid filter bubbles while still being able to
recommend content to suit personal preferences. Come up with at least one
suggestion. You can look for ideas from other sources, but we'd like to hear
your own ideas too! One way to avoid filter bubbles is by allowing the user to
adjust the diversity level of the content they see. If users can understand that
they are inside a filter bubble, it could be easier to control it. Nowadays,
plenty of apps let users select their areas of interest so that the program
brings appropriate adverts. This is a pretty good way of handling diversity in
recommendations as it gives the users control over their content. Another
way could be to automatically include a few different recommendations for
every certain number of similar content. This wouldn’t deter users from using
the app and would also introduce them to new content while maintaining
their usual flow. For example, as a YouTube user, I sometimes encounter
recommended videos from different channels with diverse content.

Example answer

Filters as such are useful. They recommend content such as music that we like.
The bubble phenomenon, where users get a biased view of facts and opinions, on
the other hand, is clearly harmful. There are no "alternative facts" – the
alternative of a fact is not a fact – but information is always presented from some
point of view. If we are not exposed to more than one point of view, we can easily
end up holding a biased worldview.

Getting completely rid of filter bubbles is probably not a good idea. After all, we
have always liked different things and been interested in different things. In the
days of print media, the newspaper that we'd read was our filter bubble, which
made sure that the bubble didn't get too small.
We believe that the first step to avoid the harmful effects of filter bubbles is to
recognize when we are inside one. To this end, it is helpful if the applications we
use clearly indicate that the recommendations that we see don't represent a
balanced overall view to the content. This can be achieved by letting the user
explore the content also through other views than a list of recommendations. In
Spotify, you can choose music based on its genre, and Netflix and HBO provide
recommendations in various different categories.

III.Regression
Our main learning objective in this section is another nice example of supervised learning
methods, and almost as simple as the nearest neighbor classifier too: linear regression. We’ll
introduce its close cousin, logistic regression as well.

Note

The difference between classification and regression


There is a small but important difference in the kind of predictions that we
should produce in different scenarios. While for example the nearest neighbor
classifier chooses a class label for any item out of a given set of alternatives (like
spam/ham, or 0,1,2,...,9), linear regression produces a numerical prediction that
is not constrained to be an integer (a whole number as opposed to something
like 3.14). So linear regression is better suited in situations where the output
variable can be any number like the price of a product, the distance to an
obstacle, the box-office revenue of the next Star Wars movie, and so on.
The basic idea in linear regression is to add up the effects of each of the
feature variables to produce the predicted value. The technical term for the
adding up process is linear combination. The idea is very straightforward, and
it can be illustrated by your shopping bill.

Note

Thinking of linear regression as a shopping bill


Suppose you go to the grocery store and buy 2.5kg potatoes, 1.0kg carrots, and
two bottles of milk. If the price of potatoes is 2€ per kg, the price of carrots is 4€
per kg, and a bottle of milk costs 3€, then the bill, calculated by the cashier, totals
2.5 × 2€ + 1.0 × 4€ + 2 × 3€ = 15€. In linear regression, the amount of potatoes,
carrots, and milk are the inputs in the data. The output is the cost of your
shopping, which clearly depends on both the price and how much of each
product you buy.
The word linear means that the increase in the output when one input feature
is increased by some fixed amount is always the same. In other words,
whenever you add, say, two kilos of carrots into your shopping basket, the bill
goes up 8€. When you add another two kilos, the bill goes up another 8€, and
if you add half as much, 1kg, the bill goes up exactly half as much, 4€.

Key terminology

Coefficients or weights
In linear regression terminology, the prices of the different products would be
called coefficients or weights (this may appear confusing since we measured the
amount of potatoes and carrots by weight, but do not let yourself be tricked by
this). One of the main advantages of linear regression is its easy interpretability:
the learned weights may in fact be more interesting than the predictions of the
outputs.

For example, when we use linear regression to predict the life expectancy, the
weight of smoking (cigarettes per day) is about minus half a year, meaning that
smoking one cigarette more per day takes you on the average half a year closer
to termination. Likewise, the weight of vegetable consumption (handful of
vegetables per day) has weight plus one year, so eating a handful of greens every
day gives you on the average one more year.
Answered

Exercise 16: Linear regression


Suppose that an extensive study is carried out, and it is found that in a particular
country, the life expectancy (the average number of years that people live)
among non-smoking women who don't eat any vegetables is 80 years. Suppose
further that on the average, men live 5 years less. Also take the numbers
mentioned above: every cigarette per day reduces the life expectancy by half a
year, and a handful of veggies per day increases it by one year.

Calculate the life expectancies for the following example cases:


For example, the first case is a male (subtract 5 years), smokes 8 cigarettes per
day (subtract 8 × 0.5 = 4 years), and eats two handfuls of veggies per day (add 2
× 1 = 2 years), so the predicted life expectancy is 80 - 5 - 4 + 2 = 73 years.

Gender Smoking (cigarettes per day) Vegetables (handfuls per day) Life expectancy (years)
male 8 2 73
male 0 6 A
female 16 1 B
female 0 4 C

Your task: Enter the correct value as an integer (whole number) for the missing
sections A, B, and C above.

Your answer:
81

Your answer is correct


Correct. A: 80 - 5 + 6 = 81

Your answer:
73

Your answer is correct


Correct. 80 - 8 + 1 = 73

Your answer:
84

Your answer is correct


Correct. 80 + 4 = 84

3/3 answers correct


In the above exercise, the life expectancy of non-smoking, veggie-hating
women, 80 years, was the starting point for the calculation. The technical term
for the starting point is the intercept. We will return to this below when we
discuss how to learn linear regression models from data.

Learning linear regression

Above, we discussed how predictions are obtained from linear regression


when both the weights and the input features are known. So we are given the
inputs and the weight, and we can produce the predicted output.

When we are given the inputs and the outputs for a number of items, we can
find the weights such that the predicted output matches the actual output as
well as possible. This is the task solved by machine learning.

Note

Example
Continuing the shopping analogy, suppose we were given the contents of a
number of shopping baskets and the total bill for each of them, and we were
asked to figure out the price of each of the products (potatoes, carrots, and so
on). From one basket, say 1kg of sirloin steak, 2kg of carrots, and a bottle of
Chianti, even if we knew that the total bill is 35€, we couldn’t determine the
prices because there are many sets of prices that will yield the same total bill.
With many baskets, however, we will usually be able to solve the problem.
But the problem is made harder by the fact that in the real world, the actual
output isn’t always fully determined by the input, because of various factors
that introduce uncertainty or "noise" into the process. You can think of
shopping at a bazaar where the prices for any given product may vary from
time to time, or a restaurant where the final damage includes a variable
amount of tip. In such situations, we can estimate the prices but only with
some limited accuracy.

Finding the weights that optimize the match between the predicted and the
actual outputs in the training data is a classical statistical problem dating back
to the 1800s, and it can be easily solved even for massive data sets.

We will not go into the details of the actual weight-finding algorithms, such as
the classical least squares technique, simple as they are. However, you can
get a feel of finding trends in data in the following exercises.

Visualizing linear regression


A good way to get a feel for what linear regression can tell us is to draw a
chart containing our data and our regression results. As a simple toy example
our data set has one variable, the number of cups of coffee an employee
drinks per day, and the number of lines of code written per day by that
employee as the output. This is not a real data set as obviously there are
other factors having an effect on the productivity of an employee other than
coffee that interact in complex ways. The increase in productivity by
increasing the amount of coffee will also hold only to a certain point after
which the jitters distract too much.

012345678910051015202530354045505560Lines of code writtenCups of


coffee per day
When we present our data in the chart above as points where one point
represents one employee, we can see that there is obviously a trend that
drinking more coffee results in more lines of code being written (recall that this
is completely made-up data). From this data set we can learn the coefficient,
or the weight, related to coffee consumption, and by eye we can already say
that it seems to be somewhere close to five, since for each cup of coffee
consumed the number of lines programmed seems to go up roughly by five.
For example, employees who drink around two cups of coffee per day seem
to produce around 20 lines of code per day, and similarly at four cups of
coffee, the amount of lines produced is around 30.

It can also be noted that employees who do not drink coffee at all also
produce code, and is shown by the graph to be about ten lines. This number
is the intercept term that we mentioned earlier. The intercept is another
parameter in the model just like the weights are, that can be learned from the
data. Just as in the life expectancy example it can be thought of as the
starting point of our calculations before we have added in the effects of the
input variable, or variables if we have more than one, be it coffee cups in this
example, or cigarettes and vegetables in the previous one.

The line in the chart represents our predicted outcome, where we have
estimated the intercept and the coefficient by using an actual linear regression
technique called least squares. This line can be used to predict the number of
lines produced when the input is the number of cups of coffee. Note that we
can obtain a prediction even if we allow only partial cups (like half, 1/4 cups,
and so on).
Answered
Exercise 17: Life expectancy and education (part 1 of 2)
Let’s study the link between the total number of years spent in school (including
everything between preschool and university) and life expectancy. Here is data
from three different countries displayed in a figure represented by dots:

Years of educationLife expectancy in


years01020304050607080900246810121416182022

We have one country where the average number of years in school is 10 and life
expectancy is 57 years, another country where the average number of years in
school is 13 and life expectancy is 53 years, and a third country where the
average number of years in school is 20 and life expectancy is 80 years.

You can drag the end points of the solid line to position the line in such a way
that it follows the trend of the data points. Note that you will not be able to get
the line fit perfectly with the data points, and this is fine: some of the data points
will lie above the line, and some below it. The most important part is that the line
describes the overall trend.

After you have positioned the line you can use it to predict the life expectancy.

Given the data, what can you tell about the life expectancy of people who have 15
years of education? Important: Notice that even if you can obtain a specific
prediction, down to a fraction of a year, by adjusting the line, you may not
necessarily be able to give a confident prediction. Take the limited amount of
data into account when giving your answer.

It is exactly 64 years
It is certainly between 60 and 70 years
It is certainly 70 years or less
It is probably less than 90

The answer is correct


Correct. The few data points that we have make it impossible say almost
anything about the life expectancy only based on the data. Of course, one can
know a great deal about life expectancy from other sources but the data in the
above chart is insufficient to do so. The first choice is clearly stating too much.
While the intervals in the second and the third choice are likely to be valid, the
word 'certainly' makes them unjustified. There is a chance, greater than zero,
that the value turns out to be, for example, greater than 70. Thus the only choice
that we can be comfortable with is the fourth one.
Answered

Exercise 18: Life expectancy and education (part 2 of 2)


In the previous exercise, we only had data from three countries. The full data set
consists of data from 14 different countries, presented here in a graph:

Years of educationLife expectancy in


years01020304050607080900246810121416182022

Based on this data, would your prediction about the life expectancy of people
with 15 years of education change? If so, why?

Which of the following options would best match your estimate of the life
expectancy for people with 15 years of education? Choose the most specific
option that you think is justified by fitting the straight line model to the above
data.

Probably between 45 and 50 years


Probably between 50 and 90 years
Probably between 69 and 71 years
Probably between 15 and 150 years

The answer is correct


The first choice would clearly be an odd estimate since the data strongly
suggest that very few countries have life expectancy less than 50, and none of
the data points with more than 12 years of education fall below 50. We can't be
sure, of course, but life expectancy between 45 and 50 years would in this case
be highly unexpected. The second choice is correct because it fits the general
trend, and all data points with more than 12 years of education fall within this
interval. The interval 69 to 71 years in the third choice could well include the
actual value, but based on the above data, it would be too bold to claim to know
the outcome with such high accuracy. The interval 15 to 150 years of the fourth
choice would almost certainly include the actual value, but we think that it
would be a poor summary of what we can learn from the data for the reason
that it is too vague.
It should be pointed out that studies like those used in the above exercises
cannot identify causal relationships. In other words, from this data alone, it is
impossible to say whether studying actually increases life expectancy through
a better-informed and healthier life-style or other mechanisms, or whether the
apparent association between life expectancy and education is due to
underlying factors that affects both. It is likely that, for example, in countries
where people tend to be highly educated, nutrition, healthcare, and safety are
also better, which increases life expectancy. With this kind of simple analysis,
we can only identify associations, which can nevertheless be useful for
prediction.

Machine learning applications of linear regression

Linear regression is truly the workhorse of many AI and data science


applications. It has its limits but they are often compensated by its simplicity,
interpretability and efficiency. Linear regression has been successfully used in
the following problems to give a few examples:

• prediction of click rates in online advertising


• prediction of retail demand for products
• prediction of box-office revenue of Hollywood movies
• prediction of software cost
• prediction of insurance cost
• prediction of crime rates
• prediction of real estate prices

Could we use regression to predict labels?

As we discussed above, linear regression and the nearest neighbor method


produce different kinds of predictions. Linear regression outputs numerical
outputs while the nearest neighbor method produces labels from a fixed set of
alternatives ("classes").

Where linear regression excels compared to nearest neighbors is


interpretability. What do we mean by this? You could say that in a way, the
nearest neighbor method and any single prediction that it produces are easy
to interpret: it’s just the nearest training data element! This is true, but when it
comes to the interpretability of the learned model, there is a clear difference.
Interpreting the trained model in nearest neighbors in a similar fashion as the
weights in linear regression is impossible: the learned model is basically the
whole data, and it is usually way too big and complex to provide us with much
insight. So what if we’d like to have a method that produces the same kind of
outputs as the nearest neighbor, labels, but is interpretable like linear
regression?
Logistic regression to the rescue

Well there is good news for you: we can turn the linear regression method’s
outputs into predictions about labels. The technique for doing this is called
logistic regression. We will not go into the technicalities, suffice to say that in
the simplest case, we take the output from linear regression, which is a
number, and predict one label A if the output is greater than zero, and another
label B if the output is less than or equal to zero. Actually, instead of just
predicting one class or another, logistic regression can also give us a
measure of uncertainty of the prediction. So if we are predicting whether a
customer will buy a new smartphone this year, we can get a prediction that
customer A will buy a phone with probability 90%, but for another, less
predictable customer, we can get a prediction that they will not buy a phone
with 55% probability (or in other words, that they will buy one with 45%
probability).

It is also possible to use the same trick to obtain predictions over more than
two possible labels, so instead of always predicting either yes or no (buy a
new phone or not, fake news or real news, and so forth), we can use logistic
regression to identify, for example, handwritten digits, in which case there are
ten possible labels.

An example of logistic regression

Let’s suppose that we collect data of students taking an introductory course in


cookery. In addition to the basic information such as the student ID, name,
and so on, we also ask the students to report how many hours they studied for
the exam (however you study for a cookery exam, probably cooking?) – and
hope that they are more or less honest in their reports. After the exam, we will
know whether each student passed the course or not. Some data points are
presented below:

Student ID Hours studied


24 15
41 9.5
58 2
101 5
103 6.5
215 6
Based on the table, what kind of conclusion could you draw between the
hours studied and passing the exam? We could think that if we have data
from hundreds of students, maybe we could see the amount needed to study
in order to pass the course. We can present this data in a chart as you can
see below.
Answered

Exercise 19: Logistic regression


-5051015200%10%20%30%40%50%60%70%80%90%100%Probability of
passingHours studied

Each dot on the figure corresponds to one student. On the bottom of the figure
we have the scale for how many hours the student studied for the exam, and the
students who passed the exam are shown as dots at the top of the chart, and the
ones who failed are shown at the bottom. We’ll use the scale on the left to
indicate the predicted probability of passing, which we’ll get from the logistic
regression model as we explain just below. Based on this figure, you can see
roughly that students who spent longer studying had better chances of passing
the course. Especially the extreme cases are intuitive: with less than an hour’s
work, it is very hard to pass the course, but with a lot of work, most will be
successful. But what about those that spend time studying somewhere inbetween
the extremes? If you study for 6 hours, what are your chances of passing?

We can quantify the probability of passing using logistic regression. The curve in
the figure can be interpreted as the probability of passing: for example, after
studying for five hours, the probability of passing is a little over 20%. We will not
go into the details on how to obtain the curve, but it will be similar to how we
learn the weights in linear regression.

If you wanted to have an 80% chance of passing a university exam, based on the
above figure, how many hours should you approximately study for?

6-7 hours
7-8 hours
8-9 hours
10-11 hours

The answer is correct


Correct. The other answers give roughly a 30%, a 50%, and a 70% chance of
passing respectively. To have an 80% chance of passing, you should study for
around 10-11 hours.
Logistic regression is also used in a great variety of real-world AI applications
such as predicting financial risks, in medical studies, and so on. However, like
linear regression, it is also constrained by the linearity property and we need
many other methods in our toolbox. We will return to the linearity issue later
when we discuss neural networks.

The limits of machine learning

To summarize, machine learning is a very powerful tool for building AI


applications. In addition to the nearest neighbor method, linear regression,
and logistic regression, there are literally hundreds, if not thousands, of
different machine learning techniques, but they all boil down to the same
thing: trying to extract patterns and dependencies from data and using them
either to gain understanding of a phenomenon or to predict future outcomes.

Machine learning can be a very hard problem and we can’t usually achieve a
perfect method that would always produce the correct label. However, in most
cases, a good but not perfect prediction is still better than none. Sometimes
we may be able to produce better predictions by ourselves but we may still
prefer to use machine learning because the machine will make its predictions
faster and it will also keep churning out predictions without getting tired. Good
examples are recommendation systems that need to predict what music, what
videos, or what ads are more likely to be of interest to you.

The factors that affect how good a result we can achieve include:

• The hardness of the task: in handwritten digit recognition, if the digits


are written very sloppily, even a human can’t always guess correctly
what the writer intended
• The machine learning method: some methods are far better for a
particular task than others
• The amount of training data: from only a few examples, it is impossible
to obtain a good classifier
• The quality of the data

Note

Data quality matters


In the beginning of this chapter, we emphasized the importance of having enough
data and the risks of overfitting. Another equally important factor is
the quality of the data. In order to build a model that generalizes well to data
outside of the training data, the training data needs to contain enough
information that is relevant to the problem at hand. For example, if you create an
image classifier that tells you what the image given to the algorithm is about, and
you have trained it only on pictures of dogs and cats, it will assign everything it
sees as either a dog or a cat. This would make sense if the algorithm is used in an
environment where it will only see cats and dogs, but not if it is expected to see
boats, cars, and flowers as well.

We’ll return to potential problems caused by ”biased” data.


It is also important to emphasize that different machine learning methods are
suitable for different tasks. Thus, there is no single best method for all
problems ("one algorithm to rule them all..."). Fortunately, one can try out a
large number of different methods and see which one of them works best in
the problem at hand.

This leads us to a point that is very important but often overlooked in practice:
what it means to work better. In the digit recognition task, a good method
would of course produce the correct label most of the time. We can measure
this by the classification error: the fraction of cases where our classifier
outputs the wrong class. In predicting apartment prices, the quality measure is
typically something like the difference between the predicted price and the
final price for which the apartment is sold. In many real-life applications, it is
also worse to err in one direction than in another: setting the price too high
may delay the process by months, but setting the price too low will mean less
money for the seller. And to take yet another example, failing to detect a
pedestrian in front of a car is a far worse error than falsely detecting one when
there is none.

As mentioned above, we can’t usually achieve zero error, but perhaps we will
be happy with error less than 1 in 100 (or 1%). This too depends on the
application: you wouldn’t be happy to have only 99% safe cars on the streets,
but being able to predict whether you’ll like a new song with that accuracy
may be more than enough for a pleasant listening experience. Keeping the
actual goal in mind at all times helps us make sure that we create actual
added value.

After completing Chapter 4 you should be able to:

• Explain why machine learning techniques are used


• Distinguish between unsupervised and supervised machine learning
scenarios
• Explain the principles of three supervised classification methods: the
nearest neighbor method, linear regression, and logistic regression

I.Neural network basics


Our next topic, deep learning and neural networks, tends to attract more interest than many
of the other topics.

One of the reasons for the interest is the hope to understand our own mind,
which emerges from neural processing in our brain. Another reason is the
advances in machine learning achieved within the recent years by combining
massive data sets and deep learning techniques.

What are neural networks?

To better understand the whole, we will start by discussing the individual units
that make it up. A neural network can mean either a “real” biological neural
network such as the one in your brain, or an artificial neural network simulated
in a computer.

Key terminology

Deep learning
Deep learning refers to certain kinds of machine learning techniques where
several “layers” of simple processing units are connected in a network so that the
input to the system is passed through each one of them in turn. This architecture
has been inspired by the processing of visual information in the brain coming
through the eyes and captured by the retina. This depth allows the network to
learn more complex structures without requiring unrealistically large amounts of
data.

Neurons, cell bodies, and signals


A neural network, either biological and artificial, consists of a large number of
simple units, neurons, that receive and transmit signals to each other. The
neurons are very simple processors of information, consisting of a cell body and
wires that connect the neurons to each other. Most of the time, they do nothing
but sit still and watch for signals coming in through the wires.
Dendrites, axons, and synapses
In the biological lingo, we call the wires that provide the input to the neurons
dendrites. Sometimes, depending on the incoming signals, the neuron may fire
and send a signal out for the other neurons to receive. The wire that transmits
the outgoing signal is called an axon. Each axon may be connected to one or more
dendrites at intersections that are called synapses.
Isolated from its fellow-neurons, a single neuron is quite unimpressive, and
capable of only a very restricted set of behaviors. When connected to each
other, however, the system resulting from their concerted action can become
extremely complex. To find evidence for this, look no further than (to use legal
jargon) "Exhibit A": your brain! The behavior of the system is determined by
the ways in which the neurons are wired together. Each neuron reacts to the
incoming signals in a specific way that can also adapt over time. This
adaptation is known to be the key to functions such as memory and learning.
Answered

Exercise 20. Elements of a neural network


Label the different components of a neuron into the diagram below. Hint: The
input of the neuron comes from the left and the output goes to the right.

Synapse (connection)
A
B
C
D

The answer is correct


Correct
Dendrite (input)
A
B
C
D

The answer is correct


Correct
Cell body
A
B
C
D

The answer is correct


Correct
Axon (output)
A
B
C
D

The answer is correct


Correct
4/4 answers correct

Why develop artificial neural networks?

The purpose of building artificial models of the brain can be neuroscience, the
study of the brain and the nervous system in general. It is tempting to think
that by mapping the human brain in enough detail, we can discover the
secrets of human and animal cognition and consciousness.

Note

Modeling the brain


The BRAIN Initiative led by American neuroscience researchers is pushing
forward technologies for imaging, modeling, and simulating the brain at a finer
and larger scale than before. Some brain research projects are very ambitious in
terms of objectives. The Human Brain Project promised in 2012 that “the
mysteries of the mind can be solved – soon”. After years of work, the Human
Brain Project was facing questions about when the billion euros invested by the
European Union will deliver what was promised, even though, to be fair, some
less ambitious milestones have been achieved.
However, even while we seem to be almost as far from understanding the
mind and consciousness, there are clear milestones that have been achieved
in neuroscience. By better understanding of the structure and function of the
brain, we are already reaping some concrete rewards. We can, for instance,
identify abnormal functioning and try to help the brain avoid them and
reinstate normal operation. This can lead to life-changing new medical
treatments for people suffering from neurological disorders: epilepsy,
Alzheimer’s disease, problems caused by developmental disorders or damage
caused by injuries, and so on.

Note

Looking to the future: brain computer interfaces


One research direction in neuroscience is brain-computer interfaces that allow
interacting with a computer by simply thinking. The current interfaces are very
limited and they can be used, for example, to reconstruct on a very rough level
what a person is seeing, or to control robotic arms by thought. Perhaps some day
we can actually implement a thought reading machine that allows precise
instructions but currently they belong to science fiction. It is also conceivable
that we could feed information into the brain by stimulating it by small electrical
pulses. Such stimulation is currently used for therapeutic purposes. Feeding
detailed information such as specific words, ideas, memories, or emotions is at
least currently science fiction rather than reality, but obviously we know neither
the limits of such technology, nor how hard it is to reach them.
We’ve drifted a little astray from the topic of the course. In fact, another main
reason for building artificial neural networks has little to do with understanding
biological systems. It is to use biological systems as an inspiration to build
better AI and machine learning techniques. The idea is very natural: the brain
is an amazingly complex information processing system capable of a wide
range of intelligent behaviors (plus occasionally some not-so-intelligent ones),
and therefore, it makes sense to look for inspiration in it when we try to create
artificially intelligent systems.

Neural networks have been a major trend in AI since the 1960s. We’ll return to
the waves of popularity in the history of AI in the final part. Currently neural
networks are again at the very top of the list as deep learning is used to
achieve significant improvements in many areas such as natural language
and image processing, which have traditionally been sore points of AI.

What is so special about neural networks?


The case for neural networks in general as an approach to AI is based on a
similar argument as that for logic-based approaches. In the latter case, it was
thought that in order to achieve human-level intelligence, we need to simulate
higher-level thought processes, and in particular, manipulation of symbols
representing certain concrete or abstract concepts using logical rules.

The argument for neural networks is that by simulating the lower-level,


“subsymbolic” data processing on the level of neurons and neural networks,
intelligence will emerge. This all sounds very reasonable but keep in mind that
in order to build flying machines, we don’t build airplanes that flap their wings,
or that are made of bones, muscle, and feather. Likewise, in artificial neural
networks, the internal mechanism of the neurons is usually ignored and the
artificial neurons are often much simpler than their natural counterparts. The
electro-chemical signaling mechanisms between natural neurons are also
mostly ignored in artificial models when the goal is to build AI systems rather
than to simulate biological systems.

Compared to how computers traditionally work, neural networks have certain


special features:

Neural network key feature 1

For one, in a traditional computer, information is processed in a central


processor (aptly named the central processing unit, or CPU for short) which
can only focus on doing one thing at a time. The CPU can retrieve data to be
processed from the computer’s memory, and store the result in the memory.
Thus, data storage and processing are handled by two separate components
of the computer: the memory and the CPU. In neural networks, the system
consists of a large number of neurons, each of which can process information
on its own so that instead of having a CPU process each piece of information
one after the other, the neurons process vast amounts of information
simultaneously.

Neural network key feature 2

The second difference is that data storage (memory) and processing isn’t
separated like in traditional computers. The neurons both store and process
information so that there is no need to retrieve data from the memory for
processing. The data can be stored short term in the neurons themselves
(they either fire or not at any given time) or for longer term storage, in the
connections between the neurons – their so called weights, which we will
discuss below.
Because of these two differences, neural networks and traditional computers
are suited for somewhat different tasks. Even though it is entirely possible to
simulate neural networks in traditional computers, which was the way they
were used for a long time, their maximum capacity is achieved only when we
use special hardware (computer devices) that can process many pieces of
information at the same time. This is called parallel processing. Incidentally,
graphics processors (or graphics processing units, GPUs) have this capability
and they have become a cost-effective solution for running massive deep
learning methods.

II.How neural networks are built


As we said earlier, neurons are very simple processing units. Having discussed linear and
logistic regression in Chapter 4, the essential technical details of neural networks can be seen
as slight variations of the same idea.

Note

Weights and inputs


The basic artificial neuron model involves a set of adaptive parameters, called
weights like in linear and logistic regression. Just like in regression, these
weights are used as multipliers on the inputs of the neuron, which are added up.
The sum of the weights times the inputs is called the linear combination of the
inputs. You can probably recall the shopping bill analogy: you multiply the
amount of each item by its price per unit and add up to get the total.
If we have a neuron with six inputs (analogous to the amounts of the six
shopping items: potatoes, carrots, and so on), input1, input2, input3, input4,
input5, and input6, we also need six weights. The weights are analogous to
the prices of the items. We’ll call them weight1, weight2, weight3, weight4,
weight5, and weight6. In addition, we’ll usually want to include an intercept
term like we did in linear regression. This can be thought of as a fixed
additional charge due to processing a credit card payment, for example.

We can then calculate the linear combination like this: linear combination =
intercept + weight1 × input1 + ... + weight6 × input6 (where the ... is a
shorthand notation meaning that the sum include all the terms from 1 to 6).

With some example numbers we could then get:

10.0 + 5.4 × 8 + (-10.2) × 5 + (-0.1) × 22 + 101.4 × (-5)


+ 0.0 × 2 + 12.0 × (-3) = -543.0
Answered

Exercise 21: Weights and inputs


In this exercise, consider the following expression that has both weights and
inputs: 10.0 + 5.4 × 8 + (-10.2) × 5 + (-0.1) × 22 + 101.4 × (-5) + 0.0 × 2 + 12.0 × (-
3) = -543.0
What is the intercept term in the expression?
a) 543.0
b) 10.0
c) -3
d) 5.4?
A
B
C
D

The answer is correct


Correct. The intercept is the number in the equation that is not multiplied by
any variable.
What are the inputs?
a) 8, 5, 22, -5, 2, -3
b) 5.4, 8, -10.2, 5, -0.1, 22, 101.4, -5, 0.0, 2, 12.0, -3
c) 5.4, -10.2, -0.1, 101.4, 0.0, 12.0
d) 43.2, -51.0, -2.2, -507.0, 0.0, -36.0
A
B
C
D

The answer is correct


Correct. Compare the equation in the exercise to the one above in the definition:
we defined the linear combination to be intercept + weights x inputs, so the
inputs are the second numbers in the multiplication.
Which of the inputs needs to be changed the least to increase the output by a certain amount?
a) first
b) second
c) third
d) fourth
A
B
C
D
The answer is not correct
Incorrect. The fourth weight is the largest one. To increase the output by some
predetermined amount, the fourth input would have to be changed the least.
What happens when the fifth input is incremented by one?
a) nothing
b) the output increases by one
c) the output increases by two
d) something else
A
B
C
D

The answer is correct


Correct. The weight for the fifth input is 0.0, which means that no matter what
value the fifth input has, its effect on the linear combination is always zero.
3/4 answers correct
The weights are almost always learned from data using the same ideas as in
linear or logistic regression, as discussed previously. But before we discuss
this in more detail, we’ll introduce another important stage that a neuron
completes before it sends out an output signal.

Activations and outputs

Once the linear combination has been computed, the neuron does one more
operation. It takes the linear combination and puts it through a so-called
activation function. Typical examples of the activation function include:

• identity function: do nothing and just output the linear combination


• step function: if the value of the linear combination is greater than zero,
send a pulse (ON), otherwise do nothing (OFF)
• sigmoid function: a “soft” version of the step function

Note that with the first activation function, the identity function, the neuron is
exactly the same as linear regression. This is why the identity function is
rarely used in neural networks: it leads to nothing new and interesting.

Note

How neurons activate


Real, biological neurons communicate by sending out sharp, electrical pulses
called “spikes”, so that at any given time, their outgoing signal is either on or off
(1 or 0). The step function imitates this behavior. However, artificial neural
networks tend to use activation functions that output a continuous numerical
activation level at all times, such as the sigmoid function. Thus, to use a
somewhat awkward figure of speech, real neurons communicate by something
similar to the Morse code, whereas artificial neurons communicate by adjusting
the pitch of their voice as if yodeling.

The output of the neuron, determined by the linear combination and the
activation function, can be used to extract a prediction or a decision. For
example, if the network is designed to identify a stop sign in front of a self-
driving car, the input can be the pixels of an image captured by a camera
attached in front of the car, and the output can be used to activate a stopping
procedure that stops the car before the sign.

Learning or adaptation in the network occurs when the weights are adjusted
so as to make the network produce the correct outputs, just like in linear or
logistic regression. Many neural networks are very large, and the largest
contain hundreds of billions of weights. Optimizing them all can be a daunting
task that requires massive amounts of computing power.
Answered

Exercise 22: Activations and outputs


Below are graphs for three different activation functions with different
properties. First we have the sigmoid function, then the step function, and finally
the identity function. IMPORTANT: Note the different y-axis (vertical) scale in the
identity function chart.

-10-8-6-4-2024681000.10.20.30.40.50.60.70.80.91Sigmoid outputInput-10-8-6-
4-2024681000.10.20.30.40.50.60.70.80.91Step function outputInput-10-8-6-4-
20246810-10-8-6-4-20246810Identity outputInput
Which of the activations described above gives:
the highest output for an input of 5?
Sigmoid
Step
Identity

The answer is correct


Correct. The identity function will give an output of 5 for an input of 5. The
sigmoid will output something very close to 1, and the step function will output
exactly 1.
the lowest output for an input of -5?
Sigmoid
Step
Identity

The answer is correct


Correct. The identity function will give an output of -5 for an input of -5. The
sigmoid will output something very close to 0, and the step function will output
exactly 0.
the highest output for an input of -2.5?
Sigmoid
Step
Identity

The answer is correct


Correct. For an input of -2.5, the identity function will output -2.5, and the step
function will output 0. The sigmoid function will output something that is
higher than 0 but lower than 0.1.
3/3 answers correct

Perceptron: the mother of all ANNs

The perceptron is simply a fancy name for the simple neuron model with the
step activation function we discussed above. It was among the very first
formal models of neural computation and because of its fundamental role in
the history of neural networks, it wouldn’t be unfair to call it the “mother of all
artificial neural networks”.

It can be used as a simple classifier in binary classification tasks. A method


for learning the weights of the perceptron from data, called the Perceptron
algorithm, was introduced by the psychologist Frank Rosenblatt in 1957. We
will not study the Perceptron algorithm in detail. Suffice to say that it is just
about as simple as the nearest neighbor classifier. The basic principle is to
feed the network training data one example at a time. Each misclassification
leads to an update in the weight.

Note
AI hyperbole
After its discovery, the Perceptron algorithm received a lot of attention, not least
because of optimistic statements made by its inventor, Frank Rosenblatt. A
classic example of AI hyperbole is a New York Times article published on July
8th, 1958:
“The Navy revealed the embryo of an electronic computer today that it expects
will be able to walk, talk, see, reproduce itself and be conscious of its existence.”

Please note that neural network enthusiasts are not at all the only ones inclined
towards optimism. The rise and fall of the logic-based expert systems approach
to AI had all the same hallmark features of an AI-hype and people claimed that
the final breakthrough is just a short while away. The outcome both in the early
1960s and late 1980s was a collapse in the research funding called an AI Winter.
The history of the debate that eventually lead to almost complete abandoning
of the neural network approach in the 1960s for more than two decades is
extremely fascinating. The article A Sociological Study of the Official History of
the Perceptrons Controversy by Mikel Olazaran (published in Social Studies
of Science, 1996) reviews the events from a sociology of science point of
view. Reading it today is quite thought provoking. Reading stories about
celebrated AI heroes who had developed neural networks algorithms that
would soon reach the level of human intelligence and become self-conscious
can be compared to some statements made during the current hype. If you
take a look at the above article, even if you wouldn’t read all of it, it will provide
an interesting background to today’s news. Consider for example an article in
the MIT Technology Review published in September 2017, where Jordan
Jacobs, co-founder of a multimillion dollar Vector institute for AI compares
Geoffrey Hinton (a figure-head of the current deep learning boom) to Einstein
because of his contributions to development of neural network algorithms in
the 1980s and later. Also recall the Human Brain project mentioned in the
previous section.

According to Hinton, “the fact that it doesn’t work is just a temporary


annoyance” (although according to the article, Hinton is laughing about the
above statement, so it’s hard to tell how serious he is about it). The Human
Brain project claims to be “close to a profound leap in our understanding of
consciousness”. Doesn’t that sound familiar?

No-one really knows the future with certainty, but knowing the track record of
earlier announcements of imminent breakthroughs, some critical thinking is
advised. We’ll return to the future of AI in the final chapter, but for now, let’s
see how artificial neural networks are built.

Putting neurons together: networks

A single neuron would be way too simple to make decisions and prediction
reliably in most real-life applications. To unleash the full potential of neural
networks, we can use the output of one neuron as the input of other neurons,
whose outputs can be the input to yet other neurons, and so on. The output of
the whole network is obtained as the output of a certain subset of the neurons,
which are called the output layer. We’ll return to this in a bit, after we
discussed the way neural networks adapt to produce different behaviors by
learning their parameters from data.

Key terminology

Layers
Often the network architecture is composed of layers. The input layer consists of
neurons that get their inputs directly from the data. So for example, in an image
recognition task, the input layer would use the pixel values of the input image as
the inputs of the input layer. The network typically also has hidden layers that
use the other neurons’ outputs as their input, and whose output is used as the
input to other layers of neurons. Finally, the output layer produces the output of
the whole network. All the neurons on a given layer get inputs from neurons on
the previous layer and feed their output to the next.
A classical example of a multilayer network is the so-called multilayer
perceptron. As we discussed above, Rosenblatt's Perceptron algorithm can be
used to learn the weights of a perceptron. For multilayer perceptron, the
corresponding learning problem is way harder and it took a long time before a
working solution was discovered. But eventually, one was invented: the
backpropagation algorithm led to a revival of neural networks in the late
1980s. It is still at the heart of many of the most advanced deep learning
solutions.

Note

Meanwhile in Helsinki...
The path(s) leading to the backpropagation algorithm are rather long and
winding. An interesting part of the history is related to the computer science
department of the University of Helsinki. About three years after the founding of
the department in 1967, a Master’s thesis was written by a student called Seppo
Linnainmaa. The topic of the thesis was “Cumulative rounding error of
algorithms as a Taylor approximation of individual rounding errors” (the thesis
was written in Finnish, so this is a translation of the actual title “Algoritmin
kumulatiivinen pyöristysvirhe yksittäisten pyöristysvirheiden Taylor-
kehitelmänä”).

The automatic differentiation method developed in the thesis was later applied
by other researchers to quantify the sensitivity of the output of a multilayer
neural network with respect to the individual weights, which is the key idea in
backpropagation.

A simple neural network classifier

To give a relatively simple example of using a neural network classifier, we'll


consider a task that is very similar to the MNIST digit recognition task, namely
classifying images in two classes. We will first create a classifier to classify
whether an image shows a cross (x) or a circle (o). Our images are
represented here as pixels that are either colored or white, and the pixels are
arranged in 5 × 5 grid. In this format our images of a cross and a circle (more
like a diamond, to be honest) look like this:

In order to build a neural network classifier, we need to formalize the problem


in a way where we can solve it using the methods we have learned. Our first
step is to represent the information in the pixels by numerical values that can
be used as the input to a classifier. Let's use 1 if the square is colored, and 0 if
it is white. Note that although the symbols in the above graphic are of different
color (green and blue), our classifier will ignore the color information and use
only the colored/white information. The 25 pixels in the image make the inputs
of our classifier.

To make sure that we know which pixel is which in the numerical


representation, we can decide to list the pixels in the same order as you'd
read text, so row by row from the top, and reading each row from left to right.
The first row of the cross, for example, is represented as 1,0,0,0,1; the second
row as 0,1,0,1,0, and so on. The full input for the cross input is then:
1,0,0,0,1,0,1,0,1,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,1.
We'll use the basic neuron model where the first step is to compute a linear
combination of the inputs. Thus need a weight for each of the input pixels,
which means 25 weights in total.

Finally, we use the step activation function. If the linear combination is


negative, the neuron activation is zero, which we decide to use to signify a
cross. If the linear combination is positive, the neuron activation is one, which
we decide to signify a circle.

Let's try what happens when all the weights take the same numerical value, 1.
With this setup, our linear combination for the cross image will be 9 (9 colored
pixels, so 9 × 1, and 16 white pixels, 16 × 0), and for the circle image it will be
8 (8 colored pixels, 8 × 1, and 17 white pixels, 17 × 0). In other words, the
linear combination is positive for both images and they are thus classified as
circles. Not a very good result given that there are only two images to classify.

To improve the result, we need to adjust the weights in such a way that the
linear combination will be negative for a cross and positive for a circle. If we
think about what differentiates images of crosses and circles, we can see that
circles have no colored pixels in the center of the image, whereas crosses do.
Likewise, the pixels at the corners of the image are colored in the cross, but
white in the circle.

We can now adjust the weights. There are an infinite number of weights that
do the job. For example, assign weight -1 to the center pixel (the 13th pixel),
and weight 1 to the pixels in the middle of each of the four sides of the image,
letting all the other weights be 0. Now, for the cross input, the center pixel
produce the value –1, while for all the other pixels either the pixel value or the
weight is 0, so that –1 is also the total value. This leads to activation 0, and
the cross is correctly classified.

How about the circle then? Each of the pixels in the middle of the sides
produces the value 1, which makes 4 × 1 = 4 in total. For all the other pixels
either the pixel value or the weight is zero, so 4 is the total. Since 4 is a
positive value, the activation is 1, and the circle is correctly recognized as
well.

Happy or not?

We will now follow similar reasoning to build a classifier for smiley faces. You
can assign weights to the input pixels in the image by clicking on them.
Clicking once sets the weight to 1, and clicking again sets it to -1. The
activation 1 indicates that the image is classified as a happy face, which can
be correct or not, while activation –1 indicates that the image is classified as a
sad face.

Don't be discouraged by the fact that you will not be able to classify all the
smiley faces correctly: it is in fact impossible with our simple classifier! This is
one important learning objective: sometimes perfect classification just isn't
possible because the classifier is too simple. In this case the simple neuron
that uses a linear combination of the inputs is too simple for the task. Observe
how you can build classifiers that work well in different cases: some classify
most of the happy faces correctly while being worse for sad faces, or the other
way around.

III.Advanced neural network techniques


In the previous section, we have discussed the basic ideas behind most neural network
methods: multilayer networks, non-linear activation functions, and learning rules such as the
backpropagation algorithm.

They power almost all modern neural network applications. However, there
are some interesting and powerful variations of the theme that have led to
great advances in deep learning in many areas.

Convolutional neural networks (CNNs)

One area where deep learning has achieved spectacular success is image
processing. The simple classifier that we studied in detail in the previous
section is severely limited – as you noticed it wasn’t even possible to classify
all the smiley faces correctly. Adding more layers in the network and using
backpropagation to learn the weights does in principle solve the problem, but
another one emerges: the number of weights becomes extremely large and
consequently, the amount of training data required to achieve satisfactory
accuracy can become too large to be realistic.

Fortunately, a very elegant solution to the problem of too many weights exists:
a special kind of neural network, or rather, a special kind of layer that can be
included in a deep neural network. This special kind of layer is a so-
called convolutional layer. Networks including convolutional layers are
called convolutional neural networks (CNNs). Their key property is that they
can detect image features such as bright or dark (or specific color) spots,
edges in various orientations, patterns, and so on. These form the basis for
detecting more abstract features such as a cat’s ears, a dog’s snout, a
person’s eye, or the octagonal shape of a stop sign. It would normally be hard
to train a neural network to detect such features based on the pixels of the
input image, because the features can appear in different positions, different
orientations, and in different sizes in the image: moving the object or the
camera angle will change the pixel values dramatically even if the object itself
looks just the same to us. In order to learn to detect a stop sign in all these
different conditions would require vast of amounts of training data because the
network would only detect the sign in conditions where it has appeared in the
training data. So, for example, a stop sign in the top right corner of the image
would be detected only if the training data included an image with the stop
sign in the top right corner. CNNs can recognize the object anywhere in the
image no matter where it has been observed in the training images.

Note

Why we need CNNs


CNNs use a clever trick to reduce the amount of training data required to detect
objects in different conditions. The trick basically amounts to using the same
input weights for many neurons – so that all of these neurons are activated by
the same pattern – but with different input pixels. We can for example have a set
of neurons that are activated by a cat’s pointy ear. When the input is a photo of a
cat, two neurons are activated, one for the left ear and another for the right. We
can also let the neuron’s input pixels be taken from a smaller or a larger area, so
that different neurons are activated by the ear appearing in different scales
(sizes), so that we can detect a small cat’s ears even if the training data only
included images of big cats.
The convolutional neurons are typically placed in the bottom layers of the
network, which processes the raw input pixels. Basic neurons (like the
perceptron neuron discussed above) are placed in the higher layers, which
process the output of the bottom layers. The bottom layers can usually be
trained using unsupervised learning, without a particular prediction task in
mind. Their weights will be tuned to detect features that appear frequently in
the input data. Thus, with photos of animals, typical features will be ears and
snouts, whereas in images of buildings, the features are architectural
components such as walls, roofs, windows, and so on. If a mix of various
objects and scenes is used as the input data, then the features learned by the
bottom layers will be more or less generic. This means that pre-trained
convolutional layers can be reused in many different image processing tasks.
This is extremely important since it is easy to get virtually unlimited amounts
of unlabeled training data – images without labels – which can be used to train
the bottom layers. The top layers are always trained by supervised machine
learning techniques such as backpropagation.

Do neural networks dream of electric sheep? Generative adversarial


networks (GANs)

Having trained a neural network on data, we can use it for predictions. Since
the top layers of the network have been trained in a supervised manner to
perform a particular classification or prediction task, the top layers are really
useful only for that task. A network trained to detect stop signs is useless for
detecting handwritten digits or cats.

A fascinating result is obtained by taking the pre-trained bottom layers and


studying what the features they have learned look like. This can be achieved
by generating images that activate a certain set of neurons in the bottom
layers. Looking at the generated images, we can see what the neural network
“thinks” a particular feature looks like, or what an image with a select set of
features in it would look like. Some even like to talk about the networks
“dreaming” or “hallucinating” images (see Google’s DeepDream system).

Note

Be careful with metaphors


However, we’d like to once again emphasize the problem with metaphors such as
dreaming when simple optimization of the input image is meant – remember the
suitcase words discussed in Chapter 1. The neural network doesn’t really dream,
and it doesn’t have a concept of a cat that it would understand in a similar sense
as a human understands. It is simply trained to recognize objects and it can
generate images that are similar to the input data that it is trained on.
To actually generate real looking cats, human faces, or other objects (you’ll
get whatever you used as the training data), Ian Goodfellow, a researcher at
Google Brain at the time, proposed a clever combination of two neural
networks. The idea is to let the two networks compete against each other.
One of the networks is trained to generate images like the ones in the training
data – it is called the generative network. The other network’s task is to
separate images generated by the first network from real images from the
training data – this one is called the adversarial network. These two combined
then make up a generative adversarial network or a GAN.
The system trains the two models side by side. In the beginning of the
training, the adversarial model has an easy task to tell apart the real images
from the training data and the clumsy attempts by the generative model.
However, as the generative network slowly gets better and better, the
adversarial model has to improve as well, and the cycle continues until
eventually the generated images are almost indistinguishable from real ones.
The GAN tries to not only reproduce the images in the training data: that
would be a way too simple strategy to beat the adversarial network. Rather,
the system is trained so that it has to be able to generate new, real-looking
images too.
The above images were generated by a GAN developed by NVIDIA in a
project led by Prof Jaakko Lehtinen (see this article for more).

Could you have recognized them as fakes?

The Rise of Large Language Models (LLMs)


As mentioned above, convolutional neural networks (CNNs) reduce the
number of learnable weights in a neural network so that the amount of training
data required to learn all of them doesn't grow astronomically large as we
keep building bigger and bigger networks. Another architectural innovation,
besides the idea of a CNN, that currently powers many state-of-the-art deep
learning models is called attention.

Attention mechanisms were originally introduced for machine translation


where they can selectively focus the attention of the model to certain words in
the input text when generating a particular word in the output. This way the
model doesn't have to pay attention to all of the input at the same time, which
greatly simplifies the learning task. Attention mechanisms were soon found to
be extremely useful not only in machine translation.

In 2017, a team working at Google published the blockbuster article "Attention


is All You Need", which introduced the so-called transformer architecture for
deep neural networks. Unless you have been living on a desert island or on
an otherwise strict media diet, you have most likely already heard about
transformers (the neural network models, not the toy franchise). It's just that
they may have been hiding inside an acronym: GPT (Generative Pretrained
Transformer). As the title of the article by the Google team suggests,
transformers heavily exploit attention mechanisms to get the most out of the
available training data and computational resources.

The most widely noted applications of transformers are found in large


language models (LLMs). The best known ones are OpenAI's GPT-series,
including GPT-1 released in June 2018 and GPT-4 announced in March 2023,
but no giant platform company wants to miss out: Google picks model names
from Sesame street and published BERT (Bidirectional Encoder
Representations from Transformers) in October 2018, while Meta joined the
party a bit later in February 2023, picking a name inspired by the animal
world, LLaMA (Large Language Model Meta AI). And it's not just the platform
companies that are driving the development: universities and other research
organizations are contributing open source models with the goal
of democratizing the technology.

Note

What's in an LLM?
LLMs are models that given a piece of text like "The capital of Finland is" predicts
how the text is likely to continue. In this case, "Helsinki" or "a pocket-sized
metropolis" would be likely continuations. LLMs are trained on large amounts of
text such as the entire contents of the Wikipedia or the CommonCrawl dataset
that, at the time of writing this, contains a whopping 260 billion web pages.
In principle, one can view LLMs as basically nothing but extremely powerful
predictive text entry techniques. However, with some further thinking, it
becomes apparent that being able to predict the continuation of any text in a
way that is indistinguishable from human writing, is (or would be) quite a feat
and encompasses many aspects of intelligence. The above example which is
based on the association between the words "the capital of Finland" and
"Helsinki" is an example where the model has learned a fact about the world.
If we'd be able to build models that associate the commonly agreed answers
to a wide range of questions, it could be argued that such a model has
learned a big chunk of so-called "world knowledge". Especially intriguing are
instances where the model seems to exhibit some level of reasoning beyond
memorization and statistical co-occurrence: currently, LLMs are able to do this
in a limited sense and they can easily make trivial mistakes because they are
based on "just" statistical machine learning. Intensive research and
development efforts are directed at building deep learning models with more
robust reasoning algorithms and databases of verified facts.

Note

ChatGPT: AI for the masses


A massive earthquake occurred in San Francisco on November 30, 2022. It was
so powerful that hardly a person on the planet was unaffected, and yet, no
seismometer detected it. This metaphorical "earthquake" was the launch of
ChatGPT by OpenAI. Word of the online chatbot service that anyone could use
free of charge quickly spread around the world and after mere five days, it had
more than a million registered users (compare this to the five years that it took
the Elements of AI to reach the same number), and in two months, the number of
signups was 100 million. No other AI service, or probably any service
whatsoever, has become a household name so quickly.

The first version of ChatGPT was based on a GPT-3.5 model fine tuned by
supervised and reinforcement learning according to a large number of human-
rated responses. The purpose of the finetuning process was to steer the model
away from toxic and incorrect responses that the language model had picked up
from its training data, and towards comprehensive and helpful responses.
It is not easy to say what caused the massive media frenzy and the
unprecedented interest towards ChatGPT by pretty much everyone, even
those who hadn't paid much attention to AI thus far. Probably some of it is
explained by the somewhat better quality of the output, due to the finetuning,
and the easy-to-use chat interface, which enables the user to not only get
one-off answers to isolated questions, like any of the earlier LLMs, but also
maintain a coherent dialogue in a specific context. In the same vein, the chat
interface allows one to make requests like "explain this to a five year old" or
"write that as a song in the style of Nick Cave." (Mr Cave, however, wasn't
impressed [BBC]). In any case, ChatGPT succeeded in bumping the interest
in AI to completely new levels.

It remains to be seen what are the real "killer apps" for ChatGPT and other
LLM-based solutions. We believe the most likely candidates are ones where
the factual content comes from the user or from another system, and the
language model is used to format the output in the form of language (either
natural language or possibly formal language such as program code). We'll
return to the expected impact of ChatGPT and other LLM-based applications
in the final chapter.

After completing Chapter 5 you should be able to:

• Explain what a neural network is and where they are being successfully
used
• Understand the technical methods that underpin neural networks

I.About predicting the future


We will start by addressing what is known to be one of the hardest problems of all: predicting
the future.

You may be disappointed to hear this, but we don’t have a crystal ball that
would show us what the world will be like in the future and how AI will
transform our lives.

As scientists, we are often asked to provide predictions, and our refusal to


provide any is faced with a roll of the eyes (“boring academics”). But in fact,
we claim that anyone who claims to know the future of AI and the implications
it will have on our society, should be treated with suspicion.
The reality distortion field

Not everyone is quite as conservative about their forecasts, however. In the


modern world where big headlines sell, and where you have to dissect news
into 280 characters, reserved (boring?) messages are lost, and simple and
dramatic messages are magnified. In the public perception of AI, this is clearly
true.

Note

From utopian visions to grim predictions


The media sphere is dominated by the extremes. We are beginning to see AI
celebrities, standing for one big idea and making oracle-like forecasts about the
future of AI. The media love their clear messages. Some promise us a utopian
future with exponential growth and trillion-dollar industries emerging out of
nowhere, true AI that will solve all problems we cannot solve by ourselves, and
where humans don’t need to work at all.

It has also been claimed that AI is a path to world domination. Others make even
more extraordinary statements according to which AI marks the end of
humanity (in about 20-30 years from now), life itself will be transformed in the
“Age of AI”, and that AI is a threat to our existence.
While some forecasts will probably get at least something right, others will
likely be useful only as demonstrations of how hard it is to predict, and many
don’t make much sense. What we would like to achieve is for you to be able to
look at these and other forecasts, and be able to critically evaluate them.

On hedgehogs and foxes

The political scientist Philip E. Tetlock, author of Superforecasting: The Art


and Science of Prediction, classifies people into two categories: those who
have one big idea (“hedgehogs”), and those who have many small ideas
(“foxes”). Tetlock has carried out an experiment between 1984 and 2003 to
study factors that could help us identify which predictions are likely to be
accurate and which are not. One of the significant findings was that foxes tend
to be clearly better at prediction than hedgehogs, especially when it comes to
long-term forecasting.
Probably the messages that can be expressed in 280 characters are more
often big and simple hedgehog ideas. Our advice is to pay attention to
carefully justified and balanced information sources, and to be suspicious
about people who keep explaining everything using a single argument.

Predicting the future is hard but at least we can consider the past and present
AI, and by understanding them, hopefully be better prepared for the future,
whatever it turns out to be like.

AI winters

The history of AI, just like many other fields of science, has witnessed the
coming and going of various different trends. In philosophy of science, the
term used for a trend is paradigm. Typically, a particular paradigm is adopted
by most of the research community and optimistic predictions about progress
in the near-future are provided. For example, in the 1960s neural networks
were widely believed to solve all AI problems by imitating the learning
mechanisms in the nature, the human brain in particular. The next big thing
was expert systems based on logic and human-coded rules, which was the
dominant paradigm in the 1980s.

The cycle of hype

In the beginning of each wave, a number of early success stories tend to


make everyone happy and optimistic. The success stories, even if they may
be in restricted domains and in some ways incomplete, become the focus on
public attention. Many researchers rush into AI – or at least calling their
research AI – in order to access the increased research funding. Companies
also initiate and expand their efforts in AI in the fear of missing out (FOMO).

So far, each time an all-encompassing, general solution to AI has been said to


be within reach, progress has ended up running into insurmountable
problems, which at the time were thought to be minor hiccups. In the case of
neural networks in the 1960s, the hiccups were related to handling
nonlinearities and to solving the machine learning problems associated with
the increasing number of parameters required by neural network
architectures. In the case of expert systems in the 1980s, the hiccups were
associated with handling uncertainty and common sense. As the true nature
of the remaining problems dawned after years of struggling and unsatisfied
promises, pessimism about the paradigm accumulated and an AI winter
followed: interest in the field faltered and research efforts were directed
elsewhere.

Modern AI

Currently, roughly since the turn of the millennium, AI has been on the rise
again. Modern AI methods tend to focus on breaking a problem into a number
of smaller, isolated and well-defined problems and solving them one at a time.
Modern AI is bypassing grand questions about meaning of intelligence, the
mind, and consciousness, and focusing on building practically useful solutions
in real-world problems. Good news for us all who can benefit from such
solutions!

Another characteristic of modern AI methods, closely related to working in the


complex and “messy” real world, is the ability to handle uncertainty, which we
demonstrated by studying the uses of probability in AI in Chapter 3. Finally,
the current upwards trend of AI has been greatly boosted by the come-back of
neural networks and deep learning techniques capable of processing images
and other real-world data better than anything we have seen before.

Note

So are we in a hype cycle?


Whether the history will repeat itself, and the current boom will be once again
followed by an AI winter, is a matter that only time can tell. Even if it does, and
the progress towards better and better solutions slows down to a halt, the
significance of AI in the society is going to stay. Thanks to the focus on useful
solutions to real-world problems, modern AI research yields fruit already today,
rather than trying to solve the big questions about general intelligence first –
which was where the earlier attempts failed.

Prediction 1: AI will continue to be all around us

As you recall, we started by motivating the study of AI by discussing


prominent AI applications that affect all our lives. We highlighted three
examples: self-driving vehicles, recommendation systems, and image and
video processing. During the course, we have also discussed a wide range of
other applications that contribute to the ongoing technological transition.

Note
AI making a difference
As a consequence of focusing on practicality rather than the big problems, we
live our life surrounded by AI (even if we may most of the time be happily
unaware of it): the music we listen to, the products we buy online, the movies
and series we watch, our routes of transportation, and even the news and
information that we have available, are all influenced more and more by AI. What
is more, basically any field of science, from medicine and astrophysics to
medieval history, is also adopting AI methods in order to deepen our
understanding of the universe and of ourselves.

Prediction 2: the Terminator isn’t coming

One of the most pervasive and persistent ideas related to the future of AI is
the Terminator. In case you should have somehow missed the image of a
brutal humanoid robot with a metal skeleton and glaring eyes...well, that’s
what it is. The Terminator is a 1984 film by director James Cameron. In the
movie, a global AI-powered defense system called Skynet becomes
conscious of its existence and wipes most of the humankind out of existence
with nukes and advanced killer robots.

Note

Two doomsday scenarios


There are two alternative scenarios that are suggested to lead to the coming of
the Terminator or other similarly terrifying forms of robot uprising. In the first,
which is the story from the 1984 film, a powerful AI system just becomes
conscious and decides that it just really, really dislikes humanity in general.

In the second alternative scenario, the robot army is controlled by an intelligent


but not conscious AI system that is in principle in human control. The system can
be programmed, for example, to optimize the production of paper clips. Sounds
innocent enough, doesn’t it?

However, if the system possesses superior intelligence, it will soon reach the
maximum level of paper clip production that the available resources, such as
energy and raw materials, allow. After this, it may come to the conclusion that it
needs to redirect more resources to paper clip production. In order to do so, it
may need to prevent the use of the resources for other purposes even if they are
essential for human civilization. The simplest way to achieve this is to kill all
humans, after which a great deal more resources become available for the
system’s main task, paper clip production.

Why these scenarios are unrealistic

There are a number of reasons why both of the above scenarios are
extremely unlikely and belong to science fiction rather than serious
speculations of the future of AI.

Reason 1:

Firstly, the idea that a superintelligent, conscious AI that can outsmart humans
emerges as an unintended result of developing AI methods is naive. As you
have seen in the previous chapters, AI methods are nothing but automated
reasoning, based on the combination of perfectly understandable principles
and plenty of input data, both of which are provided by humans or systems
deployed by humans. To imagine that the nearest neighbor classifier, linear
regression, the AlphaGo game engine, or even a deep neural network could
become conscious and start evolving into a superintelligent AI mind requires a
(very) lively imagination.

Note that we are not claiming that building human-level intelligence would be
categorically impossible. You only need to look as far as the mirror to see a
proof of the possibility of a highly intelligent physical system. To repeat what
we are saying: superintelligence will not emerge from developing narrow AI
methods and applying them to solve real-world problems (recall the narrow vs
general AI from the section on the philosophy of AI in Chapter 1).

Reason 2:

Secondly, one of the favorite ideas of those who believe in superintelligent AI


is the so-called singularity: a system that optimizes and “rewires“ itself so that
it can improve its own intelligence at an ever accelerating, exponential rate.
Such superintelligence would leave humankind so far behind that we become
like ants that can be exterminated without hesitation. The idea of exponential
intelligence increase is unrealistic for the simple reason that even if a system
could optimize its own workings, it would keep facing more and more difficult
problems that would slow down its progress, quite like the progress of human
scientists requires ever greater efforts and resources by the whole research
community and indeed the whole society, which the superintelligent entity
wouldn’t have access to. The human society still has the power to decide what
we use technology, even AI technology, for. Much of this power is indeed
given to us by technology, so that every time we make progress in AI
technology, we become more powerful and better at controlling any potential
risks due to it.

Note

The value alignment problem


The paper clip example is known as the value alignment problem: specifying the
objectives of the system so that they are aligned with our values is very hard.
However, suppose that we create a superintelligent system that could defeat
humans who tried to interfere with its work. It’s reasonable to assume that such
a system would also be intelligent enough to realize that when we say “make me
paper clips”, we don’t really mean to turn the Earth into a paper clip factory of a
planetary scale.

Separating stories from reality

All in all, the Terminator is a great story to make movies about but hardly a
real problem worth panicking about. The Terminator is a gimmick, an easy
way to get a lot of attention, a poster boy for journalists to increase click rates,
a red herring to divert attention away from perhaps boring, but real, threats
like nuclear weapons, lack of democracy, environmental catastrophes, and
climate change. In fact, the real threat the Terminator poses is the diversion of
attention from the actual problems, some of which involve AI, and many of
which don’t. We’ll discuss the problems posed by AI in what follows, but the
bottom line is: forget about the Terminator, there are much more important
things to focus on.
Answered

Exercise 23: What is the perception of AI?


For this exercise, we want you to think about how AI is portrayed. Do an
online image search for the term “AI” and see what kinds of pictures come up. If
you are using Google search, you should choose "Images" in the top of the screen.

What's the general impression you get about AI from the image search results? Is
this an accurate representation of AI? Why or why not?
Note: On using ChatGPT and similar models: We appreciate you putting an AI
tool into use. Feel free to consult it but note that since we are interested in your
own opinion, we expect you to write your final answers yourself.

Your answer:
I did a Google image search, and the results mainly depicted brains on top of
circuits or humanoid robots thinking. Based on these images, one would
conclude that AI is metallic hardware or robots that just think. However, we
should compare what we think of AI, not just what Google thinks. Referring
back to my definition of AI in Chapter 1: That is computer systems that
possess certain autonomy and adaptability, capable of attempting to imitate
both intelligent and non-intelligent behaviors of humans and other entities,
with the main goal of serving humans. After reading this definition, the first
graphic image that comes to my mind is a humanoid robot that serves
humans, whether in their daily tasks or in some daunting tasks that require
more advanced and safer approaches. The robot also has a "bubble" in its
head (or, I should say, its processor) and is thinking about its next move.
Considering this is my personal depiction, it somewhat matches with Google
images that show humanoid robots, but the ones that show some hardware
and chips all around are a bit of nonsense. In my opinion, a more accurate
representation of AI would be similar to the introductory story of the "Detroit:
Become Human" game where humanoid robots help humans and are a part
of society. The choices made by people or androids (humanoid robots in the
game) can change the whole outcome, which can result in peace or war
(hopefully the latter remains science-fiction). Still, I believe this is a smooth
way of showcasing AI in terms of human perception and adaption, I mean,
thankfully, Google's search engine doesn't show Terminator-like robots first,
which could add fuel to the fire and increase human fear. Despite these
futuristic image results, which are not very precise showcasing AI as chips
rather than robots serving humans, people still remain fearful of its future
capabilities. It would be even more devastating to see the images where it
was to be seen as a threat to humanity.

Example answer

The Google image search with the query "AI" brought us almost exclusively
brains made of circuits. After scrolling down a bit, we also got some shining
white humanoid robots, often in a very pensive state. The color blue is dominant.
Based on this, the impression would be that AI is about tinkering with wires and
circuits, trying to build an electronic brain in some modern Frankenstein spirit.
Not much color, not much fun.

We don't think that this is an accurate representation of AI at all! AI is about


solving practical, human problems, in our everyday life: better music, more
interesting and important news, making new friends. Even the research side of
AI, which is not what most of the Google hits are about, almost never involves
working with hardware such as circuits and wires. It is mostly simply about
applying sound scientific principles to understand how we can push the limits of
our AI methods. It often takes a lot of talking to other researchers, writing on
scrap paper, and programming. Not really much different from any other
research.

II.The societal implications of AI


In the very beginning of this course, we briefly discussed the importance of AI in today’s and
tomorrow’s society but at that time, we could do so only to a limited extent because we
hadn’t introduced enough of the technical concepts and methods to ground the discussion on
concrete terms.

Now that we have a better understanding of the basic concepts of AI, we are
in a much better position to take part in rational discussion about the
implications of already the current AI.

Implication 1: Algorithmic bias

AI, and in particular, machine learning, is being used to make important


decisions in many sectors. This brings up the concept of algorithmic bias.
What it means is the embedding of a tendency to discriminate according
ethnicity, gender, or other factors when making decisions about job
applications, bank loans, and so on.

Note

Once again, it’s all about the data


The main reason for algorithmic bias is human bias in the data. For example,
when a job application filtering tool is trained on decisions made by humans, the
machine learning algorithm may learn to discriminate against women or
individuals with a certain ethnic background. Notice that this may happen even if
ethnicity or gender are excluded from the data since the algorithm will be able to
exploit the information in the applicant’s name or address.
Algorithmic bias isn’t a hypothetical threat conceived by academic
researchers. It’s a real phenomenon that is already affecting people today.

Online advertising

It has been noticed that online advertisers like Google tend to display ads of
lower-pay jobs to women users compared to men. Likewise, doing a search
with a name that sounds African American may produce an ad for a tool for
accessing criminal records, which is less likely to happen otherwise.

Social networks

Since social networks are basing their content recommendations essentially


on other users’ clicks, they can easily lead to magnifying existing biases even
if they are very minor to start with. For example, it was observed that when
searching for professionals with female first names, LinkedIn would ask the
user whether they actually meant a similar male name: searching for Andrea
would result in the system asking “did you mean Andrew”? If people
occasionally click Andrew’s profile, perhaps just out of curiosity, the system
will boost Andrew even more in subsequent searches.

There are numerous other examples we could mention, and you have
probably seen news stories about them. The main difficulty in the use of AI
and machine learning instead of rule-based systems is their lack of
transparency. Partially this is a consequence of the algorithms and the data
being trade secrets that the companies are unlikely to open up for public
scrutiny. And even if they did this, it may often be hard to identify the part of
the algorithm or the elements of the data that lead to discriminating decisions.

Note

Transparency through regulation?


A major step towards transparency is the European General Data Protection
Regulation (GDPR). It requires that all companies that either reside within the
European Union or that have European customers must:
• Upon request, reveal what data they have collected about any individual
(right of access)
• Delete any such data that is not required to keep with other obligations
when requested to do so (right to be forgotten)
• Provide an explanation of the data processing carried out on the
customer’s data (right to explanation)

The last point means, in other words, that companies such as Facebook and
Google, at least when providing services to European users, must explain
their algorithmic decision making processes. It is, however, still unclear what
exactly counts as an explanation. Does for example a decision reached by
using the nearest neighbor classifier (Chapter 4) count as an explainable
decision, or would the coefficients of a logistic regression classifier be better?
How about deep neural networks that easily involve millions of parameters
trained using terabytes of data? The discussion about the technical
implementation about the explainability of decisions based on machine
learning is currently intensive. In any case, the GDPR has potential to improve
the transparency of AI technologies.

Implication 2: Seeing is believing — or is it?

We are used to believing what we see. When we see a leader on the TV


stating that their country will engage in a trade-war with another country, or
when a well-known company spokesperson announces an important business
decision, we tend to trust them better than just reading about the statement
second-hand from the news written by someone else.

Similarly, when we see photo evidence from a crime scene or from a


demonstration of a new tech gadget, we put more weight on the evidence
than on written report explaining how things look.

Of course, we are aware of the possibility of fabricating fake evidence. People


can be put in places they never visited, with people they never met, by
photoshopping. It is also possible to change the way things look by simply
adjusting lighting or pulling one’s stomach in in cheap before–after shots
advertising the latest diet pill.

Note
AI is taking the possibilities of fabricating evidence to a
whole new level
Metaphysics Live is a system capable of doing face-swaps, de-aging and other
tricks in real time.

Descript has a tool for automatic imitation of a person’s voice from a few minutes
of sample recording. While the generated audio still has a notable robotic tone, it
makes a pretty good impression.

Implication 3: Changing notions of privacy

It has been long known that technology companies collect a lot of information
about their users. Earlier it was mainly grocery stores and other retailers that
collected buying data by giving their customers loyalty cards that enable the
store to associate purchases to individual customers.

Note

Unprecedented data accuracy


The accuracy of the data that tech companies such as Facebook, Google, Amazon
and many others are collecting is way beyond the purchase data collected by
conventional stores: in principle, it is possible to record every click, every page
scroll, and the time you spend viewing any content. Websites can even access
your browsing history, so that unless you use the incognito mode (or the like)
after browsing for flights to Barcelona on one site, you will likely get
advertisements for hotels in Barcelona.
However, as such the above kind of data logging is not yet AI. The use of AI
leads new kinds of threats to our privacy, which may be harder to avoid even
if you are careful about revealing your identity.

Using data analysis to identify individuals

A good example of a hard-to-avoid issue is de-anonymization, breaking the


anonymity of data that we may have thought to be safe. The basic problem is
that when we report the results of an analysis, the results may be so specific
that they make it possible to learn something about individual users whose
data is included in the analysis. A classic example is asking for the average
salary of people born in the given year and having a specific zip code. In
many cases, this could be a very small group of people, often only one
person, so you’d be potentially giving data about a single person’s salary.

An interesting example of a more subtle issue was pointed out by researchers


at the University of Texas at Austin. They studied a public dataset made
available by Netflix containing 10 million movie ratings by some 500,000
anonymous users, and showed that many of the Netflix users can actually be
linked to user accounts on the Internet Movie Database because they had
rated several movies on both applications. Thus the researchers were able to
de-anonymize the Netflix data. While you may not think it’s big deal whether
someone else knows how you rated the latest Star Wars movie, some movies
may reveal aspects of our lives (such as politics or sexuality) which we should
be entitled to keep private.

Other methods of identification

A similar approach could in principle be used to match user accounts in


almost any service that collects detailed data about user behaviors. Another
example is typing patterns. Researchers at the University of Helsinki have
demonstrated that users can be identified based on their typing patterns: the
short intervals between specific keystrokes when typing text. This can mean
that if someone has access to data on your typing pattern (maybe you have
used their website and registered by entering your name), they can identify
you the next time you use their service even if you’d refuse to identify yourself
explicitly. They can also sell this information to whoever wants to buy it.

While many of the above examples have come as at least in part as surprises
– otherwise they could have been avoided – there is a lot of ongoing research
trying to address them. In particular, an area called differential privacy aims to
develop machine learning algorithms that can guarantee that the results are
sufficiently coarse to prevent reverse engineering specific data points that
went into them.

Implication 4: Changing work

When an early human learned to use a sharp rock to crack open bones of
dead animals to access a new source of nutrition, time and energy was
released for other purposes such as fighting, finding a mate, and making more
inventions. The invention of the steam engine in the 1700s tapped into an
easily portable form of machine power that greatly improved the efficiency of
factories as well as ships and trains. Automation has always been a path to
efficiency: getting more with less. Especially since the mid 20th century,
technological development has led to a period of unprecedented progress in
automation. AI is a continuation of this progress.

Each step towards better automation changes the working life. With a sharp
rock, there was less need for hunting and gathering food; with the steam
engine, there was less need for horses and horsemen; with the computer,
there is less need for typists, manual accounting, and many other data
processing (and apparently more need for watching cat videos). With AI and
robotics, there is even less need for many kinds of dull, repetitive work.

Note

A history of finding new things to do


In the past, every time one kind of work has been automated, people have found
new kinds to replace it. The new kinds of work are less repetitive and routine,
and more variable and creative. The issue with the current rate of advance of AI
and other technologies is that during the career of an individual, the change in
the working life might be greater than ever before. It is conceivable that some
jobs such as driving a truck or a taxi, may disappear within a few years’ time
span. Such an abrupt change could lead to mass unemployment as people don’t
have time to train themselves for other kinds of work.

The most important preventive action to avoid huge societal issues such as this is
to help young people obtain a wide-ranging education. This provides a basis for
pursuing many different jobs and which isn’t in high risk of becoming obsolete in
the near future.

It is equally important to support life-long learning and learning at work, because


there are going to be few of us who will do the same job throughout their entire
career. Cutting the hours per week would help offer work for more people, but
the laws of economics tend to push people to work more rather than less unless
public policy regulating the amount of work is introduced.
Because we can’t predict the future of AI, predicting the rate and extent of this
development is extremely hard. There have been some estimates about the
extent of job automation, ranging up to 47% of US jobs being at risk reported
by researchers at the University of Oxford. The exact numbers such as these
– 47%, not 45% or 49% –, the complicated-sounding study designs used to
get them, and the top universities that report them tend to make the estimates
sound very reliable and precise (recall the point about estimating life
expectancy using a linear model based on a limited amount of data). The
illusion of accuracy to one percentage is a fallacy. The above number, for
example, is based on looking at a large number of job descriptions – perhaps
licking the tip of your finger and putting it up to feel the wind – and using
subjective grounds to decide which tasks are likely to be automated. It is
understandable that people don’t take the trouble to read a 79 page report
that includes statements such as "the task model assumes for tractability an
aggregate, constant-returns to-scale, Cobb-Douglas production function."
However, if you don’t, then you should remain somewhat sceptical about the
conclusions too. The real value in this kind of analysis is that it suggests which
kinds of jobs are more likely to be at risk, not in the actual numbers such as
47%. The tragedy is that the headlines reporting "nearly half of US jobs at risk
of computerization" are noted, and the rest is not.

So then, what actually are the tasks that are more likely to be automated?
There are some clear signs concerning this that we can already observe:

• Autonomous robotics solutions such as self-driving vehicles, including


cars, drones, boats or delivery robots, are just at the verge of major
commercial applications. The safety of autonomous cars is hard to
estimate, but the statistics suggests that it is probably not yet quite at
the required level (the level of an average human driver). However, the
progress has been incredibly fast and it is accelerating due to the
increasing amount of available data.
• Customer-service applications such as helpdesks can be automated in
a very cost-effective fashion. Currently the quality of service is not
always to be cheered, the bottle-necks being language processing (the
system not being able to recognize spoken language or to parse the
grammar) and the logic and reasoning required to provide the actual
service. However, working applications in constrained domains (such
as making restaurant or haircut reservations) sprout up constantly.

For one thing, it is hard to tell how soon we’ll have safe and reliable self-
driving cars and other solutions that can replace human work. In addition to
this, we mustn’t forget that a truck or taxi driver doesn’t only turn a wheel: they
are also responsible for making sure the vehicle operates correctly, they
handle the goods and negotiate with customers, they guarantee the safety of
their cargo and passengers, and take care of a multitude of other tasks that
may be much harder to automate than the actual driving.
As with earlier technological advances, there will also be new work that is
created because of AI. It is likely that in the future, a larger fraction of the
workforce will focus on research and development, and tasks that require
creativity and human-to-human interaction. If you’d like to read more on this
topic, see for example Abhinav Suri’s nice essay on Artificial Intelligence and
the Rise of Economic Inequality.
Unanswered

Exercise 24: Implications of AI


What kind of articles (in newspapers and magazines or other popular science
outlets such as blogs, ...) are being written about AI - and do you think they are
realistic? Do an online search about AI related to one of your interests. Choose
one of the articles and analyze it.

1. Mention the title of the article along with its author and where it was
published (as a URL if applicable) in your answer.
2. Explain the central idea in the article in your own words using about a
paragraph of text (multiple sentences.)
3. Based on your understanding, how accurate are the AI-related statements
in the article? Explain your answer. Are the implications (if any)
realistic? Explain why or why not.

Note: On using ChatGPT and similar models: We appreciate you putting an AI


tool into use. Feel free to consult it but note that since we are interested in your
own opinion, we expect you to write your final answers yourself.

Sample Answer:

Many of the articles that we studied were about the great promise of AI
in different areas such as health-care, finance, customer service,
transportation... you name it. A pattern that seems to repeat is that
Google, IBM, Microsoft, or some of the other big players in the field have
demonstrated a prototype product and the reporter is amazed by it. This
tends to be combined with an estimate of the US or global market of the
industry in question, which easily amounts to billions of euros.

The articles very rarely report anything about the actual techniques
underlying the solutions, which is quite understandable since many
readers wouldn't be able to digest any technical details. (You would!)
A few of the articles we reviewed contain statements about AI "reading
millions of pages" and "comprehending them", but to be honest, we were
actually expecting worse based on our Facebook feed. Perhaps the social
media recommendations we get (based on our clicks! makes you
wonder...) are of lower quality than what Google search can provide?

III.Summary
The most important decisions that determine how well our society can adapt to the changes
brought by AI aren’t technological. They are political.

Everything that we have learned about AI suggests that the future is bright.
We will get new and better services and increased productivity will lead to
positive overall outcomes – but only on the condition that we carefully
consider the societal implications and ensure that the power of AI is used for
the common good.

What we need to do to ensure a positive outcome

Still, we have a lot of work to do.

• We need to avoid algorithmic bias to be able to reduce discrimination


instead of increasing it.
• We also need to learn to be critical about what we see, as seeing is no
longer the same as believing – and develop AI methods that help us
detect fraud rather than just making it easier to fabricate more real-
looking falsehoods.
• We need to set up regulation to guarantee that people have the right to
privacy, and that any violations of this right are strictly penalized.

We also need to find new ways to share the benefits to everyone, instead of
creating an AI elite, those who can afford the latest AI technology and use it to
access unprecedented economic inequality. This requires careful political
judgment (note that by political judgment, we mean decisions about policy,
which has little to do with who votes for whom in an election or the comings
and goings of individual politicians and political parties).

Note

The importance of policy


The most important decisions that determine how well our society can adapt to
the evolution of work and to the changes brought by AI aren’t technological. They
are political.

The regulation of the use of AI must follow democratic principles, and everyone
must have an equal say about what kind of a society we want to live in in the
future. The only way to make this possible is to make knowledge about
technology freely available to all. Obviously there will always be experts in any
given topic, who know more about it than the rest of us, but we should at least
have the possibility to critically evaluate what they are saying.
What you have learned with us supports this goal by providing you the basic
background about AI so that we can have a rational discussion about AI and
its implications.

Our role as individuals

As you recall, we started this course by motivating the study of AI by


discussing prominent AI applications that affect all our lives. We highlighted
three examples: self-driving cars, recommendation systems, and image and
video processing. During the course, we have also discussed a wide range of
other applications that contribute to the current technological transition.

Note

Hidden agenda
We also had a hidden agenda. We wanted to give you an opportunity to
experience the thrill of learning, and the joy of heureka moments when
something that may have been complicated and mysterious, becomes simple and
if not self-evident, at least comprehensible. These are moments when our
curiosity is satisfied. But such satisfaction is temporary. Soon after we have
found the answer to one question, we will ask the next. What then? And then?

If we have been successful, we have whetted your appetite for learning. We hope
you will continue your learning by finding other courses and further information
about AI, as well as other topics of your interest. To help you with your
exploration, we have collected some pointers to AI material that we have found
useful and interesting.
Now you are in a position where you can find out about what is going on in AI,
and what is being done to ensure its proper use. You should do so, and
whenever you feel like there are risks we should discuss, or opportunities we
should go after, don’t wait that someone else reacts.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy