Elements of AI
Elements of AI
Elements of AI
In our very first section, we’ll become familiar with the concept of AI by looking into its
definition and some examples.
To set the scene, so to speak, we’ll discuss what AI is, how it can be defined,
and what other fields or technologies are closely related. Before we do so,
however, we’ll highlight three applications of AI that illustrate different aspects
of AI. We’ll return to each of them throughout the course to deepen our
understanding.
The same technologies are also used in other autonomous systems such as
delivery robots, flying drones, and autonomous ships.
While the frontpage of the printed version of the New York Times or China
Daily is the same for all readers, the frontpage of the online version is different
for each user. The algorithms that determine the content that you see are
based on AI.
Implications: while many companies don’t want to reveal the details of their
algorithms, being aware of the basic principles helps you understand the
potential implications: these involve so called filter bubbles, echo-chambers,
troll factories, fake news, and new forms of propaganda.
The popularity of AI in the media is in part due to the fact that people have
started using the term when they refer to things that used to be called by other
names. You can see almost anything from statistics and business analytics to
manually encoded if-then rules called AI. Why is this so? Why is the public
perception of AI so nebulous? Let’s look at a few reasons.
Even AI researchers have no exact definition of AI. The field is rather being
constantly redefined when some topics are classified as non-AI, and new
topics emerge.
There’s an old (geeky) joke that AI is defined as “cool things that computers
can’t do.” The irony is that under this definition, AI can never make any
progress: as soon as we find a way to do something cool with a computer, it
stops being an AI problem. However, there is an element of truth in this
definition. Fifty years ago, for instance, automatic methods for search and
planning were considered to belong to the domain of AI. Nowadays such
methods are taught to every computer science student. Similarly, certain
methods for processing uncertain information are becoming so well
understood that they are likely to be moved from AI to statistics or probability
very soon.
Often the robothood of such creatures is only a thin veneer on top of a very
humanlike agent, which is understandable as most fiction – even science
fiction – needs to be relatable by human readers who would otherwise be
alienated by intelligence that is too different and strange. Most science fiction
is thus best read as metaphor for the current human condition, and robots
could be seen as stand-ins for repressed sections of society, or perhaps our
search for the meaning of life.
It can be hard to appreciate how complicated all this is, but sometimes it
becomes visible when something goes wrong: the object you pick is much
heavier or lighter than you expected, or someone else opens a door just as
you are reaching for the handle, and then you can find yourself seriously out
of balance. Usually these kinds of tasks feel effortless, but that feeling belies
millions of years of evolution and several years of childhood practice.
While easy for you, grasping objects by a robot is extremely hard, and it is an
area of active study. Recent examples include for example Boston Dynamics
robots.
It has since turned out that playing chess is very well suited to computers,
which can follow fairly simple rules and compute many alternative move
sequences at a rate of billions of computations a second. Computers beat the
reigning human world champion in chess in the famous Deep Blue vs
Kasparov matches in 1997. Could you have imagined that the harder problem
turned out to be grabbing the pieces and moving them on the board without
knocking it over! We will study the techniques that are used in playing games
like chess or tic-tac-toe in Chapter 2.
An attempt at a definition more useful than the “what computers can’t do yet”
joke would be to list properties that are characteristic to AI, in this case
autonomy and adaptivity.
Key terminology
Autonomy
The ability to perform tasks in complex environments without constant
guidance by a user.
Adaptivity
The ability to improve performance by learning from experience.
You may well say, for example, that a system is intelligent, perhaps because it
delivers accurate navigation instructions or detects signs of melanoma in
photographs of skin lesions. When we hear something like this, the word
“intelligent” easily suggests that the system is capable of performing any task
an intelligent person is able to perform: going to the grocery store and cooking
dinner, washing and folding laundry, and so on.
Likewise, when we say that a computer vision system understands images
because it is able to segment an image into distinct objects such as other
cars, pedestrians, buildings, the road, and so on, the word “understand” easily
suggest that the system also understands that even if a person is wearing a t-
shirt that has a photo of a road printed on it, it is not okay to drive on that road
(and over the person).
Note
Why you can say "a pinch of AI" but not "an AI"
Note
“AI” is not a countable noun
When discussing AI, we would like to discourage the use of AI as a countable
noun: one AI, two AIs, and so on. AI is a scientific discipline, like mathematics
or biology. This means that AI is a collection of concepts, problems, and
methods for solving them.
Because AI is a discipline, you shouldn’t say “an AI”, just like we don’t say “a
biology”. This point should also be quite clear when you try saying something
like “we need more artificial intelligences.” That just sounds wrong, doesn’t it?
(It does to us.)
Despite our discouragement, the use of AI as a countable noun is common.
Take for instance, the headline Data from wearables helped teach an AI to
spot signs of diabetes, which is otherwise a pretty good headline since it
emphasizes the importance of data and makes it clear that the system can
only detect signs of diabetes rather than making diagnoses and treatment
decisions. And you should definitely never ever say anything like Google’s
artificial intelligence built an AI that outperforms any made by humans, which
is one of the all-time most misleading AI headlines we’ve ever seen (note that
the headline is not by Google Research).
The use of AI as a countable noun is of course not a big deal if what is being
said otherwise makes sense, but if you’d like to talk like a pro, avoid saying
"an AI", and instead say "an AI method".
Answered
Note: You will only be able to submit the answer once, after which the correct
answers will be revealed, so take your time and re-read the material above if you
feel like it. That said, don't worry if you get some of them wrong – some of them
are debatable in any case because these kinds of things are rarely perfectly clear
cut. We are quite sure that if you just focus and do your best, you will have no
problems achieving a successful overall result in the end. Making mistakes is one
of the best opportunities to learn.
Spreadsheet that calculates sums and other pre-defined functions on given data
Yes
No
Kind of
II.Related fields
In addition to AI, there are several other closely related topics that are good to know at least
by name. These include machine learning, data science, and deep learning.
Key terminology
Machine learning
Systems that improve their performance in a given task with more and more
experience or data.
Deep learning is a subfield of machine learning, which itself is a subfield of AI,
which itself is a subfield of computer science. We will meet deep learning in
some more detail in Chapter 5, but for now let us just note that the “depth” of
deep learning refers to the complexity of a mathematical model, and that the
increased computing power of modern computers has allowed researchers to
increase this complexity to reach levels that appear not only quantitatively but
also qualitatively different from before. As you notice, science often involves a
number of progressively more special subfields, subfields of subfields, and so
on. This enables researchers to zoom into a particular topic so that it is
possible to catch up with the ever increasing amount of knowledge accrued
over the years, and produce new knowledge on the topic — or sometimes,
correct earlier knowledge to be more accurate.
Robotics means building and programming robots so that they can operate in
complex, real-world scenarios. In a way, robotics is the ultimate challenge of
AI since it requires a combination of virtually all areas of AI. For example:
Note
What is a robot?
In brief, a robot is a machine comprising sensors (which sense the environment)
and actuators (which act on the environment) that can be programmed to
perform sequences of actions. People used to science-fictional depictions of
robots will usually think of humanoid machines walking with an awkward gait
and speaking in a metallic monotone. Most real-world robots currently in use
look very different as they are designed according to the application. Most
applications would not benefit from the robot having human shape, just like we
don’t have humanoid robots to do our dishwashing but machines in which we
place the dishes to be washed by jets of water.
It may not be obvious at first sight, but any kind of vehicles that have at least
some level of autonomy and include sensors and actuators are also counted as
robotics. On the other hand, software-based solutions such as a customer service
chatbot, even if they are sometimes called “software robots”, aren’t counted as
(real) robotics.
Answered
Exercise 2: Taxonomy of AI
A taxonomy is a scheme for classifying many things that may be special cases of
one another. We have explained the relationships between a number of
disciplines or fields and pointed out, for example, that machine learning is
usually considered to be a subfield of AI.
Notice that a taxonomy does not need to be strictly hierarchical. A discipline can
be a subfield of more than one more general topic: for example, machine learning
can also be thought to be a subfield of statistics. In this case, the subfield concept
would be placed in the overlap between the more general topics.
Your task: Construct a taxonomy in the Euler diagram example given below
showing the relationships between the following things: AI, machine learning,
computer science, data science, and deep learning.
Note: This exercise is meant to inspire you to think about the different aspects of
AI and their role in various applications. As there are no clear-cut answers to
many of these questions, this exercise will not be included in the grading.
Nevertheless, we suggest that you do your best and try to answer as well as you
can, but don't worry if our answers will differ from yours.
Autonomous car
Statistics
Robotics
Machine learning
III.Philosophy of AI
The very nature of the term “artificial intelligence” brings up philosophical questions whether
intelligent behavior implies or requires the existence of a mind, and to what extent is
consciousness replicable as computation.
What Turing meant by the test is very much similar to the aphorism by Forrest
Gump: “stupid is as stupid does”. Turing’s version would be “intelligent is as
intelligent says”. In other words, an entity is intelligent if it cannot be
distinguished from another intelligent entity by observing its behavior. Turing
just constrained the set of behaviors into discussion so that the interrogator
can’t base her or his decision on appearances.
One criticism of the Turing test as a test for intelligence is that it may actually
measure whether the computer behaves like a human more than whether it is
intelligent. The test has indeed been “passed” by computer programs that
keep changing the subject, make plenty of spelling errors, and sometimes
refuse to respond at all. A famous example is Eugene Goostman, a 13-year-
old Ukrainian boy who constantly tries to avoid answering questions by
making jokes and changing the subject to his pet guineapig. Here is a short
segment of an example conversation with Eugene Goostman from the blog of
Scott Aaronson:
Eugene: Something between 2 and 4. Maybe, three? :-))) By the way, I still
don’t know your specialty - or, possibly, I’ve missed it?
Eugene: Just two, but Chernobyl mutants may have them up to five. I know
you are supposed to trick me.
Interrogator: No, I need to know that you’re not a chatbot. Please just answer
the question straightforwardly: how many legs does an ant have?
Eugene: Something between 2 and 4. Maybe, three? :-))) Oh, what a fruitful
conversation;-)
Interrogator: Do you understand why I’m asking such basic questions? Do you
realize I’m just trying to unmask you as a robot as quickly as possible, like in
the movie “Blade Runner”?
Eugene: ...wait
The idea that intelligence is the same as intelligent behavior has been
challenged by some. The best known counter-argument is John
Searle’s Chinese Room thought experiment. Searle describes an experiment
where a person who doesn’t know Chinese is locked in a room. Outside the
room is a person who can slip notes written in Chinese inside the room
through a mail slot. The person inside the room is given a big manual where
she can find detailed instructions for responding to the notes she receives
from the outside.
Searle argued that even if the person outside the room gets the impression
that he is in a conversation with another Chinese-speaking person, the person
inside the room does not understand Chinese. Likewise, his argument
continues, even if a machine behaves in an intelligent manner, for example,
by passing the Turing test, it doesn’t follow that it is intelligent or that it has a
“mind” in the way that a human has. The word “intelligent” can also be
replaced by the word “conscious” and a similar argument can be made.
The Chinese Room argument goes against the notion that intelligence can be
broken down into small mechanical instructions that can be automated.
A self-driving car is an example of an element of intelligence (driving a car)
that can be automated. The Chinese Room argument suggests that this,
however, isn’t really intelligent thinking: it just looks like it. Going back to the
above discussion on “suitcase words”, the AI system in the car doesn’t see or
understand its environment, and it doesn’t know how to drive safely, in the
way a human being sees, understands, and knows. According to Searle this
means that the intelligent behavior of the system is fundamentally different
from actually being intelligent.
Key terminology
General vs narrow AI
When reading the news, you might see the terms “general” and “narrow” AI. So
what do these mean? Narrow AI refers to AI that handles one task. General AI, or
Artificial General Intelligence (AGI) refers to a machine that can handle any
intellectual task. All the AI methods we use today fall under narrow AI, with
general AI being in the realm of science fiction. In fact, the ideal of AGI has been
all but abandoned by the AI researchers because of lack of progress towards it in
more than 50 years despite all the effort. In contrast, narrow AI makes progress
in leaps and bounds.
Strong vs weak AI
A related dichotomy is “strong” and “weak” AI. This boils down to the above
philosophical distinction between being intelligent and acting intelligently, which
was emphasized by Searle. Strong AI would amount to a “mind” that is genuinely
intelligent and self-conscious. Weak AI is what we actually have, namely systems
that exhibit intelligent behaviors despite being “mere“ computers.
Answered
Let's first scrutinize the following definitions that have been proposed earlier:
Your task:
• Do you think these are good definitions? Consider each of them in turn and
try to come up with things that they get wrong - either things that you
think should be counted as AI but aren't according to the definition, or vice
versa. Explain your answers by a few sentences per item (so just saying
that all the definitions look good or bad isn't enough).
• Also come up with your own, improved definition that solves some of the
problems that you have identified with the above candidates. Explain with
a few sentences how your definition may be better than the above ones.
Please read the above instructions carefully and answer both of the items
above in the text box below. Your answer will be reviewed by other users
and by the instructors. Please answer in English, and check your answer
before clicking 'Submit' because once submitted, you can no longer edit
your answer.
Your answer:
1. "cool things that computers can't do" This is a fun and useful way of
defining AI to explain it to young people like 6th graders, but this description
lacks the primary goal of creating AI - to serve humans. One could argue that
serving humans, such as doing their routine work, analyzing thousands of
pieces of information, etc., is cool. What is cool is subjective and constantly
changing. For example, Amazon Alexa or Siri, which were once considered
great achievements in AI, were cool, but now such voice assistants exist on
many devices, so they are like routine tasks. Moreover, the definition cannot
be applied to future computers. According to Moore's Law, which observes
that the number of transistors on a microchip doubles approximately every
12-18 months, computing power will double exponentially. This means one
cannot predict what future computers will be capable of with such power,
especially with the inclusion of quantum computing. 2. machines imitating
intelligent human behavior This definition does not explicitly point out that
AI can also imitate non-human creatures such as animals e.g. Spot - a
robotic dog by Boston Dynamics. It does not include non-intelligent human
behavior either. As noted in the current chapter, human behavior is not
necessarily intelligent. I mean, who doesn't walk into a room and forget why
they went there? If AI does something unintelligent then there will be many
critiques and it will be seen as more than just a malfunction. This raises
another question: what are the standards for being intelligent? It is arguable
that everyone would consider people doing daily chores so intelligent, but if
the machines were to do it, such as a humanoid robot that can do the
laundry (not a washing machine) or wash the dishes (not a dishwasher), they
would be considered intelligent. As can be seen, there's a bit of ambiguity
here. On the other hand, a human is more intelligent than a cockroach, but
scientists cannot even create AI with the intelligence level of such a small
insect considering the neurons inside their tiny brains. This means there is a
broad gap between AI today and AI that will reach or surpass human
intelligence. Thus the above description of “machines behaving like
intelligent humans” is yet to come. In my opinion, a more comprehensive
definition would be "machines trying to imitate humans and other entities." 3.
autonomous and adaptive systems I personally liked this definition the
most among the three. It is rather self-explanatory and descriptive. However,
it still has some flaws. Firstly, there are AI tools that are non-autonomous or
semi-autonomous. For example, Sophia, a realistic AI humanoid robot that
can interact with its surroundings, is semi-autonomous as she generates
speech either autonomously or through scripts written by her developers. As
for being adaptive, the definition doesn’t specify what is being adapted. AI
can be self-adaptive (adapting its own processes) or adaptive to its
surroundings. Finally, combining the above definitions into a single
thorough definition that avoids ambiguity related to personal preferences for
what is “cool,” and considers non-human entities, intelligence levels, and
degrees of autonomy and adaptability, AI can be defined as: Computer
systems that possess certain autonomy and adaptability, capable of
attempting to imitate both intelligent and non-intelligent behaviors of
humans and other entities, with the main goal of serving humans.
Example answer
The good: this adapts to include new problems in the future, captures a wide
range of AI such computer vision, natural language processing.
The bad: it rules out any "solved" problems, very hard to say what counts as
"cool".
The good: the same as in the previous. Also, imitate is a good word since it
doesn't require that the AI solutions should "be" intelligent (whatever it means)
and it's instead enough to act intelligently.
The good: it highlights two main characteristics of AI, captures things like robots,
self-driving cars, and so on, also nicely fits machine learning-based AI methods
that adapt to the training data.
The bad: once again, these lead to further questions and the definition of
'autonomous' in particular isn't very clear (is a vacuum cleaner bot autonomous?
How about a spam filter?). Furthermore, not all AI systems need to be
autonomous and we can in fact often achieve much more by combining human
and machine intelligence.
Your answer has received 4 peer reviews. The average grade of received
reviews is 5.00.
Imagine you’re in a foreign city, at some address (say a hotel) and want to use
public transport to get to another address (a nice restaurant, perhaps). What
do you do? If you are like many people, you pull out your smartphone, type in
the destination and start following the instructions.
This question belongs to the class of search and planning problems. Similar
problems need to be solved by self-driving cars, and (perhaps less obviously)
AI for playing games. In the game of chess, for example, the difficulty is not so
much in getting a piece from A to B as keeping your pieces safe from the
opponent.
Often there are many different ways to solve the problem, some of which may
be more preferable in terms of time, effort, cost or other criteria. Different
search techniques may lead to different solutions, and developing advanced
search algorithms is an established research area.
We will not focus on the actual search algorithms. Instead, we emphasize the
first stage of the problem solving process: defining the choices and their
consequences, which is often far from trivial and can require careful thinking.
We also need to define what our goal is, or in other words, when we can
consider the problem solved. After this has been done, we can look for a
sequence of actions that leads from the initial state to the goal.
These categories don’t cover all possible real-world scenarios, but they are
generic enough to demonstrate the main concepts and techniques.
Before we address complex search tasks like navigation or playing chess, let
us start from a much simplified model in order to build up our understanding of
how we can solve problems by AI.
We’ll start from a simple puzzle to illustrate the ideas. A robot on a rowboat
needs to move three pieces of cargo across a river: a fox, a chicken, and a
sack of chicken-feed. The fox will eat the chicken if it has the chance, and the
chicken will eat the chicken-feed if it has the chance, and neither is a
desirable outcome. The robot is capable of keeping the animals from doing
harm when it is near them, but only the robot can operate the rowboat and
only two of the pieces of cargo can fit on the rowboat together with the robot.
How can the robot move all of its cargo to the opposite bank of the river?
Note
Some of these states are forbidden by the puzzle conditions. For example, in
state NFFN (meaning that the robot is on the near side with the chicken-feed
but the fox and the chicken are on the far side), the fox will eat the chicken,
which we cannot have. Thus we can rule out states NFFN, NFFF, FNNF,
FNNN, NNFF, and FFNN (you can check each one if you doubt our
reasoning). We are left with the following ten states:
Now let’s draw the transitions. We could draw arrows that have a direction so
that they point from one node to another, but in this puzzle the transitions are
symmetric: if the robot can row from state NNNN to state FNFF, it can equally
well row the other way from FNFF to NNNN. Thus it is simpler to draw the
transitions simply with lines that don’t have a direction. Starting from NNNN,
we can go to FNFN, FNFF, FFNF, and FFFN:
We have now done quite a bit of work on the puzzle without seeming any
closer to the solution, and there is little doubt that you could have solved the
whole puzzle already by using your “natural intelligence”. But for more
complex problems, where the number of possible solutions grows in the
thousands and in the millions, our systematic or mechanical approach will
shine since the hard part will be suitable for a simple computer to do. Now that
we have formulated the alternative states and transitions between them, the
rest becomes a mechanical task: find a path from the initial state NNNN to the
final state FFFF.
One such path is colored in the following picture. The path proceeds from
NNNN to FFFN (the robot takes the fox and the chicken to the other side),
thence to NFNN (the robot takes the chicken back on the starting side) and
finally to FFFF (the robot can now move the chicken and the chicken-feed to
the other side).
Key terminology
Transitions
are possible moves between one state and another, such as NNNN to FNFN. It is
important to note that we only count direct transitions that can be accomplished
with a single action as transitions. A sequence of multiple transitions, for
example, from A to C, from C to D, and from D to B (the goal), is a path rather
than a transition.
Costs
refer to the fact that, oftentimes the different transitions aren’t all alike. They can
differ in ways that make some transitions more preferable or cheaper (in a not
necessarily monetary sense of the word) and others more costly. We can express
this by associating with each transition a certain cost. If the goal is to minimize
the total distance traveled, then a natural cost is the geographical distance
between states. On the other hand, the goal could actually be to minimize the
time instead of the distance, in which case the natural cost would obviously be
the time. If all the transitions are equal, then we can ignore the costs.
Answered
Using the diagram with the possible states below as a starting point, draw the
possible transitions in it (it is MUCH easier to do this with a pencil and paper
than without).
Having drawn the state transition diagram, find the shortest path from NNNN to
FFFF, and calculate the number of transitions on it.
Please type your answer as the number of transitions in the shortest path (just a
single number like "12"). Do NOT include any further description of your
solution. Hint: Do not count the number of states, but the number of transitions.
For example, the number of transitions in the path NNNN→FFNF→NFNF→FFFF
is 3 instead of 4.
Your answer:
7
In the initial state, both discs are stacked in the first (leftmost) peg. The goal is to
move the discs to the third peg. You can move one disc at a time, from any peg to
another, as long as there is no other disc on top of it. It is not allowed to put a
larger disc on top of a smaller disc.
This picture shows the initial state and the goal state. There are also seven other
states so that the total number of possible states is nine: three ways to place the
large disc and for each of them, three ways to place the small disc.
Your task: Draw the state diagram. The diagram should include all the nine
possible states in the game, connected by lines that show the possible transitions.
The picture below shows the overall structure of the state diagram and the
positions of the first three states. It shows that from the starting state (at the top
corner), you can move to two other states by moving the small disc. Complete the
state diagram by placing the remaining states in the correct places. Note that the
transitions are again symmetric and you can also move sideways (left or right) or
up in the diagram.
After solving the task using pen and paper, enter your solution by choosing
which state belongs to which node in the diagram. (Hint: Each state belongs to
exactly one node).
Choose for each node (1–6) in the above diagram the correct state A—F from
below.
Note
So instead of having to build a different device for each task, we use the same
computer for many tasks. This is the idea of programming. Today this invention
sounds trivial but in Turing’s days it was far from it. Some of the early
programmable computers were used during World War II to crack German
secret codes, a project where Turing was also personally involved.
The term Artificial Intelligence was coined by John McCarthy (1927-2011) –
who is often also referred to as the Father of AI. The term became established
when it was chosen as the topic of a summer seminar, known as
the Dartmouth conference, which was organized by McCarthy and others in
1956 at Dartmouth College in New Hampshire. In the proposal to organize the
seminar, McCarthy continued with Turing’s argument about automated
computation. The proposal contains the following crucial statement:
Note
Maxine and Minnie are true game enthusiasts. They just love games.
Especially two-person, perfect information games such as tic-tac-toe or chess.
One day they were playing tic-tac-toe. Maxine, or Max as her friends call her,
was playing with X. Minnie, or Min as her friends call her, had the Os. Min had
just played her turn and the board looked as follows:
Max was looking at the board and contemplating her next move, as it was her
turn, when she suddenly buried her face in her hands in despair, looking quite
like Garry Kasparov playing Deep Blue in 1997.
Yes, Min was close to getting three Os on the top row, but Max could easily
put a stop to that plan. So why was Max so pessimistic?
Game trees
To solve games using AI, we will introduce the concept of a game tree. The
different states of the game are represented by nodes in the game tree, very
similar to the above planning problems. The idea is just slightly different. In
the game tree, the nodes are arranged in levels that correspond to each
player’s turns in the game so that the “root” node of the tree (usually depicted
at the top of the diagram) is the beginning position in the game. In tic-tac-toe,
this would be the empty grid with no Xs or Os played yet. Under root, on the
second level, there are the possible states that can result from the first
player’s moves, be it X or O. We call these nodes the “children” of the root
node.
Each node on the second level, would further have as its children nodes the
states that can be reached from it by the opposing player’s moves. This is
continued, level by level, until reaching states where the game is over. In tic-
tac-toe, this means that either one of the players gets a line of three and wins,
or the board is full and the game ends in a tie.
In order to be able to create game AI that attempts to win the game, we attach
a numerical value to each possible end result. To the board positions where X
has a line of three so that Max wins, we attach the value +1, and likewise, to
the positions where Min wins with three Os in a row we attach the value -1.
For the positions where the board is full and neither player wins, we use the
neutral value 0 (it doesn’t really matter what the values are as long as they are
in this order so that Max tries to maximize the value, and Min tries to minimize
it).
Consider, for example, the following game tree which begins not at the root
but in the middle of the game (because otherwise, the tree would be way too
big to display). Note that this is different from the game shown in the
illustration in the beginning of this section. We have numbered the nodes with
numbers 1, 2, ..., 14.
The tree is composed of alternating layers where it is either Min’s turn to place
an O or Max’s turn to place an X at any of the vacant slots on the board. The
player whose turn it is to play next is shown at the left.
The game continues at the board position shown in the root node, numbered
as (1) at the top, with Min’s turn to place O at any of the three vacant cells.
Nodes (2)–(4) show the board positions resulting from each of the three
choices respectively. In the next step, each node has two possible choices for
Max to play X each, and so the tree branches again.
When starting from the above starting position, the game always ends in a
row of three: in nodes (7) and (9), the winner is Max who plays with X, and in
nodes (11)–(14) the winner is Min who plays with O.
Note that since the players’ turns alternate, the levels can be labeled as Min
levels and Max levels, which indicates whose turn it is.
Being strategic
Consider nodes (5)–(10) on the second level from the bottom. In nodes (7)
and (9), the game is over, and Max wins with three X’s in a row. The value of
these positions is +1. In the remaining nodes, (5), (6), (8), and (10), the game
is also practically over, since Min only needs to place her O in the only
remaining cell to win. In other words, we know how the game will end at each
node on the second level from the bottom. We can therefore decide that the
value of nodes (5), (6), (8), and (10) is also –1.
Here comes the interesting part. Let’s consider the values of the nodes one
level higher towards the root: nodes (2)–(4). Since we observed that both of
the children of (2), i.e., nodes (5) and (6), lead to Min’s victory, we can without
hesitation attach the value -1 to node (2) as well. However, for node (3), the
left child (7) leads to Max’s victory, +1, but the right child (8) leads to Min
winning, -1. What is the value of node (3)? Think about this for a while,
keeping in mind who makes the choice at node (3).
Since it is Max’s turn to play, she will of course choose the left child, node (7).
Thus, every time we reach the board position in node (3), Max can ensure
victory, and we can attach the value +1 to node (3).
The same holds for node (4): again, since Max can choose where to put her
X, she can always ensure victory, and we attach the value +1 to node (4).
The most important lesson in this section is to apply the above kind of
reasoning repeatedly to determine the result of the game in advance from any
board position.
So far, we have decided that the value of node (2) is –1, which means that if
we end up in such a board position, Min can ensure winning, and that the
reverse holds for nodes (3) and (4): their value is +1, which means that Max
can be sure to win if she only plays her own turn wisely.
Finally, we can deduce that since Min is an experienced player, she can reach
the same conclusion, and thus she only has one real option: play the O in the
middle of the board.
In the diagram below, we have included the value of each node as well as the
optimal game play starting at Min’s turn in the root node.
The value of the root node, which is said to be the value of the game, tells us
who wins (and how much, if the outcome is not just plain win or lose): Max
wins if the value of the game is +1, Min if the value is –1, and if the value is 0,
then the game will end in a draw. In other games, the value may also take
other values (such as the monetary value of the chips in front of you in poker
for example).
This all is based on the assumption that both players choose what is best for
them and that what is best for one is the worst for the other (so called "zero-
sum game").
Note
The algorithm can be implemented using a few lines of code. However, we will
be satisfied with having grasped the main idea. If you are interested in taking
a look at the actual algorithm (alert: programming required) feel free to check
out, for example, Wikipedia: Minimax.
Note
If we can afford to explore only a small part of the game tree, we need a way
to stop the Minimax algorithm before reaching an end-node, i.e., a node
where the game is over and the winner is known. This is achieved by using a
so called heuristic evaluation function that takes as input a board position,
including the information about which player’s turn is next, and returns a score
that should be an estimate of the likely outcome of the game continuing from
the given board position.
Note
Good heuristics
Good heuristics for chess, for example, typically count the amount of material
(pieces) weighted by their type: the queen is usually considered worth about two
times as much as a rook, three times a knight or a bishop, and nine times as much
as a pawn. The king is of course worth more than all other things combined since
losing it amounts to losing the game. Further, occupying the strategically
important positions near the middle of the board is considered an advantage and
the heuristics assign higher value to such positions.
The minimax algorithm presented above requires minimal changes to obtain
a depth-limited version where the heuristic is returned at all nodes at a given
depth limit: the depth simply refers to the number of steps that the game tree
is expanded before applying a heuristic evaluation function.
Answered
Now it's Min's turn to play an O. Evaluate the value of this state of the game as
well as the other states in the game tree where the above position is the root,
using the Minimax algorithm.
Your task:
Look at the game tree starting from the below board position. Using a pencil and
paper, fill in the values of the bottom-level nodes where the game is over. Note
that this time some of the games end in a draw, which means that the values of
the node is 0 (instead of -1 or 1).
Next continue filling the values of the nodes in the next level up. Since there is no
branching at that level, the values on the second-lowest level are the same as at
the bottom level.
On the second-highest level, fill in the values by choosing for each node the
maximum of the values of the child nodes – as you notice, this is a MAX level.
Finally, fill in the root node's value by choosing the minimum of the root node's
child nodes' values. This is the value of the game.
1
-1
0
However, probability has turned out to be the best approach for reasoning
under uncertainty, and almost all current AI applications are based, to at
least some degree, on probabilities.
•
• Why probability matters
• We are perhaps most familiar with applications of probability in games:
what are the chances of getting three of a kind in poker (about 1 in 47),
what are the chances of winning in the lottery (very small), and so on.
However, far more importantly, probability can also be used to quantify
and compare risks in everyday life: what are the chances of crashing
your car if you exceed the speed limit, what are the chances that the
interest rates on your mortgage will go up by five percentage points
within the next five years, or what are the chances that AI will automate
particular tasks such as detecting fractured bones in X-ray images or
waiting tables in a restaurant.
• Note
• The key lesson about probability
• The most important lesson about probability that we’d like you to take
away is not probability calculus. Instead, it is the ability to think of
uncertainty as a thing that can be quantified at least in principle. This
means that we can talk about uncertainty as if it were a number: numbers
can be compared (“is this thing more probable than that thing”), and they
can often be measured.
• Exercise 9: Odds
• As we already mentioned above, the odds 3:1 – for example three rainy
days for each rainless day – corresponds to probability 0.75 (or in
percentages 75%).
• In general, if the odds in favor of an event are x:y, the probability of the
event is given by x / (x+y). Try that with the odds 3:1 if you like. You
should get the answer 0.75.
• As we also pointed out, the odds 6:2 corresponds to exactly the same
probability as the odds 3:1 because when we let x=6 and y=2, and write
them in the formula x / (x+y), we get 6/(6+2), which comes out as 6/8 =
3/4 = 0.75.
• Your task:
• For the first three items 1–3, convert from odds to probabilities expressed
as natural frequencies; for example from 1:1 to 1/2. Give your answer as a
fraction, for example 2/3.
• For the last three items 4–6, convert the odds into probabilities expressed
as percentages (e.g. 4.2%). Give your answer in percentages using a single
decimal, for example 12.2%.
• Hint: the calculations are to be calculated with a simple calculator and the
formulas can be found above.
• The odds for getting three of a kind in poker are about 1:46.
• Your answer:
• 1/47
•
• Your answer is correct
• Correct. There are 46 situations where you do not get three of a kind for
one where you get it, so the probability is 1/(1+46) = 1/47.
• The odds for rain in Helsinki are 206:159.
• Your answer:
• 206/365
•
• Your answer is correct
• Correct. There are 206 rainy days for 159 dry days, so the probability is
206/(206+159) = 206/365.
• The odds for rain in San Diego are 23:342.
• Your answer:
• 23/365
•
• Your answer is correct
• Correct. There are 23 rainy days for 342 dry days, so the probability is
23/(23+342) = 23/365.
• The odds for getting three of a kind in poker are about 1:46.
• Your answer:
• 2.1%
•
• Your answer is correct
• Correct. Previously we had the probability as 1/(1+ 46) = 1/47, which
gives us roughly 0.0213, which rounds to 2.1%.
• The odds for rain in Helsinki are 206:159.
• Your answer:
• 56.4%
•
• Your answer is correct
• Correct. Previously we had the probability as 206/(206 + 159) = 206/365,
which gives us roughly 0.5644, which rounds to 56.4%.
• The odds for rain in San Diego are 23:342.
• Your answer:
• 6.3%
•
• Your answer is correct
• Correct. Previously we had the probability as 23/(23 + 342) = 23/365,
which gives us roughly 0.0630, which rounds to 6.3%.
• 6/6 answers correct
We will do this because this particular formula is both simple and elegant as
well as incredibly powerful. It can be used to weigh conflicting pieces of
evidence in medicine, in a court of law, and in many (if not all) scientific
disciplines. The formula is called the Bayes rule (or the Bayes formula).
Key terminology
In order to weigh the new information, and decide how the odds change when
it becomes available, we need to consider how likely we would be to
encounter this information in alternative situations. Let’s take as an example,
the odds that it will rain later today. Imagine getting up in the morning in
Finland. The chances of rain are 206 in 365 (including rain, snow, and hail.
Brrr). The number of days without rain is therefore 159. This converts to prior
odds of 206:159 for rain, so the cards are stacked against you already before
you open your eyes.
However, after opening your eyes and taking a look outside, you notice it’s
cloudy. Suppose the chances of having a cloudy morning on a rainy day are 9
out of 10 – that means that only one out of 10 rainy days start out with blue
skies. But sometimes there are also clouds without rain: the chances of
having clouds on a rainless day are 1 in 10. Now how much higher are the
chances of clouds on a rainy day compared to a rainless day? Think about
this carefully as it will be important to be able to comprehend the question and
obtain the answer in what follows.
The answer is that the chances of clouds on a rainy day are nine times as
high as the chances of clouds on a rainless day: on a rainy day the chances
are 9 out of 10, whereas on a rainless day the chances are 1 out of 10, which
is nine times as high.
Note that even though the two probabilities 9/10 and 1/10 sum up to 9/10 +
1/10 = 1, this is by no means always the case. In some other town, the
mornings of rainy days could be cloudy eight times out of ten. This, however,
would not mean that the rainless days are cloudy two times out of ten. You’ll
have to be careful to get the calculations right. (But never mind if you make a
mistake or two – don’t give up! The Bayes rule is a fundamental thinking tool
for everyone of us.)
Key terminology
Likelihood ratio
The above ratio (nine times as high chances of clouds on a rainy day compared to
a rainless day) is called the likelihood ratio. More generally, the likelihood ratio is
the probability of the observation in case the event of interest (in the above,
rain), divided by the probability of the observation in case of no event (in the
above, no rain). Please read the previous sentence a few times. It may look a little
intimidating, but it’s not impossible to digest if you just focus carefully. We will
walk you through the steps in detail, just don’t lose your nerve. We’re almost
there.
So we concluded that on a cloudy morning, we have: likelihood ratio = (9/10) /
(1/10) = 9
The mighty Bayes rule for converting prior odds into posterior odds is – ta-
daa! – as follows: posterior odds = likelihood ratio × prior odds
Now you are probably thinking: Hold on, that’s the formula? It’s a frigging
multiplication! That is the formula – we said it’s simple, didn’t we? You
wouldn’t imagine that a simple multiplication can be used for all kinds of
incredibly useful applications, but it can. We’ll study a couple examples which
will demonstrate this.
Note
As we calculated above, the prior odds for rain is 206:159 and the likelihood
ratio for observing clouds is 9.
Give your result in the form of odds, xx:yy, where xx and yy are numbers. (Note
that xx and yy does not mean that the numbers should have two digits each.)
Remember that when multiplying odds, you should only multiply the numerator
(the xx part). For example, if you multiply the odds 5:3 by 5, the result is 25:3.
Give the answer without simplifying the expression even if both sides have a
common factor.
Your answer:
1854:159
Our first realistic application is a classical example of using the Bayes rule,
namely medical diagnosis. This example also illustrates a common bias in
dealing with uncertain information called the base-rate fallacy.
The test may also fail in the other direction, namely to indicate breast cancer
when none exists. This is called a false positive finding. Suppose that if the
person being tested actually doesn’t have breast cancer, the chances that the
test nevertheless comes out positive are 10 in 100. (In technical terms, we
would say that the specificity of the test is 90%.)
Based on the above probabilities, you are able to calculate the likelihood ratio.
You’ll find use for it in the next exercise. If you forgot how the likelihood ratio is
calculated, you may wish to check the terminology box earlier in this section
and revisit the rain example.
Note: You can use the above diagram with stick figures to validate that your
result is in the ballpark (about right) but note that diagram isn’t quite precise.
Out of the 95 women who don’t have cancer (the gray figures in the top
panel), about nine and a half are expected to get a (false) positive result. The
remaining 85 and a half are expected to get a (true) negative result. We didn’t
want to be so cruel as to cut people – even stick figures – in half, so we used
9 and 86 as an approximation.
Answered
First, use your intuition without applying the Bayes rule, and write down on a
piece of paper (not in the answer box below) what you think the chances of
having breast cancer are after a positive test result. The intuitive answer will not
be a part of your answer. It will be just for your own information.
Next, calculate the posterior odds for her having breast cancer using the Bayes
rule. This will be your answer.
Hints:
Enter the posterior odds as your solution below. Give the answer in the form
xx:yy where xx and yy are numbers, without simplifying the expression even if
both sides have a common factor.
Your answer:
40:95
While doing the above exercise, you may have noticed that our intuition is not
well geared towards weighing different pieces of evidence. This is true
especially when the pieces of evidence conflict with each other. In the above
example, on the one hand, the base rate of breast cancer was relatively low,
meaning that breast cancer is relatively rare. So our brain thinks that it’s
unlikely that a person has it. On the other hand, the positive mammograph
test suggests the opposite. Our brain tends to choose one of these pieces of
evidence and ignore the other. It is typically the low base rate that is ignored.
That’s why your intuition probably says that the posterior probability of having
breast cancer given the positive test result is much higher than 30%. This is
known as the so called base rate fallacy. Knowing the Bayes rule is the best
cure against it.
III.Naive Bayes classification
One of the most useful applications of the Bayes rule is the so-called naive Bayes classifier.
The naive Bayes classifier can be used to determine the probabilities of the
classes given a number of different observations. The assumption in the
model is that the feature variables are conditionally independent given the
class (we will not discuss the meaning of conditional independence in this
course. For our purposes, it is enough to be able to exploit conditional
independence in building the classifier).
We will use a spam email filter as a running example for illustrating the idea of
the naive Bayes classifier. Thus, the class variable indicates whether a
message is spam (or “junk email”) or whether it is a legitimate message (also
called “ham”). The words in the message correspond to the feature variables,
so that the number of feature variables in the model is determined by the
length of the message.
Note
Despite its naivete, the naive Bayes method tends to work very well in
practice. This is a good example of the common saying in statistics, “all
models are wrong, but some are useful” means (the aphorism is generally
attributed to statistician George E.P. Box).
Estimating parameters
To get started, we need to specify the prior odds for spam (against ham). For
simplicity assume this to be 1:1 which means that on the average half of the
incoming messages are spam (in reality, the amount of spam is probably
much higher).
To get our likelihood ratios, we need two different probabilities for any word
occurring: one in spam messages and another one in ham messages.
The word distributions for the two classes are best estimated from actual
training data that contains some spam messages as well as legitimate
messages. The simplest way is to count how many times each word, abacus,
acacia, ..., zurg, appears in the data and divide the number by the total word
count.
To illustrate the idea, let’s assume that we have at our disposal some spam
and some ham. You can easily obtain such data by saving a batch of your
emails in two files.
word spam h
million 156 9
dollars 29 1
adclick 51 0
conferences 0 1
total 95791 3
We can now estimate that the probability that a word in a spam message is
“million”, for example, is about 156 out of 95791, which is roughly the same as
1 in 614. Likewise, we get the estimate that 98 out of 306438 words, which is
about the same as 1 in 3127, in a ham message are million. Both of these
probability estimates are small, less than 1 in 500, but more importantly, the
former is higher than the latter: 1 in 614 is higher than 1 in 3127. This means
that the likelihood ratio, which is the first ratio divided by the second ratio, is
more than one. To be more precise, the ratio is (1/614) / (1/3127) = 3127/614
= 5.1 (rounded to one decimal digit).
Recall that if you have any trouble at all with following the math in this section,
you should refresh the arithmetic with fractions using the pointers we gave
earlier (see the part about Odds in section Odds and Probability).
Note
Once we have the prior odds and the likelihood ratios calculated, we are
ready to apply the Bayes rule, which we already practiced in the medical
diagnosis case as our example. The reasoning goes just like it did before: we
update the odds of spam by multiplying it by the likelihood ratio. To remind
ourselves of the procedure, let’s try a message with a single word to begin
with. For the prior odds, as agreed above, you should use odds 1:1.
Answered
Your task: Calculate the posterior odds for spam given this word using the table
above, starting from prior odds 1:1. Keep in mind that the odds is not the same as
the probability, which we would usually express as a percentage.
Give your answer in the form of a single decimal number x.x using the dot '.' as
the decimal separator.
Your answer:
5.1
To handle the rest of the words in a message, we can use exactly the same
procedure. The posterior odds after one word, which you calculated in the
previous exercise, will become the prior odds for the next word, and so on.
Answered
You should again start with the prior odds 1:1, and then multiply the odds
repeatedly by the likelihood ratios for each of the four words. Notice that the
likelihood ratios are tabulated above for your reference (these are the numbers
5.1, 0.8, and so on).
Your task: Express the result as posterior odds without any rounding of the
result. You may take a look at the solution of the previous exercise for help.
Your answer:
65.1168
Hooray! You have now mastered a powerful technique used every day in a
wide range of real-world AI applications, the naive Bayes classifier. Even if
you had to skip some of the technicalities, you should try to make sure you
understood the basic principles of applying probabilities to update beliefs. As
we discussed in the beginning of this Chapter, the main advantage of
probabilistic reasoning is the ability to handle uncertain and conflicting
evidence. Using examples in medical diagnosis and spam filtering, we
demonstrated how this works in practice.
The correct label (what digit the writer was supposed to write) is shown above
each image. Note that some of the "correct” class labels are questionable: see
for example the second image from left: is that really a 7, or actually a 4?
Note
Note
• if the black pixels are mostly in the form of a single loop then the label is
0
• if the black pixels form two intersecting loops then the label is 8
• if the black pixels are mostly in a straight vertical line in the middle of
the figure then the label is 1
and so on...
This was how AI methods were mostly developed in the 1980s (so called
“expert systems”). However, even for such a simple task as digit recognition,
the task of writing such rules is very laborious. In fact, the above example rules
wouldn’t be specific enough to be implemented by programming – we’d have
to define precisely what we mean by “mostly”, “loop”, “line”, “middle”, and so
on.
And even if we did all this work, the result would likely be a bad AI method
because as you can see, the handwritten digits are often a bit so-and-so, and
every rule would need a dozen exceptions.
The roots of machine learning are in statistics, which can also be thought of
as the art of extracting knowledge from data. Especially methods such as
linear regression and Bayesian statistics, which are both already more than
two centuries old (!), are even today at the heart of machine learning. For
more examples and a brief history, see the timeline of machine
learning (Wikipedia).
Note
Classification
When it comes to machine learning, we will focus primarily on supervised
learning, and in particular, classification tasks. In classification, we observe in
input, such as a photograph of a traffic sign, and try to infer its “class”, such as the
type of sign (speed limit 80 km/h, pedestrian crossing, stop sign, etc.). Other
examples of classification tasks include: identification of fake Twitter accounts
(input includes the list of followers, and the rate at which they have started
following the account, and the class is either fake or real account) and
handwritten digit recognition (input is an image, class is 0,...,9).
Instead of manually writing down exact rules to do the classification, the point
in supervised machine learning is to take a number of examples, label each
one by the correct label, and use them to “train” an AI method to automatically
recognize the correct label for the training examples as well as (at least
hopefully) any other images. This of course requires that the correct labels are
provided, which is why we talk about supervised learning. The user who
provides the correct labels is a supervisor who guides the learning algorithm
towards correct answers so that eventually, the algorithm can independently
produce them.
Example
Suppose we have a data set consisting of apartment sales data. For each
purchase, we would obviously have the price that was paid, together with the
size of the apartment in square meters (or square feet, if you like), and the
number of bedrooms, the year of construction, the condition (on a scale from
“disaster“ to “spick and span”). We could then use machine learning to train a
regression model that predicts the selling price based on these features. See a
real-life example here.
There are a couple potential mistakes that we’d like to make you aware of.
They are related to the fact that unless you are careful with the way you apply
machine learning methods, you could become too confident about the
accuracy of your predictions, and be heavily disappointed when the accuracy
turns out to be worse than expected.
The first thing to keep in mind in order to avoid big mistakes, is to split your
data set into two parts: the training data and the test data. We first train the
algorithm using only the training data. This gives us a model or a rule that
predicts the output based on the input variables.
To assess how well we can actually predict the outputs, we can’t count on the
training data. While a model may be a very good predictor in the training data,
it is no proof that it can generalize to any other data. This is where the test
data comes in handy: we can apply the trained model to predict the outputs
for the test data and compare the predictions to the actual outputs (for
example, future apartment sale prices).
Note
Machine learning methods are especially prone to overfitting because they can
try a huge number of different “rules” until one that fits the training data
perfectly is found. Especially methods that are very flexible and can adapt to
almost any pattern in the data can overfit unless the amount of data is enormous.
For example, compared to quite restricted linear models obtained by linear
regression, neural networks can require massive amounts of data before they
produce reliable prediction.
Learning to avoid overfitting and choose a model that is not too restricted, nor
too flexible, is one of the most essential skills of a data scientist.
In unsupervised learning, the correct answers are not provided. This makes
the situation quite different since we can’t build the model by making it fit the
correct answers on training data. It also makes the evaluation of performance
more complicated since we can’t check whether the learned model is doing
well or not.
Note
Example
As a concrete example, grocery store chains collect data about their customers’
shopping behavior (that’s why you have all those loyalty cards). To better
understand their customers, the store can either visualize the data using a graph
where each customer is represented by a dot and customers who tend to buy the
same products are placed nearer each other than customers who buy different
products. Or, the store could apply clustering to obtain a set of customer groups
such as ‘low-budget health food enthusiasts’, ‘high-end fish lovers’, ‘soda and
pizza 6 days a week’, and so on. Note that the machine learning method would
only group the customers into clusters, but it wouldn’t automatically generate
the cluster labels (‘fish lovers’ and so on). This task would be left for the user.
Yet another example of unsupervised learning can be termed generative
modeling. This has become a prominent approach over the last few years as
a deep learning technique called generative adversarial networks (GANs) has
lead to great advances. Given some data, for example, photographs of
people’s faces, a generative model can generate more of the same: more
real-looking but artificial images of people’s faces.
We will return to GANs and the implications of being able to produce high-
quality artificial image content a bit later in the course, but next we will take a
closer look at supervised learning and discuss some specific methods in more
detail.
The two test items are both classified in the “green” class because their
nearest neighbors are both green (see diagram (b) above).
The position of the points in the plot represents in some way the properties of
the items. Since we draw the diagram on a flat two-dimensional surface – you
can move in two independent directions: up-down or left-right – the items have
two properties that we can use for comparison. Imagine for example
representing patients at a clinic in terms of their age and blood-sugar level.
But the above diagram should be taken just as a visual tool to illustrate the
general idea, which is to relate the class values to similarity or proximity
(nearness). The general idea is by no means restricted to two dimensions and
the nearest neighbor classifier can easily be applied to items that are
characterized by many more properties than two.
Note
Defining “nearest”
Using the geometric distance to decide which is the nearest item may not always
be reasonable or even possible: the type of the input may, for example, be text,
where it is not clear how the items are drawn in a geometric representation and
how distances should be measured. You should therefore choose the distance
metric on a case-by-case basis.
In the MNIST digit recognition case, one common way to measure image
similarity is to count pixel-by-pixel matches. In other words, we compare the
pixels in the top-left corner of each image to one another and if the more
similar color (shade of gray) they are, the more similar the two images are.
We also compare the pixels in the bottom-right corner of each image, and all
pixels inbetween. This technique is quite sensitive to shifting or scaling the
images: if we take an image of a “1” and shift it ever so slightly either left or
right, the outcome is that the two images (before and after the shift) are very
different because the black pixels are in different positions in the two images.
Fortunately, the MNIST data has been preprocessed by centering the images
so that this problem is alleviated.
The idea is to use the very simple principle that users with similar past
behavior tend to have similar future behavior. Imagine a music
recommendation system that collects data about users’ listening behavior.
Let’s say you have listened to 1980s disco music (just for the sake of
argument). One day, the service provider gets their hands on a hard-to-find
1980 disco classic, and adds it into the music library. The system now needs
to predict whether you will like it or not. One way of doing this is to use
information about the genre, the artist, and other metadata, entered by the
good people of the service provider. However, this information is relatively
scarce and coarse and it will only be able to give rough predictions.
Now let’s say that other users who have listened to 80s disco music enjoy the
new release and keep listening to it again and again. The system will identify
the similar past behavior that you and other 80s disco fanatics share, and
since other users like you enjoy the new release, the system will predict that
you will too. Hence it will show up at the top of your recommendation list. In an
alternative reality, maybe the added song is not so great and other users with
similar past behavior as yours don’t really like it. In that case, the system
wouldn’t bother recommending it to you, or at least it wouldn’t be at the top of
the list of recommendations for you.
The most recent purchase is the one in the rightmost column, so for example,
after buying a t-shirt, flip flops, sunglasses, and Moby Dick (novel), Ville bought
sunscreen. Our hypothesis is that after buying similar items, other users are also
likely to buy sunscreen.
For example, users Ville and Henrik have both bought a t-shirt, so their similarity
is 1. Note that flip flops doesn't count because we don't include the most recent
purchase when calculating the similarity — it is reserved for another purpose.
Our task is to predict the next purchase of customer Travis who has bought the
following products:
You can think of Travis being our test data, and the above six users make our
training data.
Proceed as follows:
1. Calculate the similarity of Travis relative to the six users in the training
data (done by adding together the number of similar purchases by the
users).
2. Having calculated the similarities, identify the user who is most similar to
Travis by selecting the largest of the calculated similarities.
3. Predict what Travis is likely to purchase next by looking at the most recent
purchase (the rightmost column in the table) of the most similar user from
the previous step.
Your answer:
Ville
Your answer:
sunscreen
In the above example, we only had six users’ data and our prediction was
probably very unreliable. However, online shopping sites often have millions
of users, and the amount of data they produce is massive. In many cases,
there are a horde of users whose past behavior is very similar to yours, and
whose purchase history gives a pretty good indication of your interests.
These predictions can also be self-fulfilling prophecies in the sense that you
are more likely to buy a product if it is recommended to you by the system,
which makes it tricky to evaluate how well they actually work. The same kind
of recommendation systems are also used to recommend music, movies,
news, and social media content to users. In the context of news and social
media, filters created by such systems can lead to filter bubbles.
Answered
1. Do you think that filter bubbles are harmful? After all, they are created by
recommending content that the user likes. What negative consequences, if
any, may be associated with filter bubbles? Feel free to look for more
information from other sources.
2. Think of ways to avoid filter bubbles while still being able to recommend
content to suit personal preferences. Come up with at least one suggestion.
You can look for ideas from other sources, but we'd like to hear your own
ideas too!
Note: your answer should be at least a few sentences for each part.
Your answer:
1. Do you think that filter bubbles are harmful? After all, they are created by
recommending content that the user likes. What negative consequences, if
any, may be associated with filter bubbles? Feel free to look for more
information from other sources. Indeed, they might be pretty harmful in
certain scenarios. Although filter bubbles might be useful for advertisers and
recommendation companies, as users are more likely to look for similar
products they have tried, they can also be used with bad intentions. One risk
of using filter bubbles is related to a neuromarketing strategy. In this
strategy, products that are shown more frequently become more familiar to
users, thus making the users trust and believe the ads, perceiving them as
needs rather than wants. This is related to a physiological concept called
cognitive ease, where repeated exposure to information makes it seem true.
Users affected by this trick tend to prefer the product they see more
frequently, hence creating an artificial sense of cognitive ease. (More about
cognitive ease can be learned from the following YouTube video link on one
of my favorite YouTube channels - Veritasium:
https://www.youtube.com/watch?v=cebFWOlx848). From a user’s
perspective, being brainwashed is far from harmless. Using this
physiological trick, there might be another problem - the spreading of
misinformation. As mentioned earlier, users are more likely to believe in the
information that is repeated. If a similar product is seen plenty of times,
despite being ineffective, users will perceive similar products as good due to
filter bubbles. This can also limit the users from seeing different points of
view and in some sense, preventing them from leaving their comfort zones.
Additionally, constantly seeing similar content might reduce critical thinking
skills because users are not encouraged to question and analyze different
viewpoints. 2. Think of ways to avoid filter bubbles while still being able to
recommend content to suit personal preferences. Come up with at least one
suggestion. You can look for ideas from other sources, but we'd like to hear
your own ideas too! One way to avoid filter bubbles is by allowing the user to
adjust the diversity level of the content they see. If users can understand that
they are inside a filter bubble, it could be easier to control it. Nowadays,
plenty of apps let users select their areas of interest so that the program
brings appropriate adverts. This is a pretty good way of handling diversity in
recommendations as it gives the users control over their content. Another
way could be to automatically include a few different recommendations for
every certain number of similar content. This wouldn’t deter users from using
the app and would also introduce them to new content while maintaining
their usual flow. For example, as a YouTube user, I sometimes encounter
recommended videos from different channels with diverse content.
Example answer
Filters as such are useful. They recommend content such as music that we like.
The bubble phenomenon, where users get a biased view of facts and opinions, on
the other hand, is clearly harmful. There are no "alternative facts" – the
alternative of a fact is not a fact – but information is always presented from some
point of view. If we are not exposed to more than one point of view, we can easily
end up holding a biased worldview.
Getting completely rid of filter bubbles is probably not a good idea. After all, we
have always liked different things and been interested in different things. In the
days of print media, the newspaper that we'd read was our filter bubble, which
made sure that the bubble didn't get too small.
We believe that the first step to avoid the harmful effects of filter bubbles is to
recognize when we are inside one. To this end, it is helpful if the applications we
use clearly indicate that the recommendations that we see don't represent a
balanced overall view to the content. This can be achieved by letting the user
explore the content also through other views than a list of recommendations. In
Spotify, you can choose music based on its genre, and Netflix and HBO provide
recommendations in various different categories.
III.Regression
Our main learning objective in this section is another nice example of supervised learning
methods, and almost as simple as the nearest neighbor classifier too: linear regression. We’ll
introduce its close cousin, logistic regression as well.
Note
Note
Key terminology
Coefficients or weights
In linear regression terminology, the prices of the different products would be
called coefficients or weights (this may appear confusing since we measured the
amount of potatoes and carrots by weight, but do not let yourself be tricked by
this). One of the main advantages of linear regression is its easy interpretability:
the learned weights may in fact be more interesting than the predictions of the
outputs.
For example, when we use linear regression to predict the life expectancy, the
weight of smoking (cigarettes per day) is about minus half a year, meaning that
smoking one cigarette more per day takes you on the average half a year closer
to termination. Likewise, the weight of vegetable consumption (handful of
vegetables per day) has weight plus one year, so eating a handful of greens every
day gives you on the average one more year.
Answered
Gender Smoking (cigarettes per day) Vegetables (handfuls per day) Life expectancy (years)
male 8 2 73
male 0 6 A
female 16 1 B
female 0 4 C
Your task: Enter the correct value as an integer (whole number) for the missing
sections A, B, and C above.
Your answer:
81
Your answer:
73
Your answer:
84
When we are given the inputs and the outputs for a number of items, we can
find the weights such that the predicted output matches the actual output as
well as possible. This is the task solved by machine learning.
Note
Example
Continuing the shopping analogy, suppose we were given the contents of a
number of shopping baskets and the total bill for each of them, and we were
asked to figure out the price of each of the products (potatoes, carrots, and so
on). From one basket, say 1kg of sirloin steak, 2kg of carrots, and a bottle of
Chianti, even if we knew that the total bill is 35€, we couldn’t determine the
prices because there are many sets of prices that will yield the same total bill.
With many baskets, however, we will usually be able to solve the problem.
But the problem is made harder by the fact that in the real world, the actual
output isn’t always fully determined by the input, because of various factors
that introduce uncertainty or "noise" into the process. You can think of
shopping at a bazaar where the prices for any given product may vary from
time to time, or a restaurant where the final damage includes a variable
amount of tip. In such situations, we can estimate the prices but only with
some limited accuracy.
Finding the weights that optimize the match between the predicted and the
actual outputs in the training data is a classical statistical problem dating back
to the 1800s, and it can be easily solved even for massive data sets.
We will not go into the details of the actual weight-finding algorithms, such as
the classical least squares technique, simple as they are. However, you can
get a feel of finding trends in data in the following exercises.
It can also be noted that employees who do not drink coffee at all also
produce code, and is shown by the graph to be about ten lines. This number
is the intercept term that we mentioned earlier. The intercept is another
parameter in the model just like the weights are, that can be learned from the
data. Just as in the life expectancy example it can be thought of as the
starting point of our calculations before we have added in the effects of the
input variable, or variables if we have more than one, be it coffee cups in this
example, or cigarettes and vegetables in the previous one.
The line in the chart represents our predicted outcome, where we have
estimated the intercept and the coefficient by using an actual linear regression
technique called least squares. This line can be used to predict the number of
lines produced when the input is the number of cups of coffee. Note that we
can obtain a prediction even if we allow only partial cups (like half, 1/4 cups,
and so on).
Answered
Exercise 17: Life expectancy and education (part 1 of 2)
Let’s study the link between the total number of years spent in school (including
everything between preschool and university) and life expectancy. Here is data
from three different countries displayed in a figure represented by dots:
We have one country where the average number of years in school is 10 and life
expectancy is 57 years, another country where the average number of years in
school is 13 and life expectancy is 53 years, and a third country where the
average number of years in school is 20 and life expectancy is 80 years.
You can drag the end points of the solid line to position the line in such a way
that it follows the trend of the data points. Note that you will not be able to get
the line fit perfectly with the data points, and this is fine: some of the data points
will lie above the line, and some below it. The most important part is that the line
describes the overall trend.
After you have positioned the line you can use it to predict the life expectancy.
Given the data, what can you tell about the life expectancy of people who have 15
years of education? Important: Notice that even if you can obtain a specific
prediction, down to a fraction of a year, by adjusting the line, you may not
necessarily be able to give a confident prediction. Take the limited amount of
data into account when giving your answer.
It is exactly 64 years
It is certainly between 60 and 70 years
It is certainly 70 years or less
It is probably less than 90
Based on this data, would your prediction about the life expectancy of people
with 15 years of education change? If so, why?
Which of the following options would best match your estimate of the life
expectancy for people with 15 years of education? Choose the most specific
option that you think is justified by fitting the straight line model to the above
data.
Well there is good news for you: we can turn the linear regression method’s
outputs into predictions about labels. The technique for doing this is called
logistic regression. We will not go into the technicalities, suffice to say that in
the simplest case, we take the output from linear regression, which is a
number, and predict one label A if the output is greater than zero, and another
label B if the output is less than or equal to zero. Actually, instead of just
predicting one class or another, logistic regression can also give us a
measure of uncertainty of the prediction. So if we are predicting whether a
customer will buy a new smartphone this year, we can get a prediction that
customer A will buy a phone with probability 90%, but for another, less
predictable customer, we can get a prediction that they will not buy a phone
with 55% probability (or in other words, that they will buy one with 45%
probability).
It is also possible to use the same trick to obtain predictions over more than
two possible labels, so instead of always predicting either yes or no (buy a
new phone or not, fake news or real news, and so forth), we can use logistic
regression to identify, for example, handwritten digits, in which case there are
ten possible labels.
Each dot on the figure corresponds to one student. On the bottom of the figure
we have the scale for how many hours the student studied for the exam, and the
students who passed the exam are shown as dots at the top of the chart, and the
ones who failed are shown at the bottom. We’ll use the scale on the left to
indicate the predicted probability of passing, which we’ll get from the logistic
regression model as we explain just below. Based on this figure, you can see
roughly that students who spent longer studying had better chances of passing
the course. Especially the extreme cases are intuitive: with less than an hour’s
work, it is very hard to pass the course, but with a lot of work, most will be
successful. But what about those that spend time studying somewhere inbetween
the extremes? If you study for 6 hours, what are your chances of passing?
We can quantify the probability of passing using logistic regression. The curve in
the figure can be interpreted as the probability of passing: for example, after
studying for five hours, the probability of passing is a little over 20%. We will not
go into the details on how to obtain the curve, but it will be similar to how we
learn the weights in linear regression.
If you wanted to have an 80% chance of passing a university exam, based on the
above figure, how many hours should you approximately study for?
6-7 hours
7-8 hours
8-9 hours
10-11 hours
Machine learning can be a very hard problem and we can’t usually achieve a
perfect method that would always produce the correct label. However, in most
cases, a good but not perfect prediction is still better than none. Sometimes
we may be able to produce better predictions by ourselves but we may still
prefer to use machine learning because the machine will make its predictions
faster and it will also keep churning out predictions without getting tired. Good
examples are recommendation systems that need to predict what music, what
videos, or what ads are more likely to be of interest to you.
The factors that affect how good a result we can achieve include:
Note
This leads us to a point that is very important but often overlooked in practice:
what it means to work better. In the digit recognition task, a good method
would of course produce the correct label most of the time. We can measure
this by the classification error: the fraction of cases where our classifier
outputs the wrong class. In predicting apartment prices, the quality measure is
typically something like the difference between the predicted price and the
final price for which the apartment is sold. In many real-life applications, it is
also worse to err in one direction than in another: setting the price too high
may delay the process by months, but setting the price too low will mean less
money for the seller. And to take yet another example, failing to detect a
pedestrian in front of a car is a far worse error than falsely detecting one when
there is none.
As mentioned above, we can’t usually achieve zero error, but perhaps we will
be happy with error less than 1 in 100 (or 1%). This too depends on the
application: you wouldn’t be happy to have only 99% safe cars on the streets,
but being able to predict whether you’ll like a new song with that accuracy
may be more than enough for a pleasant listening experience. Keeping the
actual goal in mind at all times helps us make sure that we create actual
added value.
One of the reasons for the interest is the hope to understand our own mind,
which emerges from neural processing in our brain. Another reason is the
advances in machine learning achieved within the recent years by combining
massive data sets and deep learning techniques.
To better understand the whole, we will start by discussing the individual units
that make it up. A neural network can mean either a “real” biological neural
network such as the one in your brain, or an artificial neural network simulated
in a computer.
Key terminology
Deep learning
Deep learning refers to certain kinds of machine learning techniques where
several “layers” of simple processing units are connected in a network so that the
input to the system is passed through each one of them in turn. This architecture
has been inspired by the processing of visual information in the brain coming
through the eyes and captured by the retina. This depth allows the network to
learn more complex structures without requiring unrealistically large amounts of
data.
Synapse (connection)
A
B
C
D
The purpose of building artificial models of the brain can be neuroscience, the
study of the brain and the nervous system in general. It is tempting to think
that by mapping the human brain in enough detail, we can discover the
secrets of human and animal cognition and consciousness.
Note
Note
Neural networks have been a major trend in AI since the 1960s. We’ll return to
the waves of popularity in the history of AI in the final part. Currently neural
networks are again at the very top of the list as deep learning is used to
achieve significant improvements in many areas such as natural language
and image processing, which have traditionally been sore points of AI.
The second difference is that data storage (memory) and processing isn’t
separated like in traditional computers. The neurons both store and process
information so that there is no need to retrieve data from the memory for
processing. The data can be stored short term in the neurons themselves
(they either fire or not at any given time) or for longer term storage, in the
connections between the neurons – their so called weights, which we will
discuss below.
Because of these two differences, neural networks and traditional computers
are suited for somewhat different tasks. Even though it is entirely possible to
simulate neural networks in traditional computers, which was the way they
were used for a long time, their maximum capacity is achieved only when we
use special hardware (computer devices) that can process many pieces of
information at the same time. This is called parallel processing. Incidentally,
graphics processors (or graphics processing units, GPUs) have this capability
and they have become a cost-effective solution for running massive deep
learning methods.
Note
We can then calculate the linear combination like this: linear combination =
intercept + weight1 × input1 + ... + weight6 × input6 (where the ... is a
shorthand notation meaning that the sum include all the terms from 1 to 6).
Once the linear combination has been computed, the neuron does one more
operation. It takes the linear combination and puts it through a so-called
activation function. Typical examples of the activation function include:
Note that with the first activation function, the identity function, the neuron is
exactly the same as linear regression. This is why the identity function is
rarely used in neural networks: it leads to nothing new and interesting.
Note
The output of the neuron, determined by the linear combination and the
activation function, can be used to extract a prediction or a decision. For
example, if the network is designed to identify a stop sign in front of a self-
driving car, the input can be the pixels of an image captured by a camera
attached in front of the car, and the output can be used to activate a stopping
procedure that stops the car before the sign.
Learning or adaptation in the network occurs when the weights are adjusted
so as to make the network produce the correct outputs, just like in linear or
logistic regression. Many neural networks are very large, and the largest
contain hundreds of billions of weights. Optimizing them all can be a daunting
task that requires massive amounts of computing power.
Answered
-10-8-6-4-2024681000.10.20.30.40.50.60.70.80.91Sigmoid outputInput-10-8-6-
4-2024681000.10.20.30.40.50.60.70.80.91Step function outputInput-10-8-6-4-
20246810-10-8-6-4-20246810Identity outputInput
Which of the activations described above gives:
the highest output for an input of 5?
Sigmoid
Step
Identity
The perceptron is simply a fancy name for the simple neuron model with the
step activation function we discussed above. It was among the very first
formal models of neural computation and because of its fundamental role in
the history of neural networks, it wouldn’t be unfair to call it the “mother of all
artificial neural networks”.
Note
AI hyperbole
After its discovery, the Perceptron algorithm received a lot of attention, not least
because of optimistic statements made by its inventor, Frank Rosenblatt. A
classic example of AI hyperbole is a New York Times article published on July
8th, 1958:
“The Navy revealed the embryo of an electronic computer today that it expects
will be able to walk, talk, see, reproduce itself and be conscious of its existence.”
Please note that neural network enthusiasts are not at all the only ones inclined
towards optimism. The rise and fall of the logic-based expert systems approach
to AI had all the same hallmark features of an AI-hype and people claimed that
the final breakthrough is just a short while away. The outcome both in the early
1960s and late 1980s was a collapse in the research funding called an AI Winter.
The history of the debate that eventually lead to almost complete abandoning
of the neural network approach in the 1960s for more than two decades is
extremely fascinating. The article A Sociological Study of the Official History of
the Perceptrons Controversy by Mikel Olazaran (published in Social Studies
of Science, 1996) reviews the events from a sociology of science point of
view. Reading it today is quite thought provoking. Reading stories about
celebrated AI heroes who had developed neural networks algorithms that
would soon reach the level of human intelligence and become self-conscious
can be compared to some statements made during the current hype. If you
take a look at the above article, even if you wouldn’t read all of it, it will provide
an interesting background to today’s news. Consider for example an article in
the MIT Technology Review published in September 2017, where Jordan
Jacobs, co-founder of a multimillion dollar Vector institute for AI compares
Geoffrey Hinton (a figure-head of the current deep learning boom) to Einstein
because of his contributions to development of neural network algorithms in
the 1980s and later. Also recall the Human Brain project mentioned in the
previous section.
No-one really knows the future with certainty, but knowing the track record of
earlier announcements of imminent breakthroughs, some critical thinking is
advised. We’ll return to the future of AI in the final chapter, but for now, let’s
see how artificial neural networks are built.
A single neuron would be way too simple to make decisions and prediction
reliably in most real-life applications. To unleash the full potential of neural
networks, we can use the output of one neuron as the input of other neurons,
whose outputs can be the input to yet other neurons, and so on. The output of
the whole network is obtained as the output of a certain subset of the neurons,
which are called the output layer. We’ll return to this in a bit, after we
discussed the way neural networks adapt to produce different behaviors by
learning their parameters from data.
Key terminology
Layers
Often the network architecture is composed of layers. The input layer consists of
neurons that get their inputs directly from the data. So for example, in an image
recognition task, the input layer would use the pixel values of the input image as
the inputs of the input layer. The network typically also has hidden layers that
use the other neurons’ outputs as their input, and whose output is used as the
input to other layers of neurons. Finally, the output layer produces the output of
the whole network. All the neurons on a given layer get inputs from neurons on
the previous layer and feed their output to the next.
A classical example of a multilayer network is the so-called multilayer
perceptron. As we discussed above, Rosenblatt's Perceptron algorithm can be
used to learn the weights of a perceptron. For multilayer perceptron, the
corresponding learning problem is way harder and it took a long time before a
working solution was discovered. But eventually, one was invented: the
backpropagation algorithm led to a revival of neural networks in the late
1980s. It is still at the heart of many of the most advanced deep learning
solutions.
Note
Meanwhile in Helsinki...
The path(s) leading to the backpropagation algorithm are rather long and
winding. An interesting part of the history is related to the computer science
department of the University of Helsinki. About three years after the founding of
the department in 1967, a Master’s thesis was written by a student called Seppo
Linnainmaa. The topic of the thesis was “Cumulative rounding error of
algorithms as a Taylor approximation of individual rounding errors” (the thesis
was written in Finnish, so this is a translation of the actual title “Algoritmin
kumulatiivinen pyöristysvirhe yksittäisten pyöristysvirheiden Taylor-
kehitelmänä”).
The automatic differentiation method developed in the thesis was later applied
by other researchers to quantify the sensitivity of the output of a multilayer
neural network with respect to the individual weights, which is the key idea in
backpropagation.
Let's try what happens when all the weights take the same numerical value, 1.
With this setup, our linear combination for the cross image will be 9 (9 colored
pixels, so 9 × 1, and 16 white pixels, 16 × 0), and for the circle image it will be
8 (8 colored pixels, 8 × 1, and 17 white pixels, 17 × 0). In other words, the
linear combination is positive for both images and they are thus classified as
circles. Not a very good result given that there are only two images to classify.
To improve the result, we need to adjust the weights in such a way that the
linear combination will be negative for a cross and positive for a circle. If we
think about what differentiates images of crosses and circles, we can see that
circles have no colored pixels in the center of the image, whereas crosses do.
Likewise, the pixels at the corners of the image are colored in the cross, but
white in the circle.
We can now adjust the weights. There are an infinite number of weights that
do the job. For example, assign weight -1 to the center pixel (the 13th pixel),
and weight 1 to the pixels in the middle of each of the four sides of the image,
letting all the other weights be 0. Now, for the cross input, the center pixel
produce the value –1, while for all the other pixels either the pixel value or the
weight is 0, so that –1 is also the total value. This leads to activation 0, and
the cross is correctly classified.
How about the circle then? Each of the pixels in the middle of the sides
produces the value 1, which makes 4 × 1 = 4 in total. For all the other pixels
either the pixel value or the weight is zero, so 4 is the total. Since 4 is a
positive value, the activation is 1, and the circle is correctly recognized as
well.
Happy or not?
We will now follow similar reasoning to build a classifier for smiley faces. You
can assign weights to the input pixels in the image by clicking on them.
Clicking once sets the weight to 1, and clicking again sets it to -1. The
activation 1 indicates that the image is classified as a happy face, which can
be correct or not, while activation –1 indicates that the image is classified as a
sad face.
Don't be discouraged by the fact that you will not be able to classify all the
smiley faces correctly: it is in fact impossible with our simple classifier! This is
one important learning objective: sometimes perfect classification just isn't
possible because the classifier is too simple. In this case the simple neuron
that uses a linear combination of the inputs is too simple for the task. Observe
how you can build classifiers that work well in different cases: some classify
most of the happy faces correctly while being worse for sad faces, or the other
way around.
They power almost all modern neural network applications. However, there
are some interesting and powerful variations of the theme that have led to
great advances in deep learning in many areas.
One area where deep learning has achieved spectacular success is image
processing. The simple classifier that we studied in detail in the previous
section is severely limited – as you noticed it wasn’t even possible to classify
all the smiley faces correctly. Adding more layers in the network and using
backpropagation to learn the weights does in principle solve the problem, but
another one emerges: the number of weights becomes extremely large and
consequently, the amount of training data required to achieve satisfactory
accuracy can become too large to be realistic.
Fortunately, a very elegant solution to the problem of too many weights exists:
a special kind of neural network, or rather, a special kind of layer that can be
included in a deep neural network. This special kind of layer is a so-
called convolutional layer. Networks including convolutional layers are
called convolutional neural networks (CNNs). Their key property is that they
can detect image features such as bright or dark (or specific color) spots,
edges in various orientations, patterns, and so on. These form the basis for
detecting more abstract features such as a cat’s ears, a dog’s snout, a
person’s eye, or the octagonal shape of a stop sign. It would normally be hard
to train a neural network to detect such features based on the pixels of the
input image, because the features can appear in different positions, different
orientations, and in different sizes in the image: moving the object or the
camera angle will change the pixel values dramatically even if the object itself
looks just the same to us. In order to learn to detect a stop sign in all these
different conditions would require vast of amounts of training data because the
network would only detect the sign in conditions where it has appeared in the
training data. So, for example, a stop sign in the top right corner of the image
would be detected only if the training data included an image with the stop
sign in the top right corner. CNNs can recognize the object anywhere in the
image no matter where it has been observed in the training images.
Note
Having trained a neural network on data, we can use it for predictions. Since
the top layers of the network have been trained in a supervised manner to
perform a particular classification or prediction task, the top layers are really
useful only for that task. A network trained to detect stop signs is useless for
detecting handwritten digits or cats.
Note
Note
What's in an LLM?
LLMs are models that given a piece of text like "The capital of Finland is" predicts
how the text is likely to continue. In this case, "Helsinki" or "a pocket-sized
metropolis" would be likely continuations. LLMs are trained on large amounts of
text such as the entire contents of the Wikipedia or the CommonCrawl dataset
that, at the time of writing this, contains a whopping 260 billion web pages.
In principle, one can view LLMs as basically nothing but extremely powerful
predictive text entry techniques. However, with some further thinking, it
becomes apparent that being able to predict the continuation of any text in a
way that is indistinguishable from human writing, is (or would be) quite a feat
and encompasses many aspects of intelligence. The above example which is
based on the association between the words "the capital of Finland" and
"Helsinki" is an example where the model has learned a fact about the world.
If we'd be able to build models that associate the commonly agreed answers
to a wide range of questions, it could be argued that such a model has
learned a big chunk of so-called "world knowledge". Especially intriguing are
instances where the model seems to exhibit some level of reasoning beyond
memorization and statistical co-occurrence: currently, LLMs are able to do this
in a limited sense and they can easily make trivial mistakes because they are
based on "just" statistical machine learning. Intensive research and
development efforts are directed at building deep learning models with more
robust reasoning algorithms and databases of verified facts.
Note
The first version of ChatGPT was based on a GPT-3.5 model fine tuned by
supervised and reinforcement learning according to a large number of human-
rated responses. The purpose of the finetuning process was to steer the model
away from toxic and incorrect responses that the language model had picked up
from its training data, and towards comprehensive and helpful responses.
It is not easy to say what caused the massive media frenzy and the
unprecedented interest towards ChatGPT by pretty much everyone, even
those who hadn't paid much attention to AI thus far. Probably some of it is
explained by the somewhat better quality of the output, due to the finetuning,
and the easy-to-use chat interface, which enables the user to not only get
one-off answers to isolated questions, like any of the earlier LLMs, but also
maintain a coherent dialogue in a specific context. In the same vein, the chat
interface allows one to make requests like "explain this to a five year old" or
"write that as a song in the style of Nick Cave." (Mr Cave, however, wasn't
impressed [BBC]). In any case, ChatGPT succeeded in bumping the interest
in AI to completely new levels.
It remains to be seen what are the real "killer apps" for ChatGPT and other
LLM-based solutions. We believe the most likely candidates are ones where
the factual content comes from the user or from another system, and the
language model is used to format the output in the form of language (either
natural language or possibly formal language such as program code). We'll
return to the expected impact of ChatGPT and other LLM-based applications
in the final chapter.
• Explain what a neural network is and where they are being successfully
used
• Understand the technical methods that underpin neural networks
You may be disappointed to hear this, but we don’t have a crystal ball that
would show us what the world will be like in the future and how AI will
transform our lives.
Note
It has also been claimed that AI is a path to world domination. Others make even
more extraordinary statements according to which AI marks the end of
humanity (in about 20-30 years from now), life itself will be transformed in the
“Age of AI”, and that AI is a threat to our existence.
While some forecasts will probably get at least something right, others will
likely be useful only as demonstrations of how hard it is to predict, and many
don’t make much sense. What we would like to achieve is for you to be able to
look at these and other forecasts, and be able to critically evaluate them.
Predicting the future is hard but at least we can consider the past and present
AI, and by understanding them, hopefully be better prepared for the future,
whatever it turns out to be like.
AI winters
The history of AI, just like many other fields of science, has witnessed the
coming and going of various different trends. In philosophy of science, the
term used for a trend is paradigm. Typically, a particular paradigm is adopted
by most of the research community and optimistic predictions about progress
in the near-future are provided. For example, in the 1960s neural networks
were widely believed to solve all AI problems by imitating the learning
mechanisms in the nature, the human brain in particular. The next big thing
was expert systems based on logic and human-coded rules, which was the
dominant paradigm in the 1980s.
Modern AI
Currently, roughly since the turn of the millennium, AI has been on the rise
again. Modern AI methods tend to focus on breaking a problem into a number
of smaller, isolated and well-defined problems and solving them one at a time.
Modern AI is bypassing grand questions about meaning of intelligence, the
mind, and consciousness, and focusing on building practically useful solutions
in real-world problems. Good news for us all who can benefit from such
solutions!
Note
Note
AI making a difference
As a consequence of focusing on practicality rather than the big problems, we
live our life surrounded by AI (even if we may most of the time be happily
unaware of it): the music we listen to, the products we buy online, the movies
and series we watch, our routes of transportation, and even the news and
information that we have available, are all influenced more and more by AI. What
is more, basically any field of science, from medicine and astrophysics to
medieval history, is also adopting AI methods in order to deepen our
understanding of the universe and of ourselves.
One of the most pervasive and persistent ideas related to the future of AI is
the Terminator. In case you should have somehow missed the image of a
brutal humanoid robot with a metal skeleton and glaring eyes...well, that’s
what it is. The Terminator is a 1984 film by director James Cameron. In the
movie, a global AI-powered defense system called Skynet becomes
conscious of its existence and wipes most of the humankind out of existence
with nukes and advanced killer robots.
Note
However, if the system possesses superior intelligence, it will soon reach the
maximum level of paper clip production that the available resources, such as
energy and raw materials, allow. After this, it may come to the conclusion that it
needs to redirect more resources to paper clip production. In order to do so, it
may need to prevent the use of the resources for other purposes even if they are
essential for human civilization. The simplest way to achieve this is to kill all
humans, after which a great deal more resources become available for the
system’s main task, paper clip production.
There are a number of reasons why both of the above scenarios are
extremely unlikely and belong to science fiction rather than serious
speculations of the future of AI.
Reason 1:
Firstly, the idea that a superintelligent, conscious AI that can outsmart humans
emerges as an unintended result of developing AI methods is naive. As you
have seen in the previous chapters, AI methods are nothing but automated
reasoning, based on the combination of perfectly understandable principles
and plenty of input data, both of which are provided by humans or systems
deployed by humans. To imagine that the nearest neighbor classifier, linear
regression, the AlphaGo game engine, or even a deep neural network could
become conscious and start evolving into a superintelligent AI mind requires a
(very) lively imagination.
Note that we are not claiming that building human-level intelligence would be
categorically impossible. You only need to look as far as the mirror to see a
proof of the possibility of a highly intelligent physical system. To repeat what
we are saying: superintelligence will not emerge from developing narrow AI
methods and applying them to solve real-world problems (recall the narrow vs
general AI from the section on the philosophy of AI in Chapter 1).
Reason 2:
Note
All in all, the Terminator is a great story to make movies about but hardly a
real problem worth panicking about. The Terminator is a gimmick, an easy
way to get a lot of attention, a poster boy for journalists to increase click rates,
a red herring to divert attention away from perhaps boring, but real, threats
like nuclear weapons, lack of democracy, environmental catastrophes, and
climate change. In fact, the real threat the Terminator poses is the diversion of
attention from the actual problems, some of which involve AI, and many of
which don’t. We’ll discuss the problems posed by AI in what follows, but the
bottom line is: forget about the Terminator, there are much more important
things to focus on.
Answered
What's the general impression you get about AI from the image search results? Is
this an accurate representation of AI? Why or why not?
Note: On using ChatGPT and similar models: We appreciate you putting an AI
tool into use. Feel free to consult it but note that since we are interested in your
own opinion, we expect you to write your final answers yourself.
Your answer:
I did a Google image search, and the results mainly depicted brains on top of
circuits or humanoid robots thinking. Based on these images, one would
conclude that AI is metallic hardware or robots that just think. However, we
should compare what we think of AI, not just what Google thinks. Referring
back to my definition of AI in Chapter 1: That is computer systems that
possess certain autonomy and adaptability, capable of attempting to imitate
both intelligent and non-intelligent behaviors of humans and other entities,
with the main goal of serving humans. After reading this definition, the first
graphic image that comes to my mind is a humanoid robot that serves
humans, whether in their daily tasks or in some daunting tasks that require
more advanced and safer approaches. The robot also has a "bubble" in its
head (or, I should say, its processor) and is thinking about its next move.
Considering this is my personal depiction, it somewhat matches with Google
images that show humanoid robots, but the ones that show some hardware
and chips all around are a bit of nonsense. In my opinion, a more accurate
representation of AI would be similar to the introductory story of the "Detroit:
Become Human" game where humanoid robots help humans and are a part
of society. The choices made by people or androids (humanoid robots in the
game) can change the whole outcome, which can result in peace or war
(hopefully the latter remains science-fiction). Still, I believe this is a smooth
way of showcasing AI in terms of human perception and adaption, I mean,
thankfully, Google's search engine doesn't show Terminator-like robots first,
which could add fuel to the fire and increase human fear. Despite these
futuristic image results, which are not very precise showcasing AI as chips
rather than robots serving humans, people still remain fearful of its future
capabilities. It would be even more devastating to see the images where it
was to be seen as a threat to humanity.
Example answer
The Google image search with the query "AI" brought us almost exclusively
brains made of circuits. After scrolling down a bit, we also got some shining
white humanoid robots, often in a very pensive state. The color blue is dominant.
Based on this, the impression would be that AI is about tinkering with wires and
circuits, trying to build an electronic brain in some modern Frankenstein spirit.
Not much color, not much fun.
Now that we have a better understanding of the basic concepts of AI, we are
in a much better position to take part in rational discussion about the
implications of already the current AI.
Note
Online advertising
It has been noticed that online advertisers like Google tend to display ads of
lower-pay jobs to women users compared to men. Likewise, doing a search
with a name that sounds African American may produce an ad for a tool for
accessing criminal records, which is less likely to happen otherwise.
Social networks
There are numerous other examples we could mention, and you have
probably seen news stories about them. The main difficulty in the use of AI
and machine learning instead of rule-based systems is their lack of
transparency. Partially this is a consequence of the algorithms and the data
being trade secrets that the companies are unlikely to open up for public
scrutiny. And even if they did this, it may often be hard to identify the part of
the algorithm or the elements of the data that lead to discriminating decisions.
Note
The last point means, in other words, that companies such as Facebook and
Google, at least when providing services to European users, must explain
their algorithmic decision making processes. It is, however, still unclear what
exactly counts as an explanation. Does for example a decision reached by
using the nearest neighbor classifier (Chapter 4) count as an explainable
decision, or would the coefficients of a logistic regression classifier be better?
How about deep neural networks that easily involve millions of parameters
trained using terabytes of data? The discussion about the technical
implementation about the explainability of decisions based on machine
learning is currently intensive. In any case, the GDPR has potential to improve
the transparency of AI technologies.
Note
AI is taking the possibilities of fabricating evidence to a
whole new level
Metaphysics Live is a system capable of doing face-swaps, de-aging and other
tricks in real time.
Descript has a tool for automatic imitation of a person’s voice from a few minutes
of sample recording. While the generated audio still has a notable robotic tone, it
makes a pretty good impression.
It has been long known that technology companies collect a lot of information
about their users. Earlier it was mainly grocery stores and other retailers that
collected buying data by giving their customers loyalty cards that enable the
store to associate purchases to individual customers.
Note
While many of the above examples have come as at least in part as surprises
– otherwise they could have been avoided – there is a lot of ongoing research
trying to address them. In particular, an area called differential privacy aims to
develop machine learning algorithms that can guarantee that the results are
sufficiently coarse to prevent reverse engineering specific data points that
went into them.
When an early human learned to use a sharp rock to crack open bones of
dead animals to access a new source of nutrition, time and energy was
released for other purposes such as fighting, finding a mate, and making more
inventions. The invention of the steam engine in the 1700s tapped into an
easily portable form of machine power that greatly improved the efficiency of
factories as well as ships and trains. Automation has always been a path to
efficiency: getting more with less. Especially since the mid 20th century,
technological development has led to a period of unprecedented progress in
automation. AI is a continuation of this progress.
Each step towards better automation changes the working life. With a sharp
rock, there was less need for hunting and gathering food; with the steam
engine, there was less need for horses and horsemen; with the computer,
there is less need for typists, manual accounting, and many other data
processing (and apparently more need for watching cat videos). With AI and
robotics, there is even less need for many kinds of dull, repetitive work.
Note
The most important preventive action to avoid huge societal issues such as this is
to help young people obtain a wide-ranging education. This provides a basis for
pursuing many different jobs and which isn’t in high risk of becoming obsolete in
the near future.
So then, what actually are the tasks that are more likely to be automated?
There are some clear signs concerning this that we can already observe:
For one thing, it is hard to tell how soon we’ll have safe and reliable self-
driving cars and other solutions that can replace human work. In addition to
this, we mustn’t forget that a truck or taxi driver doesn’t only turn a wheel: they
are also responsible for making sure the vehicle operates correctly, they
handle the goods and negotiate with customers, they guarantee the safety of
their cargo and passengers, and take care of a multitude of other tasks that
may be much harder to automate than the actual driving.
As with earlier technological advances, there will also be new work that is
created because of AI. It is likely that in the future, a larger fraction of the
workforce will focus on research and development, and tasks that require
creativity and human-to-human interaction. If you’d like to read more on this
topic, see for example Abhinav Suri’s nice essay on Artificial Intelligence and
the Rise of Economic Inequality.
Unanswered
1. Mention the title of the article along with its author and where it was
published (as a URL if applicable) in your answer.
2. Explain the central idea in the article in your own words using about a
paragraph of text (multiple sentences.)
3. Based on your understanding, how accurate are the AI-related statements
in the article? Explain your answer. Are the implications (if any)
realistic? Explain why or why not.
Sample Answer:
Many of the articles that we studied were about the great promise of AI
in different areas such as health-care, finance, customer service,
transportation... you name it. A pattern that seems to repeat is that
Google, IBM, Microsoft, or some of the other big players in the field have
demonstrated a prototype product and the reporter is amazed by it. This
tends to be combined with an estimate of the US or global market of the
industry in question, which easily amounts to billions of euros.
The articles very rarely report anything about the actual techniques
underlying the solutions, which is quite understandable since many
readers wouldn't be able to digest any technical details. (You would!)
A few of the articles we reviewed contain statements about AI "reading
millions of pages" and "comprehending them", but to be honest, we were
actually expecting worse based on our Facebook feed. Perhaps the social
media recommendations we get (based on our clicks! makes you
wonder...) are of lower quality than what Google search can provide?
III.Summary
The most important decisions that determine how well our society can adapt to the changes
brought by AI aren’t technological. They are political.
Everything that we have learned about AI suggests that the future is bright.
We will get new and better services and increased productivity will lead to
positive overall outcomes – but only on the condition that we carefully
consider the societal implications and ensure that the power of AI is used for
the common good.
We also need to find new ways to share the benefits to everyone, instead of
creating an AI elite, those who can afford the latest AI technology and use it to
access unprecedented economic inequality. This requires careful political
judgment (note that by political judgment, we mean decisions about policy,
which has little to do with who votes for whom in an election or the comings
and goings of individual politicians and political parties).
Note
The regulation of the use of AI must follow democratic principles, and everyone
must have an equal say about what kind of a society we want to live in in the
future. The only way to make this possible is to make knowledge about
technology freely available to all. Obviously there will always be experts in any
given topic, who know more about it than the rest of us, but we should at least
have the possibility to critically evaluate what they are saying.
What you have learned with us supports this goal by providing you the basic
background about AI so that we can have a rational discussion about AI and
its implications.
Note
Hidden agenda
We also had a hidden agenda. We wanted to give you an opportunity to
experience the thrill of learning, and the joy of heureka moments when
something that may have been complicated and mysterious, becomes simple and
if not self-evident, at least comprehensible. These are moments when our
curiosity is satisfied. But such satisfaction is temporary. Soon after we have
found the answer to one question, we will ask the next. What then? And then?
If we have been successful, we have whetted your appetite for learning. We hope
you will continue your learning by finding other courses and further information
about AI, as well as other topics of your interest. To help you with your
exploration, we have collected some pointers to AI material that we have found
useful and interesting.
Now you are in a position where you can find out about what is going on in AI,
and what is being done to ensure its proper use. You should do so, and
whenever you feel like there are risks we should discuss, or opportunities we
should go after, don’t wait that someone else reacts.