Elements of Ai Copy Phone 2
Elements of Ai Copy Phone 2
Elements of Ai Copy Phone 2
CHAPTER 1
How should we define AI?
As you have probably noticed, AI is currently a “hot topic”: media coverage and
public discussion about AI is almost impossible to avoid. However, you may
also have noticed that AI means different things to different people. For some,
AI is about artificial life-forms that can surpass human intelligence, and for
others, almost any data processing technology can be called AI.
To set the scene, so to speak, we’ll discuss what AI is, how it can be defined,
and what other fields or technologies are closely related. Before we do so,
however, we’ll highlight three applications of AI that illustrate different aspects
of AI. We’ll return to each of them throughout the course to deepen our
understanding.
Application 1. Self-driving cars
Self-driving cars require a combination of AI techniques of many kinds: search
and planning to find the most convenient route from A to B, computer vision
to identify obstacles, and decision making under uncertainty to cope with the
complex and dynamic environment. Each of these must work with almost
flawless precision in order to avoid accidents.
The same technologies are also used in other autonomous systems such as
delivery robots, flying drones, and autonomous ships.
Implications: road safety should eventually improve as the reliability of the
systems surpasses human level. The efficiency of logistics chains when moving
goods should improve. Humans move into a supervisory role, keeping an eye
on what’s going on while machines take care of the driving. Since
transportation is such a crucial element in our daily life, it is likely that there
are also some implications that we haven’t even thought about yet.
Application 2. Content recommendation
A lot of the information that we encounter in the course of a typical day is
personalized. Examples include Facebook, Twitter, Instagram, and other social
media content; online advertisements; muic recommendations on Spotify;
movie recommendations on Netflix, HBO, and other streaming services. Many
online publishers such as newspapers’ and broadcasting companies’ websites
as well as search engines such as Google also personalize the content they
offer.
While the frontpage of the printed version of the New York Times or China
Daily is the same for all readers, the frontpage of the online version is different
for each user. The algorithms that determine the content that you see are
based on AI.
Implications: while many companies don’t want to reveal the details of their
algorithms, being aware of the basic principles helps you understand the
potential implications: these involve so called filter bubbles, echo-chambers,
troll factories, fake news, and new forms of propaganda.
Application 3. Image and video processing
Face recognition is already a commodity used in many customer, business, and
government applications such as organizing your photos according to people,
automatic tagging on social media, and passport control. Similar techniques
can be used to recognize other cars and obstacles around an autonomous car,
or to estimate wildlife populations, just to name a few examples.
AI can also be used to generate or alter visual content. Examples already in use
today include style transfer, by which you can adapt your personal photos to
look like they were painted by Vincent van Gogh, and computer generated
characters in motion pictures such as Avatar, the Lord of the Rings, and
popular Pixar animations where the animated characters replicate gestures
made by real human actors.
Implications: when such techniques advance and become more widely
available, it will be easy to create natural looking fake videos of events that are
impossible to distinguish from real footage. This challenges the notion that
“seeing is believing”.
What is, and what isn’t AI? Not an easy question!
Often the robothood of such creatures is only a thin veneer on top of a very
humanlike agent, which is understandable as most fiction – even science
fiction – needs to be relatable by human readers who would otherwise be
alienated by intelligence that is too different and strange. Most science fiction
is thus best read as metaphor for the current human condition, and robots
could be seen as stand-ins for repressed sections of society, or perhaps our
search for the meaning of life.
It can be hard to appreciate how complicated all this is, but sometimes it
becomes visible when something goes wrong: the object you pick is much
heavier or lighter than you expected, or someone else opens a door just as you
are reaching for the handle, and then you can find yourself seriously out of
balance. Usually these kinds of tasks feel effortless, but that feeling belies
millions of years of evolution and several years of childhood practice.
While easy for you, grasping objects by a robot is extremely hard, and it is an
area of active study. Recent examples include for example Boston Dynamics
robots.
It has since turned out that playing chess is very well suited to computers,
which can follow fairly simple rules and compute many alternative move
sequences at a rate of billions of computations a second. Computers beat the
reigning human world champion in chess in the famous Deep Blue vs Kasparov
matches in 1997. Could you have imagined that the harder problem turned out
to be grabbing the pieces and moving them on the board without knocking it
over! We will study the techniques that are used in playing games like chess or
tic-tac-toe in Chapter 2.
Because AI is a discipline, you shouldn’t say “an AI”, just like we don’t say “a
biology”. This point should also be quite clear when you try saying something
like “we need more artificial intelligences.” That just sounds wrong, doesn’t it?
(It does to us.)
Despite our discouragement, the use of AI as a countable noun is common.
Take for instance, the headline Data from wearables helped teach an AI to spot
signs of diabetes, which is otherwise a pretty good headline since it
emphasizes the importance of data and makes it clear that the system can only
detect signs of diabetes rather than making diagnoses and treatment
decisions. And you should definitely never ever say anything like Google’s
artificial intelligence built an AI that outperforms any made by humans, which
is one of the all-time most misleading AI headlines we’ve ever seen (note that
the headline is not by Google Research).
The use of AI as a countable noun is of course not a big deal if what is being
said otherwise makes sense, but if you’d like to talk like a pro, avoid saying "an
AI", and instead say "an AI method".
II. Related fields
In addition to AI, there are several other closely related topics that are good to know at
least by name. These include machine learning, data science, and deep learning.
Machine learning can be said to be a subfield of AI, which itself is a subfield
of computer science (such categories are often somewhat imprecise and some
parts of machine learning could be equally well or better belong to statistics).
Machine learning enables AI solutions that are adaptive. A concise definition
can be given as follows:
Key terminology
Machine learning
Systems that improve their performance in a given task with more and more
experience or data.
Deep learning is a subfield of machine learning, which itself is a subfield of AI,
which itself is a subfield of computer science. We will meet deep learning in
some more detail in Chapter 5, but for now let us just note that the “depth” of
deep learning refers to the complexity of a mathematical model, and that the
increased computing power of modern computers has allowed researchers to
increase this complexity to reach levels that appear not only quantitatively but
also qualitatively different from before. As you notice, science often involves a
number of progressively more special subfields, subfields of subfields, and so
on. This enables researchers to zoom into a particular topic so that it is
possible to catch up with the ever increasing amount of knowledge accrued
over the years, and produce new knowledge on the topic — or sometimes,
correct earlier knowledge to be more accurate.
Data science is a recent umbrella term (term that covers several subdisciplines)
that includes machine learning and statistics, certain aspects of computer
science including algorithms, data storage, and web application development.
Data science is also a practical discipline that requires understanding of the
domain in which it is applied in, for example, business or science: its purpose
(what "added value" means), basic assumptions, and constraints. Data science
solutions often involve at least a pinch of AI (but usually not as much as one
would expect from the headlines).
Robotics means building and programming robots so that they can operate in
complex, real-world scenarios. In a way, robotics is the ultimate challenge of AI
since it requires a combination of virtually all areas of AI. For example:
Computer vision and speech recognition for sensing the environment
Natural language processing, information retrieval, and reasoning under
uncertainty for processing instructions and predicting consequences of
potential actions
Cognitive modeling and affective computing (systems that respond to
expressions of human feelings or that mimic feelings) for interacting and
working together with humans
Many of the robotics-related AI problems are best approached by machine
learning, which makes machine learning a central branch of AI for robotics.
What is a robot?
In brief, a robot is a machine comprising sensors (which sense the
environment) and actuators (which act on the environment) that can be
programmed to perform sequences of actions. People used to science-fictional
depictions of robots will usually think of humanoid machines walking with an
awkward gait and speaking in a metallic monotone. Most real-world robots
currently in use look very different as they are designed according to the
application. Most applications would not benefit from the robot having human
shape, just like we don’t have humanoid robots to do our dishwashing but
machines in which we place the dishes to be washed by jets of water.
It may not be obvious at first sight, but any kind of vehicles that have at least
some level of autonomy and include sensors and actuators are also counted as
robotics. On the other hand, software-based solutions such as a customer
service chatbot, even if they are sometimes called “software robots”, aren’t
counted as (real) robotics.
III. Philosophy of AI
The very nature of the term “artificial intelligence” brings up philosophical questions
whether intelligent behavior implies or requires the existence of a mind, and to what
extent is consciousness replicable as computation.
The Turing test
Alan Turing (1912-1954) was an English mathematician and logician. He is
rightfully considered to be the father of computer science. Turing was
fascinated by intelligence and thinking, and the possibility of simulating them
by machines. Turing’s most prominent contribution to AI is his imitation game,
which later became known as the Turing test.
In the test, a human interrogator interacts with two players, A and B, by
exchanging written messages (in a chat). If the interrogator cannot determine
which player, A or B, is a computer and which is a human, the computer is said
to pass the test. The argument is that if a computer is indistinguishable from a
human in a general natural language conversation, then it must have reached
human-level intelligence.
What Turing meant by the test is very much similar to the aphorism by
Forrest Gump: “stupid is as stupid does”. Turing’s version would be “intelligent
is as intelligent says”. In other words, an entity is intelligent if it cannot be
distinguished from another intelligent entity by observing its behavior. Turing
just constrained the set of behaviors into discussion so that the interrogator
can’t base her or his decision on appearances.
One problem: does being human-like mean you are intelligent?
One criticism of the Turing test as a test for intelligence is that it may actually
measure whether the computer behaves like a human more than whether it is
intelligent. The test has indeed been “passed” by computer programs that
keep changing the subject, make plenty of spelling errors, and sometimes
refuse to respond at all. A famous example is Eugene Goostman, a 13-year-old
Ukrainian boy who constantly tries to avoid answering questions by making
jokes and changing the subject to his pet guineapig. Here is a short segment of
an example conversation with Eugene Goostman from the blog of Scott
Aaronson:
Is Eugene a computer or a person?
Interrogator: How many legs does a camel have?
Eugene: Something between 2 and 4. Maybe, three? :-))) By the way, I still
don’t know your specialty - or, possibly, I’ve missed it?
Interrogator: How many legs does a millipede have?
Eugene: Just two, but Chernobyl mutants may have them up to five. I know
you are supposed to trick me.
Interrogator: No, I need to know that you’re not a chatbot. Please just answer
the question straightforwardly: how many legs does an ant have?
Eugene: Something between 2 and 4. Maybe, three? :-))) Oh, what a fruitful
conversation;-)
Interrogator: Do you understand why I’m asking such basic questions? Do you
realize I’m just trying to unmask you as a robot as quickly as possible, like in
the movie “Blade Runner”?
Eugene: ...wait
Eugene Goostman, who in reality is a computer program, fooled 10 out of 30
judges into thinking he is a real person.
This question belongs to the class of search and planning problems. Similar
problems need to be solved by self-driving cars, and (perhaps less obviously) AI
for playing games. In the game of chess, for example, the difficulty is not so
much in getting a piece from A to B as keeping your pieces safe from the
opponent.
Often there are many different ways to solve the problem, some of which
may be more preferable in terms of time, effort, cost or other criteria. Different
search techniques may lead to different solutions, and developing advanced
search algorithms is an established research area.
Now let’s draw the transitions. We could draw arrows that have a direction
so that they point from one node to another, but in this puzzle the transitions
are symmetric: if the robot can row from state NNNN to state FNFF, it can
equally well row the other way from FNFF to NNNN. Thus it is simpler to draw
the transitions simply with lines that don’t have a direction. Starting from
NNNN, we can go to FNFN, FNFF, FFNF, and FFFN:
We have now done quite a bit of work on the puzzle without seeming any
closer to the solution, and there is little doubt that you could have solved the
whole puzzle already by using your “natural intelligence”. But for more complex
problems, where the number of possible solutions grows in the thousands and
in the millions, our systematic or mechanical approach will shine since the hard
part will be suitable for a simple computer to do. Now that we have formulated
the alternative states and transitions between them, the rest becomes a
mechanical task: find a path from the initial state NNNN to the final state FFFF.
One such path is colored in the following picture. The path proceeds from
NNNN to FFFN (the robot takes the fox and the chicken to the other side),
thence to NFNN (the robot takes the chicken back on the starting side) and
finally to FFFF (the robot can now move the chicken and the chicken-feed to
the other side).
State space, transitions, and costs. To formalize a planning problem, we use
concepts such as the state space, transitions, and costs.
Key terminology
The state space means the set of possible situations. In the chicken-crossing
puzzle, the state space consisted of ten allowed states NNNN through to FFFF
(but not for example NFFF, which the puzzle rules don’t allow). If the task is to
navigate from place A to place B, the state space could be the set of locations
defined by their (x,y) coordinates that can be reached from the starting point
A. Or we could use a constrained set of locations, for example, different street
addresses so that the number of possible states is limited.
Transitions are possible moves between one state and another, such as NNNN
to FNFN. It is important to note that we only count direct transitions that can
be accomplished with a single action as transitions. A sequence of multiple
transitions, for example, from A to C, from C to D, and from D to B (the goal), is
a path rather than a transition.
Costs refer to the fact that, oftentimes the different transitions aren’t all alike.
They can differ in ways that make some transitions more preferable or cheaper
(in a not necessarily monetary sense of the word) and others more costly. We
can express this by associating with each transition a certain cost. If the goal is
to minimize the total distance traveled, then a natural cost is the geographical
distance between states. On the other hand, the goal could actually be to
minimize the time instead of the distance, where the natural cost would
obviously be the time. If the transitions are equal, then we ignore the costs.
II.Solving problems with AI
Interlude on the history of AI: starting from search
AI is arguably as old as computer science. Long before we had computers,
people thought of the possibility of automatic reasoning and intelligence. As
we already mentioned in Chapter 1, one of the great thinkers who considered
this question was Alan Turing. In addition to the Turing test, his contributions
to AI, and more generally to computer science, include the insight that
anything that can be computed (= calculated using either numbers or other
symbols) can be automated.
Note
Helping win WWII
Turing designed a very simple device that can compute anything that is
computable. His device is known as the Turing machine. While it is a
theoretical model that isn’t practically useful, it lead Turing to the invention of
programmable computers: computers that can be used to carry out different
tasks depending on what they were programmed to do.
So instead of having to build a different device for each task, we use the same
computer for many tasks. This is the idea of programming. Today this invention
sounds trivial but in Turing’s days it was far from it. Some of the early
programmable computers were used during World War II to crack German
secret codes, a project where Turing was also personally involved.
The term Artificial Intelligence was coined by John McCarthy (1927-2011) –
who is often also referred to as the Father of AI. The term became established
when it was chosen as the topic of a summer seminar, known as
the Dartmouth conference, which was organized by McCarthy and others in
1956 at Dartmouth College in New Hampshire. In the proposal to organize the
seminar, McCarthy continued with Turing’s argument about automated
computation. The proposal contains the following crucial statement:
Note
John McCarthy’s key statement about AI
“The study is to proceed on the basis of the conjecture that every aspect of
learning or any other feature of intelligence can in principle be so precisely
described that a machine can be made to simulate it.”
In other words, any element of intelligence can be broken down into small
steps so that each of the steps is as such so simple and “mechanical” that it can
be written down as a computer program. This statement was, and is still today,
a conjecture, which means that we can’t really prove it to be true.
Nevertheless, the idea is absolutely fundamental when it comes to the way we
think about AI. For example, it shows that McCarthy wanted to bypass any
arguments in the spirit of Searle’s Chinese Room: intelligence is intelligence
even if the system that implements it is just a computer that mechanically
follows a program.
Why search and games became central in AI research
As computers developed to the level where it was feasible to experiment with
practical AI algorithms in the 1950s, the most distinctive AI problems (besides
cracking Nazi codes) were games. Games provided a convenient restricted
domain that could be formalized easily. Board games such as checkers, chess,
and recently quite prominently Go (an extremely complex strategy board game
originating from China at least 2500 years ago), have inspired countless
researchers, and continue to do so.
Closely related to games, search and planning techniques were an area where
AI led to great advances in the 1960s: algorithms with names such as the
Minimax algorithm or Alpha-Beta Pruning, which were developed then, are still
the basis for game playing AI, although of course more advanced variants have
been proposed over the years. In this chapter, we will study games and
planning problems on a conceptual level.
III.Search and games
In this section, we will study a classic AI problem: games. The simplest scenario, which we
will focus on for the sake of clarity, are two-player, perfect-information games such as tic-
tac-toe and chess.
Example: playing tic tac toe
Maxine and Minnie are true game enthusiasts. They just love games. Especially
two-person, perfect information games such as tic-tac-toe or chess. One day
they were playing tic-tac-toe. Maxine, or Max as her friends call her, was
playing with X. Minnie, or Min as her friends call her, had the Os. Min had just
played her turn and the board looked as follows:
Max was looking at the board and contemplating her next move, as it was
her turn, when she suddenly buried her face in her hands in despair, looking
quite like Garry Kasparov playing Deep Blue in 1997. Yes, Min was close to
getting three Os on the top row, but Max could easily put a stop to that plan. So
why was Max so pessimistic?
Game trees
To solve games using AI, we will introduce the concept of a game tree. The
different states of the game are represented by nodes in the game tree, very
similar to the above planning problems. The idea is just slightly different. In the
game tree, the nodes are arranged in levels that correspond to each player’s
turns in the game so that the “root” node of the tree (usually depicted at the
top of the diagram) is the beginning position in the game. In tic-tac-toe, this
would be the empty grid with no Xs or Os played yet. Under root, on the
second level, there are the possible states that can result from the first player’s
moves, be it X or O. We call these nodes the “children” of the root node.
Each node on the second level, would further have as its children nodes the
states that can be reached from it by the opposing player’s moves. This is
continued, level by level, until reaching states where the game is over. In tic-
tac-toe, this means that either one of the players gets a line of three and wins,
or the board is full and the game ends in a tie.
Minimizing and maximizing value
In order to be able to create game AI that attempts to win the game, we attach
a numerical value to each possible end result. To the board positions where X
has a line of three so that Max wins, we attach the value +1, and likewise, to
the positions where Min wins with three Os in a row we attach the value -1.
For the positions where the board is full and neither player wins, we use the
neutral value 0 (it doesn’t really matter what the values are as long as they are
in this order so that Max tries to max the value, and Min tries to minimize it).
A sample game tree
Consider, for example, the following game tree which begins not at the root
but in the middle of the game (because otherwise, the tree would be way too
big to display). Note that this is different from the game shown in the
illustration in the beginning of this section. We have numbered the nodes with
numbers 1, 2, ..., 14].
The tree is composed of alternating layers where it is either Min’s turn to place
an O or Max’s turn to place an X at any of the vacant slots on the board. The
player whose turn it is to play next is shown at the left.
The game continues at the board position shown in the root node,
numbered as (1) at the top, with Min’s turn to place O at any of the three
vacant cells. Nodes (2)–(4) show the board positions resulting from each of the
three choices respectively. In the next step, each node has two possible choices
for Max to play X each, and so the tree branches again.
When starting from the above starting position, the game always ends in a row
of three: in nodes (7) and (9), the winner is Max who plays with X, and in nodes
(11)–(14) the winner is Min who plays with O.
Note that since the players’ turns alternate, the levels can be labeled as Min
levels and Max levels, which indicates whose turn it is.
Being strategic
Consider nodes (5)–(10) on the second level from the bottom. In nodes (7) and
(9), the game is over, and Max wins with three X’s in a row. The value of these
positions is +1. In the remaining nodes, (5), (6), (8), and (10), the game is also
practically over, since Min only needs to place her O in the only remaining cell
to win. In other words, we know how the game will end at each node on the
second level from the bottom. We can therefore decide that the value of
nodes (5), (6), (8), and (10) is also –1.
Here comes the interesting part. Let’s consider the values of the nodes one
level higher towards the root: nodes (2)–(4). Since we observed that both of
the children of (2), i.e., nodes (5) and (6), lead to Min’s victory, we can without
hesitation attach the value -1 to node (2) as well. However, for node (3), the left
child (7) leads to Max’s victory, +1, but the right child (8) leads to Min winning,
-1. What is the value of node (3)? Think about this for a while, keeping in mind
who makes the choice at node (3). Since it is Max’s turn to play, she will of
course choose the left child, node (7). Thus, every time we reach the board
position in node (3), Max can ensure victory, and we can attach the value +1 to
node (3).
The same holds for node (4): again, since Max can choose where to put her X,
she ca always ensure victory, and we attach the value +1 to node (4)
Sounds good, can I go home now? As stated above, the Minimax algorithm
can be used to implement optimal game play in any deterministic, two-player,
perfect-information zero-sum game. Such games include tic-tac-toe, connect
four, chess, Go, etc. Rock-paper-scissors is not in this class of games since it
involves information hidden from the other player; nor are Monopoly or
backgammon which are not deterministic. So as far as this topic is concerned,
is that all folks, can we go home now? The answer is that in theory, yes, but in
practice, no.
Note
The problem of massive game trees
In many games, the game tree is simply way too big to traverse in full. For
example, in chess the average branching factor, i.e., the average number of
children (available moves) per node is about 35. That means that to explore all
the possible scenarios up to only two moves ahead, we need to visit
approximately 35 x 35 = 1225 nodes – probably not your favorite pencil-and-
paper homework exercise. A look-ahead of three moves requires visiting 42875
nodes; four moves 1500625; and ten moves 2758547353515625 (that’s about
2.7 quadrillion) nodes. In Go, the average branching factor is estimated to be
about 250. Go means no-go for Minimax.
More tricks: Managing massive game trees
A few more tricks are needed to manage massive game trees. Many of them
were crucial elements in IBM’s Deep Blue computer defeating the chess world
champion, Garry Kasparov, in 1997.
If we can afford to explore only a small part of the game tree, we need a way
to stop the Minimax algorithm before reaching an end-node, i.e., a node where
the game is over and the winner is known. This is achieved by using a so
called heuristic evaluation function that takes as input a board position,
including the information about which player’s turn is next, and returns a score
that should be an estimate of the likely outcome of the game continuing from
the given board position.
Note
Good heuristics
Good heuristics for chess, for example, typically count the amount of material
(pieces) weighted by their type: the queen is usually considered worth about
two times as much as a rook, three times a knight or a bishop, and nine times
as much as a pawn. The king is of course worth more than all other things
combined since losing it amounts to losing the game. Further, occupying the
strategically important positions near the middle of the board is considered an
advantage and the heuristics assign higher value to such positions.
The minimax algorithm presented above requires minimal changes to obtain
a depth-limited version where the heuristic is returned at all nodes at a given
depth limit: the depth simply refers to the number of steps that the game tree
is expanded before applying a heuristic evaluation function.
CHAPTER 3
I.Odds and probability
In the previous section, we discussed search and its application where there is perfect
information – such as in games like chess. However, in the real world things are rarely so
clear cut.
Instead of perfect information, there is a host of unknown possibilities, ranging
from missing information to deliberate deception.
Take a self-driving car for example — you can set the goal to get from A to B in
an efficient and safe manner that follows all laws. But what happens if the
traffic gets worse than expected, maybe because of an accident ahead?
Sudden bad weather? Random events like a ball bouncing in the street, or a
piece of trash flying straight into the car’s camera?
Despite its naivete, the naive Bayes method tends to work very well in
practice. This is a good example of the common saying in statistics, “all models
are wrong, but some are useful” means (the aphorism is generally attributed to
statistician George E.P. Box).
Estimating parameters
To get started, we need to specify the prior odds for spam (against ham). For
simplicity assume this to be 1:1 which means that on the average half of the
incoming messages are spam (in reality, the amount of spam is probably much
higher).
To get our likelihood ratios, we need two different probabilities for any word
occurring: one in spam messages and another one in ham messages.
The word distributions for the two classes are best estimated from actual
training data that contains some spam messages as well as legitimate
messages. The simplest way is to count how many times each word, abacus,
acacia, ..., zurg, appears in the data and divide the number by the total word
count.
To illustrate the idea, let’s assume that we have at our disposal some spam and
some ham. You can easily obtain such data by saving a batch of your emails in
two files.
Assume that we have calculated the number of occurrences of the following
words (along with all other words) in the two classes of messages:
word spam ham
million 156 98
dollars 29 119
adclick 51 0
conferences 0 12
total 95791 306438
We can now estimate that the probability that a word in a spam message is
“million”, for example, is about 156 out of 95791, which is roughly the same as
1 in 614. Likewise, we get the estimate that 98 out of 306438 words, which is
about the same as 1 in 3127, in a ham message are million. Both of these
probability estimates are small, less than 1 in 500, but more importantly, the
former is higher than the latter: 1 in 614 is higher than 1 in 3127. This means
that the likelihood ratio, which is the first ratio divided by the second ratio, is
more than one. To be more precise, the ratio is (1/614) / (1/3127) = 3127/614
= 5.1 (rounded to one decimal digit).
Recall that if you have any trouble at all with following the math in this section,
you should refresh the arithmetic with fractions using the pointers we gave
earlier (see the part about Odds in section Odds and Probability).
Note
Zero means trouble
One problem with estimating the probabilities directly from the counts is that
zero counts lead to zero estimates. This can be quite harmful for the
performance of the classifier – it easily leads to situations where the posterior
odds are 0/0, which is nonsense. The simplest solution is to use a small lower
bound for all probability estimates. The value 1/100000, for instance, does the
job.
Using the above logic, we can determine the likelihood ratio for all possible
words without having to use zero, giving us the following likelihood ratios:
word likelihood ratio
million 5.1
dollars 0.8
adclick 53.2
conferences 0.3
The correct label (what digit the writer was supposed to write) is shown above
each image. Note that some of the "correct” class labels are questionable: see
for example the second image from left: is that really a 7, or actually a 4?
Note
MNIST – What’s that?
Every machine learning student knows about the MNIST dataset. Fewer know
what the acronym stands for. In fact, we had to look it up to be able to tell you
that the M stands for Modified, and NIST stands for National Institute of
Standards and Technology. Now you probably know something that an average
machine learning expert doesn’t!
In the most common machine learning problems, exactly one class value is
correct at a time. This is also true in the MNIST case, although as we said, the
correct answer may often be hard to tell. In this kind of problem, it is not
possible that an instance belongs to multiple classes (or none at all) at the
same time. What we would like to achieve is an AI method that can be given an
image like the ones above, and automatically spits out the correct label (a
number between 0 and 9).
Note
How not to solve the problem
An automatic digit recognizer could in principle be built manually by writing
down rules such as:
if the black pixels are mostly in the form of a single loop then the label is
0
if the black pixels form two intersecting loops then the label is 8
if the black pixels are mostly in a straight vertical line in the middle of the
figure then the label is 1
and so on...
This was how AI methods were mostly developed in the 1980s (so called
“expert systems”). However, even for such a simple task as digit recognition,
the task of writing such rules is very laborious. In fact, the above example rules
wouldn’t be specific enough to be implemented by programming – we’d have
to define precisely what we mean by “mostly”, “loop”, “line”, “middle”, and so
on.
And even if we did all this work, the result would likely be a bad AI method
because as you can see, the handwritten digits are often a bit so-and-so, and
every rule would need a dozen exceptions.
Three types of machine learning
The roots of machine learning are in statistics, which can also be thought of as
the art of extracting knowledge from data. Especially methods such as linear
regression and Bayesian statistics, which are both already more than two
centuries old (!), are even today at the heart of machine learning. For more
examples and a brief history, see the timeline of machine learning (Wikipedia).
The area of machine learning is often divided in subareas according to the
kinds of problems being attacked. A rough categorization is as follows:
Supervised learning: We are given an input, for example a photograph with a
traffic sign, and the task is to predict the correct output or label, for example
which traffic sign is in the picture (speed limit, stop sign, etc.). In the simplest
cases, the answers are in the form of yes/no (we call these binary classification
problems).
Unsupervised learning: There are no labels or correct outputs. The task is to
discover the structure of the data: for example, grouping similar items to form
“clusters”, or reducing the data to a small number of important “dimensions”.
Data visualization can also be considered unsupervised learning.
Reinforcement learning: Commonly used in situations where an AI agent like a
self-driving car must operate in an environment and where feedback about
good or bad choices is available with some delay. Also used in games where
the outcome may be decided only at the end of the game.
The categories are somewhat overlapping and fuzzy, so a particular method
can sometimes be hard to place in one category. For example, as the name
suggests, so-called semisupervised learning is partly supervised and partly
unsupervised.
Note
Classification
When it comes to machine learning, we will focus primarily on supervised
learning, and in particular, classification tasks. In classification, we observe in
input, such as a photograph of a traffic sign, and try to infer its “class”, such as
the type of sign (speed limit 80 km/h, pedestrian crossing, stop sign, etc.).
Other examples of classification tasks include: identification of fake Twitter
accounts (input includes the list of followers, and the rate at which they have
started following the account, and the class is either fake or real account) and
handwritten digit recognition (input is an image, class is 0,...,9).
For example, when we use linear regression to predict the life expectancy, the
weight of smoking (cigarettes per day) is about minus half a year, meaning that
smoking one cigarette more per day takes you on the average half a year closer
to termination. Likewise, the weight of vegetable consumption (handful of
vegetables per day) has weight plus one year, so eating a handful of greens
every day gives you on the average one more year.
24 15 Pass
41 9.5 Pass
58 2 Fail
101 5 Fail
215 6 Pass
Based on the table, what kind of conclusion could you draw between the hours
studied and passing the exam? We could think that if we have data from
hundreds of students, maybe we could see the amount needed to study in
order to pass the course. We can present this data in a chart as you can see
below.
The limits of machine learning
To summarize, machine learning is a very powerful tool for building AI
applications. In addition to the nearest neighbor method, linear regression,
and logistic regression, there are literally hundreds, if not thousands, of
different machine learning techniques, but they all boil down to the same
thing: trying to extract patterns and dependencies from data and using them
either to gain understanding of a phenomenon or to predict future outcomes.
Machine learning can be a very hard problem and we can’t usually achieve a
perfect method that would always produce the correct label. However, in most
cases, a good but not perfect prediction is still better than none. Sometimes
we may be able to produce better predictions by ourselves but we may still
prefer to use machine learning because the machine will make its predictions
faster and it will also keep churning out predictions without getting tired. Good
examples are recommendation systems that need to predict what music, what
videos, or what ads are more likely to be of interest to you.
The factors that affect how good a result we can achieve include:
The hardness of the task: in handwritten digit recognition, if the digits
are written very sloppily, even a human can’t always guess correctly what
the writer intended
The machine learning method: some methods are far better for a
particular task than others
The amount of training data: from only a few examples, it is impossible
to obtain a good classifier
The quality of the data
Note
Data quality matters
In the beginning of this chapter, we emphasized the importance of having
enough data and the risks of overfitting. Another equally important factor is
the quality of the data. In order to build a model that generalizes well to data
outside of the training data, the training data needs to contain enough
information that is relevant to the problem at hand. For example, if you create
an image classifier that tells you what the image given to the algorithm is
about, and you have trained it only on pictures of dogs and cats, it will assign
everything it sees as either a dog or a cat. This would make sense if the
algorithm is used in an environment where it will only see cats and dogs, but
not if it is expected to see boats, cars, and flowers as well.
Note
Modeling the brain
The BRAIN Initiative led by American neuroscience researchers is pushing
forward technologies for imaging, modeling, and simulating the brain at a finer
and larger scale than before. Some brain research projects are very ambitious
in terms of objectives. The Human Brain Project promised in 2012 that “the
mysteries of the mind can be solved – soon”. After years of work, the Human
Brain Project was facing questions about when the billion euros invested by
the European Union will deliver what was promised, even though, to be fair,
some less ambitious milestones have been achieved.
However, even while we seem to be almost as far from understanding the
mind and consciousness, there are clear milestones that have been achieved in
neuroscience. By better understanding of the structure and function of the
brain, we are already reaping some concrete rewards. We can, for instance,
identify abnormal functioning and try to help the brain avoid them and
reinstate normal operation. This can lead to life-changing new medical
treatments for people suffering from neurological disorders: epilepsy,
Alzheimer’s disease, problems caused by developmental disorders or damage
caused by injuries, and so on.
Note
Looking to the future: brain computer interfaces
One research direction in neuroscience is brain-computer interfaces that allow
interacting with a computer by simply thinking. The current interfaces are very
limited and they can be used, for example, to reconstruct on a very rough level
what a person is seeing, or to control robotic arms by thought. Perhaps some
day we can actually implement a thought reading machine that allows precise
instructions but currently they belong to science fiction. It is also conceivable
that we could feed information into the brain by stimulating it by small
electrical pulses. Such stimulation is currently used for therapeutic purposes.
Feeding detailed information such as specific words, ideas, memories, or
emotions is at least currently science fiction rather than reality, but obviously
we know neither the limits of such technology, nor how hard it is to reach
them.
We’ve drifted a little astray from the topic of the course. In fact, another main
reason for building artificial neural networks has little to do with understanding
biological systems. It is to use biological systems as an inspiration to build
better AI and machine learning techniques. The idea is very natural: the brain is
an amazingly complex information processing system capable of a wide range
of intelligent behaviors (plus occasionally some not-so-intelligent ones), and
therefore, it makes sense to look for inspiration in it when we try to create
artificially intelligent systems.
Neural networks have been a major trend in AI since the 1960s. We’ll return to
the waves of popularity in the history of AI in the final part. Currently neural
networks are again at the very top of the list as deep learning is used to
achieve significant improvements in many areas such as natural language and
image processing, which have traditionally been sore points of AI.
What is so special about neural networks?
The case for neural networks in general as an approach to AI is based on a
similar argument as that for logic-based approaches. In the latter case, it was
thought that in order to achieve human-level intelligence, we need to simulate
higher-level thought processes, and in particular, manipulation of symbols
representing certain concrete or abstract concepts using logical rules.
The argument for neural networks is that by simulating the lower-level,
“subsymbolic” data processing on the level of neurons and neural networks,
intelligence will emerge. This all sounds very reasonable but keep in mind that
in order to build flying machines, we don’t build airplanes that flap their wings,
or that are made of bones, muscle, and feather. Likewise, in artificial neural
networks, the internal mechanism of the neurons is usually ignored and the
artificial neurons are often much simpler than their natural counterparts. The
electro-chemical signaling mechanisms between natural neurons are also
mostly ignored in artificial models when the goal is to build AI systems rather
than to simulate biological systems.
Compared to how computers traditionally work, neural networks have certain
special features:
Neural network key feature 1
For one, in a traditional computer, information is processed in a central
processor (aptly named the central processing unit, or CPU for short) which
can only focus on doing one thing at a time. The CPU can retrieve data to be
processed from the computer’s memory, and store the result in the memory.
Thus, data storage and processing are handled by two separate components of
the computer: the memory and the CPU. In neural networks, the system
consists of a large number of neurons, each of which can process information
on its own so that instead of having a CPU process each piece of information
one after the other, the neurons process vast amounts of information
simultaneously.
Neural network key feature 2
The second difference is that data storage (memory) and processing isn’t
separated like in traditional computers. The neurons both store and process
information so that there is no need to retrieve data from the memory for
processing. The data can be stored short term in the neurons themselves (they
either fire or not at any given time) or for longer term storage, in the
connections between the neurons – their so called weights, which we will
discuss below.
Because of these two differences, neural networks and traditional computers
are suited for somewhat different tasks. Even though it is entirely possible to
simulate neural networks in traditional computers, which was the way they
were used for a long time, their maximum capacity is achieved only when we
use special hardware (computer devices) that can process many pieces of
information at the same time. This is called parallel processing. Incidentally,
graphics processors (or graphics processing units, GPUs) have this capability
and they have become a cost-effective solution for running massive deep
learning methods.
II.How neural networks are built
As we said earlier, neurons are very simple processing units. Having discussed linear and
logistic regression in Chapter 4, the essential technical details of neural networks can be
seen as slight variations of the same idea.
Note
Weights and inputs
The basic artificial neuron model involves a set of adaptive parameters, called
weights like in linear and logistic regression. Just like in regression, these
weights are used as multipliers on the inputs of the neuron, which are added
up. The sum of the weights times the inputs is called the linear combination of
the inputs. You can probably recall the shopping bill analogy: you multiply the
amount of each item by its price per unit and add up to get the total.
If we have a neuron with six inputs (analogous to the amounts of the six
shopping items: potatoes, carrots, and so on), input1, input2, input3, input4,
input5, and input6, we also need six weights. The weights are analogous to the
prices of the items. We’ll call them weight1, weight2, weight3, weight4,
weight5, and weight6. In addition, we’ll usually want to include an intercept
term like we did in linear regression. This can be thought of as a fixed
additional charge due to processing a credit card payment, for example.
We can then calculate the linear combination like this: linear combination =
intercept + weight1 × input1 + ... + weight6 × input6 (where the ... is a
shorthand notation meaning that the sum include all the terms from 1 to 6).
Activations and outputs
Once the linear combination has been computed, the neuron does one more
operation. It takes the linear combination and puts it through a so-called
activation function. Typical examples of the activation function include:
identity function: do nothing and just output the linear combination
step function: if the value of the linear combination is greater than zero,
send a pulse (ON), otherwise do nothing (OFF)
sigmoid function: a “soft” version of the step function
Note that with the first activation function, the identity function, the neuron is
exactly the same as linear regression. This is why the identity function is rarely
used in neural networks: it leads to nothing new and interesting.
Note
How neurons activate
Real, biological neurons communicate by sending out sharp, electrical pulses
called “spikes”, so that at any given time, their outgoing signal is either on or
off (1 or 0). The step function imitates this behavior. However, artificial neural
networks tend to use activation functions that output a continuous numerical
activation level at all times, such as the sigmoid function. Thus, to use a
somewhat awkward figure of speech, real neurons communicate by something
similar to the Morse code, whereas artificial neurons communicate by
adjusting the pitch of their voice as if yodeling.
The output of the neuron, determined by the linear combination and the
activation function, can be used to extract a prediction or a decision. For
example, if the network is designed to identify a stop sign in front of a self-
driving car, the input can be the pixels of an image captured by a camera
attached in front of the car, and the output can be used to activate a stopping
procedure that stops the car before the sign.
Learning or adaptation in the network occurs when the weights are adjusted so
as to make the network produce the correct outputs, just like in linear or
logistic regression. Many neural networks are very large, and the largest
contain hundreds of billions of weights. Optimizing them all can be a daunting
task that requires massive amounts of computing power.
Perceptron: the mother of all ANNs
The perceptron is simply a fancy name for the simple neuron model with the
step activation function we discussed above. It was among the very first formal
models of neural computation and because of its fundamental role in the
history of neural networks, it wouldn’t be unfair to call it the “mother of all
artificial neural networks”.
It can be used as a simple classifier in binary classification tasks. A method for
learning the weights of the perceptron from data, called the Perceptron
algorithm, was introduced by the psychologist Frank Rosenblatt in 1957. We
will not study the Perceptron algorithm in detail. Suffice to say that it is just
about as simple as the nearest neighbor classifier. The basic principle is to feed
the network training data one example at a time. Each misclassification leads
to an update in the weight.
Note
AI hyperbole
After its discovery, the Perceptron algorithm received a lot of attention, not
least because of optimistic statements made by its inventor, Frank Rosenblatt.
A classic example of AI hyperbole is a New York Times article published on July
8th, 1958:
“The Navy revealed the embryo of an electronic computer today that it expects
will be able to walk, talk, see, reproduce itself and be conscious of its
existence.”
Please note that neural network enthusiasts are not at all the only ones
inclined towards optimism. The rise and fall of the logic-based expert systems
approach to AI had all the same hallmark features of an AI-hype and people
claimed that the final breakthrough is just a short while away. The outcome
both in the early 1960s and late 1980s was a collapse in the research funding
called an AI Winter.
The history of the debate that eventually lead to almost complete abandoning
of the neural network approach in the 1960s for more than two decades is
extremely fascinating. The article A Sociological Study of the Official History of
the Perceptrons Controversy by Mikel Olazaran (published in Social Studies of
Science, 1996) reviews the events from a sociology of science point of view.
Reading it today is quite thought provoking. Reading stories about celebrated
AI heroes who had developed neural networks algorithms that would soon
reach the level of human intelligence and become self-conscious can be
compared to some statements made during the current hype. If you take a
look at the above article, even if you wouldn’t read all of it, it will provide an
interesting background to today’s news. Consider for example an article in the
MIT Technology Review published in September 2017, where Jordan Jacobs,
co-founder of a multimillion dollar Vector institute for AI compares Geoffrey
Hinton (a figure-head of the current deep learning boom) to Einstein because
of his contributions to development of neural network algorithms in the 1980s
and later. Also recall the Human Brain project mentioned in the previous
section.
According to Hinton, “the fact that it doesn’t work is just a temporary
annoyance” (although according to the article, Hinton is laughing about the
above statement, so it’s hard to tell how serious he is about it). The Human
Brain project claims to be “close to a profound leap in our understanding of
consciousness”. Doesn’t that sound familiar?
No-one really knows the future with certainty, but knowing the track record of
earlier announcements of imminent breakthroughs, some critical thinking is
advised. We’ll return to the future of AI in the final chapter, but for now, let’s
see how artificial neural networks are built.
Putting neurons together: networks
A single neuron would be way too simple to make decisions and prediction
reliably in most real-life applications. To unleash the full potential of neural
networks, we can use the output of one neuron as the input of other neurons,
whose outputs can be the input to yet other neurons, and so on. The output of
the whole network is obtained as the output of a certain subset of the
neurons, which are called the output layer. We’ll return to this in a bit, after
we discussed the way neural networks adapt to produce different behaviors by
learning their parameters from data.
Key terminology
Layers
Often the network architecture is composed of layers. The input layer consists
of neurons that get their inputs directly from the data. So for example, in an
image recognition task, the input layer would use the pixel values of the input
image as the inputs of the input layer. The network typically also has hidden
layers that use the other neurons’ outputs as their input, and whose output is
used as the input to other layers of neurons. Finally, the output layer produces
the output of the whole network. All the neurons on a given layer get inputs
from neurons on the previous layer and feed their output to the next.
A classical example of a multilayer network is the so-called multilayer
perceptron. As we discussed above, Rosenblatt's Perceptron algorithm can be
used to learn the weights of a perceptron. For multilayer perceptron, the
corresponding learning problem is way harder and it took a long time before a
working solution was discovered. But eventually, one was invented: the
backpropagation algorithm led to a revival of neural networks in the late
1980s. It is still at the heart of many of most advanced deep learning solutions.
Note
Meanwhile in Helsinki...
The path(s) leading to the backpropagation algorithm are rather long and
winding. An interesting part of the history is related to the computer science
department of the University of Helsinki. About three years after the founding
of the department in 1967, a Master’s thesis was written by a student called
Seppo Linnainmaa. The topic of the thesis was “Cumulative rounding error of
algorithms as a Taylor approximation of individual rounding errors” (the thesis
was written in Finnish, so this is a translation of the actual title “Algoritmin
kumulatiivinen pyöristysvirhe yksittäisten pyöristysvirheiden Taylor-
kehitelmänä”).
The automatic differentiation method developed in the thesis was later applied
by other researchers to quantify the sensitivity of the output of a multilayer
neural network with respect to the individual weights, which is the key idea in
backpropagation.
A simple neural network classifier
To give a relatively simple example of using a neural network classifier, we'll
consider a task that is very similar to the MNIST digit recognition task, namely
classifying images in two classes. We will first create a classifier to classify
whether an image shows a cross (x) or a circle (o). Our images are represented
here as pixels that are either colored or white, and the pixels are arranged in 5
× 5 grid. In this format our images of a cross and a circle (more like a diamond,
to be honest) look like this:
In order to build a neural network classifier, we need to formalize the problem
in a way where we can solve it using the methods we have learned. Our first
step is to represent the information in the pixels by numerical values that can
be used as the input to a classifier. Let's use 1 if the square is colored, and 0 if
it is white. Note that although the symbols in the above graphic are of different
color (green and blue), our classifier will ignore the color information and use
only the colored/white information. The 25 pixels in the image make the inputs
of our classifier.
To make sure that we know which pixel is which in the numerical
representation, we can decide to list the pixels in the same order as you'd read
text, so row by row from the top, and reading each row from left to right. The
first row of the cross, for example, is represented as 1,0,0,0,1; the second row
as 0,1,0,1,0, and so on. The full input for the cross input is then:
1,0,0,0,1,0,1,0,1,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,1.
We'll use the basic neuron model where the first step is to compute a linear
combination of the inputs. Thus need a weight for each of the input pixels,
which means 25 weights in total.
Finally, we use the step activation function. If the linear combination is
negative, the neuron activation is zero, which we decide to use to signify a
cross. If the linear combination is positive, the neuron activation is one, which
we decide to signify a circle.
Let's try what happens when all the weights take the same numerical value, 1.
With this setup, our linear combination for the cross image will be 9 (9 colored
pixels, so 9 × 1, and 16 white pixels, 16 × 0), and for the circle image it will be 8
(8 colored pixels, 8 × 1, and 17 white pixels, 17 × 0). In other words, the linear
combination is positive for both images and they are thus classified as circles.
Not a very good result given that there are only two images to classify.
To improve the result, we need to adjust the weights in such a way that the
linear combination will be negative for a cross and positive for a circle. If we
think about what differentiates images of crosses and circles, we can see that
circles have no colored pixels in the center of the image, whereas crosses do.
Likewise, the pixels at the corners of the image are colored in the cross, but
white in the circle.
We can now adjust the weights. There are an infinite number of weights that
do the job. For example, assign weight -1 to the center pixel (the 13th pixel),
and weight 1 to the pixels in the middle of each of the four sides of the image,
letting all the other weights be 0. Now, for the cross input, the center pixel
produce the value –1, while for all the other pixels either the pixel value or the
weight is 0, so that –1 is also the total value. This leads to activation 0, and the
cross is correctly classified.
How about the circle then? Each of the pixels in the middle of the sides
produces the value 1, which makes 4 × 1 = 4 in total. For all the other pixels
either the pixel value or the weight is zero, so 4 is the total. Since 4 is a positive
value, the activation is 1, and the circle is correctly recognized as well.
Happy or not?
We will now follow similar reasoning to build a classifier for smiley faces. You
can assign weights to the input pixels in the image by clicking on them. Clicking
once sets the weight to 1, and clicking again sets it to -1. The activation 1
indicates that the image is classified as a happy face, which can be correct or
not, while activation –1 indicates that the image is classified as a sad face.
Don't be discouraged by the fact that you will not be able to classify all the
smiley faces correctly: it is in fact impossible with our simple classifier! This is
one important learning objective: sometimes perfect classification just isn't
possible because the classifier is too simple. In this case the simple neuron that
uses a linear combination of the inputs is too simple for the task. Observe how
you can build classifiers that work well in different cases: some classify most of
the happy faces correctly while being worse for sad faces, or the other way
around.
III.Advanced neural network techniques
In the previous section, we have discussed the basic ideas behind most neural network
methods: multilayer networks, non-linear activation functions, and learning rules such as
the backpropagation algorithm.
They power almost all modern neural network applications. However, there
are some interesting and powerful variations of the theme that have led to
great advances in deep learning in many areas.
Convolutional neural networks (CNNs)
One area where deep learning has achieved spectacular success is image
processing. The simple classifier that we studied in detail in the previous
section is severely limited – as you noticed it wasn’t even possible to classify all
the smiley faces correctly. Adding more layers in the network and using
backpropagation to learn the weights does in principle solve the problem, but
another one emerges: the number of weights becomes extremely large and
consequently, the amount of training data required to achieve satisfactory
accuracy can become too large to be realistic.
Fortunately, a very elegant solution to the problem of too many weights exists:
a special kind of neural network, or rather, a special kind of layer that can be
included in a deep neural network. This special kind of layer is a so-
called convolutional layer. Networks including convolutional layers are
called convolutional neural networks (CNNs). Their key property is that they
can detect image features such as bright or dark (or specific color) spots, edges
in various orientations, patterns, and so on. These form the basis for detecting
more abstract features such as a cat’s ears, a dog’s snout, a person’s eye, or
the octagonal shape of a stop sign. It would normally be hard to train a neural
network to detect such features based on the pixels of the input image,
because the features can appear in different positions, different orientations,
and in different sizes in the image: moving the object or the camera angle will
change the pixel values dramatically even if the object itself looks just the
same to us. In order to learn to detect a stop sign in all these different
conditions would require vast of amounts of training data because the network
would only detect the sign in conditions where it has appeared in the training
data. So, for example, a stop sign in the top right corner of the image would be
detected only if the training data included an image with the stop sign in the
top right corner. CNNs can recognize the object anywhere in the image no
matter where it has been observed in the training images.
Note
Why we need CNNs
CNNs use a clever trick to reduce the amount of training data required to
detect objects in different conditions. The trick basically amounts to using the
same input weights for many neurons – so that all of these neurons are
activated by the same pattern – but with different input pixels. We can for
example have a set of neurons that are activated by a cat’s pointy ear. When
the input is a photo of a cat, two neurons are activated, one for the left ear and
another for the right. We can also let the neuron’s input pixels be taken from a
smaller or a larger area, so that different neurons are activated by the ear
appearing in different scales (sizes), so that we can detect a small cat’s ears
even if the training data only included images of big cats.
The convolutional neurons are typically placed in the bottom layers of the
network, which processes the raw input pixels. Basic neurons (like the
perceptron neuron discussed above) are placed in the higher layers, which
process the output of the bottom layers. The bottom layers can usually be
trained using unsupervised learning, without a particular prediction task in
mind. Their weights will be tuned to detect features that appear frequently in
the input data. Thus, with photos of animals, typical features will be ears and
snouts, whereas in images of buildings, the features are architectural
components such as walls, roofs, windows, and so on. If a mix of various
objects and scenes is used as the input data, then the features learned by the
bottom layers will be more or less generic. This means that pre-trained
convolutional layers can be reused in many different image processing tasks.
This is extremely important since it is easy to get virtually unlimited amounts of
unlabeled training data – images without labels – which can be used to train
the bottom layers. The top layers are always trained by supervised machine
learning techniques such as backpropagation.
The first version of ChatGPT was based on a GPT-3.5 model fine tuned by
supervised and reinforcement learning according to a large number of human-
rated responses. The purpose of the finetuning process was to steer the model
away from toxic and incorrect responses that the language model had picked
up from its training data, and towards comprehensive and helpful responses.
It is not easy to say what caused the massive media frenzy and the
unprecedented interest towards ChatGPT by pretty much everyone, even
those who hadn't paid much attention to AI thus far. Probably some of it is
explained by the somewhat better quality of the output, due to the finetuning,
and the easy-to-use chat interface, which enables the user to not only get one-
off answers to isolated questions, like any of the earlier LLMs, but also
maintain a coherent dialogue in a specific context. In the same vein, the chat
interface allows one to make requests like "explain this to a five year old" or
"write that as a song in the style of Nick Cave." (Mr Cave, however, wasn't
impressed [BBC]). In any case, ChatGPT succeeded in bumping the interest in
AI to completely new levels.
CHAPTER 6
I.About predicting the future
We will start by addressing what is known to be one of the hardest problems of all:
predicting the future.
You may be disappointed to hear this, but we don’t have a crystal ball that
would show us what the world will be like in the future and how AI will
transform our lives. As scientists, we are often asked to provide predictions,
and our refusal to provide any is faced with a roll of the eyes (“boring
academics”). But in fact, we claim that anyone who claims to know the future
of AI and the implications it will have on our society, should be treated with
suspicion.
The reality distortion field
Not everyone is quite as conservative about their forecasts, however. In the
modern world where big headlines sell, and where you have to dissect news
into 280 characters, reserved (boring?) messages are lost, and simple and
dramatic messages are magnified. In the public perception of AI, this is clearly
true.
Note
From utopian visions to grim predictions
The media sphere is dominated by the extremes. We are beginning to see AI
celebrities, standing for one big idea and making oracle-like forecasts about
the future of AI. The media love their clear messages. Some promise us
a utopian future with exponential growth and trillion-dollar industries
emerging out of nowhere, true AI that will solve all problems we cannot solve
by ourselves, and where humans don’t need to work at all.
It has also been claimed that AI is a path to world domination. Others make
even more extraordinary statements according to which AI marks the end of
humanity (in about 20-30 years from now), life itself will be transformed in the
“Age of AI”, and that AI is a threat to our existence.
While some forecasts will probably get at least something right, others will
likely be useful only as demonstrations of how hard it is to predict, and many
don’t make much sense. What we would like to achieve is for you to be able to
look at these and other forecasts, and be able to critically evaluate them.
On hedgehogs and foxes
The political scientist Philip E. Tetlock, author of Superforecasting: The Art and
Science of Prediction, classifies people into two categories: those who have one
big idea (“hedgehogs”), and those who have many small ideas (“foxes”).
Tetlock has carried out an experiment between 1984 and 2003 to study factors
that could help us identify which predictions are likely to be accurate and
which are not. One of the significant findings was that foxes tend to be clearly
better at prediction than hedgehogs, especially when it comes to long-term
forecasting.
Probably the messages that can be expressed in 280 characters are more often
big and simple hedgehog ideas. Our advice is to pay attention to carefully
justified and balanced information sources, and to be suspicious about people
who keep explaining everything using a single argument.
Predicting the future is hard but at least we can consider the past and present
AI, and by understanding them, hopefully be better prepared for the future,
whatever it turns out to be like.
AI winters
The history of AI, just like many other fields of science, has witnessed the
coming and going of various different trends. In philosophy of science, the
term used for a trend is paradigm. Typically, a particular paradigm is adopted
by most of the research community and optimistic predictions about progress
in the near-future are provided. For example, in the 1960s neural networks
were widely believed to solve all AI problems by imitating the learning
mechanisms in the nature, the human brain in particular. The next big thing
was expert systems based on logic and human-coded rules, which was the
dominant paradigm in the 1980s.
The cycle of hype
In the beginning of each wave, a number of early success stories tend to make
everyone happy and optimistic. The success stories, even if they may be in
restricted domains and in some ways incomplete, become the focus on public
attention. Many researchers rush into AI – or at least calling their research AI –
in order to access the increased research funding. Companies also initiate and
expand their efforts in AI in the fear of missing out (FOMO).
So far, each time an all-encompassing, general solution to AI has been said to
be within reach, progress has ended up running into insurmountable problems,
which at the time were thought to be minor hiccups. In the case of neural
networks in the 1960s, the hiccups were related to handling nonlinearities and
to solving the machine learning problems associated with the increasing
number of parameters required by neural network architectures. In the case of
expert systems in the 1980s, the hiccups were associated with handling
uncertainty and common sense. As the true nature of the remaining problems
dawned after years of struggling and unsatisfied promises, pessimism about
the paradigm accumulated and an AI winter followed: interest in the field
faltered and research efforts were directed elsewhere.
Modern AI
Currently, roughly since the turn of the millennium, AI has been on the rise
again. Modern AI methods tend to focus on breaking a problem into a number
of smaller, isolated and well-defined problems and solving them one at a time.
Modern AI is bypassing grand questions about meaning of intelligence, the
mind, and consciousness, and focusing on building practically useful solutions
in real-world problems. Good news for us all who can benefit from such
solutions!
Another characteristic of modern AI methods, closely related to working in the
complex and “messy” real world, is the ability to handle uncertainty, which we
demonstrated by studying the uses of probability in AI in Chapter 3. Finally, the
current upwards trend of AI has been greatly boosted by the come-back of
neural networks and deep learning techniques capable of processing images
and other real-world data better than anything we have seen before.
Note
So are we in a hype cycle?
Whether the history will repeat itself, and the current boom will be once again
followed by an AI winter, is a matter that only time can tell. Even if it does, and
the progress towards better and better solutions slows down to a halt, the
significance of AI in the society is going to stay. Thanks to the focus on useful
solutions to real-world problems, modern AI research yields fruit already
today, rather than trying to solve the big questions about general intelligence
first – which was where the earlier attempts failed.
Prediction 1: AI will continue to be all around us
As you recall, we started by motivating the study of AI by discussing prominent
AI applications that affect all our lives. We highlighted three examples: self-
driving vehicles, recommendation systems, and image and video processing.
During the course, we have also discussed a wide range of other applications
that contribute to the ongoing technological transition.
Note
AI making a difference
As a consequence of focusing on practicality rather than the big problems, we
live our life surrounded by AI (even if we may most of the time be happily
unaware of it): the music we listen to, the products we buy online, the movies
and series we watch, our routes of transportation, and even the news and
information that we have available, are all influenced more and more by AI.
What is more, basically any field of science, from medicine and astrophysics to
medieval history, is also adopting AI methods in order to deepen our
understanding of the universe and of ourselves.
However, if the system possesses superior intelligence, it will soon reach the
maximum level of paper clip production that the available resources, such as
energy and raw materials, allow. After this, it may come to the conclusion that
it needs to redirect more resources to paper clip production. In order to do so,
it may need to prevent the use of the resources for other purposes even if they
are essential for human civilization. The simplest way to achieve this is to kill all
humans, after which a great deal more resources become available for the
system’s main task, paper clip production.
Why these scenarios are unrealistic
There are a number of reasons why both of the above scenarios are extremely
unlikely and belong to science fiction rather than serious speculations of the
future of AI.
Reason 1:
Firstly, the idea that a superintelligent, conscious AI that can outsmart humans
emerges as an unintended result of developing AI methods is naive. As you
have seen in the previous chapters, AI methods are nothing but automated
reasoning, based on the combination of perfectly understandable principles
and plenty of input data, both of which are provided by humans or systems
deployed by humans. To imagine that the nearest neighbor classifier, linear
regression, the AlphaGo game engine, or even a deep neural network could
become conscious and start evolving into a superintelligent AI mind requires a
(very) lively imagination.
Note that we are not claiming that building human-level intelligence would be
categorically impossible. You only need to look as far as the mirror to see a
proof of the possibility of a highly intelligent physical system. To repeat what
we are saying: superintelligence will not emerge from developing narrow AI
methods and applying them to solve real-world problems (recall the narrow vs
general AI from the section on the philosophy of AI in Chapter 1).
Reason 2:
Secondly, one of the favorite ideas of those who believe in superintelligent AI
is the so-called singularity: a system that optimizes and “rewires“ itself so that
it can improve its own intelligence at an ever accelerating, exponential rate.
Such superintelligence would leave humankind so far behind that we become
like ants that can be exterminated without hesitation. The idea of exponential
intelligence increase is unrealistic for the simple reason that even if a system
could optimize its own workings, it would keep facing more and more difficult
problems that would slow down its progress, quite like the progress of human
scientists requires ever greater efforts and resources by the whole research
community and indeed the whole society, which the superintelligent entity
wouldn’t have access to. The human society still has the power to decide what
we use technology, even AI technology, for. Much of this power is indeed given
to us by technology, so that every time we make progress in AI technology, we
become more powerful and better at controlling any potential risks due to it.
Note
The value alignment problem
The paper clip example is known as the value alignment problem: specifying
the objectives of the system so that they are aligned with our values is very
hard. However, suppose that we create a superintelligent system that could
defeat humans who tried to interfere with its work. It’s reasonable to assume
that such a system would also be intelligent enough to realize that when we
say “make me paper clips”, we don’t really mean to turn the Earth into a paper
clip factory of a planetary scale.
Separating stories from reality
All in all, the Terminator is a great story to make movies about but hardly a real
problem worth panicking about. The Terminator is a gimmick, an easy way to
get a lot of attention, a poster boy for journalists to increase click rates, a red
herring to divert attention away from perhaps boring, but real, threats like
nuclear weapons, lack of democracy, environmental catastrophes, and climate
change. In fact, the real threat the Terminator poses is the diversion of
attention from the actual problems, some of which involve AI, and many of
which don’t. We’ll discuss the problems posed by AI in what follows, but the
bottom line is: forget about the Terminator, there are much more important
things to focus on.
II.The societal implications of AI
In the very beginning of this course, we briefly discussed the importance of AI in today’s
and tomorrow’s society but at that time, we could do so only to a limited extent because
we hadn’t introduced enough of the technical concepts and methods to ground the
discussion on concrete terms.
Now that we have a better understanding of the basic concepts of AI, we are in
a much better position to take part in rational discussion about the
implications of already the current AI.
Implication 1: Algorithmic bias
AI, and in particular, machine learning, is being used to make important
decisions in many sectors. This brings up the concept of algorithmic bias. What
it means is the embedding of a tendency to discriminate according ethnicity,
gender, or other factors when making decisions about job applications, bank
loans, and so on.
Note
Once again, it’s all about the data
The main reason for algorithmic bias is human bias in the data. For example,
when a job application filtering tool is trained on decisions made by humans,
the machine learning algorithm may learn to discriminate against women or
individuals with a certain ethnic background. Notice that this may happen even
if ethnicity or gender are excluded from the data since the algorithm will be
able to exploit the information in the applicant’s name or address.
Algorithmic bias isn’t a hypothetical threat conceived by academic researchers.
It’s a real phenomenon that is already affecting people today.
Online advertising
It has been noticed that online advertisers like Google tend to display ads of
lower-pay jobs to women users compared to men. Likewise, doing a search
with a name that sounds African American may produce an ad for a tool for
accessing criminal records, which is less likely to happen otherwise.
Social networks
Since social networks are basing their content recommendations essentially on
other users’ clicks, they can easily lead to magnifying existing biases even if
they are very minor to start with. For example, it was observed that when
searching for professionals with female first names, LinkedIn would ask the
user whether they actually meant a similar male name: searching for Andrea
would result in the system asking “did you mean Andrew”? If people
occasionally click Andrew’s profile, perhaps just out of curiosity, the system
will boost Andrew even more in subsequent searches.
There are numerous other examples we could mention, and you have probably
seen news stories about them. The main difficulty in the use of AI and machine
learning instead of rule-based systems is their lack of transparency. Partially
this is a consequence of the algorithms and the data being trade secrets that
the companies are unlikely to open up for public scrutiny. And even if they did
this, it may often be hard to identify the part of the algorithm or the elements
of the data that lead to discriminating decisions.
Note
Transparency through regulation?
A major step towards transparency is the European General Data Protection
Regulation (GDPR). It requires that all companies that either reside within the
European Union or that have European customers must:
Upon request, reveal what data they have collected about any individual
(right of access)
Delete any such data that is not required to keep with other obligations
when requested to do so (right to be forgotten)
Provide an explanation of the data processing carried out on the
customer’s data (right to explanation)
The last point means, in other words, that companies such as Facebook and
Google, at least when providing services to European users, must explain their
algorithmic decision making processes. It is, however, still unclear what exactly
counts as an explanation. Does for example a decision reached by using the
nearest neighbor classifier (Chapter 4) count as an explainable decision, or
would the coefficients of a logistic regression classifier be better? How about
deep neural networks that easily involve millions of parameters trained using
terabytes of data? The discussion about the technical implementation about
the explainability of decisions based on machine learning is currently intensive.
In any case, the GDPR has potential to improve the transparency of AI
technologies.
Implication 2: Seeing is believing — or is it?
We are used to believing what we see. When we see a leader on the TV stating
that their country will engage in a trade-war with another country, or when a
well-known company spokesperson announces an important business
decision, we tend to trust them better than just reading about the statement
second-hand from the news written by someone else.
Similarly, when we see photo evidence from a crime scene or from a
demonstration of a new tech gadget, we put more weight on the evidence
than on written report explaining how things look.
Of course, we are aware of the possibility of fabricating fake evidence. People
can be put in places they never visited, with people they never met, by
photoshopping. It is also possible to change the way things look by simply
adjusting lighting or pulling one’s stomach in in cheap before–after shots
advertising the latest diet pill.
AI is taking the possibilities of fabricating evidence to a whole new level
Metaphysics Live is a system capable of doing face-swaps, de-aging and other
tricks in real time.
The most important preventive action to avoid huge societal issues such as this
is to help young people obtain a wide-ranging education. This that provides a
basis for pursuing many different jobs and which isn’t in high risk of becoming
obsolete in the near future.