Chunk 1
Chunk 1
Chunk 1
ARTIFICIAL INTELLIGENCE
LIBFEXDLBDSEAIS01
INTRODUCTION TO ARTIFICIAL
INTELLIGENCE
MASTHEAD
Publisher:
The London Institute of Banking & Finance
8th Floor, Peninsular House
36 Monument Street
London
EC3R 8LJ
United Kingdom
LIBFEXDLBDSEAIS01
Version No.: 001-2024-0327
2
TABLE OF CONTENTS
INTRODUCTION TO ARTIFICIAL INTELLIGENCE
Introduction
Signposts Throughout the Course Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Unit 1
History of AI 9
Unit 2
Modern AI Systems 27
Unit 3
Reinforcement Learning 35
Unit 4
Natural Language Processing 43
Unit 5
Computer Vision 63
3
Appendix
List of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
List of Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4
INTRODUCTION
WELCOME
SIGNPOSTS THROUGHOUT THE COURSE BOOK
This course book contains the core content for this course. Additional learning materials
can be found on the learning platform, but this course book should form the basis for your
learning.
The content of this course book is divided into units, which are divided further into sec-
tions. Each section contains only one new key concept to allow you to quickly and effi-
ciently add new learning material to your existing knowledge.
At the end of each section of the digital course book, you will find self-check questions.
These questions are designed to help you check whether you have understood the con-
cepts in each section.
For all modules with a final exam, you must complete the knowledge tests on the learning
platform. You will pass the knowledge test for each unit when you answer at least 80% of
the questions correctly.
When you have passed the knowledge tests for all the units, the course is considered fin-
ished and you will be able to register for the final assessment. Please ensure that you com-
plete the evaluation prior to registering for the assessment.
Good luck!
6
LEARNING OBJECTIVES
In this course, you will get an introduction to the field of artificial intelligence.
The discipline of Artificial Intelligence originates from various fields of study such as cog-
nitive science and neuroscience. The coursebook starts with an overview of important
events and paradigms that have shaped the current understanding of artificial intelli-
gence. In addition, you will learn about the typical tasks and application areas of artificial
intelligence.
On the completion of this coursebook, you will understand the concepts behind reinforce-
ment learning, which are comparable to the human way of learning in the real world by
exploration and exploitation.
Moreover, you will learn about the fundamentals of natural language processing and com-
puter vision. Both are important for artificial agents to be able to interact with their envi-
ronment.
7
UNIT 1
HISTORY OF AI
STUDY GOALS
Introduction
This unit will discuss the history of artificial intelligence (AI). We will start with the histori-
cal developments of AI which date back to Ancient Greece. We will also discuss the recent
history of AI.
In the next step, we will learn about the AI winters. From a historical perspective, there
have been different hype cycles in the development of AI because not all requirements for
a performant system could be met at that time.
We will also examine expert systems and their development. The last section closes with a
discussion of the notable advances in artificial intelligence. This includes modern con-
cepts and its use cases.
The figure above illustrates the milestones in AI which will be discussed in the following
sections.
10
Aristotle, Greek Philosopher (384–322 BCE)
Aristotle was the first to formalize human thinking in a way to be able to imitate it. To for-
malize logical conclusions, he fully enumerated all possible categorical syllogisms (Giles,
2016).
These early considerations about computing machinery are very important because prog-
ress in computing is a necessary precondition for any sort of development in AI.
11
René Descartes, French Philosopher (1596–1650)
The French philosopher Descartes believed that rationality and reason can be defined
using principles from mechanics and mathematics. The ability to formulate objectives
using equations is an important foundation for AI, as its objectives are defined mathemati-
cally. According to Descartes, rationalism and materialism are two sides of the same coin
(Bracken, 1984). This links to the methods used in AI where rational decisions are derived
in a mathematical way.
Thomas Hobbes specified Descartes’ theories about rationality and reason. In his work, he
identified similarities between human reasoning and computations of machines. Hobbes
described that, in rational decision-making, humans employ operations similar to calcu-
lus, such that they can be formalized in a way that is analogous to mathematics (Flasiński,
2016).
Hume made fundamental contributions to questions of logical induction and the concept
of causal reasoning (Wright, 2009). For example, he combined learning principles with
repeated exposure, which has had – among others – a considerable influence on the
Learning curve learning curve (Russell & Norvig, 2022).
The learning curve is a
graphical representation
of the ratio between a Nowadays, many machine learning algorithms are based on the principle of deriving pat-
learning outcome and the terns or relations in data through repeated exposure.
time required to solve a
new tasks.
Recent History of Artificial Intelligence
The recent history of AI started around 1956 when the seminal Dartmouth conference took
place. The term artificial intelligence was first coined at this conference and a definition of
the concept was proposed (Nilsson, 2009). In the following, we will discuss the the key per-
sonalities, organizations , and concepts in the development of AI.
Key personalities
The recent history of AI normally starts with the pioneering Dartmouth conference in 1956
where the term “artificial intelligence” was first coined, and a definition of the term was
suggested.
During the decade of AI's inception, important personalities contributed to the discipline.
Alan Turing was an English computer scientist and mathematician who formalized and
mechanized rational thought processes. In 1950 he conceptualized the well-known Turing
Test. This test examines if an AI communicates with a human observer without the human
observer being able to distinguish whether they are conversing with a machine or another
human. If the human cannot identify an AI as such, it is considered a real AI (Turing, 1950).
12
The American scientist John McCarthy studied automata. It was he who first coined the
term “artificial intelligence” during preparations for the Dartmouth conference (McCarthy
et al., 1955). In cooperation with the Massachusetts Institute of Technology (MIT) and
International Business Machines (IBM), he established AI as an independent field of study.
He was the inventor of the programming language Lisp in 1958 (McCarthy, 1960). For more
than 30 years LISP was used in a variety of applications of AI, such as fraud detection and
robotics. In the 1960s, he founded the Stanford Artificial Intelligence Laboratory which has
had a significant influence on research on implementing human capabilities, like reason-
ing, listening, and seeing, in machines (Feigenbaum, 2012).
American researcher Marvin Minsky, a founder of the MIT Artificial Intelligence Laboratory
in 1959, was another important participant in the Dartmouth conference. Minsky com-
bined insights from AI and cognitive science (Horgan, 1993).
With a background in linguistics and philosophy, Noam Chomsky is another scientist who
contributed to the development of AI. His works about formal language theory and the
development of the Chomsky hierarchy still play an important role in areas such as natural
language processing (NLP). Besides that, he is well known for his critical views on topics
such as social media.
Key institutions
The most influential institutions involved in the development of AI are Dartmouth College
and MIT. Since the Dartmouth conference, there have been several important conferences
at Dartmouth College discussing the latest developments in AI. Many of the early influen-
tial AI researchers have taught at MIT, making it a key institution for AI research. But also
companies such as IBM and Intel, and government research institutions, such as the
Defense Advanced Research Projects (DARPA), have contributed much to AI by funding
research on the subject (Crevier, 1993).
Many research areas have been contributing to the development of artificial intelligence.
The most important areas are decision theory, game theory, neuroscience, and natural
language processing:
• In decision theory mathematical probability and economic utility are combined. This
provides the formal criteria for decision-making in AI regarding economic benefit and
dealing with uncertainty.
• Game theory is an important foundation for rational agents to learn strategies to solve
games. It is based on the research of the American–Hungarian computer scientist John
von Neuman (1903–1957), and the American–German mathematician and game theo-
rist Oskar Morgenstern (1902–1977); (Leonard, 2010).
• The insights from neuroscience about how the brain works are increasingly used in arti-
ficial intelligence models, especially as the importance of artificial neural networks
(ANN) is increasing. Nowadays, there are many models in AI trying to emulate the way
the brain stores information and solves problems.
13
• Natural language processing (NLP) combines linguistics and computer science. The goal
of NLP is to process not only written language (text) but also spoken language (speech).
High-level programming languages are important to program AI. They are closer to human
language than low-level programming languages such as machine code or assembly lan-
guage and allow programmers to work independently from the hardware’s instruction
sets. Some of the languages that have been developed specifically for AI are Lisp, Prolog,
and Python:
• Lisp has been developed by John McCarthy and is one of the oldest programming lan-
guages. The name comes from “list processing” as Lisp is able to process character
strings in a unique way (McCarthy, 1960). Even though it dates back to the 1960s it has
not only been used for early AI programming but is still relevant today.
• Another early AI programming language is Prolog which was specially designed to prove
theorems and solve logical formulas.
• Nowadays, the general-purpose high-level programming language Python is the most
important programming language. As Python is open source, there exist extensive libra-
ries which help programmers to create applications in a very efficient way.
There are three important factors that have contributed to the recent progress in artificial
intelligence:
• Increasing availability of massive amounts of data, which are required to develop and
train AI algorithms.
• Large improvements in data processing capacity of computers.
• New insights from mathematics, cognitive science, philosophy, and machine learning.
These factors support the development of approaches that were previously impossible, be
it because of a lack of processing capability or a lack of training data.
1.2 AI Winter
The term “AI winter” first appeared in the 1980s. It was coined by AI researchers to
describe periods when interest, research activities, and funding of AI projects significantly
decreased (Crevier, 1993). The term might sound a bit dramatic. However, it reflects the
culture of AI, which is known for its excitement and exuberance.
Historically, the term has its origin in the expression “nuclear winter”, which is an after-
effect of a hypothetical nuclear world war. It describes the state where the atmosphere is
overcome by ashes and the sunshine cannot reach the Earth’s atmosphere, meaning that
temperatures would drop excessively and nothing would be able to grow. Therefore,
transferring this term to AI, it marks periods where interest and funding of AI technologies
were significantly reduced, causing a reduction in research activities. Downturns like this
are usually based on exaggerated expectations towards the capabilities of new technolo-
gies that cannot be realistically met.
14
There have been two AI winters. The first lasted approximately from 1974 to 1980 and the
second from 1987 to 1993 (Crevier, 1993).
During the cold war between the former Soviet Union and the United States (US), auto-
matic language translation was one of the major drivers to fund AI research activities
(Hutchins, 1997). As there were not enough translators to meet the demand, expectations
were high to automate this task. However, the promised outcomes in machine translation
could not be met. Early attempts to automatically translate language failed spectacularly.
One of the big challenges at that time was handling word ambiguities. For instance, the
English sentence “out of sight, out of mind” was translated into Russian as the equivalent
of “invisible idiot” (Hutchins, 1995).
When the Automatic Language Processing Advisory Committee evaluated the results of
the research that had been generously funded by the US, they concluded that machine
translations are not as accurate, nor faster nor cheaper than employing humans (Auto-
matic Language Proesccing Advisory Committee, 1966). Additionally, perceptrons – which
were at that time a popular model of neural-inspired AI – had severe shortcomings as even
simple logical functions, such as exclusive or (XOR), could not be represented in those
early systems.
The second AI winter started around 1987 when the AI community became more pessimis-
tic about developments. One major reason for this was the collapse of the Lisp machine Lisp machine
business which led to the perception that the industry might end (Newquist, 1994). More- A Lisp machine is a type
of computer that sup-
over, it turned out that it was not possible to develop early successful examples of expert ports the Lisp language.
systems beyond a certain point. Those expert systems had been the main driver of the
returned interest in AI systems after the first AI winter. The reason for the limitations was
that the growth of fact databases was no longer manageable, and results were unreliable
towards unknown inputs i.e., inputs on which the machines had not been trained.
However, there are also arguments that there are no such thing as AI winters, and that
they are myths spread by a few prominent researchers and organizations who had lost
money (Kurzweil, 2014). While the interest in Lisp machines and expert systems
decreased, AI was still deeply embedded in many other types of processing operations
such as credit card transactions.
There are several conditions that can cause AI winters. The three most important require-
ments for the success of artificial intelligence are
15
The past AI winters occurred because not all requirements were met.
During the first AI winter, there were already powerful algorithms. However, for successful
results, it is necessary to process a huge amount of data. This requires a lot of memory
capacity as well as high processing speed. At the time, there were not enough data availa-
ble to properly train those algorithms. Therefore, the expectations of interested parties
and investors could not be met. As the funded research was unable to produce the prom-
ised results, the funding was stopped.
Until the 1980s the computing capacity had increased enough to train the available algo-
rithms on small data sets. However, as approaches from machine learning and deep learn-
ing became integral parts of AI in the late 1980s, there was a greater need for large data
sets to train AI systems, which became an issue. The lack of labeled training data – even
though computing capacity would have been available – created the perception that sev-
eral of the AI projects had failed.
Nowadays, all three aspects mentioned above are fully met. There is enough computa-
tional power to train the available algorithms on a large number of existing data sets. The
figure below summarizes the preconditions for AI to be successful.
However, the question of whether there might be another AI winter in the future can
hardly be answered. If a hyped concept gets a lot of funding but does not perform, it might
be defunded which could cause another AI winter. Nevertheless, nowadays AI technolo-
16
gies are embedded in many other fields of research. If low-performing projects are
defunded, there is always room for new developments. Therefore, everybody is free to
decide whether AI winters are simply a myth or if the concept really matters.
Expert systems are designed to help a non-expert user make decisions based on the
knowledge of an expert.
Expert systems are composed of a body of formalized expert knowledge from a specific
application domain, which is stored in the knowledge base. The inference engine uses the
knowledge base to draw conclusions from the rules and facts in the knowledge. It imple-
ments rules of logical reasoning to derive new facts, rules, and conclusions not explicitly
contained in the given corpus of the knowledge base. A user interface enables the non-
expert user to interact with the expert system to solve a given problem from the applica-
tion domain.
17
Types of Expert Systems
With respect to the representation of knowledge, three approaches to expert systems can
be distinguished:
One of the initial insights gained from the attempt at general problem solving was that the
construction of a domain specific problem solver should—at least in principle—be easier
to achieve. This led the way to think about systems that combined domain-specific knowl-
edge with domain-dependent apposite reasoning patterns. Edward Feigenbaum, who
worked at Stanford University, the leading academic institution for the subject at the time,
defined the term expert system and built the first practical examples while leading the
Heuristic Programming Project (Kuipers & Prasad, 2021).
The first notable application was Dendral, a system for identifying organic molecules. In
the next step, expert systems were established to help with medical diagnoses of infec-
tious diseases based on given data and rules (Woods, 1973). The expert system that
evolved out of this was called MYCIN, which had a knowledge base of around 600 rules.
However, it took until the 1980s for expert systems to reach the height of research interest,
leading to the development of commercial applications.
The main achievement of expert systems was their role in pioneering the idea of a formal,
yet accessible representation of knowledge. This representation was explicit in the sense
that it was formulated as a set of facts and rules that were suitable for creation, inspec-
tion, and review by a domain expert. This approach thus clearly separates domain-specific
business logic from the general logic needed to run the program – the latter encapsulated
in the inference engine. In stark contrast, more conventional programming approaches
implicitly represent both internal control and business logic in the form of a program code
18
that is hard to read and understand by people who are not IT experts. At least in principle,
the approach championed by expert systems enabled even non-programmers to develop,
improve, and maintain a software solution. Moreover, it introduced the idea of rapid pro-
totyping since the fixed inference engine enabled the creation of programs for entirely dif-
ferent purposes simply by changing the set of underlying rules in the knowledge base.
However, a major downside of the classical expert system paradigm, which also finally led
to a sharp decline in its popularity, was also related to the knowledge base. As expert sys-
tems were engineered for a growing number of applications, many interesting use cases
required larger and larger knowledge bases to satisfactorily represent the domain in ques-
tion. This insight proved problematic in two different aspects:
1. Firstly, the computational complexity of inference grows faster than it does linearly in
the number of facts and rules. This means that for many practical problems the sys-
tem’s answering times were prohibitively high.
2. Secondly, as a knowledge base grows, proving its consistency by ensuring that no
constituent parts contradict each other, becomes exceedingly challenging.
Additionally, rule-based systems in general lack the ability to learn from experience. Exist-
ing rules cannot be modified by the expert system itself. Updates of the knowledge base
can only be done by the expert.
In the early years, AI research was dominated by the “symbolic” AI. In this approach, rules
from formal logic are used to formalize thought processes as manipulation of symbolic
representations of information. Accordingly, AI systems developed during this era deal
with the implementation of logical calculus. In most cases, this is done by implementing a
search strategy, where solutions are derived in a step-by-step procedure. The steps in this
procedure are either inferred logically from a preceding step or systematically derived
using backtracking of possible alternatives to avoid dead ends.
The early years were also the period where first attempts for natural language processing
were developed. The first approaches for language processing were focused on highly lim-
ited environments and settings. Therefore, it was possible to achieve initial successes. The
simplification of working environments – a “microworld” approach – also yielded good
results in the fields of computer vision and robot control.
19
In parallel, the first theoretical models of neurons were developed. The research focus was
on the interaction between those cells (i.e., computational units) to implement basic logi-
cal functions in networks.
The focus of the first wave of AI research was primarily on logical inference. In contrast,
the main topics of the second wave were driven by the attempt to solve the problem of
knowledge representation. The reason for this focus shift was caused by the insight that in
day-to-day situations intelligent behavior is not only based on logical inference but much
more on general knowledge about the way the world works. This knowledge-based way to
view intelligence was the origin of early expert systems. The main characteristic of these
technologies was that domain-relevant knowledge was systematically stored in data-
bases. Using these databases, a set of methods was developed to access that knowledge
in an efficient, effective way.
The emerging interest in AI after the first AI winter was also accompanied by an upturn in
governmental funding at the beginning of the 1980s with projects such as the Alvey
project in the UK and the Fifth Generation Computer project of the Japanese Government
(Russell & Norvig, 2022).
During the 1990s there were some major advances of AI in games when the first computer
system “Deep Blue” was able to beat Garry Kasparov, the world champion in chess at that
time.
Aside from this notable but narrow success, AI methods have become widely used in the
development of real-world applications. Successful approaches in the subfields of AI have
gradually found their way into everyday life – often without being explicitly labeled as AI.
In addition, since the early 1990s, there has been a growing number of ideas from decision
theory, mathematics, statistics, and operations research that those contributed signifi-
cantly to AI becoming a rigorous and mature scientific discipline. Especially the paradigm
of intelligent agents has become increasingly popular. In this context, the concept of intel-
ligent agents from economic theory combines with the notions of objects and modularity
of computer science and forms the idea of entities that can act intelligently. This perspec-
tive allows it to shift perspective from AI being an imitation of human intelligence to the
study of intelligent agents and a broader study of intelligence in general.
The advances in AI since the 1990s have been supported by a significant increase in data
storage and computational capacities. Along with this, during the rise of the internet,
there has been an incomparable increase in variety, velocity, and volume of generated
data, which also supported the AI boom.
20
In 2012 the latest upturn in the interest of AI research started when deep learning was
developed based on advances in connectionists machine learning models. The increase in
data processing and information storage capabilities combined with larger data corpora
brought theoretical advances in machine learning models into practice. With deep learn-
ing, new performance levels in many machine learning benchmark problems could be
achieved. This led to a revival of interest in well-established learning models, like rein-
forcement learning and created space for new ideas, like adversarial learning.
There are many fields of study that continuously contribute to AI research. The most influ-
ential fields will be described in the following.
Linguistics
Linguistics can be broadly described as the science of natural language. It deals with
exploring the structural (grammatical) and phonetic properties of interpersonal communi-
cation. To understand language, it is necessary to understand the context and the subject
matter in which it is used. In his book Syntactic Structures, Noam Chomsky (1957) made an
important contribution to linguistics and, therefore, to natural language processing. Since
our thoughts are so closely linked to language as a form of representation, one could take
it a step further and link creativity and thought to linguistic AI. For example, how is it pos-
sible that a child says something it has never said before? In AI, we understand natural
language as a medium of communication in a specific context. Therefore, language is
much more than just a representation of words.
Cognition
In the context of AI, cognition refers to different capabilities such as perception and cogni-
tion, reasoning, intelligence, learning and understanding, and thinking and comprehen-
sion. This is also reflected in the word “recognition”. A large part of our current under-
standing of cognition is a combination of psychology and computer science. In
psychology, theories and hypotheses are formed from observations with humans and ani-
mals. In computer science, behavior is modeled based on what has been observed in psy-
chology. When modeling the brain by a computer, we have the same principle of stimulus
and response as in the human brain. When the computer receives a stimulus, an internal
representation of that stimulus is made. The response to that stimulus can lead to the
original model being modified. Once we have a well-working computer model for a spe-
cific situation, the next step will be to find out how decisions are made. As decisions based
on AI are involved in more and more areas of our lives, it is important to have high trans-
parency about the reasoning process to an external observer. Therefore, explainability
(the ability to explain, how a decision has been made) is becoming increasingly important.
However, approaches based on deep learning still lack explainability.
21
Games
When relating games to AI, this includes much more than gambling or computer games.
Rather, games refer to learning, probability, and uncertainty. In the early twentieth cen-
tury, game theory was established as a mathematical field of study by Oskar Morgenstern
and John von Neuman (Leonard, 2010). In game theory, a comprehensive taxonomy of
games was developed and, in connection with this, some gaming strategies that have
been proven to be optimal strategies.
Another discipline related to game theory is decision theory. While game theory is more
about how the moves of one player affect the options of another player, decision theory
deals with usefulness and uncertainty, i.e., utility and probability. Both are not necessarily
about winning but more about learning, experimenting with possible options, and finding
out what works based on observations.
Games, like chess, checkers, and poker, are usually played for the challenge of winning or
for entertainment. Nowadays, machines can play better than human players. Until 2016,
people believed that the game of Go might be an unsolvable challenge for computers
because of its combinatorial complexity. The objective of the game is to surround the
most territory on a board with 19 horizontal and vertical lines. Even though the ruleset is
quite simple, the complexity comes from the large size of the game board and the result-
ing number of possible moves. This complexity makes it impossible to apply methods that
have been used for games like chess and checkers. However, in 2015 DeepMind developed
the system AlphaGo based on deep networks and reinforcement learning. This system was
the first to be able to beat Lee Sedol, one of the world’s best Go players (Silver et al., 2016).
Not long after AlphaGo, DeepMind developed the system AlphaZero (Silver et al., 2018). In
contrast to AlphaGo, which learned from Go knowledge from past records, AlphaZero only
learns based on intensive self-play following the set of rules. This system turned out to be
even stronger than AlphaGo. It is also remarkable that AlphaZero even found some effec-
tive and efficient strategies, which had, so far, been missed by Go experts.
It has only been a few years since the term “Internet of things” (IoT) first came up. IoT con-
nects physical and virtual devices using technologies from information and communica-
tion technology. In our everyday lives, we are surrounded by a multitude of physical devi-
ces that are always connected, such as phones, smart home devices, cars, and wearables.
The communication between those devices produces a huge amount of data which links
IoT to AI. While IoT itself is only about connecting devices and collecting data, AI can help
add intelligent behavior to the interaction between those machines.
Having intelligent devices integrated into our everyday lives not only create opportunities
but also many new challenges. For instance, data about medication based on physical
measurements of a wearable device could be used positively, to remind a person about
medication intake, but also to decide about a possible increase in their health insurance
rate. Therefore, topics like ethics of data use and privacy violations, become increasingly
important facing the new fields of use of AI.
22
Quantum computing
The Future of AI
It is always highly speculative when trying to assess the impact of a research area or new
technology on the future as the future prospects will always be biased by previous experi-
ences. Therefore, we do not attempt to predict the long-term future of AI. Nevertheless, we
want to examine the directions of developments in AI and the supporting technologies.
The Gartner hype curve is frequently used to evaluate the potential of new technologies
(Gartner, 2021). The hype curve is presented in a diagram where the y-axis represents the
expectations towards a new technology and time is plotted on the x-axis.
23
Figure 5: The Gartner Hype Cycle
The hype cycle has some similarities with the inverted U-shape of a normal distribution
except that the right end of the curve leads into an increasing slope that eventually flat-
tens out.
In 2021, the hype cycle for artificial intelligence showed the following trends (Gartner,
2021):
So far, none of the topics of AI have yet reached the plateau of productivity. This reflects
the general acceptance of this area and the productive use of the related technologies.
24
SUMMARY
Research about artificial intelligence has been of interest for a long time.
The first theoretical thoughts about artificial intelligence date back to
Greek philosophers like Aristotle. Those early considerations were con-
tinued by philosophers like Hobbes and Descartes. Since the 1950s, it
has also become an important component of computer science and
made important contributions in areas such as knowledge representa-
tion in expert systems, machine learning, and modeling neural net-
works.
In the past decades, there have been several ups and downs in AI
research. They were caused by a cycle between innovations accompa-
nied by high expectations and disappointment when those expectations
could not be met, often because of technical limitations.
Over time, AI has been shaped by different paradigms from multiple dis-
ciplines. The most popular paradigm nowadays is deep learning. New
fields of applications like IoT or quantum computing offer a vast amount
of opportunities of how AI can be used. However, it remains to see how
intelligent behavior will be implemented in machines in the future.
25
UNIT 2
MODERN AI SYSTEMS
STUDY GOALS
– explain the difference between narrow and general artificial intelligence systems.
– name the most important application areas for artificial intelligence.
– understand the importance of artificial intelligence for corporate activities.
2. MODERN AI SYSTEMS
Introduction
Artificial intelligence has become an integral part of our everyday life. There are several
examples where we do not even notice the presence of AI, be it in Google maps or smart
replies in Gmail.
There are two categories of AI that will be explained in the following unit: narrow and gen-
eral AI.
Even though many people believe that we already have some sort of strong artificial intel-
ligence, current approaches are still implemented in a domain-specific way and lack the
necessary flexibility to be considered AGI. However, there is a large consensus that it is
only a matter of time until artificial intelligence will be able to outperform human intelli-
gence. Results from a survey of 352 AI researchers indicate that there is a 50 percent
chance that algorithms might reach that state by 2060 (Grace et al., 2017).
In the following, we will have a closer look at the underlying concepts of weak and strong
artificial intelligence.
The term ANI or weak AI reflects the current and future artificial intelligence. Systems
based on ANI can already solve complex problems or tasks faster than humans. However,
the capabilities of those systems are limited to the use cases for which they have been
designed. In contrast to the human brain, narrow systems cannot generalize from a spe-
cific task to a task from another domain.
28
For example, a particular device or system which can play chess, will probably not be able
to play another strategy game like Go or Shogi without being explicitly programmed to
learn that game. Voice assistants as Siri or Alexa can be seen as some sort of hybrid intelli-
gences, which combine several weak AIs. Those tools are able to translate natural lan-
guage and to analyze those words with their databases in order to complete different
tasks. However, they are only able to solve a limited number of problems for which their
algorithms are suitable and for which they have been trained for. For instance, currently,
they would not be able to analyze pictures or optimize traffic.
In short, ANI includes the display of intelligence with regard to complex problem solving
and the display of intelligence relative to one single task.
The reference point for which AGI is measured and judged against are the versatile cogni-
tive abilities of humans. The goal of AGI is not only to imitate the interpretation of sensory
input, but also to emulate the whole spectrum of human cognitive abilities. This includes
all abilities currently represented by ANI, as well as the ability of domain-independent
generalization. This means knowledge of one task can be applied to another in a different
domain. This might also include motivation and volition. Some philosophical sources go
one step further and require AGI to have some sort of consciousness or self-awareness
(Searle, 1980). Developing an AGI would require the following system capabilities:
Considering the current state of AGI, it is difficult to imagine developing a system that
meets these requirements. In addition, both types of AI also entail the concept of superin-
telligence. This concept goes even further than current conceptions, and describes the
idea that an intelligent system can reach a level of cognition that goes beyond human
capabilities. This self-improvement might be achieved by a recursive cycle. However, this
level of AI is above AGI and still very abstract.
29
The growing interest is also corroborated by an increase in research activities. According
to the annual AI Index (Zhang et al., 2021), from 2019 to 2020, the number of journal publi-
cations on AI grew by 34.5 percent. Since 2010 AI papers increased more than twenty-fold.
The most popular research topics have been natural language processing and computer
vision which are important for various areas of application.
In a global survey about the state of AI, McKinsey & Company (2021) identified the follow-
AI adoption ing industries as the main fields of AI adoption: High Tech/Telecom, Automotive and
The use of AI capabilities Assembly, Financial Services, Business, Legal and Professional Services, Healthcare/
such as machine learning
in at least one business Pharma and Consumer Goods/Retail. In the following section, we will have a closer look at
function is called AI adop- these fields.
tion.
The figure above summarizes the most important domains in which AI is used.
Due to the constant increase of global network traffic and network equipment, there has
been a rapid growth of AI in telecommunication. In this area, AI can not only be used to
optimize and automate networks but also to ensure that the networks, are healthy and
secure.
Using AI in predictive maintenance, it can help fix network issues even before they occur.
Moreover, network anomalies can be accurately predicted when using self-optimizing net-
works.
Big data makes it possible to easily detect network anomalies and therefore prevent frau-
dulent behavior within them.
30
Automotive and Assembly
In the past years, autonomous driving has become a huge research topic. It will drastically
transform the automotive industry in the next decades from a steel-driven to a software-
driven industry. Nowadays, cars are already equipped with many sensors to ensure the
driver’s safety, for staying in–lane or emergency braking assistance.
Intelligent sensors can also detect technical problems based on the car or risks from the
driver – such as fatigue or being under the influence of alcohol – and initiate appropriate
actions.
Like in high tech and telecommunication, in assembly processes, AI can be used for pre-
dictive maintenance and to fix inefficiencies in the assembly line. Moreover, using com-
puter vision, it is already possible to detect defects faster and more accurately than a
human.
Financial Services
Financial services offer numerous applications for artificial intelligence. Intelligent algo-
rithms enable financial institutions to detect and prevent fraudulent transactions and
money laundering much earlier than was previously possible. Computer vision algorithms
can be used to precisely identify counterfeit signatures by comparing them to scans of the
originals stored in a database.
Additionally, many banks and brokers already use Robo-advising; Based on a user's invest-
ment profile, accurate recommendations about future investments can be made
(D’Acunto et al., 2019). Portfolios can also be optimized based on AI applications.
Especially in industries where paperwork and repetitive tasks play an important role, AI
can help to make processes faster and more efficient.
Significant elements of routine workflows are currently being automated using robotic
process automation (RPA), which can drastically reduce administrative costs. Systems in Robotic process
RPA do not necessarily have to be enabled with intelligent AI capabilities. However, meth- automation
The automated execution
ods, such as natural language processing and computer vision, can help enhance those of repetitive, manual,
processes with more intelligent business logic. time consuming or error
prone tasks by software
bots is described as
The ongoing developments in big data technologies can help companies extract more robotic process automa-
information from their data. Predictive analytics can be used to identify current and future tion.
trends about the markets a company is in and react accordingly.
Another important use case is the reduction of risk and fraud, especially in legal, account-
ing, and consulting practices. Intelligent agents can help to identify potentially fraudulent
patterns, which will allow for earlier responses.
31
Healthcare and Pharma
In the last few years, healthcare and pharma have been the fastest growing area adopting
AI.
AI-based systems can help detect diseases based on the symptoms. For instance, recent
studies have been able to use AI–based systems to detect COVID–19 based on cough
recordings (Laguarta et al., 2020).
Not only in diagnostics AI can offer many advantages. Intelligent agents can be used to
monitor patients according to their needs. Moreover, regarding medication, AI can help
find an optimal combination of prescriptions to avoid side effects.
Wearable devices – such as heart rate or body temperature trackers – can be used to con-
stantly observe the vital parameters of a person. Based on this data, an agent can give
advice about the wearer’s condition. Moreover, in case critical anomalies are detected, it is
possible to initiate an emergency call.
The consumer goods and retail industry focuses on predicting customer behavior. Web-
sites track how a customer’s profile changes based on their number of visits. This allows
for personal purchase predictions for each customer. This data can not only be used to
make personalized shopping recommendations but also to optimize the whole supply
chain and direct about future research.
Market segmentation is, nowadays, no longer based on geographical regions such as prov-
ince or country. Modern technologies allow it to segment customers’ behavior on a street-
by-street basis. This information can be used to fine-tune operations and decide whether
store locations should be kept or closed.
Evaluation of AI Systems
As the above-mentioned examples illustrate, the application areas for modern AI sytems
are almost unlimited. More and more companies manage to support their business mod-
els with AI or even create completely new ones. Therefore, it is important to carefully eval-
uate new systems. When evaluating AI systems, it is crucial, that all data sets are inde-
pendent from each other and follow a similar probability distribution.
To develop proper models for AI applications, the available data is split into three data
sets:
32
1. Training data set: As the name indicates, this data set is used to fit the parameters of
an algorithm during the training process.
2. Development set: This data set is often also referred to as a validation set. It is used to
evaluate the performance of the model developed using the training set and for fur-
ther optimization. It is important that the development set contains data that have
not been included in the training data.
3. Test set: Once the model is finalised using the training and the development set, the
test set can be used for a final evaluation of the model. Like for the development set,
it is important that the data in the test set have not been used before. The test set is
only used once to validate the model and to ensure that it is not overfitted.
When developing and tuning algorithms, metrics should be in place to evaluate how well
it performs independently and compared to other systems. In a binary classification task,
accuracy, precision, recall, and F-score are metrics that are commonly used for this pur-
pose.
For example, Financial services uses a binary classification task in fraud detection. A finan-
cial transaction can either be categorized as fraud or not. Based on this, we will have four
categories of classification results:
1. True positives (TP): identifies samples that were correctly classified as positive, i.e.
being fraudulent transactions
2. False positives (FP): all results that wrongly indicate a sample to be positive even
though it’s negative, i.e., a non-fraudulent transaction being categorized as fraud
3. True negatives: marks classification results that were correctly classified as negative,
i.e., non-fraudulent transactions that were also labeled as such
4. False negatives: classification results that were wrongly classified as negative even
though they should have been positive, i.e., fraudulent transactions that were classi-
fied as non-fraud
The classification results can be displayed in a confusion matrix, also known as error
matrix. This is shown in the table below.
Accuracy is an indicator for how many samples were classified correctly. It can be com-
puted as follows:
33
TP + TN
Accuracy = TP + TN + FP + FN
It measures which percentage of the total prediction was correct. Precision denotes the
number of positive samples that were classified correctly in relation to all samples pre-
dicted in this class:
TP
P recision = TP + FP
Recall indicates how many of the positively detected samples were identified correctly in
relation to the total number of samples that should have been identified as such:
TP
Recall = TP + FN
precision · recall
F =2· precision + recall
In classification tasks with more than two classes, metrics can be calculated for every
class. In the end the average of the values can be combined to one metric for all classes.
SUMMARY
There are two types of AI: narrow and general. Current AI systems all
belong to the category of ANI. ANI can solve complex problems faster
than humans. However, its capabilities are limited to the domain for
which it has been programmed. Even though the term ANI might suggest
a limitation, it is embedded in many areas of our lives. In contrast to
that, AGI (AI which has the cognitive abilities to transfer knowledge to
other areas of application) remains a theoretical construct, but is still an
important research topic.
The application areas for AI are almost unlimited. AI has had a signifi-
cant impact on today’s corporate landscape. Use cases, such as the opti-
mization of service operations, the enhancement of products based on
AI, and automation of manual processes, can help companies towards
optimizing their business functions. Those use cases stretch across a
wide range of industries, be it automotive and assembly, financial serv-
ices, healthcare and pharma, consumer goods, and many more.
34
UNIT 3
REINFORCEMENT LEARNING
STUDY GOALS
Introduction
Imagine you are lost in a labyrinth and have to find your way out. As you are there for the
first time, you do not know which way to choose to reach the door to leave. Moreover,
there are dangerous fields on the labyrinth and you should avoid stepping on them.
You will have four actions you can perform: move up, down, left, or right. As you do not
know the labyrinth, the only way to find your way out is to see what happens when you
perform random actions. Within the learning process, you will find out that there are fields
on the labyrinth that will reward you by letting you escape the labyrinth. However, there
are also fields where you will receive a negative reward as they are dangerous to step on.
After some time, you will manage to find your way out without stepping on the dangerous
fields from the experience you have made walking around. This process of learning by
reward is called reinforcement learning.
In this unit, you will learn more about the basic ideas of reinforcement learning and the
underlying principles. Moreover, you will get to know algorithms, such as Q-learning, that
can help you optimize the learning experience.
36
In supervised learning, a machine learns how to solve a problem based on a previously
labeled data set. Typical application areas for supervised learning are regression and clas-
sification problems such as credit risk estimation or spam detection. Training those kinds
of algorithms takes much effort because it requires a large amount of pre-labeled training
data.
Action (A) The set of all possible actions the agent can per-
form
Policy (π) The policy the agent applies to determine the next
action based on the current state
Within the process of reinforcement learning, the agent starts in a certain state st ∈ S and
applies an action at ∈ A st to the environment E, where A st is the set of actions availa-
ble at state st. The environment reacts by returning a new state st + 1 and a reward rt + 1to
the agent. In the next step the agent will apply the next action at + 1 to the environment
which will again return a new state and a reward.
In the introductory example, you are acting as the agent in the labyrinth environment. The
actions you can perform are to move up, down, left, or right. After each move, you will
reach another state by moving to another field in the labyrinth. Each time you perform an
action, you will receive a reward from the environment. It will be positive if you reach the
37
door or negative if you step on a dangerous field. From your new position, the whole
learning cycle will start again. Your goal will be to maximize your reward. The process of
receiving a reward as a function of a state-action pair can be formalized as follows:
f st, at = rt + 1
The process of an action being selected from a given state, transitioning to a new state,
and receiving a reward happens repeatedly. For a sequence of discrete time steps
t = 0, 1, 2, … starting at the state s0 ∈ S, the agent-environment interaction will lead to a
sequence:
s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, …
The goal of the agent is to maximize the reward it will receive during the learning process.
The cycle will continue until the agent ends in a terminal state. The total reward R after a
time T can be computed as the sum of rewards received at this point:
Rt = rt + 1 + rt + 2 + … + rT
This reward is also referred to as the Value V π s in the state s using the strategy π. In our
example the maximum reward will be received once you reach the exit of the labyrinth. We
will have a closer look at the value function in the next section.
38
3.2 Markov Decision Process and Value
Function
To be able to evaluate different paths in the labyrinth, we need a suitable approach to
compare interaction sequences. One method to formalize sequential decision-making is
Markov Decision Processes (MDP). In the following, we will discuss how MDPs work.
MDPs are used to estimate the probability of a future event based on a sequence of possi-
ble events. If a present state holds all the relevant information about past actions, it is said
to have the “Markov property”. In reinforcement learning, the Markov property is critical
because all decisions and values are functions of the present state (Sutton & Barto, 2018),
i.e., decisions are made depending on the environment’s state.
When a task in reinforcement learning satisfies the Markov property, it can be modeled as
an MDP. The process representing the sequence of events in an MDP is called a Markov
chain. If the Markof property is satisfied, in every state of a Markov chain, the probability
that another state is reached depends solely on two factors: the transition probability of
reaching the next state and the present state. MDPs consist of the following components:
• States S
• Actions A
• Rewards for an action at a certain state ra = R s, a, s′
• Transition probabilities for the actions to move from one state to the next state T a s, s′
Because of the Markov property, the transition function depends only on the current state:
The equation above states that the probability P of transitioning from state s_t to state
s_{t+1} given an action a_t depends only on the current state s_t and action a_t and not on
any previous states or actions.
π s, a = p at = a st = s
Using our labyrinth example, the position at which you stand offers no information about
the sequence of states you took to get there. However, your position in the labyrinth repre-
sents all the required information for the decision about your next state, which means it
has the Markov property.
39
The Value Function
Previously, we learned that the value of a state can be computed as the sum of rewards
received within the learning process. Additionally, a discount rate can be used to evaluate
the rewards of future actions at the present state. The discount rate indicates the likeli-
hood to reach a reward state in the future. This helps the agent select actions more pre-
cisely according to the expected reward. An action at + 1 will then be chosen to maximize
the expected discounted return:
V π s =Eπ rt + 1 + γr t + 2 + … + γ T − 1 rT st = s}
∞
=Eπ ∑k = 0 γkrt + k + 1 st = s
where γ is the discount rate, with 0 ≤ γ ≤ 1, denoting the security of the expected return.
A value of γ closer to 1 indicates a higher likelihood for future rewards. Especially in sce-
narios where the length of time the process will take is not known in advance, it is impor-
tant to set γ < 1, as otherwise the value function will not converge.
The following figure illustrates which action the agent should optimally perform in the
respective states of the labyrinth to maximize the reward, i.e., trying to reach the exit and
avoid the dangerous field.
40
3.3 Temporal Difference and Q–Learning
So far, we discussed model-based reinforcement learning. That means that an agent tries
to understand the model of the environment. All decisions are based on a value function.
This value function is based on the current state and the future state where the agent will
end.
Q-Learning
One well-known algorithm based on TD learning is the Q-learning. After initialization, the
agent will conduct random acts which are then evaluated. Based on the outcome of an
action, the agent will adapt its behavior for the subsequent actions.
The goal of the Q-learning algorithm is to maximize the quality function Q s, a . The goal
is to maximize the cumulative reward while being in a given state s by predicting the best
action a (van Otterlo & Wiering, 2012). During the learning process Q s, a is iteratively
updated using the Bellman equation: Bellman equation
The Bellman equation
computes the expected
Q s, a = r + γmaxa′Q s′, a′ reward in an MDP of tak-
ing an action in a certain
state. The reward is bro-
All Q-values computed during the learning process are stored in the Q-Matrix. In every iter- ken into the immediate
ation, the matrix is used to find the best possible action. When the agent has to perform a and the total future
new action, it will look for the maximum Q-value of the state-action pair. expected reward.
41
The Q-learning Algorithm
In the following, we will itemize the Q-learning algorithm. The algorithm consists of an ini-
tialization and an iteration phase. In the initialization phase, all values in the Q-table are
set to 0. In the iteration phase, the agent will perform the following steps:
1. Choose an action for the current state. In this phase there are two different strategies
that can be followed:
• Exploration: perform random actions in order to find out more information about
the environment
• Exploitation: perform actions based on the information which is already known
about the environment based on the Q-table. The goal is to maximize the return
2. Perform the chosen action
3. Evaluate the outcome and get the value of the reward. Based on the result the Q-table
will be updated.
SUMMARY
Reinforcement learning deals with finding the best strategy for how an
agent should behave in an environment to achieve a certain goal. The
learning process of that agent happens based on a reward system which
either rewards the agent for good decisions or punishes it for bad ones.
42
UNIT 4
NATURAL LANGUAGE PROCESSING
STUDY GOALS
Introduction
Natural language processing (NLP) is one of the major application domains in artificial
intelligence.
NLP can be divided into three subdomains: speech recognition, language understanding,
and language generation. Each will be addressed in the following sections. After an intro-
duction to NLP and its application areas, you will learn more about the basic NLP techni-
ques and how data vectorization works.
All these subdomains build on methods from artificial intelligence and form the basis for
the areas of application of NLP.
Historical Developments
Early NLP research dates back to the seventeenth century, when Descartes and Leibnitz
conducted some early theoretical research about NLP (Schwartz, 2019). It became a tech-
nical discipline in the mid-1950s. The geopolitical tension between the former Soviet
Union and the United States led to an increased demand for English-Russian translators.
Therefore, it was attempted to outsource translation to machines. Even though the first
results were promising, machine translation turned out to be much more complex than
44
originally thought, especially as no significant progress could be seen. In 1964 the Auto-
matic Language Processing Advisory Committee classified the NLP technology as “hope-
less” and decided to temporarily stop the research funding in this area. This was seen as
the start of the NLP winter.
Almost 20 years after the NLP winter began, NLP started to regain interest. This was due to
the following three developments:
constellations. Moreover, the improved computing power allowed to process a much big-
ger amount of training data which was now available because of the growing amount of
electronic literature. This opened up big opportunities for the available algorithms to
learn and improve.
One of the early pioneers in AI was the mathematician and computer scientist Alan Mathi-
son Turing. In his research, he formed the theoretical foundation of what became the
Turing test (Turing, 1950). In the test, a human test person uses a chat to interview two
chat partners: another human and a chatbot. Both try to convince the test person that
they are human. If the test person cannot identify which of their conversational partners is
human and which is the machine, the test is successfully passed. According to Turing pass-
ing the test allows the assumption that the intellectual abilities of a computer are at the
same level as the human brain.
The Turing test primarily addresses the natural language processing abilities of a machine.
Therefore, the Turing test has often been criticized as being too focused on functionality
and not on consciousness. One early attempt to pass the Turing test was done by Joseph
Weizenbaum who developed a computer program to simulate a conversation with a psy-
chotherapist (Weizenbaum, 1966). His computer program ELIZA was one of the first con-
versational AIs. To process the sentence entered by the user, ELIZA utilizes rule-based pat-
tern matching combined with a thesaurus. The publication got some remarkable feedback
from the community. Nevertheless, the simplicity of this approach was soon recognized
and according to the expectations from the community, ELIZA did not pass the Turing test.
45
In 2014 the Chatbot “Eugene Goostman” was the very first AI which seemed to have
passed the Turing test. The Chatbot pretended to be a 13-year-old boy from Ukraine who
was not a native English speaker. This trick was used to explain that the bot did not know
everything and sometimes made mistakes with the language. However, this trick was also
the reason why the validity of the experiment was later questioned (Masnick, 2014).
Topic identification
As the name indicates, topic identification deals with the challenge to automatically find
the topics of a given text (May et al., 2015). This can either be done in a supervised or in an
unsupervised way. In supervised topic identification, a model can, for instance, be trained
on newspaper articles that have been labeled with topics, such as politics, sports, or cul-
ture. In an unsupervised setting, the topics are not known in advance. In this case, the
algorithm has to deal with topic modeling or topic discovery to find clusters with similar
topics.
Popular use cases for topic identification are, for instance, social media and brand moni-
toring, customer support, and market research. Topic identification can help find out what
people think about a brand or a product. Social media provides a tremendous amount of
text data that can be analyzed for these uses cases. Customers can be grouped according
to their interests, and reactions to certain advertisements or marketing campaigns can be
easily analyzed. When it comes to market research, topic identification can help when
analyzing open answers in questionnaires. If those answers are pre-classified, it can
reduce the effort to analyze open answers.
Text summarization
46
are summarized to get a sentence rank. After sorting the sentences according to their rank,
it is easy to evaluate the importance of each one and create a summary from a predefined
number with the highest rank.
There are two major challenges when dealing with supervised extractive text summariza-
tion, as training requires a lot of hand-annotated text data. These are:
1. It is necessary that the annotations contain the words that have to be in the summary.
When humans summarize texts, they tend to do this in an abstract way. Therefore, it is
hard to find training data in the required format.
2. The decision about what information should be included in the summary is subjective
and depends on the focus of a task. While a product description would focus more on
the technical aspects of a text, a summary of the business value of a product will put
the emphasis on completely different aspects.
A typical use case for text summarization is presenting a user a preview of the content of
search results or articles. This makes it easier to quickly analyze a huge amount of infor-
mation. Moreover, in question answering, text summarization techniques can be used to
help a user find answers to certain questions in a document.
Sentiment analysis
Sentiment analysis captures subjective aspects of texts (Nasukawa & Yi, 2003), such as
analyzing the author’s mood on a tweet on Twitter. Like topic identification, sentiment
analysis deals with text classification. The major difference between topic identification
and sentiment analysis is that topic identification focuses on objective aspects of the text
while sentiment analysis centers on subjective characteristics like moods and emotions.
The application areas for sentiment analysis are manifold. Customer sentiment analysis
has gained much traction as a research field lately. The ability to track customers’ senti-
ments over time can, for instance, give important insights about how customers react to
changes of a product/a service or how external factors like global crises influence custom-
ers’ perceptions. Social networks, such as Facebook, Instagram, and Twitter, provide huge
amounts of data about how customers feel about a product. Having a better understand-
ing of customer’s needs can help modify and optimize business processes accordingly.
Detecting emotions from user-generated content comes with some big challenges when
dealing with irony/sarcasm, negation, and multipolarity.
There is much sarcasm in user-generated content, especially in social media. Even for
humans, it can sometimes be hard to detect sarcasm, which makes it even more difficult
for a machine. Let us, for instance, look at the sentence
Only a few years back this would have been a straightforward sentence. Now, if said about
a modern smartphone, it is easy for a human to tell that this statement is sarcastic. While
there has been some recent success in sarcasm detection using methods from deep learn-
ing (Ghosh & Veale, 2016), dealing with sarcasm remains a challenging task.
47
Negation is another challenge when trying to detect a statement's sentiment. Negation
can be explicit or implicit, and also comes with the morphology of a word denoted by pre-
fixes, such as “dis-” and “non-,” or suffixes, such as “-less”. Double negation is another lan-
guage construct that can be easily misunderstood. While most of the time double nega-
tives will cancel each other, in some contexts it can also intensify the negation.
Considering negation in the model used for sentiment analysis can help to significantly
increase the accuracy (Sharif et al., 2016).
Named entity recognition (NER) deals with the challenge of locating and classifying
named entities in an unstructured text. Those entities can then be assigned to categories
such as names, locations, time and date expressions, organizations, quantities, and many
more. NER plays an important role in understanding the content of a text. Especially for
text analysis and data organization, NER is a good starting point for further analysis. The
following figure shows an example of how entities can be identified from a sentence.
NER can be used in all domains where categorizing text can be advantageous. For
instance, tickets in customer support can be categorized according to their topics. Tickets
can then automatically be forwarded to a specialist. Also, if data has to be anonymized
due to privacy regulations, NER can help to save costs. It can identify personal data and
automatically remove it. Depending on the quality of the underlying data, manual cleanup
is no longer necessary. Another use case is to extract information from candidate resumes
in the application process. It can significantly decrease the workload of the HR depart-
ment, especially when there are many applicants (Zimmermann et al., 2016).
The biggest challenge in NER is that to train a model, it is necessary to have a large
amount of annotated data for training available. The model will later always focus on the
specific tasks/the specific subset of entities on which it has been trained.
48
Translation
Machine translation (MT) is a subfield of NLP that combines several disciplines. Using
methods from artificial intelligence, computer science, information theory, and statistics,
in MT text or speech are automatically translated from one language to another.
In the last decades, the quality of MT has significantly improved. In most cases, the quality
of machine translations is still not as good as those done by professional translators. How-
ever, combining MT and manual post-processing is nowadays often faster than translating
everything manually. Like in any other area of NLP, the output quality depends signifi-
cantly on the quality of the training data. Therefore, often domain-specific data is used.
While in the past, the most commonly used method was statistical machine translation Statistical machine
(SMT), neural machine translation (NMT) has become more popular (Koehn & Knowles, translation
In statistical machine
2017). translation, translations
are generated using stat-
MT can be used for text-to-text translations as well as speech-to-speech translations. Using istical models that were
built based on the analy-
MT for text can help quickly translate text documents or websites, assist professional sis of bilingual text cor-
translators to accelerate the translation process, or as a part of a speech-to-speech trans- pora.
lation system. As globalization progresses, MT has become more important every day. In Neural machine trans-
lation
2016, Google was translating over 100 billion words per day in more than 100 languages. In neural machine trans-
(Turovsky, 2016) The following figure shows how text-to-text translation is interlinked with lation, an artificial neural
speech-to-speech translation. network is used to learn a
statistical model for MT.
49
quality of the output does not only depend on the quality of the MT component, but also
on the quality of the ASR and TTS components, which makes speech-to-speech-transla-
tion challenging.
Nowadays, the two biggest challenges in MT are domain mismatch and under-resourced
languages. Domain mismatch means that words and sentences can have different transla-
tions based on the domain. Thus, it is important to use domain adaption when developing
an MT system for a special use case.
For some combinations of languages in MT, there are no bilingual text corpora available
for source and target language. One approach to solving the problem of under-resourced
languages is to use pivot MT. In pivot MT, the source and target language are bridged using
a third language (Kim et al., 2019). When, for instance, translating from Khmer (Cambodia)
to Zulu (South Africa), a text will first be translated from Khmer to English and afterwards
from English to Zulu.
Chatbots
Chatbots are text-based dialogue systems. They allow interaction with a computer based
on text in natural language. Based on the input, the system will reply in natural language.
Sometimes, chatbots are used in combination with an avatar simulating a character or
personality. One of the most popular chatbots was ELIZA imitating a psychotherapist.
Chatbots are often used in messenger apps, such as Facebook, or website chats. Moreover,
they form the basis for digital assistants, like Alexa, Siri, or Google Assistant.
• Notification assistants (level 1): These chatbots only interact unidirectionally with the
user. They can be used for notifications about events or updates (i.e., push notifica-
tions).
• Frequently asked questions assistants (level 2): Those bots can bi-directionally interact
with a user. They can interpret the user’s query and find an appropriate answer in a
knowledge base.
• Contextual assistants (level 3): these chatbots can not only interact bidirectionally, but
also be context-aware based on the conversation history.
In the future, it is likely that further levels of chatbots will evolve. A chatbot is based on
three components:
1. Natural language understanding (NLU): This component parses the input text and
identifies the intent and the entities of the user (user information).
2. Dialogue management component: The goal of this component is to interpret the
intent and entities identified by the NLU in context with the conversation and decide
the reaction of the bot.
3. Message generator component: Based on the output of the other components, the
task of this component is to generate the answer of the chatbot by either filling a pre-
defined template or by generating a free text.
50
Chatbots can save a lot of time and money. Therefore, use cases are increasing continu-
ously. They are normally available 24/7 at comparatively low costs and can easily be
scaled if necessary. In customer service, they can not only reply to customers’ requests,
but also give product recommendations or make travel arrangements, such as hotel or
flight reservations. If a request is too complicated for a bot, there are usually interfaces to
forward requests to a human support team.
Rule-Based Techniques
Rule-based techniques for NLP use a set of predefined rules to tackle a given problem.
Those rules try to reproduce the way humans build sentences. A simple example for a
rule-based system is the extraction of single words from a text based on the very simple
rule “divide the text at every blank space”. Looking at terms like “New York” already illus-
trates how fast a seemingly simple problem can get complicated. Therefore, more com-
plex systems are based on linguistic structures using formal grammars.
The rule-based approach implies that, to build a system, humans have to be involved in
the process. Because of this, one of the major advantages of rule-based systems is the
explainability: As the rules have been designed by humans, it is easy to understand how a
task has been processed and to locate errors.
The major drawback of the rule-based approach is that it requires experts to build a set of
appropriate rules. Moreover, rule-based systems are normally built in a domain-specific
way, which makes it difficult to use a system in a domain for which it was not designed.
51
Statistical-Based Techniques
Since computational power has increased in the past decades, systems based on statisti-
cal methods – which are often subsumed under the term machine learning – have
replaced most of the early rule-based systems. Those methods follow a data-driven
approach. The models generated by statistical-based methods are trained with a huge
amount of training data to derive the rules of a given task. After that, the models can be
used to classify a set of unknown data to make predictions. In contrast to rule-based sys-
tems, statistical-based systems do not require expert knowledge about the domain. They
can easily be developed based on existing methods and improved by providing appropri-
ate data. Also transferring the model to another domain is much easier than for rule-based
systems.
Tasks
In general, NLP tasks can be divided into four categories, syntax, semantics, discourse,
and speech. In the following, we will give an overview of those tasks.
Syntax
Syntactical tasks in NLP deal with the features of language such as categories, word boun-
daries, and grammatical functions. Typical tasks dealing with the syntax are tokenization
and part-of-speech (POS) tagging.
The goal of tokenization is to split a text into individual units such as words, sentences, or
sub-word units. For instance, the sentence “I enjoy studying artificial intelligence.” could
be tokenized into “I” “enjoy” “studying” “artificial” “intelligence” “.” .
POS tagging – also called grammatical tagging – goes one step further and adds grammat-
ical word functions and categories to the text. The following example illustrates how a
sentence can be analyzed using POS tagging.
52
Syntactic ambiguity, i.e., words which cannot be clearly assigned to a category, are a big
challenge in NLP.
which can be interpreted in many different ways. Two of the possible interpretations are
In the first interpretation, “like” is a comparative preposition, while in the second interpre-
tation it is a verb.
Semantics
The focus of semantic tasks is on the meaning of words and sentences. Understanding the
meaning of a text is essential for most application areas of NLP. In sentiment analysis, sub-
jective aspects of the language are analyzed. For instance, when analyzing posts on social
media, it is important to understand what the text means to identify whether it is positive
or negative. Named entity recognition (NER) is another research field where semantics are
important for correct classification results. Identifying entities, such as names, locations,
or dates, from a given text cannot be done without understanding the semantics of a text.
In topic identification, a given text is labeled with a topic. Therefore, it is important to
understand what the text is about. For instance, newspaper articles could be labeled with
topics such as “politics”, “culture”, or “weather”.
If NLP is used for answering questions, a computer needs to create an appropriate answer
to a certain question. Assuming a question answering algorithm was trained on this course
book, the algorithm might display this section when asked “What are the typical tasks in
NLP?” For this purpose, the semantics of this section must be interpreted correctly. Also,
in machine translation, understanding the correct meaning of a text is essential. Other-
wise, the translation will yield results that are hard to understand or even wrong.
The figure above illustrates how important it is to properly understand the semantics of a
text.
53
Discourse
Discourse deals with text that is longer than a single sentence. It is important for tasks,
such as topic identification and text summarization, where an algorithm produces a sum-
mary of a given text by extracting the most important sentences. Analyzing the discourse
of a text involves several sub-tasks, like identifying the topic structure, analysis of the cor-
eference, and the conversation structure.
Speech
Speech tasks in NLP are all about spoken language. In speech tasks two sub-tasks can be
distinguished:
Both are important for conversational interfaces, such as voice assistance, like Siri or
Alexa. The following figure summarizes the typical tasks in NLP.
54
4.3 Vectorizing Data
In machine learning, algorithms only accept numeric input. Therefore, if we want to
extract information from an unstructured text, we need to find a way that the computer
can process it. For this purpose, the text has to be converted into a numerical format,
which the computer can process.
In the following, we want to introduce two approaches to how words can be embedded
into a semantic vector space: the bag-of-words approach, which is simple, and the more
powerful concept of neural word and sentence vectors.
Bag-of-Words
One of the easiest approaches to convert textual information into numbers is the bag-of-
words (BoW) model. Using BoW a text is represented by a vector that describes the num-
ber of word occurrences in a given text document. The term “bag” refers to the fact that
once the words are put into the unique set of words describing a text, all information
about the structure or order of the words in a text is discarded. To understand the BoW
approach, we will use the following example text:
In the first step, we need to identify all unique words from the text. For this purpose, we
use tokenization.
In the next step, we need to score the words in every sentence. As we know that our
vocabulary consists of 8 words, the resulting vector will have a length of 8. The BoW vec-
tors for the sentences above will look as follows:
• [1, 1, 1, 0, 0, 0, 0, 0]
• [1, 0, 0, 1, 1, 1, 1, 0]
• [0, 0, 1, 0, 1, 1, 1, 1]
There are different methods to score the words in the BoW model. In the above sentences
every word only occurred once, therefore the resulting vectors are a binary representation
of the text. If the whole text from the above example were summarized in one vector, the
following options are available:
55
[2,1,2,1,2,2,2,1].
As you will notice, this representation no longer contains any information about the origi-
nal order of the words.
Limitations of Bag-of-Words
Taken together, the BoW model is simple, which induces some major disadvantages:
• Selection of vocabulary: The vocabulary of the model has to be selected very carefully.
The balance between the size of the model and sparsity must always be kept in mind.
The larger the vocabulary, the higher will be the sparsity of the vectors.
• Risk of high sparsity: For computational reasons, it is more difficult to model sparse rep-
resentation of data, as the complexity of time and space will increase with higher spar-
sity. Moreover, it is more difficult to make use of the data if only a little information is
contained in a large representational space.
• Loss of meaning: Using BoW, neither word order nor context nor sense are considered.
In our example, the different meanings of “like” (once being used as a preposition and
once as a verb) gets completely lost. In situations like that, the BoW model does not per-
form well.
Word Vectors
To be able to embed words in a semantic vector space they can be represented as word
vectors. Linear operations can be applied to find word analogies and similarities. These
word similarities can, for instance, be based on the cosine similarity. Most importantly,
once words are transformed into word vectors, they can be used as an input for machine
learning models, like artificial neural networks and linear classifiers. In the following,
three vectorization methods will be presented: Word2Vec, TD-IDF and GloVe.
Word2Vec
The Word2Vec model is based on a simple neural network. The neural network generates
word embeddings based on only one hidden layer (Mikolov et al., 2013). A research mile-
stone was passed when Google Research published the model in 2013. The input layer of
the neural network expects a “one-hot vector.” The one-hot vector is a BoW vector for one
single word. This means that all indices of that vector are set to 0 except for the index of
the word, which is analyzed. This index is set to 1.
Training the neural network for Word2Vec requires a large text corpus. This could, for
instance, be a Wikipedia dump. When the training is performed, a fixed-length word win-
dow with length N is slid over the corpus. Typical values for N would, for example, be
N = 5 or N = 10.
56
1. Continuous Bag-of-Words (CBOW): This model can be used if the goal is to predict one
missing word in a fixed window in the context of the other N − 1 words. As an input
vector, we can either use the average or the sum of the one-hot vectors.
2. Skip-gram: If we have one word within a fixed window, with this model we can predict
the remaining N − 1 context words.
One important aspect of CBOW is that the prediction outcome is not influenced by the
order of the context words. In skip-gram, nearby context words are weighted more heavily
than more distant context words. While CBOW generally performs faster, the skip-gram
architecture is better suited for infrequent words.
When training Word2Vec, the goal is to maximize the probabilities for those words that
appeared in the fixed window of the analyzed sample from the data corpus used for train-
ing. The function we receive from this process is the objective function.
In an NLP task, the goal is usually not to find a model to predict the next word based on a
given text snippet, but to analyze the syntax and semantics of a given word or text. If we
remove the output layer of the model we generated before and look at the hidden layer
instead, we can extract the output vector from this layer. Neural networks usually develop
a strong abstraction and generalization ability on their last layers. It is, therefore, possible
to use the output vector of the hidden layer as an abstract representation of the features
from the input data. Thus, we can use it as an embedding vector for the word we want to
analyze.
57
Nowadays, there are many pre-trained Word2Vec models available for various languages
that can easily be adapted for specific NLP tasks.
In BoW, the frequency of vocabulary words does only reflect words that are contained in
the document. Therefore, all words are given the same weight when analyzing a text, no
matter their importance. Term frequency-inverse document frequency (TF-IDF) is a statis-
tical measure from information retrieval that tackles this problem and is one of the most
commonly used weighting schemes in information retrieval (Beel et al., 2016). In TF-IDF
the term frequency (TF) is combined with the inverse document frequency (IDF). The rele-
vance of a word increases with its frequency in a certain text but is compensated by the
word frequency in the whole data set. For the computation of TF-IDF we need the follow-
ing parameters:
• Term frequency (TF) reflects how often a term t occurs in a document d. The word order
is not relevant in this case. The number of occurrences is weighted by the total number
of terms in the document:
• Inverse document frequency (IDF) tests the relevance of a particular term. As the name
suggests, it is the inverse of the document frequency logarithmically scaled:
1
IDF t, D = log DF t, D
T F IDF t, D = T F t, d · IDF t, D
High values of TF-IDF indicate words that occur often in a document while the number of
documents that contain the respective term is small compared to the total amount of
documents. Therefore, TF-IDF can help find terms in a document that are most important
in a text.
GloVe
Global Vectors for word representation (GloVe) is another vectorization method commonly
used in NLP. While Word2Vec is a predictive model, GloVe is an unsupervised approach
based on the counts of words. It was developed because Pennington et al. (2014) con-
cluded that the skip-gram approach in Word2Vec does not fully consider the statistical
58
information when it comes to word co-occurrences. Therefore, they combined the skip-
gram approach with the benefits of matrix factorization. The GloVe model uses a co- Matrix factorization
occurrence matrix, which contains information about the word context. The developed This is used to reduce a
matrix into its compo-
model has been shown to outperform related models, especially for named entity recog- nents to simplify complex
nition and similarity tasks (Pennington et al., 2014). matrix operations.
Sentence Vectors
So far, we have learned how to represent words as vectors. However, various NLP tasks,
like question answering or sentiment analysis, require not only the analysis of a single
word but of a whole sentence or paragraph. Therefore, we also need a way how to encode
a sequence of words to be able to process it with a learning algorithm.
One approach is to build an average of the vectors of a sentence from Word2Vec and use
the resulting vectors as input for a model. However, this method would come with the dis-
advantage that the word order is no longer included in the word encodings. For instance,
the sentences “I put the coffee in the cup” and “I put the cup in the coffee” contain the
same words. Only the word order makes the difference in the sentence.
To tackle the problem of dealing with text snippets of various lengths, there exist several
approaches. In the following sections, we will present a selection of those algorithms.
Please note that in the following the term “sentence” will also be used to represent a
whole paragraph of text, not only as a sentence in a strict grammatical way.
Skip-thought
In the skip-thought vectors approach (Kiros et al., 2015) the concept of the skip-gram
architecture we introduced previously in the section about the Word2Vec approach is
transferred to the level of sentences.
Like Word2Vec, skip-thought requires a large text corpus to train the model. In contrast to
Word2Vec, instead of using a sliding word window, skip-thought analyzes a triple of three
consecutive sentences. The resulting model is a typical example of an encoder-decoder
architecture. The middle sentence from the triple is used as an input for the encoder. The
encoder produces an output, which is connected to the decoder. There are two ways to
optimize the model: the decoder can either be used to predict the following or the previ-
ous sentence of the sentence the encoder received.
There are some NLP tasks that do not require a prediction model. For those tasks, the
decoder part is no longer needed after the training and can be discarded. To get the vector
representation of the sentence, we can use the output vector of the encoder.
In case we use the model to only predict the following or the previous sentence, the result
is a uni-skip vector. When concatenating two uni-skip vectors so that one predicts the pre-
vious and the other predicts the next sentence, the result is called a bi-skip vector. If n-
dimensional uni-skip vectors are combined with n-dimensional bi-skip vectors, the result
will be a 2n-dimensional combine-skip vector. In a comparison of several combine-
thought models, the combine-skip model has been proven to perform slightly better.
59
There is a pre-trained English language model available to the public based on the Book-
Corpus dataset.
The universal sentence encoder (USE) is a family of models for sentence embedding that
was developed by Google Research (Cer et al., 2018). There are two architecture variants
of the USE. One variant uses a deep averaging network (DAN); (Iyyer et al., 2015) and is
faster but less accurate, while the other variant utilizes a transformer model.
Also for USE, there are pre-trained models available to the public: one English model and
one multilingual model (Chidambaram et al., 2019). These models are both based on the
DAN architecture.
As the name indicates, BERT (Devlin et al., 2018) was based on the transformer architec-
ture. Like USE this model was introduced by Google Research. The language model is
available open-source and has been pre-trained on a large text corpus in two combined
and unsupervised ways: masked language model and next sentence prediction.
With the masked language model, a sentence is taken from the training set. In the next
step, about 15 percent of the words in that sentence are masked. For example, in the sen-
tence
the words “drink” and “milk” have been masked. The model is then trained to predict the
missing words in the sentence. The focus of the model is to understand the context of the
words. The processing of the text data is no longer done in an unidirectional way from
either left to right or right to left.
Using next sentence prediction as a training method, the model receives a pair of two sen-
tences. The model's goal is to predict if the first sentence is followed by the second sen-
tence. Therefore, the resulting model focuses mainly on how a pair of sentences are
related.
Both models were trained together to minimize the combined loss function of the two
strategies.
SUMMARY
The use of NLP in computer science dates back to the 1950s. There is a
wide range of application areas for NLP, which include topics such as
question answering, sentiment analysis, named entity recognition, and
topic identification.
60
To be able to process language with computers, vectorization techni-
ques such as Bag-of-Words, word vectors, and sentence vectors are
used. However, these models come with some limitations. For example,
if Bag-of-Words is used, we lose all information about word order. There-
fore, this model can only be used if the word order is not crucial. More-
over, some models, including BERT, have limitations towards the input
length of a text (e.g., 256-word tokens). A larger paragraph of text can
only be embedded using tricks, like segmenting it into smaller parts.
Nevertheless, there has been huge progress in NLP in the past years as
computational power has been increasing drastically and larger data
corpora have become available to train language models.
61
UNIT 5
COMPUTER VISION
STUDY GOALS
Introduction
This unit will discuss the basic principles of computer vision. It starts with a definition of
the topic, the historical background, and an overview of the most important computer
vision tasks. After that, you will learn how an image can be represented as an array of pix-
els and how images can be modified using filters. We will illustrate how to detect features
in images, such as edges, corners, and blobs. This knowledge will be used to illustrate how
you can use calibration and deal with distortion.
Moreover, this unit addresses the topic of semantic segmentation, which can be used to
classify pixels into categories.
Historical Developments
Research in computer vision began in the 1960s at some of the pioneering universities for
robotics and AI, such as Stanford University, the Massachusetts Institute of Technology,
and Carnegie Mellon University. The goal of that early research was to mimic the visual
system of humans (Szeliski, 2022). Researchers tried to make robots more intelligent by
automating the process of image analysis using a camera attached to the computer. The
big difference between digital image processing and computer vision at that time was that
researchers tried to reconstruct the 3D structure from the real world to gain a better
understanding of the scene (Szeliski, 2022).
Early foundations of algorithms, such as line labeling, edge extraction, object representa-
tion and motion estimation date back to the 1970s (Szeliski, 2022). In the 1980s, there was
a shift of focus towards the quantitative aspects of computer vision and mathematical
analysis. Concepts, such as inference of shape from characteristics like texture, shading,
Photogrammetry contour models, and focus, evolved. In the 1990s, methods from photogrammetry were
A group of contactless used to develop algorithms for sparse 3D reconstructions of scenes based on multiple
methods to derive the
position and shape of images. The results led to a better understanding of camera calibration. Statistical meth-
physical objects directly ods, in particular eigenfaces, were used for facial recognition from pictures. Due to an
from photographic
images is called Photo-
grammetry.
64
increasing interaction between computer vision and computer graphics, there has been a
significant change in methods like morphing, image-based modeling and rendering,
image stitching, light-field-rendering, and interpolation of views (Szeliski, 2022).
Typical Tasks
There are four major categories in computer vision: recognition tasks, motion analysis,
image restoration and geometry reconstruction. The following figure illustrates those
tasks.
Recognition tasks
There are different types of recognition tasks in computer vision. Typical tasks involve the
detection of objects, persons, poses, or images. Object recognition deals with the estima-
tion of different classes of objects that are contained in an image (Zou et al., 2019). For
instance, a very basic classifier could be used to detect whether there is a hazardous mate-
rial label on an image or not. Making the classifier more specific could additionally recog-
nize information about the label type such as “flammable” or “poison.” Object recognition
is also important in the area of autonomous driving to detect other vehicles or pedes-
trians.
In object identification tasks, objects or persons that are in an image are identified using
unique features (Barik & Mondal, 2010). For person identification, for example, a computer
vision system can use characteristics, such as fingerprint, face or handwriting. Facial rec-
65
ognition, for instance, uses biometric features from an image and compares them to the
biometric features of other images from a given database. Person identification is com-
monly used to verify the identity of a person for access control.
Pose estimation tasks play an important role in autonomous driving. The goal is to esti-
mate the orientation and/or position of a given object relative to the camera (Chen et al.,
2020). This can, for instance, be the distance to another vehicle ahead or an obstacle on
the road.
In classical odometry, motion sensors are used to estimate the change of the position of
an object over time. Visual odometry, conversely, hand analyzes a sequence of images to
gather information about the position and orientation of the camera (Aqel et al., 2016).
Autonomous cleaning bots can, for instance, use this information to estimate the location
in a specific room.
In tracking tasks, an object is located and followed in successive frames. A frame can be
defined as a single image in a longer sequence of images, such as videos or animations
(Yilmaz et al., 2006). This can, for instance, be the tracking of people, vehicles, or animals.
Image restoration deals with the process of recovering a blurry or noisy image to an image
of better and clearer quality. This can, for instance, be old photographs, but also movies
that were damaged over time. To recover the image quality, filters like median or low-pass
Noise filters can remove the noise (Dhruv et al., 2017). Nowadays, methods from image restora-
In computer vision, Noise tion can also be used to restore missing or damaged parts of an artwork.
refers to a quality loss of
an image which is caused
by a disturbed signal. Geometry reconstruction tasks
In computer vision, there are five major challenges that must be tackled (Szeliski, 2022):
• The illumination of an object is very important. If lighting conditions change, this can
yield different results in the recognition process. For instance, red can easily be
detected as orange if the environment is bright.
66
• Differentiating similar objects can also be difficult in recognition tasks. If a system is
trained to recognize a ball it might also try to identify an egg as a ball.
• The size and aspect ratios of objects in images or videos pose another challenge in com-
puter vision. In an image, objects that are further away will appear to be smaller than
closer objects even if they are the same size.
• Algorithms must be able to deal with rotation of an object. If we look for instance at a
pencil on a table, it can either look like a line when we look from the top or as a circle
when we change to a different perspective.
• The location of objects can vary. In computer vision, this effect is called translation.
Going back to our example of the pencil, it should not make a difference to the algo-
rithm if the pencil is located on the center of a paper or next to it.
Because of these challenges, there is muchre research towards algorithms that are scale-,
rotation-, and/or translation invariant (Szeliski, 2022).
Pixels
Images are constructed as a two-dimensional pixel array (Lyra et al., 2011). A pixel is the
smallest unit of a picture. The word originates from the two terms “pictures” (pix) and
“element”(el) (Lyon, 2006). A pixel is normally represented as a single square with one
color. It becomes visible when zooming deep into a digital image. You can see an example
of the pixels of an image in the figure below.
67
In the resolution of an image, the number of pixels is specified. If the resolution is high, the
more details will be in the image. Conversely, if the resolution is low, the picture might
look fuzzy or blurry.
Color representations
There are various ways to represent the color of a pixel as a numerical value. The easiest
way is to use monochrome pictures. In this case, the color of a pixel will be represented by
a single bit, being 0 or 1. In a true color image, a pixel will be represented by 24 bits.
The following table shows the most important color representations with the according
number of available colors (color depth).
One way to represent colors is the RGB color representation. We illustrate this using the
24-bit color representation. Using RGB, the 24 bits of a pixel are separated in three parts,
each 8 bits in length. Each of those parts represents the intensity of a color between 0 and
255. The first is the color red (R), the second green (G), and last blue (B). Out of these three
components all the other colors can be mixed additively. For instance the color code
RGB(0, 255, 0) will yield 100 percent green. If all values are set to 0, the resulting color will
be black. If all values are set to 255 it will be white. The figure below illustrates how the
colors are mixed in an additive way.
68
Figure 19: Additive Mixing of Colors
Another way to represent colors is the CMYK model. In contrast to the RGB representation
it is a subtractive color model comprised of cyan, magenta, yellow and key (black). The
color values in CMYK range from 0 to 1. Therefore, to convert colors from RGB to CMYK, the
RGB values first have to be divided by 255. Therefore, the values of cyan, magenta, yellow
and key can be computed as follows:
R G B
K=1 − max , ,
255 255 255
R
1− −K
255
C= 1−K
G
1− −K
255
M= 1−K
B
1− −K
255
Y= 1−K
While the RGB is better suited for digital representation of images, CMYK is commonly
used for printed material.
69
Images as functions
We will now discuss how an image can be built from single pixels. To do that, we need a
function that can map a two-dimensional coordinate (x,y) to a specific color value. On the
x-axis we begin on the left with a value of 0 and continue to the right until the maximum
width of an image is reached. On the y-axis, we begin with 0 at the top and reach the
height of an image at the bottom.
Let us look at the function f x, y for an 8-bit gray scale image. The function values of
f 42, 100 = 0 would mean that we will have a black pixel 42 pixels to the right and 100
pixels below the starting point. In a 24-bit image the result of the function would be a tri-
ple value indicating the RGB intensity of the specified pixel.
Filters
Filters play an important role in computer vision when it comes to applying affects to an
image, implementing techniques like smoothing, or inpainting, or extracting useful infor-
mation from an image, like the detection of corners or edges. It can be defined as a func-
tion that gets an image as an input, applies modifications to that image, and returns the
filtered image as an output (Szeliski, 2022).
2D convolution
The convolution of an image I with a kernel k with a size of n and a center coordinate a can
be calculated as follows:
n n
I ⋅ x, y = ∑ ∑ I x + i − a, y + j − a k i, j
i = 1i = j
where I · x, y is the value of the resulting image I · at position x, y while I is the origi-
nal image. The center coordinate for a 3x3 convolution matrix is 2, for a 5x5 convolution
matrix 3 and so forth. To understand the process, we will use the following example of a
3x3 convolution. The kernel matrix used for the convolution is shown in the middle col-
umn of the figure.
70
Figure 20: 2D Image Convolution
The kernel matrix is moved over each position of the input image. In our input image the
current position is marked orange. In our example we start with the center position of the
image and multiply the image on this position with the values of the kernel matrix. The
resulting value for the center position of our filtered image is computed as follows:
0 · 41 + 0 · 26 + 0 · 86 + 0 · 27 + 0 · 42 + 1 · 47 + 0 · 44 + 0 · 88 + 0
· 41 = 47
In the next step, we shift the kernel matrix to the next position and compute the new value
of the filtered image:
0 · 26 + 0 · 86 + 0 · 41 + 0 · 42 + 0 · 47 + 1 · 93 + 0 · 88 + 0 · 41 + 0
· 24 = 93
The bottom row in our figure shows the result after all positions of the image have been
multiplied with the kernel matrix.
71