Day 3 - Regexps
Day 3 - Regexps
Day 3 - Regexps
Yared Y.
• Lexers are normally constructed by lexer generators,
that transform human-readable specifications of
tokens and white-space into efficient programs.
• Specifications are traditionally written using regular
expressions: An algebraic notation for describing sets
of strings.
• The generated lexers are in a class of extremely simple
programs called finite automata.
Formal language elements
• A formal language is one that can be specified precisely
and is amenable for use with computers, whereas a
natural language is one which is normally spoken by
people.
• The syntax of Java is an example of a formal language
Symbols, Alphabets, Strings,
Language
• Let us assume that the source program is stored in a file. It
consists of a sequence of characters. Lexical analysis, i.e., the
scanner, reads this sequence from left to right and decomposes
it into a sequence of lexical units, called symbols.
• Example:
Using the English alphabet {a, b, c}, the strings
"abc", "bac", and "aa" are all different strings.
In the binary alphabet {0, 1}, "101" and "011" are
different strings.
• For example, if Σ = {a, b, c}, then
"abc",
"aab", and
"cba" are strings over Σ.
Alphabet: {0, 1}
• Each of these strings has exactly one '1', which fits the definition of
the language.
Example:
1. {0,10, 1011}
2. { }
3. {Ɛ, 0, 00, 000, 0000, 000000, ….}
4. The set of all strings of zeros and ones having an
even number of ones
• A language L over an alphabet Σ is a set of strings
composed of symbols from Σ.
• For example,
L = {"a", "ab", "abc"} is a language over the alphabet
Σ = {a, b, c}.
summary
• Symbol: The basic unit
• Alphabet: A set of symbols
• String: A sequence of symbols
• Set: A collection of distinct items
• Language: A set of strings
Language elements
• A string is a list of characters from a given alphabet.
• The elements of a string need not be unique, and the
order in which they are listed is important. For example,
“abc” and “cba” are different strings, as are “abb” and
“ab”.
• The string which consists of no characters is still a string
(of characters from the given alphabet), and we call it
the null string and designate it by Ɛ.
Regular Expressions and Finite
Automata
• For lexical analysis, specifications are traditionally written using
regular expressions: An algebraic notation for describing sets of
strings. The generated lexers are in a class of extremely simple
programs called finite automata.
• To identify the tokens, we need some method of describing the possible
tokens that can appear in the input stream.
• For this purpose, we introduce regular expression, a notation that can
be used to describe essentially all the tokens of programming language.
• Secondly , having decided what the tokens are, we need some
mechanism to recognize these in the input stream. This is done by the
token recognizers, which are designed using transition diagrams and
finite automata
Regular expressions (regex)
• A regex is a sequence of characters that defines a
search pattern. This pattern can be used to match
strings or parts of strings.
• Regexs are used to define the patterns for various types
of tokens.
Example:
A keyword like `if`can be matched with the regex `if`.
An identifier (like variable names) can be matched with
a regex like `[a-zA-Z_][a-zA-Z0-9_]*`.
An integer literal can be matched with a regex like `[0-
• The set of all integer constants or the set of all variable
names are examples of sets of strings, where the
individual digits or letters used to form these constants
or names are taken from a particular alphabet, i.e., a
set of characters.
• A set of strings is called a language.
• For integers, the alphabet consists of the digits 0–9 and
for variable names the alphabet contains both letters
and digits (and perhaps a few other characters, such as
underscore).
Alphabets
• Any finite set of symbols. Typical examples of symbols
are letters, digits, and punctuation.
∑ = {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of
Hexadecimal alphabets,
Alphabet: Σ = {a, b, c}
Set of strings of length 3 (Σ3):
Σ3 = {aaa, aab, aac, aba, abb, abc, aca, acb, acc, baa,
bab, bac, bba, bbb, bbc, bca, bcb, bcc, caa, cab, cac, cba, cbb, cbc,
cca, ccb, ccc}
There are 33 = 27 strings of length 3.
How to read it
∑n has 2n elements!
Operations on Alphabets
• Union
Definition: The union of two alphabets Σ1 and Σ2 is the set
of symbols that belong to either Σ1 or Σ2.
Notation: Σ1 ∪ Σ2.
• Positive closure (L +)
• Positive closure indicates one or more occurrences of input
symbols in a string, i.e., it excludes empty string Ɛ(set of
strings with 1or more occurrences of input symbols).
Positive closure
Positive closure (L +)