PPT Week 2 -Lexical Analysis
PPT Week 2 -Lexical Analysis
MSCS 7103
Theories of Programming Languages
Regular Expressions
Lexical Analysis
Week 2 MSCS 7103: Theories of Programming Languages
Regular Expressions
Week 2 MSCS 7103: Theories of Programming Languages
Regular Expressions
Week 2 MSCS 7103: Theories of Programming Languages
Regular Expressions
Regular Expressions
Regular expressions
A way to specify sets of strings in order to describe tokens
Lexical analysis
Turns a stream of characters into a stream of tokens
Finite Automata
A machine used to recognize patterns from some character set (or alphabet) or
language
Week 2 MSCS 7103: Theories of Programming Languages
Regular Expressions
Languages
Definition:
Regular Expressions
Languages
Examples:
Alphabet = English Characters
Language = C Programs
Note: ASCII character set is different from English character set.
Week 2 MSCS 7103: Theories of Programming Languages
Regular Expressions
Regular Expressions
A notation for specifying regular languages
Definition:
Base cases: ∅, ϵ, and a are regular expressions, where a ∈ Σ.
Induction: If A and B are regular expressions, then
A|B is a regular expression,
AB is a regular expression,
(A) is a regular expression,
A∗ is a regular expression.
Regular Expressions
Regular Expressions
Union
– L(A | B) = { s | s ∈L(A) or s ∈L(B) }
Examples:
– L('if' | 'then' | 'else') = { “if”, “then”, “else” }
– L('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9') = what?
– L( ('0'|'1') ('0'|'1') ) = {“00”,”01”,”10”,”11”}
#10
Week 2 MSCS 7103: Theories of Programming Languages
Regular Expressions
More Examples
What regular expression describes the set L of binary strings that do not
contain 101 as a substring?
1001001110 ∈ L
00010010100 ∈/ L
(0|ϵ)(1|000∗)∗(0|ϵ)
#11
Week 2 MSCS 7103: Theories of Programming Languages
Regular Expressions
In practice (1):
No ϵ or ∅:
#12
Week 2 MSCS 7103: Theories of Programming Languages
Regular Expressions
In practice (1):
Regular Expressions
In practice (2):
Character classes allow us to write tedious expressions such as a|b| · · ·
|z more easily.
Examples:
“Recent” years:
199(6|7|8|9)|20(0(0|1|2|3|4|5|6|7|8|9)|1(0|1|2|3|4|5|6|7|8)) 199[6–
9]|20(0[0–9]|1[0–8])
#14
Week 2 MSCS 7103: Theories of Programming Languages
Regular Expressions
In practice (2):
→
Identifier in C: (a|b| · · · |z|A|B| · · · |Z|_)(a|b| · · · |z|A|B| · · · |Z|0|1| · · ·
|9|_)∗
a–zA–Z_][a–zA–Z0–9_]∗
→[
Anything but a lowercase letter: [^a–z]
Any letter: .
Digit, non-digit: \d, \D
Whitespace, non-whitespace: \s, \S
Word character, non-word character: \w, \W
#15
Week 2 MSCS 7103: Theories of Programming Languages
Regular Expressions
Summary:
Lexical Analysis
#17
Week 2 MSCS 7103: Theories of Programming Languages
Lexical Analysis
In English:
noun, verb, adjective, ...
In a programming language:
Identifier, Integer, Keyword, Whitespace, ...
Parser relies on token distinctions:
e.g., identifiers are treated differently than keywords
#18
Week 2 MSCS 7103: Theories of Programming Languages
Lexical Analysis
OpenPar: a left-parenthesis
#19
Week 2 MSCS 7103: Theories of Programming Languages
Lexical Analysis
Lexical Analyzer:
#20
Week 2 MSCS 7103: Theories of Programming Languages
Lexical Analysis
#21
Week 2 MSCS 7103: Theories of Programming Languages
Lexical Analysis
R = R1 | R2| R3 | ...
#22
Week 2 MSCS 7103: Theories of Programming Languages
Lexical Analysis
#23
Week 2 MSCS 7103: Theories of Programming Languages
Lexical Analysis
Lexing Example:
Lexical Analysis
Ambiguities (1):
Example:
Lexical Analysis
Ambiguities (2):
#26
Week 2 MSCS 7103: Theories of Programming Languages
Lexical Analysis
Summary:
To resolve ambiguities
To handle errors
Good algorithms known (next)
Requiring only a single pass over the input
And few operations per character (table lookup)
#27
Prepared by
Mary Ann F. Quioc, DIT