0% found this document useful (0 votes)

3 views

PPT Week 2 -Lexical Analysis

The document discusses regular expressions and their role in lexical analysis within programming languages. It defines regular expressions, finite automata, and the process of tokenizing input strings into syntactic categories. Additionally, it addresses the ambiguities in token recognition and the importance of using a systematic approach to resolve them.

Uploaded by

Edrian Rodriguez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

PPT Week 2 -Lexical Analysis

Uploaded by

Edrian Rodriguez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Week 2

MSCS 7103
Theories of Programming Languages
Regular Expressions

Lexical Analysis
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

Program Translation Flowchart

Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

Regular expressions
A way to specify sets of strings in order to describe tokens

Lexical analysis
Turns a stream of characters into a stream of tokens

Finite Automata
A machine used to recognize patterns from some character set (or alphabet) or
language
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

Languages

Definition:

Let Σ be a set of characters.

A language over Σ is a set of strings of characters drawn from Σ.

Σ is called the alphabet.

Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

Languages

Examples:
Alphabet = English Characters

Language = English Sentences

Note: Not every string on English characters is an English sentence.
Example: xayenb sbe'

Alphabet = ASCII characters

Language = C Programs
Note: ASCII character set is different from English character set.
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

Regular Expressions
A notation for specifying regular languages

Definition:
Base cases: ∅, ϵ, and a are regular expressions, where a ∈ Σ.
Induction: If A and B are regular expressions, then
A|B is a regular expression,
AB is a regular expression,
(A) is a regular expression,
A∗ is a regular expression.

Precedence: Kleene star (*), Concatenation, Union (|)

Parentheses indicate grouping
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

Base Regular Expressions

• Single character: 'c'

– L('c') = { “c”}
(for any c ∈ Σ )
• Concatenation: AB
– A and B are other regular expressions
– L(AB) = { ab | a ∈ L(A) and b ∈ L(B) }
• Example: L('i' 'f') = { “if” }
– We abbreviate 'i' 'f' as 'if'
#9
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

Compound Regular Expressions

Union
– L(A | B) = { s | s ∈L(A) or s ∈L(B) }

Examples:
– L('if' | 'then' | 'else') = { “if”, “then”, “else” }
– L('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9') = what?
– L( ('0'|'1') ('0'|'1') ) = {“00”,”01”,”10”,”11”}

#10
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

More Examples

(0|1)∗ all binary strings

(0|1)∗0 all binary strings that end in 0
(0|1)00∗ all binary strings that start with 0 or 1, followed by one or
more 0s 0|1(0|1)∗ all binary numbers without leading 0s

What regular expression describes the set L of binary strings that do not
contain 101 as a substring?
1001001110 ∈ L
00010010100 ∈/ L
(0|ϵ)(1|000∗)∗(0|ϵ)
#11
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

In practice (1):

No ϵ or ∅:

The empty string is represented as the empty string: a(b|) instead of

a(b|ϵ) to express the language {a, ab}.

The empty language is not very useful in practice.

#12
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

In practice (1):

Additional repetition constructs:

R+ = RR∗: one or more repetitions of R

R? = (R|): zero or one repetition of R

R{n}, R{, n}, R{m, }, R{m, n}: n, up to n, at least m, between m and n

repetitions of R
Some capabilities beyond regular languages:
Allow, for example, recognition of languages such as αβα,#13for α, β ∈ Σ∗.
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

In practice (2):
Character classes allow us to write tedious expressions such as a|b| · · ·
|z more easily.
Examples:
“Recent” years:
199(6|7|8|9)|20(0(0|1|2|3|4|5|6|7|8|9)|1(0|1|2|3|4|5|6|7|8)) 199[6–
9]|20(0[0–9]|1[0–8])

#14
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

In practice (2):
→
Identiﬁer in C: (a|b| · · · |z|A|B| · · · |Z|_)(a|b| · · · |z|A|B| · · · |Z|0|1| · · ·
|9|_)∗
a–zA–Z_][a–zA–Z0–9_]∗
→[
Anything but a lowercase letter: [^a–z]
Any letter: .
Digit, non-digit: \d, \D
Whitespace, non-whitespace: \s, \S
Word character, non-word character: \w, \W
#15
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

Summary:

Regular expressions describe many useful languages

Given a string s and a regexp R, is

s ∈L(R)

But a yes/no answer is not enough!

Instead: partition the input into lexemes

We will adapt regular expression to this goal
#16
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Lexical Analysis Definition:

What do we want to do? Example:

if (i == j) z = 0;
else
z = 1;
The input is just a sequence of characters:
if (i == j)\n\tz = 0;\nelse\n\tz = 1;
Goal: partition input strings into substrings
– And classify them according to their role

#17
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Output of lexical analysis is a list of tokens

A token is a syntactic category

In English:
noun, verb, adjective, ...

In a programming language:
Identifier, Integer, Keyword, Whitespace, ...
Parser relies on token distinctions:
e.g., identifiers are treated differently than keywords
#18
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Tokens: correspond to sets of strings

Identifier: strings of letters or digits, starting with a letter

Integer: a non-empty string of digits

Keyword: “else” or “if” or “begin” or ...

Whitespace: a non-empty sequence of blanks, newlines, and/or tabs

OpenPar: a left-parenthesis
#19
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Lexical Analyzer:

An implementation must do two things:

Recognize substrings corresponding to tokens

Return the value or lexeme of the token

– The lexeme is the substring

#20
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Lexical Specifications (1):

Select a set of tokens

Number, Keyword, Identifier, ...
Write a regexp for the lexemes of each token
Number = digit+
Keyword = 'if' | 'else' | ...
Identifier = letter ( letter | digit ) *
OpenPar = '('
– ...

#21
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Lexical Specifications (2):

Construct R, matching all lexemes for all tokens:

R = Keyword | Identifier | Number | ...

R = R1 | R2| R3 | ...

Fact: if s ∈L(R) then s is a lexeme

Furthermore, s ∈L(Rj) for some j
This j determines the token that is reported

#22
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Lexical Specifications (3):

Let the input be x1 ... xn

Each xi is in the alphabet Σ
For 1 ≤i≤n,check
– x1...xi∈L(R)
If so, it must be that
x1...xi∈L(Rj) for some j
Remove x1...xi from the input and restart

#23
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Lexing Example:

R = Whitespace | Integer | Identifer | Plus

Parse “f +3 +g”
“f” matches R, more precisely Identifier
“ “ matches R, more precisely Whitespace
“+” matches R, more precisely Plus
– ...

The token-lexeme pairs are

<Identifier, “f”>
<Whitespace, “ “>
– <Plus, “+”> ...
#24
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Ambiguities (1):

There are ambiguities in the algorithm

Example:

R = Whitespace | Integer | Identifier | Plus

Parse “foo+3”
“f” matches R, more precisely Identifier
But also “fo” matches R, and “foo”, but not “foo+”

How much input is used?

Maximal Munch rule: Pick the longest possible substring that matches R
#25
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Ambiguities (2):

R = Whitespace | 'new' | Integer | Identifier

Parse “new foo”

“new” matches R, more precisely 'new'
but also Identifier – which one do we pick?

In general, use the rule listed first.

So we must list 'new' (and other keywords) before Identifier.

#26
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Summary:

Regular expressions provide a concise notation for string patterns

Their use in lexical analysis requires small extensions

To resolve ambiguities
To handle errors
Good algorithms known (next)
Requiring only a single pass over the input
And few operations per character (table lookup)

#27
Prepared by
Mary Ann F. Quioc, DIT

Java Programming, 9th Edition Free PDF Download - Education Books
No ratings yet
Java Programming, 9th Edition Free PDF Download - Education Books
1 page
OTS Avaloq Parameterization Principles Agenda 3 1
No ratings yet
OTS Avaloq Parameterization Principles Agenda 3 1
7 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Compilers - Week 2
No ratings yet
Compilers - Week 2
14 pages
Chapter 2 lexical_analysis
No ratings yet
Chapter 2 lexical_analysis
38 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
Chapter2-Lexical Analysis
No ratings yet
Chapter2-Lexical Analysis
64 pages
Compiler Design - Lexical Analysis
No ratings yet
Compiler Design - Lexical Analysis
16 pages
2 Lex
No ratings yet
2 Lex
45 pages
Chapter 2 - Lexical Analysis_Regular Expressions(1)
No ratings yet
Chapter 2 - Lexical Analysis_Regular Expressions(1)
27 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
ch3 M.PPTX - 0
No ratings yet
ch3 M.PPTX - 0
46 pages
16 Regexp Ocamllex
No ratings yet
16 Regexp Ocamllex
43 pages
Lecture 03
No ratings yet
Lecture 03
42 pages
Intro To Compilers Lecture 2
No ratings yet
Intro To Compilers Lecture 2
15 pages
Compiler Design Unit-1 - 4
No ratings yet
Compiler Design Unit-1 - 4
4 pages
Compiler Design 2
No ratings yet
Compiler Design 2
76 pages
SE Compiler Chapter 2
No ratings yet
SE Compiler Chapter 2
16 pages
M2-MAIN
No ratings yet
M2-MAIN
41 pages
Chapter-2
No ratings yet
Chapter-2
99 pages
Lexi Cal a Analyzer
No ratings yet
Lexi Cal a Analyzer
38 pages
CD ch2
No ratings yet
CD ch2
104 pages
2
No ratings yet
2
109 pages
Compiler Design Chapter-2
60% (5)
Compiler Design Chapter-2
105 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
Lexical Analysis
No ratings yet
Lexical Analysis
47 pages
2_Lexical Analysis
No ratings yet
2_Lexical Analysis
52 pages
pr
No ratings yet
pr
40 pages
Compiler
No ratings yet
Compiler
60 pages
Lexical Analysis1
No ratings yet
Lexical Analysis1
44 pages
Module-3 Lexical Analysis: System Software 15CS63
No ratings yet
Module-3 Lexical Analysis: System Software 15CS63
8 pages
rkCD-Chapter 2 - LEXICAL ANALYSIS
No ratings yet
rkCD-Chapter 2 - LEXICAL ANALYSIS
9 pages
Lecture 3
No ratings yet
Lecture 3
31 pages
Exercises For Section 3.3
No ratings yet
Exercises For Section 3.3
8 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Lexical Analyzer 1
No ratings yet
Lexical Analyzer 1
37 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Lecture 06
No ratings yet
Lecture 06
27 pages
Chapter Two (3) (Autosaved)
No ratings yet
Chapter Two (3) (Autosaved)
29 pages
Lecture 3 (30-1-23)
No ratings yet
Lecture 3 (30-1-23)
11 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Chapter 3 Finite automata and lexical analysis
No ratings yet
Chapter 3 Finite automata and lexical analysis
100 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
CD_UNIT-2
No ratings yet
CD_UNIT-2
64 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
Chapter 3 Finite Automata and Lexical Analysis
No ratings yet
Chapter 3 Finite Automata and Lexical Analysis
95 pages
4 Lexical Analysis
No ratings yet
4 Lexical Analysis
60 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 7 Lexical Analysis
No ratings yet
Chapter 7 Lexical Analysis
61 pages
02. Chapter 3 - Lexical Analysis (1)
No ratings yet
02. Chapter 3 - Lexical Analysis (1)
52 pages
Chapter 3 Finite Automata and Lexical Analysis
No ratings yet
Chapter 3 Finite Automata and Lexical Analysis
100 pages
Chapter 2 - Copy
No ratings yet
Chapter 2 - Copy
39 pages
M.Suhaib Khalid PDF
No ratings yet
M.Suhaib Khalid PDF
10 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
69 pages
Lect2 Lexical
No ratings yet
Lect2 Lexical
9 pages
Chapter 2
No ratings yet
Chapter 2
91 pages
Module 3
No ratings yet
Module 3
7 pages
Compilers CH 3
No ratings yet
Compilers CH 3
58 pages
04 Lexi Cal A Analysis
No ratings yet
04 Lexi Cal A Analysis
39 pages
Chapter2-Lexical Analysis
No ratings yet
Chapter2-Lexical Analysis
28 pages
Lexical Analysis
No ratings yet
Lexical Analysis
41 pages
Unit 1-REGULAR LANGUAGES
No ratings yet
Unit 1-REGULAR LANGUAGES
27 pages
Linear Systems - Iterative Methods
No ratings yet
Linear Systems - Iterative Methods
55 pages
SPIT Old Is Gold Solutions by Thomas Basyal
No ratings yet
SPIT Old Is Gold Solutions by Thomas Basyal
225 pages
Jpdfunit aShortIntroduction PDF
No ratings yet
Jpdfunit aShortIntroduction PDF
18 pages
Ejercicios Cap2 PH
No ratings yet
Ejercicios Cap2 PH
7 pages
Extending Lua in C
No ratings yet
Extending Lua in C
4 pages
L03 - C Shell Scripting - Part 1 1. What Is A Shell?: Shell Command-Line Interpreter
No ratings yet
L03 - C Shell Scripting - Part 1 1. What Is A Shell?: Shell Command-Line Interpreter
8 pages
MSIP Period Maps Tool 200901
No ratings yet
MSIP Period Maps Tool 200901
8 pages
[FREE PDF sample] No Fluff Just Stuff Anthology The 2006 Edition Neal Ford ebooks
100% (1)
[FREE PDF sample] No Fluff Just Stuff Anthology The 2006 Edition Neal Ford ebooks
62 pages
Advances and Issues in Frequent Pattern Mining
No ratings yet
Advances and Issues in Frequent Pattern Mining
21 pages
VIPS OOPS Unit 2 Collection API Interface
No ratings yet
VIPS OOPS Unit 2 Collection API Interface
63 pages
Os CD Lab Manual
82% (11)
Os CD Lab Manual
58 pages
OOPS Unit 4
No ratings yet
OOPS Unit 4
67 pages
Java Documentation Page The Java White Paper - Chapter 3
No ratings yet
Java Documentation Page The Java White Paper - Chapter 3
9 pages
TIC-TAC-TOE andriod studio
No ratings yet
TIC-TAC-TOE andriod studio
45 pages
Ax2012 Enus Deviii 05
No ratings yet
Ax2012 Enus Deviii 05
16 pages
Chapter 2 - Pointers, Virtual Functions
No ratings yet
Chapter 2 - Pointers, Virtual Functions
37 pages
Oracle-Base - Bulk Binds (Bulk Collect & Forall) and Record Processing in Oracle
No ratings yet
Oracle-Base - Bulk Binds (Bulk Collect & Forall) and Record Processing in Oracle
11 pages
Gate-Cs 2006
No ratings yet
Gate-Cs 2006
31 pages
SQL
No ratings yet
SQL
56 pages
Simplification of CFG: Presented To Presented by
100% (2)
Simplification of CFG: Presented To Presented by
12 pages
Error Detection/Correction: Section 1.7 Section 3.9 Bonus Material: Hamming Code
No ratings yet
Error Detection/Correction: Section 1.7 Section 3.9 Bonus Material: Hamming Code
28 pages
Object Oriented Programming: Main Concepts of OOP
No ratings yet
Object Oriented Programming: Main Concepts of OOP
28 pages
Image Processing With TensorFlow
100% (2)
Image Processing With TensorFlow
29 pages
4.instructions & Instructions Sequencing
No ratings yet
4.instructions & Instructions Sequencing
16 pages
Dummy
No ratings yet
Dummy
25 pages
Topic: Unit Testing and Integration Testing
No ratings yet
Topic: Unit Testing and Integration Testing
4 pages
ניתוצ תירגול 4 - 2023
No ratings yet
ניתוצ תירגול 4 - 2023
35 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

PPT Week 2 -Lexical Analysis

Uploaded by

PPT Week 2 -Lexical Analysis

Uploaded by

Week 2

Program Translation Flowchart

Let Σ be a set of characters.

A language over Σ is a set of strings of characters drawn from Σ.

Σ is called the alphabet.

Language = English Sentences

Alphabet = ASCII characters

Precedence: Kleene star (*), Concatenation, Union (|)

Base Regular Expressions

• Single character: 'c'

Compound Regular Expressions

(0|1)∗ all binary strings

The empty string is represented as the empty string: a(b|) instead of

The empty language is not very useful in practice.

Additional repetition constructs:

R+ = RR∗: one or more repetitions of R

R? = (R|): zero or one repetition of R

R{n}, R{, n}, R{m, }, R{m, n}: n, up to n, at least m, between m and n

Regular expressions describe many useful languages

Given a string s and a regexp R, is

But a yes/no answer is not enough!

Instead: partition the input into lexemes

Lexical Analysis Definition:

What do we want to do? Example:

Output of lexical analysis is a list of tokens

A token is a syntactic category

Tokens: correspond to sets of strings

Identifier: strings of letters or digits, starting with a letter

Integer: a non-empty string of digits

Keyword: “else” or “if” or “begin” or ...

Whitespace: a non-empty sequence of blanks, newlines, and/or tabs

An implementation must do two things:

Recognize substrings corresponding to tokens

Return the value or lexeme of the token

Lexical Specifications (1):

Select a set of tokens

Lexical Specifications (2):

Construct R, matching all lexemes for all tokens:

Fact: if s ∈L(R) then s is a lexeme

Lexical Specifications (3):

Let the input be x1 ... xn

R = Whitespace | Integer | Identifer | Plus

The token-lexeme pairs are

There are ambiguities in the algorithm

R = Whitespace | Integer | Identifier | Plus

How much input is used?

R = Whitespace | 'new' | Integer | Identifier

Parse “new foo”

In general, use the rule listed first.

So we must list 'new' (and other keywords) before Identifier.

Regular expressions provide a concise notation for string patterns

Their use in lexical analysis requires small extensions

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.