0% found this document useful (0 votes)
3 views

PPT Week 2 -Lexical Analysis

The document discusses regular expressions and their role in lexical analysis within programming languages. It defines regular expressions, finite automata, and the process of tokenizing input strings into syntactic categories. Additionally, it addresses the ambiguities in token recognition and the importance of using a systematic approach to resolve them.

Uploaded by

Edrian Rodriguez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

PPT Week 2 -Lexical Analysis

The document discusses regular expressions and their role in lexical analysis within programming languages. It defines regular expressions, finite automata, and the process of tokenizing input strings into syntactic categories. Additionally, it addresses the ambiguities in token recognition and the importance of using a systematic approach to resolve them.

Uploaded by

Edrian Rodriguez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Week 2

MSCS 7103
Theories of Programming Languages
Regular Expressions

Lexical Analysis
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

Program Translation Flowchart


Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

Regular expressions
A way to specify sets of strings in order to describe tokens

Lexical analysis
Turns a stream of characters into a stream of tokens

Finite Automata
A machine used to recognize patterns from some character set (or alphabet) or
language
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

Languages

Definition:

Let Σ be a set of characters.

A language over Σ is a set of strings of characters drawn from Σ.

Σ is called the alphabet.


Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

Languages

Examples:
Alphabet = English Characters

Language = English Sentences


Note: Not every string on English characters is an English sentence.
Example: xayenb sbe'

Alphabet = ASCII characters

Language = C Programs
Note: ASCII character set is different from English character set.
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

Regular Expressions
A notation for specifying regular languages

Definition:
Base cases: ∅, ϵ, and a are regular expressions, where a ∈ Σ.
Induction: If A and B are regular expressions, then
A|B is a regular expression,
AB is a regular expression,
(A) is a regular expression,
A∗ is a regular expression.

Precedence: Kleene star (*), Concatenation, Union (|)


Parentheses indicate grouping
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

Base Regular Expressions

• Single character: 'c'


– L('c') = { “c”}
(for any c ∈ Σ )
• Concatenation: AB
– A and B are other regular expressions
– L(AB) = { ab | a ∈ L(A) and b ∈ L(B) }
• Example: L('i' 'f') = { “if” }
– We abbreviate 'i' 'f' as 'if'
#9
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

Compound Regular Expressions

Union
– L(A | B) = { s | s ∈L(A) or s ∈L(B) }

Examples:
– L('if' | 'then' | 'else') = { “if”, “then”, “else” }
– L('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9') = what?
– L( ('0'|'1') ('0'|'1') ) = {“00”,”01”,”10”,”11”}

#10
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

More Examples

(0|1)∗ all binary strings


(0|1)∗0 all binary strings that end in 0
(0|1)00∗ all binary strings that start with 0 or 1, followed by one or
more 0s 0|1(0|1)∗ all binary numbers without leading 0s

What regular expression describes the set L of binary strings that do not
contain 101 as a substring?
1001001110 ∈ L
00010010100 ∈/ L
(0|ϵ)(1|000∗)∗(0|ϵ)
#11
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

In practice (1):

No ϵ or ∅:

The empty string is represented as the empty string: a(b|) instead of


a(b|ϵ) to express the language {a, ab}.

The empty language is not very useful in practice.

#12
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

In practice (1):

Additional repetition constructs:

R+ = RR∗: one or more repetitions of R

R? = (R|): zero or one repetition of R

R{n}, R{, n}, R{m, }, R{m, n}: n, up to n, at least m, between m and n


repetitions of R
Some capabilities beyond regular languages:
Allow, for example, recognition of languages such as αβα,#13for α, β ∈ Σ∗.
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

In practice (2):
Character classes allow us to write tedious expressions such as a|b| · · ·
|z more easily.
Examples:
“Recent” years:
199(6|7|8|9)|20(0(0|1|2|3|4|5|6|7|8|9)|1(0|1|2|3|4|5|6|7|8)) 199[6–
9]|20(0[0–9]|1[0–8])

#14
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

In practice (2):

Identifier in C: (a|b| · · · |z|A|B| · · · |Z|_)(a|b| · · · |z|A|B| · · · |Z|0|1| · · ·
|9|_)∗
a–zA–Z_][a–zA–Z0–9_]∗
→[
Anything but a lowercase letter: [^a–z]
Any letter: .
Digit, non-digit: \d, \D
Whitespace, non-whitespace: \s, \S
Word character, non-word character: \w, \W
#15
Week 2 MSCS 7103: Theories of Programming Languages

Regular Expressions

Summary:

Regular expressions describe many useful languages

Given a string s and a regexp R, is


s ∈L(R)

But a yes/no answer is not enough!

Instead: partition the input into lexemes


We will adapt regular expression to this goal
#16
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Lexical Analysis Definition:

What do we want to do? Example:


if (i == j) z = 0;
else
z = 1;
The input is just a sequence of characters:
if (i == j)\n\tz = 0;\nelse\n\tz = 1;
Goal: partition input strings into substrings
– And classify them according to their role

#17
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Output of lexical analysis is a list of tokens

A token is a syntactic category

In English:
noun, verb, adjective, ...

In a programming language:
Identifier, Integer, Keyword, Whitespace, ...
Parser relies on token distinctions:
e.g., identifiers are treated differently than keywords
#18
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Tokens: correspond to sets of strings

Identifier: strings of letters or digits, starting with a letter

Integer: a non-empty string of digits

Keyword: “else” or “if” or “begin” or ...

Whitespace: a non-empty sequence of blanks, newlines, and/or tabs

OpenPar: a left-parenthesis
#19
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Lexical Analyzer:

An implementation must do two things:

Recognize substrings corresponding to tokens

Return the value or lexeme of the token


– The lexeme is the substring

#20
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Lexical Specifications (1):

Select a set of tokens


Number, Keyword, Identifier, ...
Write a regexp for the lexemes of each token
Number = digit+
Keyword = 'if' | 'else' | ...
Identifier = letter ( letter | digit ) *
OpenPar = '('
– ...

#21
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Lexical Specifications (2):

Construct R, matching all lexemes for all tokens:


R = Keyword | Identifier | Number | ...

R = R1 | R2| R3 | ...

Fact: if s ∈L(R) then s is a lexeme


Furthermore, s ∈L(Rj) for some j
This j determines the token that is reported

#22
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Lexical Specifications (3):

Let the input be x1 ... xn


Each xi is in the alphabet Σ
For 1 ≤i≤n,check
– x1...xi∈L(R)
If so, it must be that
x1...xi∈L(Rj) for some j
Remove x1...xi from the input and restart

#23
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Lexing Example:

R = Whitespace | Integer | Identifer | Plus


Parse “f +3 +g”
“f” matches R, more precisely Identifier
“ “ matches R, more precisely Whitespace
“+” matches R, more precisely Plus
– ...

The token-lexeme pairs are


<Identifier, “f”>
<Whitespace, “ “>
– <Plus, “+”> ...
#24
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Ambiguities (1):

There are ambiguities in the algorithm

Example:

R = Whitespace | Integer | Identifier | Plus


Parse “foo+3”
“f” matches R, more precisely Identifier
But also “fo” matches R, and “foo”, but not “foo+”

How much input is used?


Maximal Munch rule: Pick the longest possible substring that matches R
#25
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Ambiguities (2):

R = Whitespace | 'new' | Integer | Identifier

Parse “new foo”


“new” matches R, more precisely 'new'
but also Identifier – which one do we pick?

In general, use the rule listed first.

So we must list 'new' (and other keywords) before Identifier.

#26
Week 2 MSCS 7103: Theories of Programming Languages

Lexical Analysis

Summary:

Regular expressions provide a concise notation for string patterns

Their use in lexical analysis requires small extensions

To resolve ambiguities
To handle errors
Good algorithms known (next)
Requiring only a single pass over the input
And few operations per character (table lookup)

#27
Prepared by
Mary Ann F. Quioc, DIT

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy