Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Introduction to Formal Languages
Introduction to Formal Languages
Introduction to Formal Languages
Ebook357 pages4 hours

Introduction to Formal Languages

Rating: 2 out of 5 stars

2/5

()

Read preview

About this ebook

This highly technical introduction to formal languages in computer science covers all areas of mainstream formal language theory, including such topics as operations on languages, context-sensitive languages, automata, decidability, syntax analysis, derivation languages, and more. Geared toward advanced undergraduates and graduate students, the treatment examines mathematical topics related to mathematical logic, set theory, and linguistics. All subjects are integral to the theory of computation.
Numerous worked examples appear throughout the book, and end-of-chapter exercises enable readers to apply theory and methods to real-life problems. Elegant mathematical proofs are provided for almost all theorems.
LanguageEnglish
Release dateMar 17, 2015
ISBN9780486169378
Introduction to Formal Languages

Related to Introduction to Formal Languages

Titles in the series (100)

View More

Related ebooks

Mathematics For You

View More

Related articles

Reviews for Introduction to Formal Languages

Rating: 2 out of 5 stars
2/5

1 rating0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Introduction to Formal Languages - György E. Révész

    Index

    PREFACE

    Formal languages have been playing a fundamental role in the progress of computer science. Formal notations in mathematics, logic, and other sciences had been developed long before the advent of electronic computers. The evolution of formal notations was slow enough to allow the selection of those patterns which appeared to be the most favorable for their users, but the computer revolution has resulted in an unprecedented proliferation of artificial languages. Given time, I presume, natural selection would take effect, but for now we have to face this modern story of the Tower of Babel.

    The purpose of a scientific theory is to bring order to the chaos, and this is what formal language theory is trying to do—with partial success. Unfortunately, the theory itself sometimes appears to be in serious disorder, and nobody knows for sure which way it will turn next time. One important feature, however, is apparent so far: there is a close relationship between formal grammars and other abstract notions used in computer science, such as automata and algorithms. Indeed, since the results in one theory can often be translated into another, it seems to be an arbitrary decision as to which interpretation is primary.

    In this book, formal grammars are given preferential treatment because they are probably the most commonly known of the various theories among computer scientists. This is due to the success of context-free grammars in describing the syntax of programming languages. For this reason, I have also changed the proofs of some of the classical theorems which are usually shown with the aid of automata (e.g., Theorems 3.9 and 3.12). In this way, the introduction of the latter notion is deferred until it really becomes necessary; at the same time a more uniform treatment of the subject is presented. The connection between grammars and automata is also emphasized by the similarity of the notation for derivation and reduction, respectively.

    I did not try to account for all relevant results. Instead, I wanted to give a fairly coherent picture of mainstream formal language theory, which, of course, cannot be totally unbiased. In some cases, I have included less commonly known results to indicate particular directions in the development of the theory. This is especially true with Chapter 10, which is the first comprehensive presentation of its specific topic. I would like to take this opportunity to express my thanks to Johnson M. Hart for his contribution to this book as coauthor of Chapter 10, of which he wrote Sections 10.1, 10.4, 10.5, and 10.6. His enthusiasm and encouragement with respect to the writing of the entire book is greatly appreciated.

    Chapters 1 through 9 can be used as the primary text for graduate or senior undergraduate courses in formal language theory. Automata theory and complexity theory are not studied here in depth. Nevertheless, the text provides sufficient background in those areas that they can be explored more specifically in other courses later on. The same is true for computability theory, that is, the theory of algorithms. In the areas of compiler design and programming language semantics, I feel that my book can have a favorable influence on the way of thinking about these subjects, though less directly.

    Some of the theorems, e.g., Theorems 3.10, 3.12, 4.4, 5.2, and 8.10, may be skipped at the first reading of the book without impairing the understanding of the rest. Worked examples in the text usually form an integral part of the presentation. Exercises at the end of chapters or larger sections serve two purposes: they help in understanding the theory, and they illustrate some of its applications. I have made all efforts to simplify the proofs included in the book without compromising their mathematical rigor. A proof is not just a tool for convincing the skeptics but also an aid for better understanding the true nature of the result. I have therefore refrained from presenting theorems without proofs except in a very few cases.

    György Révész

    CHAPTER

    ONE

    THE NOTION OF FORMAL LANGUAGE

    1.1 BASIC CONCEPTS AND NOTATIONS

    A finite nonvoid set of arbitrary symbols (such as the letters used for writing in some natural language or the characters available on the keyboard of a typewriter) is called a finite alphabet and will usually be denoted by V. The elements of V are called letters or symbols and the finite strings of letters are called words over V. The set of all words over V is denoted by V*. The empty word, which contains no letters, will be denoted by λ and is considered to be in V* for every V.

    Two words written in one is called the catenation of the given words. The catenation of words is an associative operation, but in general it is noncommutative. Thus, if P and Q are in V* then their catenation PQ is usually different from QP except when V contains only one letter. But for every P, Q, and R in V* the catenation of PQ and R is the same as the catenation of P and QR. Therefore, the resulting word can be written as PQR without parentheses. The set V* is obviously closed with respect to catenation, that is, the result PQ is always in V* whenever both P and Q are in V. The empty word λ plays the role of the unit element for catenation, namely λP = Pλ = P for every P. (We shall assume that P is in V* for some given V even if we do not mention that.)

    The length of P denoted by |P| is simply the number of letters of P. Hence |λ| = 0 and |PQ| = |P| + |Q| for every P and Q. Two words are equal if one is a letter by letter copy of the other. The word P is a part of Q if there are words P1 and P2 such that Q = P1PP2. Further, if P ≠ λ and P Q, then it is a proper part of Q and if P1 = λ or P2 = λ then P is a head (initial part) or a tail of Q, respectively.

    For a positive integer i and for an arbitrary word P we denote by Pi the i-times iterated catenation of P (that is, i copies of P written in one word). By convention P⁰ = λ for every P. If, for example, P = ab, then P³ = ababab. (Note that we need parentheses to distinguish (ab)³ = ababab from ab³ = abbb.)

    The mirror image of P, denoted by P−1, is the word obtained by writing the letters of P in the reverse order. Thus, if P = a1a2 ··· an then P−1 = anan−1 ··· a1. Clearly, (P−1)−1 = P and (P−1)i = (Pi)−1 for i = 0, 1,….

    An arbitrary set of words of V* is called a language and is usually denoted by L. The empty language containing no words at all is denoted by Ø. It should not be confused with the language {λ} containing a single word λ. The set V* without λ is denoted by V+. A language L V* is finite if it contains a finite number of words, otherwise it is infinite. The complete language V* is always denumerably infinite. (The set of all subsets of V*, that is, the set of languages over a finite alphabet, is nondenumerable.)

    The above notion of language is fairly general but not extremely practical. It includes all written natural languages as well as the artificial ones, but it does not tell us how to define a particular language. Naturally, we want finite—and possibly concise—descriptions for the mostly infinite languages we are dealing with. In some cases we may have a finite characterization via some simple property. If, for instance, V = {a, b} then

    are all well-defined languages. Or let Na(P) and Nb(P) denote the number of occurrences of a and b, respectively, in P. Then

    L5 = {P|P ∈ {a, b}+ and Na(P) = Nb(P)}

    is also well-defined. But we need other, more specific tools to define more realistic languages. For this purpose the notion of generative grammar will be introduced as follows:

    Definition 1.1 A generative grammar G is an ordered fourtuple (VN, VT, S, F) where VN and VT are finite alphabets with VN VT = Ø, S is a distinguished symbol of VN, and F is a finite set of ordered pairs (P, Q) such that P and Q are in (VN VT)* and P contains at least one symbol from VN.

    The symbols of VN are called nonterminal symbols or variables and will usually be denoted by capital letters. The symbols of VT are called terminal symbols and will be denoted by small letters. According to Definition 1.1 the sets VN and VT are disjoint in every grammar. The nonterminal symbol S is called the initial symbol and is used to start the derivations of the words of the language.

    The ordered pairs in F are called rewriting rules or productions and will be written in the form P Q where the symbol → is, of course, not in VN VT. Productions are used to derive new words from given ones by replacing a part equal to the left-hand side of a rule by the right-hand side of the same rule. The precise definitions are given below.

    Definition 1.2: Derivation in one step Given a grammar G = (VN, VT, S, F) and two words X, Y ∈ (VN VT)*, we say that Y is derivable from X in one step, in symbols X Y, iff there are words P1 and P2 in (VN VT)* and a production P Q in F such that X = P1PP2 and Y = P1QP2.

    Definition 1.3: Derivation Given a grammar G = (VN, VT, S, F) and two words X, Y in (VN VT)*, we say that Y is derivable from X, iff X = Y or there is some word Z in (VN VT.

    , which involves at least one step. The subscript G will usually be omitted when the context makes it clear which grammar is used.

    Definition 1.4 The language generated by G is defined as

    or in other words

    This means that the language generated by G contains exactly those words which are derivable from the initial symbol S and contain only terminal symbols.

    Nonterminal symbols are used only as intermediary (auxiliary) symbols in the course of derivations. A derivation terminates when no more nonterminals are left in the word. (Note that according to Definition 1.1 the left-hand side of a rewriting rule must contain at least one nonterminal symbol.) A derivation aborts if there are nonterminals left in the word but there is no rewriting rule in F that can be applied to it.

    Example 1.1 Let G = (VN, VT, S, F) be a grammar where VN = {S, A, B}, VT = {a, b), and the rules in F are as follows:

    We will show that this grammar generates the language (see L5 above)

    PROOF It is easy to see that in each word of L(G) the number of a’s must be the same as that of b’s. Namely, in every word derivable from S the sum of the a’s and As is equal to that of the b’s and B’s. So we have established the inclusion L(G) ⊆ L.

    The reverse inclusion will be shown by induction on the number of occurrences of a (or b) in the words of L.

    Basis: Both ab and ba are in L(G).

    Induction step: Assume that every word in L having at most n occurrences of a does belong to L(G), and let P L have n + 1 occurrences of a (|P| = 2n + 2).

    First consider the case when P = aibX and X ∈ {a, b}+. If i > 1 then take the shortest head of X, denoted by U1, such that Nb(U1) > Na(U1). Clearly, U1 = W1b for some word W1 with Nb(W1) = Na(W1). (Note that W1 may be λ.) This process can be repeated with the rest of X until we get the decomposition of P of the form

    P = aibW1b ··· Wi−1bWi

    where each Wj (for j = 1,…,i) is either λ or it is in L. For Wj , B b, and B bS .

    Because of symmetry (the roles of a and b can simply be exchanged), the case P = biaY need not be discussed separately; this completes the proof.

    Example 1.2 Let G = ({S, X, Y},{a, b, c}, S, F) where the rules in F are:

    We show that this grammar generates the language

    {anbncn|n = 1, 2,…}

    PROOF First we show by induction on i . For i . Then this derivation can be continued only by applying the rule Xb bX i times, then using Xc Ybcc and applying bY Yb i . Now we have two possibilities: the rule aY aaX yields

    which was to be shown first, while the rule aY aa yields

    .

    It can be seen, further, that no other words are derivable in this grammar, since we have always exactly one rule that can be applied except for the last step in the above derivation. But the application of the rule aY aa will always terminate the derivation so it can be continued only in one way, which completes the proof.

    1.2 THE CHOMSKY HIERARCHY OF LANGUAGES

    As can be seen from the definitions given above, every grammar generates a unique language, but the same language can be generated by several grammars.

    Definition 1.5 Two grammars are called weakly equivalent (or, more simply, equivalent) if they generate the same language.

    Naturally, the question arises whether we can recognize equivalent grammars just by looking at them. In other words, is there any similarity between two equivalent grammars? Unfortunately, it is impossible to solve this problem in general. We can, however, classify our generative grammars on the basis of the forms of their production rules. The classification given below has been introduced by N. Chomsky, who is the founder of the whole theory.

    Definition 1.6 A generative grammar G = (VN, VT, S, F) is said to be of type i if it satisfies the corresponding restrictions in this list:

    i = 0:  No restrictions.

    i = 1:  Every rewriting rule in F has form Q1AQ2 → Q1PQ2, with Q1, Q2, and P in (VN VT)*, A VN, and P ≠ λ, except possibly for the rule S → λ, which may occur in P, in which case S does not occur on the right-hand sides of the rules.

    i = 2:  Every rule in F has form A P, where A VN and P ∈ (VN VT)*.

    i = 3:  Every rule in F has form either A PB or A P, where A, B VN .

    A language is said to be of type i if it is generated by a type i grammar. The class (or family) of type i .

    Type 0 grammars are often called phrase structure grammars, which refers to their linguistical origin. Type 1 grammars are called context-sensitive, since each of their rules allows for replacing an occurrence of a nonterminal symbol A by the corresponding word P only in the context Q1, Q2. On the contrary, the rules of a type 2 grammar are called context-free, as they allow for such replacements in any context. Type 3 grammars are called regular or finite state (the latter expression refers to their connection with finite automata, which will be discussed later).

    We can illustrate the linguistic background with this example. Consider the following sentence:

    The tall boy quickly answered the mathematical question.

    Now, let us denote a sentence by S, a noun phrase by A, a verb phrase by B, an article by C, an adjective by D, a noun by E, an adverb by G, and a verb by F. Then take the following productions:

    where the + sign represents the space as a terminal symbol. It is easy to see that the given sentence can be derived from S with the aid of these rules. (Note that the words tall and mathematical can be exchanged in the sentence without destroying its syntax, though its meaning will be slightly different.) Thus, we can derive syntactically correct sentences with the aid of such grammars. But the English language, just like any other natural language, contains much more complicated sentences, too. It is, therefore, impossible to describe the complete syntax of a natural language using only such simple rules. Nevertheless, generative grammars are useful tools for studying the basic structures of sentences of natural languages. Artificial languages, on the other hand, can be defined in such a way that they are generated by generative grammars.

    Returning to the classification of formal languages, it is obvious from Definition 1.6 that every type 3 language is also type 2 and every type 1 language is also type 0. It is also trivial that they are all type 0 at the same time.

    This means that

    is also true and that we have here a proper hierarchy of language classes

    where all the inclusions are proper. The latter assertion is hardly trivial, since every language can be generated by several grammars which need not necessarily be of the same type.

    As far as applications are concerned, context-free grammars are widely used in computer science for defining programming languages like FORTRAN, ALGOL, COBOL, PASCAL, etc. Context-sensitive grammars

    Enjoying the preview?
    Page 1 of 1
    pFad - Phonifier reborn

    Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

    Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


    Alternative Proxies:

    Alternative Proxy

    pFad Proxy

    pFad v3 Proxy

    pFad v4 Proxy