Formal Languages

Grammars: Generating Languages

2+3*4 = 14, not 20. Not a convention - a consequence of the fact that the arithmetic-expression grammar in C++, Python, and JavaScript unambiguously gives multiplication tighter binding than addition. With an ambiguous grammar, the same source file would produce different numbers on different machines. Grammars are not theory for its own sake: they are the precise mechanism without which no compiler works.

  • **Compilers:** the GCC/Clang parser builds an AST from the C/C++ CFG; ambiguity = a compilation error or incorrect code
  • **Protocols:** HTTP, SMTP, DNS are described by grammars in RFC documents (ABNF); ambiguity in a protocol = a security vulnerability
  • **Linguistics:** "I saw the man with the telescope" - natural language ambiguity, formalizable through parse trees

Backus, Naur, and the Birth of BNF

1959. John Backus of IBM and Peter Naur of Denmark created a formal notation for describing ALGOL 60 - the first programming language whose syntax was nailed down by precise rules. Backus-Naur Form (BNF) is productions in exactly the sense covered here. Before BNF, languages were described in natural language: "a variable is a letter optionally followed by letters and digits." Ambiguity, disagreements, incompatible implementations. BNF wiped all that out in one stroke. Every RFC, every language standard, every protocol spec written today descends from that notation.

Предварительные знания

  • The Chomsky Hierarchy

Productions: Substitution Rules

A formal language is a set of strings. How does anyone specify that set? Listing every string is impossible - there are infinitely many. A **grammar** does the job: a set of rules that generates every string of the language and only those. The JSON parser, Python interpreter, and C++ compiler all sit on top of this exact pattern: a grammar describes a language more precisely and compactly than any other tool.

**A formal grammar** G = (V, Sigma, R, S), where: **V** is a finite set of non-terminals (variables), **Sigma** is the terminal alphabet (V ∩ Sigma = empty), **R** is a finite set of productions (substitution rules), **S in V** is the start non-terminal. **Terminals** are the symbols of the final language (a, b, 0, 1). **Non-terminals** are auxiliary variables (S, A, B) that will not appear in the final string.

ComponentNotationExample (G for a^n b^n)Analogy
Non-terminalsV{S}Variables in an equation
TerminalsSigma{a, b}Final symbols (letters)
ProductionsR{S->aSb, S->ab}Substitution rules
Start symbolSSStarting point of derivation

**Notation convention:** non-terminals go in UPPERCASE (S, A, B, E, T, F), terminals in lowercase (a, b, 0, 1) or special symbols (+, *, brackets). The vertical bar | bundles alternatives: S -> aSb | ab means "S becomes aSb OR ab".

Grammar: S -> 0S1 | 01. What language does it generate?

Derivation: From S to a String

A grammar defines rules. A **derivation** is a sequence of rule applications that walks the start symbol S into a string of terminals. Each step: pick a non-terminal, pick a rule, substitute. The process stops when no non-terminals remain. Recursive descent parsers in GCC/Clang literally compute a leftmost derivation in real time.

**Derivation** - a sequence S => alpha_1 => alpha_2 => ... => w, where each step => applies one production. **Leftmost derivation:** at each step the leftmost non-terminal is replaced. **Rightmost derivation:** the rightmost non-terminal. Notation: =>_lm (leftmost), =>_rm (rightmost), =>* (zero or more steps).

**The language of a grammar** L(G) is the set of every terminal string derivable from S: L(G) = {w in Sigma* : S =>* w}. Two grammars are **equivalent** when L(G_1) = L(G_2). The same language can have many different grammars.

ConceptNotationMeaning
Single derivation stepalpha => betaOne non-terminal replaced
Zero or more stepsalpha =>* betaTransitive closure
Leftmost derivation=>_lmAlways the leftmost non-terminal
Rightmost derivation=>_rmAlways the rightmost non-terminal
Language of the grammarL(G){w in Sigma* : S =>* w}

S -> AB, A -> a, B -> b. Leftmost derivation of 'ab'?

Sentential Form

Every step of a derivation produces a string mixing terminals and non-terminals. Take E+T*F: E, T, F are still non-terminals, while + and * are already terminals. Such intermediate strings are called **sentential forms**. Not abstraction for its own sake - every step of recursive descent parsing produces exactly a sentential form.

**A sentential form** is any string alpha in (V union Sigma)* derivable from S: S =>* alpha. It may contain both terminals and non-terminals. If alpha contains only terminals (alpha in Sigma*), it is called a **sentence** of the language. L(G) is the set of all sentences.

StringTypeNon-terminals?Further derivation
SSentential FormYes (S)Must continue
E+TSentential FormYes (E, T)Must continue
n+T*FSentential FormYes (T, F)Must continue
n+n*nSentenceNoDerivation complete
xyzNot a sentential form-Not derivable from S

**A parse tree** visualizes a derivation. Root = S, internal nodes = non-terminals, leaves = terminals. Each application of rule A -> X_1 X_2...X_k attaches child nodes X_1, X_2, ..., X_k to node A. Read the leaves left to right and the derived string falls out.

The string 'aAbB' in a grammar with V={S,A,B}, Sigma={a,b} is:

Grammar Ambiguity

The grammar E -> E+E | E*E | n admits TWO different parse trees for "n+n*n". One reads it as (n+n)*n, the other as n+(n*n). Not an academic curiosity - an ambiguous grammar produces a non-deterministic compiler. The same source file yields different results depending on which tree the parser picks. That is why every industrial language designs unambiguous grammars from day one.

An **ambiguous grammar** is one in which there exists a string w in L(G) with two or more **distinct parse trees** (equivalently, two distinct leftmost derivations). A grammar is **unambiguous** if every string has exactly one parse tree.

ProblemAmbiguous grammarUnambiguous alternative
Operator precedenceE -> E+E | E*E | nE -> E+T | T; T -> T*F | F; F -> n
AssociativityE -> E-E | nE -> E-T | T (left assoc.)
Dangling elseS -> if E then S else S | if E then SSplit into matched/unmatched
ListsL -> L,L | itemL -> L,item | item (left recursion)

**How to kill ambiguity?** For expressions: introduce **precedence** (separate non-terminals E, T, F for +, *, atoms) and **associativity** (left recursion E -> E+T for left associativity). For if-else: split into "matched" and "unmatched" statements.

**Not every ambiguous grammar is fixable.** Some **inherently ambiguous languages** exist - languages for which ALL grammars are ambiguous. Example: L = {a^i b^j c^k : i=j or j=k}. Strings with i=j=k satisfy both rules, and no grammar can eliminate the ambiguity.

Any ambiguous grammar can be rewritten as an unambiguous one while preserving the language

There exist inherently ambiguous languages - context-free languages for which NO unambiguous CFG exists. Example: L = {a^i b^j c^k : i = j or j = k}

Strings of the form a^n b^n c^n (where i=j=k) are covered by two rules: i=j and j=k. Any grammar must have two paths for such strings. This is proved using the pumping lemma for context-free languages.

The grammar E -> E-E | n gives two parse trees for "n-n-n": (n-n)-n and n-(n-n). How is the grammar made unambiguous with left associativity?

Key Ideas

  • **Grammar G = (V, Sigma, R, S)** - a set of substitution rules for generating strings of a language
  • **Derivation S =>* w** - a sequence of production applications; leftmost derivation replaces the leftmost non-terminal
  • **Sentential form** - an intermediate string in a derivation; a sentence is a sentential form with no non-terminals
  • **An ambiguous grammar** allows two parse trees for one string; fixed by introducing precedence and associativity

Related Topics

Grammars are a tool for generating languages; automata are a tool for recognizing them:

  • The Chomsky Hierarchy — The type of grammar determines the class of language it generates
  • Operations on Languages — Union, concatenation, and closure - the algebra over grammars
  • LL and LR Parsers — Practical implementation of parsing based on CFGs

Вопросы для размышления

  • JSON grammar: object -> { pairs } | {}, pairs -> pair , pairs | pair, pair -> string : value. Is this grammar unambiguous? Why?
  • Why does Python not have the dangling else problem, while C and Java do? How does Python's syntax eliminate the ambiguity?
  • Revisiting 2+3*4: if the grammar were ambiguous, the compiler could produce 14 OR 20. What other real problems does ambiguity cause in compilers?

Связанные уроки

  • fl-02-chomsky — The Chomsky hierarchy defines the grammar classes studied here
  • fl-12-cfg — Context-free grammars develop the productions concept introduced here
  • fl-17-ll-lr — LL/LR parsers are the practical implementation of CFG parsing
  • comp-10-cfg — CFG in compilers - direct application of formal grammars to real languages
  • nlp-01 — NLP uses the same formal grammars for syntactic parsing of natural language
  • fl-11-pumping-lemma — The pumping lemma proves the limits of what grammars can generate
  • ml-04
Grammars: Generating Languages

0

1

Sign In