lex (tokenize) --> parse (grammars) --> semantic analysis -->
--> back end (usually code generation)
CFG are used to define the structure of things:
- Recursive structures
- "Context-free" means -- recursive subtrees self-contained
(contrast with context sensitive)
- Programming language constructs
- n.b., 1960's,70's excitement over (block-) structured languages
- equally useful to programmer and to compiler designer
- Natural languages
- Textbook uses noun-phrase, verb-phase examples
- Markup languages
- XML
- SGML
- DTD's, XML Schema, RelaxNG
- Database languages
- SQL (structured query language for relational db's like Sybase, Oracle)
- OQL (object query language - for object databases)
- XPath and XQuery (for XML Databases)
- BNF for defining Network Protocols
- RFC 822 (e-mail)
- SOAP
- CORBA
- A lexical analyzer's (tokenizer's) input is a string: a sequence of chars
- A parser's input is a sequence of tokens
- Parser gets input from tokenizer
- Parser (usually) outputs a tree
- Parser can also just output Yes/No, in which case used as a validator, not compiler (transducer)
![]() | Caution |
|---|---|
Your text gives many examples of parsers that take a sequence of characters as input. Although it is possible to write grammars and use parsers in this way, in practice the tokenizer reads characters and outputs tokens; the parser reads tokens generated by the tokenizer. |
Loosely we can thing of these things as equivalent (in computational power)
- parsing using a state machine (a DFA or NFA) equipped with a stack
- this is formally known as a deterministic PDA (push down automata)
- Parsing top-down
- Parsing bottom-up (although maybe slightly more powerful)
- Parsing with recursive descent
using a set of recursive routines written in your favorite programming lang.
- Computing a derivation of an input string
Top-down parsing
Recursive descent
Lookahead
FIRST, FOLLOW sets
LL grammars
LL(1)-ness
- eliminating left-recursion
L -> L , id | id
- eliminating common prefixes
E -> id := E | id (E)
Bottom-up parsing
Examples
- JJTree
- YACC
Note: left recursion is good!
LR handles
State machines can recognize handles (i.e., regular set)
Classes of Languages
CFL, ACFL, DCFL, LR, LL, ...
Traditionally this course covers regular sets (Chomsky Type 0), context-free (Chomsky Type 1) and then semantic formalisms higher in the Chomsky hierarchy. These clean separations of concern have historically helped complier designers implement tokenizer, parser, and sematic modules that are well abstracted.
Recent theoretical attention in formal language circles has been drawn to tree-automata. Tree automata don't quite fit the traditional Chomsky hierarchy. In between, they are somewhat more complex than Type 0 but not quite powerful enough to be considered capable of defining Type 1 languages. Tree automata have been useful in defining the formal semantics for XML standards, n.b., RelaxNG and XML Schema.
The student is advised that unless there is strong interest, lectures will follow the text material, which covers only the regular sets and CFL's and (unfortunately) makes no mention of tree automata.