Phrases: Syntax and Parsing

Definitions

A token is a legal sequence of characters. A phrase is a potentially meaningful sequence of tokens. For example, the following token sequences are potentially meaningful:

Green ideas sleep furiously.
They are flying planes.
2 + 3 < 4

These token sequences are not potentially meaningful:

Mr. Jones opened the
2 + 3 <

But what is the meaning of "meaningful"? In many programming languages, there are three basic categories of meaning: calculations that produce answers (values), declarations that produce associations between names and values (bindings), and commands that update variables.

Phrases(L) = all legal phrases of language L
= Expressions(L) U Declarations(L) U Commands(L)

Expressions(L) = phrases that describe calculations
example: 3.14 * square(r)

Declarations(L) = phrases that introduce names
example: float pi = 3.14

Commands(L) = phrases that control memory or execution order
example: while(i < MAX) total += value[i++];

Grammars

Computers will demand a more precise way of defining Phrases(L). For this we turn to formal grammars:

Phrases(L) = L[G]
= all sequences of tokens that can be derived from grammar G

An EBNF (Extended Bachus Normal Form) grammar is a list of rules of the form :

<Phrase> ::= PATTERN | PATTERN | ...

Where

PATTERN = a regular expression involving tokens and variables.

Examples

Example

The main difference between regular expressions and EBNF rules is the possibility of self reference or recursion. The right hand side of an EBNF rule is a regular expression possibly involving the non-terminal symbol on the left hand side of the rule. For example, the language of balanced parenthesis is given by the rule:

Here are some sample phrases from this language:

L[<Paren>] = { 0, (0), ((0)), (((0))), etc. }

Compare this with the language of tokens that match the pattern that simply states "a bunch of left parenthesis, followed by a number, followed by a bunch of right parenthesis":

L[(*<Number>)*] = { 0, (0, ((0, 0), 0)))), ((0, etc. }

It can be proved that no regular expression can demand that parenthesis should be balanced.

Of course we can use EBNF rules instead of regular expressions:

But this is overkill compared to the non-recursive rule:

<Digits> ::= <Digit>+

Example

Here's a grammar, G, for a programming language called Simple:

<Operator> ::= + | - | * | / | == | < | <= | > | >=
<Number> ::= (+ | -)?<Digit>+(.<Digit>+)?
<Digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<Symbol> ::= <Letter>(<Letter> | <Digit>)*

<Command> ::= <Assignment> | <Block> | <Selection> | <Iteration>
<Assignment> ::= <Symbol> = <Expression>
<Block> ::= { <Command> (;<Command>)* }
<Selection> ::= if (<Expression>) <Command> (else <Command>)?
<Iteration> ::= while(<Expression>) <Command>

Here are some sample phrases taken from L[G]:

var x = 10
var accum = 0
while (0 < x) { accum = accum + x; x = x - 1 }

Derivations and Parse Trees

A parser uses the EBNF rules of a grammar to builds a tree called a parse tree from a list of tokens. If this isn't possible, it throws a syntax error:

ParseTree parse(TokenList tokens, Grammar g) throws ParseError {
...
}

A tree can be represented as a nested list of the form:

(ROOT CHILD_1 CHILD_2 ... CHILD_n)

Where ROOT is the label of the root node of the tree, and CHILD_i is either a sub-tree or the label of a leaf node.

Parent nodes of parse trees are labeled by non-terminal symbols-- i.e., symbols that appear on the left hand side of an EBNF rule, leaf nodes are labeled by tokens, and child nodes are labeled by instances of the patterns that appear on the right hand side of a rule.

For example, the parse tree of the declaration:

var x = y + 5

would look like this:

(<Phrase>
   (<Declaration>
      var
      (<Symbol> x)
      =
      (<Expression>
         (<Expression> (<Symbol> y))
         (<Operator> +)
         (<Expression> (<Number> (<Digit> 5))))))

The parse tree for the command:

while(x < y) x = x + 1

would look like this:

(<Phrase>
   (<Command>
      (<Iteration>
         while
         (
         (<Expression> "x < y")
         )
         (<Command>
            (<Assignment> "x = x + 1")))))

Notice that to save space we shortened some of the sub-trees by putting the leafs in quotes.

Ambiguity, Precedence, and Associativity

The expression:

3 + 2 * 5

is generated by the rules:

<Expression> ::= <Expression> <Operator> <Expression> | <Number>
<Operator> ::= + | * | - | /

But notice that this expression is ambiguous. It can be interpreted either as:

(3 + 2) * 5 = 25

or as:

3 + (2 * 5) = 13

Actually, these different interpretations correspond to different parse trees:

(<Expression>
   (<Expression>
      (<Expression> (<Number> 3))
      +
      (<Expression> (<Number> 5)))
   *
   (<Expression> (<Number> 5)))

versus:

(<Expression>
   (<Expression> (<Number> 3))
   +
   (<Expression>
      (<Expression> (<Number> 2))
      *
      (<Expression> (<Number> 5))))

The problem is that the rule is ambiguous. We could try to replace this rule by a non-ambiguous rule:

Now only the first tree is possible.

Associativity

Notice that our new grammar implies that operators are left-to-right associative. This means that when evaluating any expression involving two or more operators, the operations are performed from left to right.

Precedence

Most languages associate precedence levels with groups of operators. For example, it is common to declare that while all four operators (+, *, -, /) are left-to-right associative, * and / have higher precedence than + and -. Thus, the expression:

2 - 4 + 3 * 5 = (2 - 4) + (3 * 5) = (2 - 4) + 15 = -2 + 15 = 13

We can also adjust our grammar to assert that * has higher precedence than +:

Now "3 + 2 * 5" can only be parsed as:

(<Expression>
   (<Product> (<Number> 3))
   +
   (<Product>
      (<Number> 2)
      *
      (<Number> 5)))

Of course we can always add parenthesis to the language so that programmers can override the built in precedence and associativity rules:

Now the expression "(3 + 2) * 5" must be parsed as:

(<Expression>
   (<Sum>
      (<Prod>
         (<Exp>
            '('
            (<Exp>
               (<sum>
                  (<Prod> (<Number> 3))
                  +
                  (<Prod> (<Number> 2)))
            ')'
            )
         *
         (<Number> 5))))

Ambiguitiy is Undecidable

Sadly, it is a result of theoretical Computer Science that the following problem cannot be solved by a computer:

Given an arbitrary EBNF G, find a new EBNF G' such that
1. L[G] = L[G']
2. G' is unambiguous

Abstract Syntax

Parse trees are cumbersome because they contain all of the syntactic details of the underlying language. In fact, phrases are cumbersome for the same reason. For example, C and Pascal both contain conditional commands:

if (C) S1 else S2
if C then S1 else S2

Semantically, both commands are identical, they only differ by minor syntactic changes or syntactic sugar coating.

The idea behind abstract syntax is to dispense with the syntactic sugar coating and simply represent phrases as abstract syntax trees (ASTs). An AST for expressions has the form:

(<Operator> <Expression>+)

For example, assuming * has higher precedence than +, the AST corresponding to "3 + 2 * 5" is:

(+ 3 (* 2 5))

An AST for a conditional command has the form:

(if <Expression> <Command> <Command>?)

Both the C and Pascal versions of our conditional statement have the same AST:

(if C S1 S2)

We can define a grammar for Simple ASTs:

<Phrase> ::= <Expression> | <Declaration> | <Command>
<Expression> ::= (<Operator> <Expression>+)
<Declaration> ::= (var <Symbol> <Expression>)
<Command> ::= <Assignment> | <Block> | <Selection> | <Iteration>
<Assignment> ::= (= <Symbol> <Expression>)
<Block> ::= (begin <Command>+)
<Selection> ::= (if <Expression> <Command> <Command>?)
<Iteration> ::= (while <Expression> <Command>)

Some languages, like LISP and Omega, adopt ASTs as their official syntax. While this makes programs difficult to read and write (programmers are always balancing parenthesis) it makes parsers easy to write (because the programmer has already done most of the work). In list-processing languages like LISP, the phrases of the language can also be treated as lists (data). This means we can write LISP programs that can analyze and generate other LISP programs. (This is called meta-programming.)

Problem

Given the following grammar:

<Exp> ::= <Exp> + <Exp> | 1

and the following expression:

e = 1 + 1 + 1

a. Write at least two distinct parse trees for e.

b. Rewrite these parse trees as abstract syntax trees.

c. Rewrite <Exp> to eliminate the ambiguity by forcing left-to-right associativity.

d. Rewrite <Exp> to eliminate the ambiguity by forcing right-to-left associativity.

Problem

Given the following grammar:

<Exp> ::= <Exp> (+ | *) <Exp> | 0 | 1 | 2

and the following expression:

e = 1 + 1 * 2

Parsers

Recursive Descent Parsing in Java

In this section we will assume a scanner provides us with a list of tokens. The List interface is declared in the java.util package. Here are the methods we will use:

interface List extends Collection {
   Object get(int pos);
   Object remove(int pos);
   boolean isEmpty();
   // etc.
}

There are many implementations of the List interface, but we will not assume any particular implementation in our code.

Step 1

In general, the left hand side of each EBNF rule of a grammar, G, specifies a sub-class of parse trees. We formalize each of these classes as Java classes, taking advantage of Java sub-classing where we can. For example, Simple has the following rule:

We define the Java classes:

class Phrase { ... }
class Expression extends Phrase { ... }
class Command extends Phrase { ... }
class Declaration extends Phrase { ... }

Step 2

Each of these classes may be equipped with fields corresponding to non-terminal sub-phrases appearing on the right hand side of the rule that defines it. For example, Simple declarations are specified by the EBNF rule:

where symbols are simply Java strings:

For example, the declaration:

var x 6 * 7

declares a variable named x with an initial value of 42.

We elaborate the Declaration class as follows:

class Declaration extends Phrase {
   private String name;
   private Expression exp;
   // etc.
}

Step 3

Equip each class with a static recognizer method. This method determines if the next phrase in the token stream is probably a member of the class. None of these recognizers should actually modify the token stream. The idea is to peek at enough of the beginning of the stream to determine that the next phrase probably belongs to the class. For example:

class Expression extends Phrase {
public static boolean next(List tokens) { ... }
// etc.
}

class Command extends Phrase {
public static boolean next(List tokens) { ... }
// etc.
}

class Declaration extends Phrase {
   private String name;
   private Expression exp;
   public static boolean next(List tokens) {
      String token = (String)tokens.get(0);
      return token.equals("var");
   }
   // etc.
}

class Phrase {
   public static boolean next(List tokens) {
      return Expression.next(tokens) ||
            Command.next(tokens) ||
            Declaration.next(tokens);
   }
   // etc.
}

We say "probably belongs to the class" because we can't be too sure. For example, the following token string tests positive for "probably" being a declaration:

var 21 + 3

Because it begins with the reserved word, "var" we know that it definitely cannot be an expression or a command, so we guess that it is a declaration. It is the job of the parser to decide for sure.

Step 4

Equip each class with a parsing method that expects a list of tokens as input. In the case of an abstract class, this would be a static method that returns a Phrase object. The abstract parsers simply call the appropriate subclass parsers. In the case of a concrete class, the parse method is simply a constructor that initializes fields. The parsers actually remove all tokens from the token stream that belong to the phrase being parsed. In this way the token stream can be passed from one parser to the next with the assumption that its prefix contains the next unparsed phrase.

For example, assume Phrase, Expression, and Statement are abstract classes while Declaration is a concrete class. Then:

abstract class Expression extends Phrase {
   public static boolean next(List tokens) { ... }
   public static Phrase parse(List tokens) throws ParseError {
      ...
   }
   // etc.
}

abstract class Command extends Phrase {
   public static boolean next(List tokens) { ... }
   public static Phrase parse(List tokens) throws ParseError {
      ...
   }
   // etc.
}

Typically, the parsing method of an abstract class is a static method that simply delegates to the appropriate parse method of a subclass:

abstract class Phrase {
   public static boolean next(List tokens) { ... }
   public static Phrase parse(List tokens) throws ParseError {
      if (Expression.next(tokens)) {
         return Expression.parse(tokens);
      } else if (Command.next(tokens)) {
         return Command.parse(tokens);
      } else if (Declaration.next(tokens)) {
         return new Declaration(tokens);
      } else {
         throw new ParseError("Unrecognized tokens");
      }
   }
   // etc.
}

The parse method of a concrete class is a constructor that pulls tokens out of the token stream in order to initialize its fields:

class Declaration extends Phrase {
   private String name;
   private Expression exp;
   public static boolean isSymbol(String s) { ... }
   public static boolean next(List tokens) { ... }
   public Declaration(List tokens) throws ParseError {
      tokens.remove(0); // discard keyword "var"
      if (tokens.isEmpty())
         throw new ParseError("not enough tokens");
      name = (String)tokens.remove(0);
      if (!isSymbol(name))
         throw new ParseError("illegal symbol: " + name);
      if (tokens.isEmpty())
         throw new ParseError("not enough tokens");
      exp = (Expression)Expression.parse(tokens);
   }
   // etc.
}

Notice that when a field is a phrase, the parser simply calls the parser of that class to do the dirty work. In this way, the declaration parser doesn't need to know anything about parsing expressions.

The static isSymbol() method might use the scanner or some other means to verify that name is a legal symbol. For example, legal symbols in Simple must begin with a letter and contain only digits and letters:

<Symbol> ::= <Letter>(<Letter> | <Digit>)*

Step 5 Pretty Printers (Optional)

A pretty printer is the inverse of a parser. For debugging purposes, it is often useful to override the toString() method inherited from Java's Object super class in each concrete subclass of Phrase. For example:

class Declaration extends Phrase {
   private String name;
   private Expression exp;
   public static boolean isSymbol(String s) { ... }
   public static boolean next(List tokens) { ... }
   public Declaration(List tokens) throws ParseError { ... }
   public String toString() {
      String result = "var " + name + " = ";
      result += exp.toString();
      return result;
   }
   // etc.
}

Step 6

Create the token list and pass it to the parser:

List tokens = Lex.scan(regExp, cmmd);
Phrase phrase = Phrase.parse(tokens);

We may consider phrase to be our parse tree.

Where's the Recursion?

Where is the recursion in our parser? Let's expand our language with some additional rules:

<Expression> ::= <NandExp> | <Symbol> | <Literal>
<Literal> ::= true | false
<NandExp> ::= nand <Expression> <Expression>

For example, and expression might take the form:

nand nand true x nand y false

(nand x y is logically equivalent to not and x y. All logical operators: not, and, or, xor, nor, if, iff, etc. can be implemented in terms of nand.)

We introduce a subclass for nand expressions as follows:

class NandExp extends Expression {
   private Expression operand1, operand2;
   public boolean next(List tokens) { ... }
   public NandExp(List tokens) throws ParseError {
      operand1 = (Expression) Expression.parse(tokens);
      if (tokens.isEmpty())
         throw new ParseError("not enough tokens");
      operand2 = (Expression) Expression.parse(tokens);
   }
   // etc.
}

We can now complete the parse method of the Expression class:

abstract class Expression extends Phrase {
   public static Phrase parse(List tokens) throws ParseError {
      if (Literal.next(tokens)) {
         return new Literal(tokens);
      } else if (Symbol.next(tokens)) {
         return new Symbol(tokens);
      } else if (NandExp.next(tokens)) {
         return new NandExp(tokens);
      } else {
         throw new ParseError("unrecognized expression");
      }
   }
   // etc.
}

Notice that Expression.parse() may call the NandExp() constructor, which, in turn, calls Expression.parse(). In other words, Expression.parse() indirectly calls itself.

Design Principles and Patterns

Factories and Products

In object oriented programming a class that contains a method that creates an instance of another class is called a factory, the method is called a factory method, and the return type is called the product class:

class Product { ... }
class Factory {
   // factory method:
   Product makeProduct(Description desc) { ... }
   // etc.
}

We can consider phrases as products, token streams as product descriptions, and parsers as factory methods. We could have introduce parser classes to function as factories:

class PhraseParser {
Phrase parse(List tokens) throws ParseError { ... }
}

class ExpressionParser {
Phrase parse(List tokens) throws ParseError { ... }
}

class DeclarationParser {
Phrase parse(List tokens) throws ParseError { ... }
}

class CommandParser {
Phrase parse(List tokens) throws ParseError { ... }
}

Alternatively, we could have created a single factory class containing all of the parsers (this is called a kit):

class Parser {
   Phrase parsePhrase(List tokens) throws ParseError { ... }
   Phrase parseExpression(List tokens) throws ParseError { ... }
   Phrase parseDeclaration(List tokens) throws ParseError { ... }
   Phrase parseCommand(List tokens) throws ParseError { ... }
}

The first solution introduces too many classes that need to be maintained. The second solution is alright, but this class will depend on all of the subclasses of Phrase. Any changes to these classes will also require a change to the Parser class.

Since each class contains a factory methods that creates instances of itself-- namely, the constructors-- why not stick with this convention and keep the factory methods with the product classes. In other words, each Phrase subclass is a product and a factory.

Decopuling Classes

Class A depends on class B:

A <= B

if changes to B may require changes to A. This relationship is transitive and reflexive:

A <= A
A <= B <= C

If A extends B or contains a field, parameter, local variable, or return type of type B, then A depends on B.

In our parser we have the following dependencies:

Expression <= Phrase

As a general principle of software engineering, we would like o have as few dependencies between our classes as possible. Suppose we put parsers and recognizers in Phrase:

class Phrase {
   boolean expNext(List tokens) { ... }
   boolean decNext(List tokens) { ... }
   // etc.
}

Notice that we now have a new dependency:

Phrase <= Expression

Tokens versus Phrases

A phrase is a string that can be derived from a grammar. A token is a string that matches a pattern (i.e., a regular expression). The format we are using for grammars and regular expressions is very similar:

<Phrase> ::= <RegExp1> | <RegExp2> | ...
<Token> ::= <RegExp3> | <RegExp4> | ...

The difference is that the patterns that <Phrase> may appear (directly or indirectly) in <RegExp1> and <RegExp2>, but <Token> may not appear (directly or indirectly) in <RegExp3> or <RegExp4>.

In other words, tokens can also be derived from grammars, just very simple non-recursive grammars. Hence, tokens are simple phrases. However, there are phrases that cannot be described by matching patterns. (For example {1ⁿ0ⁿ}).