Parsing in Scala

EBNF Grammars

The syntax of a programming language is specified using an EBNF grammar (Extended Backus-Naur Form).

Similar to a regular expression, a grammar is a set of rules of the form:

PHRASE ::= PATTERN1 | PATTERN2 | ...

A string conforms to the PHRASE rule if it matches PATTERN1 or PATTERN2 or etc.

A parser for a grammar G is a function of the form:

parser: String => Option[Tree]

If s is a string then parser(s) attempts to build a derivation tree showing how s can be derived from G. If it fails, then None is returned.

An EBNF grammar for SOP

Expressions in the SOP language (Sums Of Products) are nested sums and products of numbers.

Here's a sample session with an SOP interpreter:

-> 2
result = 2.0
-> 2 * 3
result = 6.0
-> 2 + 3
result = 5.0
-> 2 * 3 * 4 + 5 * 6 * 7
result = 234.0

Here's our first attempt at an SOP grammar:

EXPRESSION ::= NUMBER | EXPRESSION~OPERATOR~EXPRESSION
NUMBER ::= 0|[1-9][0-9]*(.[0-9]+)?
OPERATOR ::= +|*

Here are the meanings of common EBNF operators:

::=      consists of
|        or
~        followed by
+        one or more of preceeding
*        zero or more of preceeding
?        preceeding is optional
()       used for disambiguation and specifying opertor scopes

The first rule states that an SOP expression consists of a number or two expressions separated by an operator.

The left side of a rule, called a non-terminal or head, is a name for a type of phrase. Notice that the right side of a rule, called the body, resembles a regular expression. In fact, the bodies of the NUMBER and OPERATOR rules are regular expressions. The only major difference is that the body of an EBNF rule can recursively refer to the head of the rule as in the example above (some limited forms of recursion are allowed in regular expressions) allowing phrases to be nested in other phrases.

Parsers

A parser based on this grammar is a function that takes a string as input and produces a tree as output. For example:

parser("12 + 10 * 2") = 12 + (10 * 2)

Note: 12 + (10 * 2) is a linear representation of the tree:

The tree is derived by replacing non-terminals with their bodies:

EXPRESSION
   => (EXPRESSION) ~ OPERATOR ~ (EXPRESSION)
   => (NUMBER) ~ OPERATOR ~ (EXPRESSION)
   => 12 ~ OPERATOR ~ (EXPRESSION)
   => 12 ~ + ~ (EXPRESSION)
   => 12 ~ + ~ ((EXPRESSION) ~ OPERATOR ~ (EXPRESSION))
   => 12 ~ + ~ ((NUMBER) ~ OPERATOR ~ (EXPRESION))
   => 12 ~ + ~ (10 ~ OPERATOR ~ (EXPRESSION))
   => 12 ~ + ~ (10 ~ * ~ (EXPRESSION))
   => 12 ~ + ~ (10 ~ * ~ NUMBER)
   => 12 ~ + ~ (10 ~ * ~ 2)

(Parentheses added to show tree-depth.)

Ambiguous Grammars

The enemy of a parser is ambiguity. A grammar is ambiguous if several different trees can be derived from the same expression. For example, we could also derive:

parser("12 + 10 * 2") = (12 + 10) * 2

This is the tree:

Exercise: find this derivation.

Notice that the bottom-up evaluation of 12 + (10 * 2) = 32, but (12 + 10) * 2 = 44. So it does make a difference.

To avoid ambiguity, grammar writers must try to eliminate choices in their rules:

EXPRESSION ::= SUM
SUM ::= PRODUCT~(+~SUM)?
PRODUCT ::= TERM~(*~TERM)*
TERM ::= NUMBER | (SUM)
NUMBER ::= 0|[1-9][0-9]*(\.[0-9]+)?

Notice that this grammar forces the parser to parse sums after products, giving products higher precedence than sums:

EXPRESSION
   => SUM
   => (PRODUCT) ~ + ~ (SUM)
   => (TERM) ~ + ~ (SUM)
   => NUMBER ~ + ~ (SUM)
   => 12 ~ + ~ (SUM)
   => 12 ~ + ~ (PRODUCT)
   => 12 ~ + ~ ((TERM) ~ * ~ (TERM))
   => 12 ~ + ~ ((NUMBER) ~ * ~ (TERM))
   => 12 ~ + ~ (10 ~ * ~ (TERM))
   => 12 ~ + ~ (10 ~ * ~ NUMBER)
   => 12 ~ + ~ (10 ~ * ~ 2)

A famous theorem from Computation Theory shows that there is no algorithm for disambiguating grammars. This means that disambiguating a grammar is more of art than science.

Compiler Compilers

A compiler-complier is a function that generates a parser from a grammar:

cc(grammar) = parser

YACC and ANTLR are well known examples of compiler-compilers.

Parsing in Scala using Combinators

Scala supports functional programming, and therefore allows programmers to implement combinators (functions that take simple functions as inputs, then combines them together as a more complicated output function). This is very useful for implementing parsers.

Recall that a parser is a function that transforms a string into a tree:

parser: String => Option[Tree]

For example:

parser("12 + 10 * 2") = Some(12 ~ + ~ (10 ~ * ~ 2))
parser("12 + * 10 2") = None (i.e., syntax error)

(This is a simplification. There is no Tree type in Scala, instead it should be Option[Any].)

Scala interprets each rule in a grammar as a parser function and interprets the EBNF operators as higher-order functions (called combinators) that combine smaller parser functions into larger ones.

def combinator(p1: String=>Option[Tree], p2: String=>Option[Tree]): String=>Option[Tree] = ...

Scala regards the operators and quantifiers used in EBNF grammars (i.e., ~, |, *, and ?) as combinators:

Notice that Scala replaces p* by rep(p) and p? by opt(p). We don't really need p+ since this is the same as p ~ rep(p).

For example, if p is a parser and L is the set of strings that p accepts (i.e., parses without returning FAIL), then rep(p) is a parser that accepts L*.

Therefore grammars are parsers in Scala!

Example: Sums of Products in Scala

A more useful parser uses the cases block to transform trees into instances of classes representing expressions. These expressions can subsequently be executed, translated into assembly language, type checked, debugged, refactored, etc.

To demonstrate this approach let's build an interpreter for a language called SOP, which allows users to parse and execute arbitrary sums of products of numbers.

Here's a sample session with the SOP interpreter:

-> 2
result = 2.0
-> 2 * 3
result = 6.0
-> 2 + 3
result = 5.0
-> 2 * 3 * 4 + 5 * 6 * 7
result = 234.0
-> 2 * 3 * (4 + 5) * 6 * 7
result = 2268.0

Note that multiplication has higher precedence than addition, but that parentheses can be used to override this.

Here's the Scala version of the SOP grammar:

import scala.util.parsing.combinator._

class SOPParsers extends RegexParsers {
  def expression: Parser[Any] = sum
  def sum: Parser[Any] = product ~ opt("+" ~ sum)
  def product: Parser[Any] = term ~ rep("*" ~ term)
  def term: Parser[Any] = number | "(" ~ expression ~ ")"
  def number: Parser[Any] = """0|[1-9][0-9]*(\.[0-9]+)?""".r
}

Notes:

·       The Scala parser combinators library is in a separate jar file that must be downloaded and added to the project.

·       RegexParsers is a class in this jar file that in addition to the combinators above also provides a combinator that converts regular expressions into parsers.

·       Parser[Any] is essentially the type String => Option[Any]. We use Any instead of Tree because trees can be to varied to fit in any one class.

·       This is a weird grammar—the rule for sum uses recursion, while the rule for product uses iteration. I do this to show the difference in how they are processed.

·       Escape characters still need to be used in regular expressions.

Consider the rule:

def product: Parser[Any] = term ~ rep("*" ~ term)

This says that the product parser is built from the term parser and a parser that only accepts the string "*" using the rep and ~ combinators:

Adding the parser jar file to the build path in IntelliJ

Download the latest version of Scala's parser combinator jar file.

Under the file menu open the Project Structure dialog:

Click the + button under standard libraries and navigate to the location where you downloaded the jar file to select it.

Warning: Many indecipherable errors occur if the version of the parser combinators jar file is incompatible with the version of Scala. Try to match the version numbers as closely as possible. Look on the web for other versions in needed, or roll back the version of Scala. Using build tools like Maven or SBC can automate this process.

Adding the parser jar file to the build path in Eclipse

Download the latest version of Scala's parser combinator jar file.

Open the Project Properties dialog.

Select Java Build Path. Select the Libraries tab. Click Add External JARs button and navigate to the downloaded jar file to select it.

Warning: Many indecipherable errors occur if the version of the parser combinators jar file is incompatible with the version of Scala. Try to match the version numbers as closely as possible. Look on the web for other versions in needed, or roll back the version of Scala. Using build tools like Maven or SBC can automate this process.

The SOP Console

To test our parser we will need a console:

import scala.io._

object console {
 
  val parsers = new SOPParsers
 
  def execute(cmmd: String): String = {
      val result = parsers.parseAll(parsers.expression, cmmd)
      result match {
         case result: parsers.Failure => throw new Exception("syntax error")
         case _ => {
            val tree = result.get  // get the expression from the tree
            tree.toString
         }
      }
   }
 
  def repl() {
    var more = true
    while(more) {
      try {
        print("-> ")
        val cmmd = StdIn.readLine
        if (cmmd == "quit") more = false
        else println(execute(cmmd))
      } catch {
           case e: Exception => println(e)
      }
    }
    println("bye")
  }
 
  def main(args: Array[String]): Unit = { repl() }
 
}

Notes:

·       parsers.parseAll(rule, string) = the result of parsing string with rule.

·       This result can be a failure (due to a syntax error).

·       Otherwise result.get returns the parse tree.

A sample session

-> 2
((2~List())~None)
-> 2 + 3
((2~List())~Some((+~((3~List())~None))))
-> 2 * 3
((2~List((*~3)))~None)
-> 2 * 3 * 4 * 5
((2~List((*~3), (*~4), (*~5)))~None)

This takes a little work to interpret. (Don't worry, we'll clean it up in version 2.0.)

Here's the grammar/parser again:

class SOPParsers extends RegexParsers {
  def expression: Parser[Any] = sum
  def sum: Parser[Any] = product ~ opt("+" ~ sum)
  def product: Parser[Any] = term ~ rep("*" ~ term)
  def term: Parser[Any] = number | "(" ~ expression ~ ")"
  def number: Parser[Any] = """0|[1-9][0-9]*(\.[0-9]+)?""".r
}

Here's the derivation of 2:

expression
   => sum
   => (product ~ NONE)
   => ((term ~ Nil) ~ NONE)
   => ((number ~ Nil) ~ NONE)
   => ((2 ~ Nil) ~ NONE)

Here's the derivation of 2 + 3:

expression
   => sum
   => (product ~ Some(+ ~ sum))
   => ((term ~ Nil) ~ Some(+ ~ sum))
   => ((number ~ Nil) ~ Some(+ ~ sum))
   => ((2 ~ Nil) ~ Some(+ ~ sum))
   => ((2 ~ Nil) ~ Some(+ ~ (product ~ None)))
   => ((2 ~ Nil) ~ Some(+ ~ ((term ~ Nil)  ~ None)))
   => ((2 ~ Nil) ~ Some(+ ~ ((number ~ Nil)  ~ None)))
   => ((2 ~ Nil) ~ Some(+ ~ ((3 ~ Nil)  ~ None)))

Here's the derivation of 2 * 3:

expression
   => sum
   => (product ~ None)
   => ((term ~ List(* ~ term) ~ None)
   => ((number ~ * ~ term) ~ None)
   => ((2 ~ List((* ~ term) ~ None)
   => ((2 ~ List((* ~ number) ~ None)
   => ((2 ~ List((* ~ 3) ~ None)

Notice that the output of opt is an option, while the output of rep is a list.

SOP 2.0

The SOP Expression Hierarchy

We really want our parser to return objects representing expressions, so we begin by defining an expression hierarchy:

trait Expression {
  def execute: Double
}

case class Sum(val operand1: Expression, val operand2: Expression) extends Expression {
  def execute =
    if (operand2 == null) operand1.execute
    else operand1.execute + operand2.execute
}

case class Product(val operand1: Expression, val operands: List[Expression]) extends Expression {
  def execute =
    if (operands == Nil) operand1.execute
    else operand1.execute * operands.map(_.execute).reduce(_*_)
}

case class Number(val value: Double) extends Expression {
  def execute = value
}

Notes:

·       Expression classes must be case classes.

·       Both Sum and Product constructors can take a single input.

·       Of course we could add other functions to the Expression trait:

   trait Expression {
      def execute: Double
      def typeCheck: Type
      def compile: List[Command]
   }

SOP 2.0 parsers

import scala.util.parsing.combinator._

class SOP2Parsers extends RegexParsers {
  def expression: Parser[Expression] = sum
  def sum: Parser[Expression] = product ~ opt("+" ~ sum) ^^ {
    case p ~ None => p
    case p ~ Some("+" ~ s) => Sum(p, s)
  }
  def product: Parser[Expression] = term ~ rep("*" ~> term) ^^ {
    case t ~ Nil => t
    case t ~ terms => Product(t, terms)
  }
  def term: Parser[Expression] = number | "(" ~> expression <~ ")"
  def number: Parser[Number] = """0|[1-9][0-9]*(\.[0-9]+)?""".r ^^ {
    case num => Number(num.toDouble)
  }
}

Basically, the rule:

def sum: Parser[Expression] = product ~ opt("+" ~ sum) ^^ {
    case p ~ None => p
    case p ~ Some("+" ~ s) => Sum(p, s)
  }

is shorthand for:

def sum(str: String): Expression = {
   val sumParser = product ~ opt("+" ~ sum)
   val tree = sumParser(str)
   tree match {
      case p ~ None => p
      case p ~ Some("+" ~ s) => Sum(p, s)
  }
}

In other words after the parser generates the ugly trees we saw earlier, the tree is passed to the match block where it is converted into some object in the expression hierarchy.

SOP 2.0 console

The only change needed to the console is that the parser returns an expression which must be executed:

object console {
   val parsers = new SOP2Parsers
 
   def execute(cmmd: String): String = {
      val result = parsers.parseAll(parsers.expression, cmmd)
      result match {
         case result: parsers.Failure => throw new Exception("syntax error")
         case _ => {
            val exp = result.get  // get the expression from the tree
            val value = exp.execute  // execute the expression
            value.toString  // return string representation of result

         }
      }
   }
   // etc.
}

Tutorial

I've added the following tutorial for parsers: Scala Parser Combinators: an Example.