Pattern Matching

Regular Expressions

Regular expressions are used to specify the tokens of a language. (They have many other uses as well.)

The language of regular expressions, Regex, is a language for describing lexical patterns. In other words, Regex is a meta-language for describing object languages:

If r is a regular expression, then L(r) = the set of all strings that match r

A string is a regular expression that matches itself:

L("abc") = {"abc"}, L("123") = {"123"}, L("$") = {"$"}, L("") = {""}, etc.

Regex meta-characters intended as object language literals must be preceded by the escape character (\):

L("\+") = {"+"}, L("\|") = {"|"}, L("\)") = {")"}, L("\\") = {"\"}, L("\"") = {"""}, etc.

Assume r and s are regular expressions, then so are:

L(r~s) = L(r)L(s) = {uv | u is in L(r) and s is in L(s)} // concatonation

L(r|s) = L(r) U L(s)

For example:

L("abc"~"de") = {"abcde"}

L("a"|"b"|"c"|"d"|"e") = {"a", "b", "c", "d", "e"}

Regular expressions can be followed by a quantifier (?, +, *):

L(r?) = L(r|"") // "" = the empty string

L(r+) = L(r|rr|rrr|...)

L(r*) = L(""|r|rr|rrr|...)

For example:

L(("0"|"1")+) = all binary strings
L(("\+"|"-")?("0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9")+) = signed and unsigned integers

Parentheses can be used to resolve ambiguities:

L("0"~"1"+) = {"01", "011", "0111", ... }, but L(("0"~"1")+) = {"01", "0101", "010101", ... }

L("0"|"1"~"2") = {"0", "12"}, but L(("0"|"1")~"2") = {"02", "12"}

There are many pre-defined abbreviations in java.util.regex.Pattern.

"[abcde]" = "[a-e]" = "a"|"b"|"c"|"d"|"e"

"\\s" matched by any whitespace character

Note: r~s (r followed by r) is often written simply as rs in Regex, but is more commonly used in the notation for the grammars that follow.

Pattern matching in Scala

Scala's (Java's) String class provides a matches and split methods:

class String {
   def matches(regex: String): Boolean = if (this in L(regex)) true else false
   def split(regex: Char): Array[String] = this split into tokens separated by regex
   // etc.
}

Here are some examples:

A natural number is 0 or an unsigned positive integer, i.e., a string of digits beginning with a non-zero digit:

val natPattern = "0|[1-9][0-9]*"                //> natPattern  : String = 0|[1-9][0-9]*
"23".matches(natPattern)                        //> res0: Boolean = true
"023".matches(natPattern)                       //> res1: Boolean = false
"1".matches(natPattern)                         //> res2: Boolean = true
"0".matches(natPattern)                         //> res3: Boolean = true

An integer is 0 or a positive integer optionally preceded by a sign:

val intPattern = "0|(\\+|-)?[1-9][0-9]*"        //> intPattern  : String = 0|(\+|-)?[1-9][0-9]*
"23".matches(intPattern)                        //> res4: Boolean = true
"+23".matches(intPattern)                       //> res5: Boolean = true
"-23".matches(intPattern)                       //> res6: Boolean = true
"0".matches(intPattern)                         //> res7: Boolean = true
"+0".matches(intPattern)                        //> res8: Boolean = false

Notice that the plus sign needed two escape characters. This is because the escape character is both a meta-language symbol and an object language symbol used to indicate that quotes should be taken literally:

"\"Hello World\"" //> "Hello World"

All of this escaping quotes and escape characters can be avoided by expressing regular expressions as raw strings:

val intPattern = """0|(\+|-)?[1-9][0-9]*""" // raw strings bracketed by triple quotes

A floating point number is an integer followed by an optional decimal part:

val floatPattern = intPattern + """(\.[0-9]+)?"""  //> floatPattern  : String = 0|(\+|-)?[1-9][0-9]*(.[0-9]+)?
"23".matches(floatPattern)                        //> res9: Boolean = true
"+23".matches(floatPattern)                       //> res10: Boolean = true
"-23.0001".matches(floatPattern)                  //> res11: Boolean = true

(Note: '.' is a meta-character representing any character, a wildcard. Used as a literal decimal point, it must be preceded by an escape character.)

An identifier is any alpha-numeric string beginning with a letter:

val idPattern = """[a-zA-Z][0-9a-zA-Z]*"""       //> idPattern  : String = [a-zA-Z][0-9a-zA-Z]*
"z26".matches(idPattern)                         //> res12: Boolean = true
"z".matches(idPattern)                           //> res13: Boolean = true
"2z".matches(idPattern)                          //> res14: Boolean = false
"zebra".matches(idPattern)                       //> res15: Boolean = true

In some languages a token is any string not containing white space characters:

val tokenPattern = """\S+"""                      //> tokenPattern  : String = \S+
"34^%$(F_=n".matches(tokenPattern)                //> res16: Boolean = true
"\"hello world\"".matches(tokenPattern)           //> res17: Boolean = false
"\"helloworld\"".matches(tokenPattern)            //> res18: Boolean = true

The split method is an easy way to extract tokens from a string:

"balance = balance + amount".split("\\s+")        //> res16: Array[String] = Array(balance, =, balance, +, amount)

Extracting matches I

Scala provides a Regex  class:

class Regex {
   def findAllIn(text: String): Iterator = iterator over the sequence of all matches
   def replaceAllIn(text: String, sub: String): String = result of replacing all matches by sub
   // etc.
}

For example, assume we want to look at each legal token in a program written in language L. Assume tokens in L consist of numbers, identifiers, and operator symbols. Then:

val operatorPattern = """\+|-|\*|/|=|<|>|!="""
val tokenPattern2 = (floatPattern + "|" + idPattern + "|" + operatorPattern).r

Where:

class String {
   def r(pattern: String): Regex = converts this into an instance of Regex
   // etc.
}

Using the iterator we can iterate through the list of valid tokens:

val tokens = tokenPattern2.findAllIn("12*x + 23.01 <= pi")
for(next <- tokens) print(next + ", ")  //> 12, *, x, 23.01, <, =, pi,

Regex.replaceAllIn is useful for form letters:

var letter = "Dear NAME1, I've decided to leave you and date NAME2 instead. I hope we can be friends, NAME1."
letter = ("NAME1".r).replaceAllIn(letter, "John")
letter = ("NAME2".r).replaceAllIn(letter, "Steve")

letter //> res21: String = Dear John, I've decided to leave you and date Steve instead. I hope we can be friends, John.

Problem

Regex.findAllIn simply skips illegal tokens. Write a function called scan that returns a list of all tokens, but throws an IllegalToken exception if an illegal token is detected:

def scan(text: String, pattern: Regex): List[String] /* or throws IllegalToken */ = ???

Problem

Write a function that converts dates in the form mm/dd/yyyy into dates in the European form dd/mm/yyyy

Extracting Matches I

Scala's match/case expression can be used to implement multi-way conditionals:

def eval(arg1: Int, op: String, arg2: Int) =
    op match {
       case "+" => arg1 + arg2
       case "-" => arg1 - arg2
       case "*" => arg1 * arg2
       case "/" => arg1 / arg2
       case _ => throw new Exception("Unrecognized operator: " + op)
    }

Extracting Matches II

A group is a sub-expressions in a regular expression that's surrounded by parentheses.

For example, the following regular expression contains three groups.

val expPattern = """([0-9]+)\s*(\+|\*|-|/)\s*([0-9]+)""".r

Given a string that matches the pattern, we can extract the substrings that match the groups and declare them as variables or constants:

def eval(exp: String) = {
    val expPattern(arg1, op, arg2) = exp // defining constants arg1, op, and arg2
    op match {
       case "+" => arg1.toInt + arg2.toInt
       case "-" => arg1.toInt - arg2.toInt
       case "*" => arg1.toInt * arg2.toInt
       case "/" => arg1.toInt / arg2.toInt
       case _ => throw new Exception("Unrecognized operator: " + op)
    }
 }

Warning: Parentheses used to disambiguate a regular expression get interpreted as groups, even if that is not the intention. For example, suppose we want to allow for signed integers in the above example. We might try:

"""((\+|-)?[0-9]+)\s*(\+|\*|-|/)\s*((\+|-)?[0-9]+)""".r

But are the parentheses around the sign a group or simply defining the scope of the "?" quantifier?

Instead we must use:

"""([\+-]?[0-9]+)\s*(\+|\*|-|/)\s*([\+-]?[0-9]+)""".r

A student pointed out an alternative. Non-grouping parentheses can be expressed as follows: (?: ... ). For example:

"""((?:\+|-)?[0-9]+)\s*(\+|\*|-|/)\s*((?:\+|-)?[0-9]+)""".r

Case Classes

Extraction works because Scala's Regex class implements an unapply method. This method deconstructs a regular expression object into sub-expressions.

We can automatically generate unapply methods in our own classes simply by qualifying a class declaration as a case class:

case class Exp(arg1: Int, op: Char, arg2: Int) {}

In addition to unapply, case classes also have:

·       A companion object with an apply method // so we can construct instances without using new

·       Val fields for each constructor parameter

·       toString, equals, hashCode, and copy methods

The following examples show two ways to extract the fields of an expression into local variables:

def eval(e: Exp) =
   e match {
       case Exp(arg1, '+', arg2) => arg1 + arg2
       case Exp(arg1, '-', arg2) => arg1 - arg2
       case Exp(arg1, '*', arg2) => arg1 * arg2
       case Exp(arg1, '/', arg2) => arg1 / arg2
       case _  => throw new Exception("Unrecognized operator: " + e.op)
   }

def eval2(e:Exp) = {
    val Exp(arg1, op, arg2) = e // defining arg1, op, & arg2
    op match {
       case '+' => arg1 + arg2
       case '-' => arg1 - arg2
       case '*' => arg1 * arg2
       case '/' => arg1 / arg2
       case _ => throw new Exception("Unrecognized operator: " + e.op)
    }
}

Pattern-Driven Programming

In data-driven programming (e.g., polymorphism), the flow of control is determined by data, not programmers. More specifically, flow is determined by the class instantiated by the data:

Employee e = new Programmer(); // subsumption
e.print(); // calls Programmer.print

Data determines the flow of control in pattern-driven programming too, the difference is that it's the lexical patterns instantiated by the data that determines the flow. We saw examples earlier with extraction and multi-way conditionals.

Prolog, Datalog, and Proplog interpreters use pattern-driven control. Goals are matched to appropriate facts and rules using a sophisticated pattern matching algorithm called unification.