Regular expressions are used to specify the tokens of a language. (They have many other uses as well.)
The language of regular expressions, Regex, is a language for describing lexical patterns. In other words, Regex is a meta-language for describing object languages:
If r is a regular expression, then L(r) = the set of all strings that match r
A string is a regular expression that matches itself:
L("abc") = {"abc"}, L("123") = {"123"}, L("$") = {"$"}, L("") = {""}, etc.
Regex meta-characters intended as object language literals must be preceded by the escape character (\):
L("\+") = {"+"}, L("\|") = {"|"}, L("\)") = {")"}, L("\\") = {"\"}, L("\"") = {"""}, etc.
Assume r and s are regular expressions, then so are:
L(r~s) = L(r)L(s) = {uv | u is in L(r) and s is in L(s)} // concatonation
L(r|s) = L(r) U L(s)
For example:
L("abc"~"de") = {"abcde"}
L("a"|"b"|"c"|"d"|"e") = {"a", "b", "c", "d", "e"}
Regular expressions can be followed by a quantifier (?, +, *):
L(r?) = L(r|"") // "" = the empty string
L(r+) = L(r|rr|rrr|...)
L(r*) = L(""|r|rr|rrr|...)
For example:
L(("0"|"1")+) = all binary strings
L(("\+"|"-")?("0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9")+)
= signed and unsigned integers
Parentheses can be used to resolve ambiguities:
L("0"~"1"+) = {"01", "011", "0111", ... }, but L(("0"~"1")+) = {"01", "0101", "010101", ... }
L("0"|"1"~"2") = {"0", "12"}, but L(("0"|"1")~"2") = {"02", "12"}
There are many pre-defined abbreviations in java.util.regex.Pattern.
"[abcde]" = "[a-e]" = "a"|"b"|"c"|"d"|"e"
"\\s" matched by any whitespace character
Note: r~s (r followed by r) is often written simply as rs in Regex, but is more commonly used in the notation for the grammars that follow.
Scala's (Java's) String class provides a matches and split methods:
class String {
def matches(regex: String): Boolean =
if (this in L(regex)) true else false
def split(regex: Char): Array[String] =
this split into tokens separated by regex
// etc.
}
Here are some examples:
A natural number is 0 or an unsigned positive integer, i.e., a string of digits beginning with a non-zero digit:
val natPattern = "0|[1-9][0-9]*" //> natPattern : String = 0|[1-9][0-9]*
"23".matches(natPattern)
//> res0:
Boolean = true
"023".matches(natPattern) //> res1: Boolean
= false
"1".matches(natPattern) //> res2:
Boolean = true
"0".matches(natPattern) //> res3:
Boolean = true
An integer is 0 or a positive integer optionally preceded by a sign:
val intPattern = "0|(\\+|-)?[1-9][0-9]*" //> intPattern : String = 0|(\+|-)?[1-9][0-9]*
"23".matches(intPattern) //> res4: Boolean
= true
"+23".matches(intPattern) //> res5: Boolean
= true
"-23".matches(intPattern) //> res6: Boolean
= true
"0".matches(intPattern) //> res7:
Boolean = true
"+0".matches(intPattern) //> res8: Boolean
= false
Notice that the plus sign needed two escape characters. This is because the escape character is both a meta-language symbol and an object language symbol used to indicate that quotes should be taken literally:
"\"Hello World\"" //> "Hello World"
All of this escaping quotes and escape characters can be avoided by expressing regular expressions as raw strings:
val intPattern = """0|(\+|-)?[1-9][0-9]*""" // raw strings bracketed by triple quotes
A floating point number is an integer followed by an optional decimal part:
val floatPattern = intPattern +
"""(\.[0-9]+)?"""
//> floatPattern : String =
0|(\+|-)?[1-9][0-9]*(.[0-9]+)?
"23".matches(floatPattern) //> res9: Boolean
= true
"+23".matches(floatPattern)
//> res10:
Boolean = true
"-23.0001".matches(floatPattern) //> res11: Boolean =
true
(Note: '.' is a meta-character representing any character, a wildcard. Used as a literal decimal point, it must be preceded by an escape character.)
An identifier is any alpha-numeric string beginning with a letter:
val idPattern =
"""[a-zA-Z][0-9a-zA-Z]*""" //> idPattern : String = [a-zA-Z][0-9a-zA-Z]*
"z26".matches(idPattern) //> res12:
Boolean = true
"z".matches(idPattern) //> res13:
Boolean = true
"2z".matches(idPattern) //> res14:
Boolean = false
"zebra".matches(idPattern) //> res15: Boolean
= true
In some languages a token is any string not containing white space characters:
val tokenPattern = """\S+""" //>
tokenPattern : String = \S+
"34^%$(F_=n".matches(tokenPattern) //> res16: Boolean = true
"\"hello world\"".matches(tokenPattern) //> res17: Boolean = false
"\"helloworld\"".matches(tokenPattern) //> res18: Boolean = true
The split method is an easy way to extract tokens from a string:
"balance = balance + amount".split("\\s+") //> res16: Array[String] = Array(balance, =, balance, +, amount)
Scala provides a Regex class:
class Regex {
def findAllIn(text: String): Iterator =
iterator over the sequence of all matches
def replaceAllIn(text: String, sub:
String): String = result of replacing all matches by sub
// etc.
}
For example, assume we want to look at each legal token in a program written in language L. Assume tokens in L consist of numbers, identifiers, and operator symbols. Then:
val operatorPattern = """\+|-|\*|/|=|<|>|!="""
val tokenPattern2 = (floatPattern + "|" + idPattern + "|" + operatorPattern).r
Where:
class String {
def r(pattern: String): Regex =
converts this into an instance of Regex
// etc.
}
Using the iterator we can iterate through the list of valid tokens:
val tokens = tokenPattern2.findAllIn("12*x + 23.01 <=
pi")
for(next <- tokens) print(next + ", ") //> 12, *, x, 23.01, <, =,
pi,
Regex.replaceAllIn is useful for form letters:
var letter = "Dear NAME1, I've decided to leave
you and date NAME2 instead. I hope we can be friends, NAME1."
letter = ("NAME1".r).replaceAllIn(letter, "John")
letter = ("NAME2".r).replaceAllIn(letter, "Steve")
letter //> res21: String = Dear John, I've decided to leave you and date
Steve instead. I hope we can be friends, John.
Regex.findAllIn simply skips illegal tokens. Write a function called scan that returns a list of all tokens, but throws an IllegalToken exception if an illegal token is detected:
def scan(text: String, pattern: Regex): List[String] /* or throws IllegalToken */ = ???
Write a function that converts dates in the form mm/dd/yyyy into dates in the European form dd/mm/yyyy
Scala's match/case expression can be used to implement multi-way conditionals:
def eval(arg1: Int, op: String, arg2: Int) =
op match {
case "+" => arg1 + arg2
case "-" => arg1 - arg2
case "*" => arg1 * arg2
case "/" => arg1 / arg2
case _ => throw new
Exception("Unrecognized operator: " + op)
}
A group is a sub-expressions in a regular expression that's surrounded by parentheses.
For example, the following regular expression contains three groups.
val expPattern = """([0-9]+)\s*(\+|\*|-|/)\s*([0-9]+)""".r
Given a string that matches the pattern, we can extract the substrings that match the groups and declare them as variables or constants:
def eval(exp: String) = {
val
expPattern(arg1, op, arg2) = exp // defining constants arg1, op, and arg2
op match {
case "+" => arg1.toInt + arg2.toInt
case "-" => arg1.toInt - arg2.toInt
case "*" => arg1.toInt * arg2.toInt
case "/" => arg1.toInt / arg2.toInt
case _ => throw new
Exception("Unrecognized operator: " + op)
}
}
Warning: Parentheses used to disambiguate a regular expression get interpreted as groups, even if that is not the intention. For example, suppose we want to allow for signed integers in the above example. We might try:
"""((\+|-)?[0-9]+)\s*(\+|\*|-|/)\s*((\+|-)?[0-9]+)""".r
But are the parentheses around the sign a group or simply defining the scope of the "?" quantifier?
Instead we must use:
"""([\+-]?[0-9]+)\s*(\+|\*|-|/)\s*([\+-]?[0-9]+)""".r
A student pointed out an alternative. Non-grouping parentheses can be expressed as follows: (?: ... ). For example:
"""((?:\+|-)?[0-9]+)\s*(\+|\*|-|/)\s*((?:\+|-)?[0-9]+)""".r
Extraction works because Scala's Regex class implements an unapply method. This method deconstructs a regular expression object into sub-expressions.
We can automatically generate unapply methods in our own classes simply by qualifying a class declaration as a case class:
case class Exp(arg1: Int, op: Char, arg2: Int) {}
In addition to unapply, case classes also have:
· A companion object with an apply method // so we can construct instances without using new
· Val fields for each constructor parameter
· toString, equals, hashCode, and copy methods
The following examples show two ways to extract the fields of an expression into local variables:
def eval(e: Exp) =
e match {
case Exp(arg1, '+', arg2) => arg1 + arg2
case Exp(arg1, '-', arg2) => arg1 - arg2
case Exp(arg1, '*', arg2) => arg1 * arg2
case Exp(arg1, '/', arg2) => arg1 / arg2
case _ =>
throw new Exception("Unrecognized operator: " + e.op)
}
def eval2(e:Exp) = {
val
Exp(arg1, op, arg2) = e // defining arg1, op, & arg2
op match {
case '+' => arg1 + arg2
case '-' => arg1 - arg2
case '*' => arg1 * arg2
case '/' => arg1 / arg2
case _ => throw new
Exception("Unrecognized operator: " + e.op)
}
}
In data-driven programming (e.g., polymorphism), the flow of control is determined by data, not programmers. More specifically, flow is determined by the class instantiated by the data:
Employee e = new Programmer(); // subsumption
e.print(); // calls Programmer.print
Data determines the flow of control in pattern-driven programming too, the difference is that it's the lexical patterns instantiated by the data that determines the flow. We saw examples earlier with extraction and multi-way conditionals.
Prolog, Datalog, and Proplog interpreters use pattern-driven control. Goals are matched to appropriate facts and rules using a sophisticated pattern matching algorithm called unification.