Regular Expressions

The language of regular expressions, Regex, is a language for describing text patterns.

If r is a regular expression, then L(r) = the set of all strings that match r.

Regular Expressions (in Java)

A character is a regular expression that matches itself:

L(a) = {a}, L(1) = {1}, L($) = {$}, L(\s) = { }, etc.

Regex meta-characters intended as object language literals must be preceded by the escape character (\). This, in turn, must be escaped by another escape character:

L(\\+) = {+}, L(\\|) = {|}, L(\\)) = {)}, L(\\\) = {\}, L(\\") = {"}, etc.

Assume r and s are regular expressions, then so are:

L(rs) = L(r)L(s) = {u~v | u is in L(r) and s is in L(s)} // concatonation

L(r|s) = L(r) U L(s) // union or choice

For example:

L(abcde) = {abcde}

L(a|b|c|d|e) = {a, b, c, d, e}

Regular expressions can be followed by a quantifier (?, +, *):

L(r?) = L(r|"") // "" = the empty string

L(r+) = L(r|rr|rrr|...)

L(r*) = L(""|r|rr|rrr|...)

For example:

L((0|1)+) = all binary strings

L((\+|-)?(0|1|2|3|4|5|6|7|8|9)+) = signed and unsigned integers

Parentheses can be used to resolve ambiguities:

L(01+) = {01, 011, 0111, ... }, but L((01)+) = {01, 0101, 010101, ... }

L(0|12) = {0, 12}, but L((0|1)2) = {02, 12}

There are many pre-defined abbreviations in java.utils.regex.Matcher

[abcde] = [a-e] = a|b|c|d|e

\\s matched by any whitespace character

String.math & String.split

Java's String class provides a matches and split methods:

class String {
   Boolean matches(String regex) { ... }
   String[] split(String regex) { ... }
   // etc.
}

Demo

public class Demo1 {
  
   public static String intRegEx = "0|(\\+|-)?[1-9][0-9]*";
   public static String floatRegEx = "0|(\\+|-)?[1-9][0-9]*(\\.[0-9]+)?";
   public static String nameRegEx = "[a-zA-Z][0-9a-zA-Z]*";
  
   public static void test1() {
      System.out.println("-32".matches(intRegEx));
      System.out.println("-32".matches(floatRegEx));
      System.out.println("3.14".matches(floatRegEx));
      System.out.println("3.14".matches(intRegEx));
      System.out.println("HelloMars29".matches(nameRegEx));
      System.out.println("Hello Mars".matches(nameRegEx));
   }
}

Output:

true
true
true
false
true
false

Patterns and Matchers

The JDK contains classes representing regular expressions and matcher machines:

java.util.regex.Pattern
java.util.regex.Matcher

(Be sure to include java.util.regex.* in your programs.)

Regular expressions written as strings can be compiled into regular expression objects called patterns.

Given an input string, a pattern's matcher method returns a device called a matcher.

A matcher allows users to extract and replace substrings of the input that match the pattern.

Demos

Test1 creates patterns for matching floats. Given a mathematical formula it creates a matcher. The matcher is used to extract the floats that appear in the formula:

   public static void test1() {
      String floatRegEx = "0|(\\+|-)?[1-9][0-9]*(\\.[0-9]+)?";
      Pattern floatPattern = Pattern.compile(floatRegEx);
      Matcher m = floatPattern.matcher("pi = 3.14 && e = 2.718");
      while(m.find()) {
         System.out.println(m.group());  
      }
      m.reset(); // reset the matcher to go again
   }

The output:

3.14
2.718

Test2 shows how substrings of substrings can be extracted. Notice that dateRegEx groups month, day, and year digits with parentheses. A matcher for a date string can extract the substrings matching these groups:

   public static void test2() {
      String dateRegEx = "([0-9][0-9])/([0-9][0-9])/([0-9][0-9])";
      Pattern datePattern = Pattern.compile(dateRegEx);
      Matcher m = datePattern.matcher("02/04/15");
      if (m.matches()) {
           int month   = Integer.parseInt(m.group(1));
           int day = Integer.parseInt(m.group(2));
           int year  = Integer.parseInt(m.group(3));
           System.out.println("date = " + day + "/" + month + "/" + year);
      } else {
            System.out.println("no matches");
      }
   }

Output:

date = 4/2/15

Test3 shows how a matcher can replace substrings with alternate text:

   public static void test3() {
      Pattern damnPattern = Pattern.compile("damn");
      String unclean = "Frankly my dear, I don't give a damn.";
      Matcher m = damnPattern.matcher(unclean);
      String clean = m.replaceAll("darn");
      System.out.println(clean);
   }

Output:

Frankly my dear, I don't give a darn.