The language of regular expressions, Regex, is a language for describing text patterns.
If r is a regular expression, then L(r) = the set of all strings that match r.
A character is a regular expression that matches itself:
L(a) = {a}, L(1) = {1}, L($) = {$}, L(\s) = { }, etc.
Regex meta-characters intended as object language literals must be preceded by the escape character (\). This, in turn, must be escaped by another escape character:
L(\\+) = {+}, L(\\|) = {|}, L(\\)) = {)}, L(\\\) = {\}, L(\\") = {"}, etc.
Assume r and s are regular expressions, then so are:
L(rs) = L(r)L(s) = {u~v | u is in L(r) and s is in L(s)} // concatonation
L(r|s) = L(r) U L(s) // union or choice
For example:
L(abcde) = {abcde}
L(a|b|c|d|e) = {a, b, c, d, e}
Regular expressions can be followed by a quantifier (?, +, *):
L(r?) = L(r|"") // "" = the empty string
L(r+) = L(r|rr|rrr|...)
L(r*) = L(""|r|rr|rrr|...)
For example:
L((0|1)+) = all binary strings
L((\+|-)?(0|1|2|3|4|5|6|7|8|9)+) = signed and unsigned integers
Parentheses can be used to resolve ambiguities:
L(01+) = {01, 011, 0111, ... }, but L((01)+) = {01, 0101, 010101, ... }
L(0|12) = {0, 12}, but L((0|1)2) = {02, 12}
There are many pre-defined abbreviations in java.utils.regex.Matcher
[abcde] = [a-e] = a|b|c|d|e
\\s matched by any whitespace character
Java's String class provides a matches and split methods:
class String {
Boolean matches(String regex) { ... }
String[] split(String regex) { ... }
// etc.
}
public class Demo1 {
public static String intRegEx =
"0|(\\+|-)?[1-9][0-9]*";
public static String floatRegEx
= "0|(\\+|-)?[1-9][0-9]*(\\.[0-9]+)?";
public static String nameRegEx =
"[a-zA-Z][0-9a-zA-Z]*";
public static void test1() {
System.out.println("-32".matches(intRegEx));
System.out.println("-32".matches(floatRegEx));
System.out.println("3.14".matches(floatRegEx));
System.out.println("3.14".matches(intRegEx));
System.out.println("HelloMars29".matches(nameRegEx));
System.out.println("Hello
Mars".matches(nameRegEx));
}
}
Output:
true
true
true
false
true
false
The JDK contains classes representing regular expressions and matcher machines:
java.util.regex.Pattern
java.util.regex.Matcher
(Be sure to include java.util.regex.* in your programs.)
Regular expressions written as strings can be compiled into regular expression objects called patterns.
Given an input string, a pattern's matcher method returns a device called a matcher.
A matcher allows users to extract and replace substrings of the input that match the pattern.
Test1 creates patterns for matching floats. Given a mathematical formula it creates a matcher. The matcher is used to extract the floats that appear in the formula:
public static void
test1() {
String floatRegEx =
"0|(\\+|-)?[1-9][0-9]*(\\.[0-9]+)?";
Pattern floatPattern = Pattern.compile(floatRegEx);
Matcher m =
floatPattern.matcher("pi = 3.14 && e = 2.718");
while(m.find()) {
System.out.println(m.group());
}
m.reset(); // reset the matcher to
go again
}
The output:
3.14
2.718
Test2 shows how substrings of substrings can be extracted. Notice that dateRegEx groups month, day, and year digits with parentheses. A matcher for a date string can extract the substrings matching these groups:
public static void
test2() {
String dateRegEx =
"([0-9][0-9])/([0-9][0-9])/([0-9][0-9])";
Pattern datePattern = Pattern.compile(dateRegEx);
Matcher m =
datePattern.matcher("02/04/15");
if
(m.matches()) {
int month = Integer.parseInt(m.group(1));
int day
= Integer.parseInt(m.group(2));
int year = Integer.parseInt(m.group(3));
System.out.println("date = " + day
+ "/" + month
+ "/" + year);
} else {
System.out.println("no
matches");
}
}
Output:
date = 4/2/15
Test3 shows how a matcher can replace substrings with alternate text:
public static void
test3() {
Pattern damnPattern = Pattern.compile("damn");
String unclean = "Frankly my
dear, I don't give a damn.";
Matcher m =
damnPattern.matcher(unclean);
String clean =
m.replaceAll("darn");
System.out.println(clean);
}
Output:
Frankly my dear, I don't give a darn.