Regular Expressions and State Machines

What's a Language?

A set of strings.

A formal language is a set of strings that is defined by some formal mechanism

What's a REGEX?

Regex is short for Regular Expression. This is an expression (a string, really) for matching a set of strings according to certain syntax rules.

Regular Expression and NFA by Example


There are various syntaxes used for regular expressions, Grep, Java 4 and later, Javascript,
Perl, PHP, SQL, all have their own variations.   THere are also a couple of POSIX standard regex syntaxes that are used in many applications.

Example regular expression (a.k.a., regex)


(letter)(letter + digit)*
+
('-' + epsilon) digit digit*

Explanation of regular expression notation used in the above example
- '+' means 'or'
- '*' means Kleene Closure; note no length restriction
- letter stands for A+B+C+...+Z+a+b+c+...+z
- digit stands for 0+1+2+...+9
- epsilon means empty string

Informally, the language defined by this regular expression is,

The set of string that are either identifers (initial letter followed by zero or more digits or letters) OR integer literals (one or more digits, optionally prefixed with a minus sign).


Regular expressions are used in:
- Perl programming
- advanced search patterns (text search such as in editors or web search engines)
- Grep, awk, and other Unix utilities

Converting regular expression to NFA

[ draw the NFA on the board as a state diagram ]

Transitions and final states


d(0, '-') -> 1
d(0, digit) -> 2
d(0, letter) -> 3

d(1, digit) -> 2

d(2, digit) -> 2

d(3, letter) -> 4
d(3, digit) -> 4

d(4, letter) -> 4
d(4, digit) -> 4

f = {2,3,4}


If the machine sees an input char and there is no directed edge out of
the current state labeled such, it fails.

Real world setting would add these edges, to continue processing after whitespace (ws)

d(0, ws) -> 0
d(2, ws) -> 0
d(3, ws) -> 0
d(4, ws) -> 0

With ws edges, the machine enters and leaves final states,
an acceptor would stop at the end of the input string and announce T/F
whether the whole string is accepted.
Compiler people modify this model an emit many tokens for one big input string,
the contents of the entire source code file.

Emitting tokens when leaving final state - gives longest match.

Different strategies for accepting tokens: greedy vs. reluctant

A new improved example regular set example, minimal state NFA
[ not done in class ]

d(0, '-') -> 1
d(0, digit) -> 2
d(0, letter) -> 4
d(0, ws) -> 0

d(1, digit) -> 2

d(2, digit) -> 2
d(2, ws) -> 0

d(4, letter) -> 4
d(4, digit) -> 4
d(4, ws) -> 0

f = {2,4}

Incremental processing inside the IDE

Picture this. Your favorite IDE colorizes input that you type into the editing buffer. Identifers (initial letter followed by zero or more digits or letters) are colored blue. Integer literals (one or more digits, optionally prefixed with a minus sign) are colored red. How does the IDE do this?


[ note, this example uses the minimal NFA, above ]

By recording the NFA state(s) for each character.

 5 1   f o o     - 3 1
0 2 2 0 4 4 4 0 0 1 2 2

Now user types/inserts letter j :


 j 5 1   f o o     - 3 1
0 4 4 4 0 4 4 4 0 0 1 2 2
  ^^^^^
  only 3 character states change, the rest of the buffer's character states
  remain the same.

propagation rule for incremental lexical analysis says we work left to right
updating states until there is no change.  There is no need to re-scan the
entire buffer, only the propagation from the insertion point.


Color Table
 0 - white
 1 - red
 2 - red
 3 - 
 4 - blue

The IDE colorizes by looking up each character state in the color table.
(also the color table could be user-defined)

[Caution]Caution
The material below is not covered during lecture, thus is material that students need not know for the mid-term and final exams. It is here for people who are interested.

Basics for Finite Automata Proofs

Construct DFA from NFA using power set of states method.

Examples of Non-regular sets

Remember a language is a set (of strings) so we can say, "regular set" or "regular language" and this would mean the same thing.

Example of a non-regular set:

a^nb^n

In other words, this set,

{"", "ab", "aabb", "aaabbb", ... }

Intuitive reasonings -- this language cannot be regular, because there cannot be enough states to count an arbitrary number of a's.

For the mathematically inclined, we have techniques for proving that a set is not regular. The pumping lemma (for regular sets) is one way to prove that a language is not regular. Formally, shows that no possible FA could be written to accept the language. Beginners are often troubled by the pumping lemma for regular sets because it involves establishing a contradiction on a statement that involves nested "forall" and "exists" expressions.

Pumping Lemma for Regular Sets


Let L be define as,


a^n b^n

Then pick any string, w, that is in the set L,

w = x y z

Remember that w can be any string, and it can be broken up into x y z in any

Three cases:

1. y is all a's
2. y is in a+b+
3. y is all b's

Now for each possible case, show that pumping (meaning repeating 0 or more y's) "screws things up", because the
resulting string would not be in L.

1. pump 0 times, xz has fewer a's so it's not in L
2. pump twice, xyyz looks like a's, b's, a's, b's and cannot be in L.
3. pump 0 times, xz has fewer b's so it's not in L

Since all cases we can produce strings not in L, L is not regular.

A real world example

The set of all Java programs is not a regular set.

Consider these tiny Java programs, which compile but are otherwise useless,

class X1 {
}

class X1 {
    class X2 {
    }
}

class X1 {
    class X2 {
        class X3 {
        }
    }
}

Intuitively, if we map the string "class X* {" to the sting "a" and we map the string "}" to the string "b" we get the language L, defined in the previous example.

Intuitively, since L is non-regular, we should observe Java is non-regular.

For the mathematically inclined, we have another technique for proving that a language is non-regular, by defining a homomorphism. Here's how we could do that formally for the set of all Java programs.

Let h be a homomorphism mapping characters from Java programs to either "a" or "b"

h("{") = "a"
h("}") = "b"
h(anything else) =  empty string.

Then h(Java) = L. Because regular sets are closed under homomorphism, Java is not a regular set.