A set of strings.
A formal language is a set of strings that is defined by some formal mechanism
Regex is short for Regular Expression. This is an expression (a string, really) for matching a set of strings according to certain syntax rules.
There are various syntaxes used for regular expressions, Grep, Java 4 and later, Javascript,
Perl, PHP, SQL, all have their own variations. THere are also a couple of POSIX standard regex syntaxes that are used in many applications.
Example regular expression (a.k.a., regex)
(letter)(letter + digit)*
+
('-' + epsilon) digit digit*
The set of string that are either identifers (initial letter followed by zero or more digits or letters) OR integer literals (one or more digits, optionally prefixed with a minus sign).
Regular expressions are used in:
- Perl programming
- advanced search patterns (text search such as in editors or web search engines)
- Grep, awk, and other Unix utilities
Converting regular expression to NFA
[ draw the NFA on the board as a state diagram ]
Transitions and final states
d(0, '-') -> 1
d(0, digit) -> 2
d(0, letter) -> 3
d(1, digit) -> 2
d(2, digit) -> 2
d(3, letter) -> 4
d(3, digit) -> 4
d(4, letter) -> 4
d(4, digit) -> 4
f = {2,3,4}
d(0, ws) -> 0
d(2, ws) -> 0
d(3, ws) -> 0
d(4, ws) -> 0
d(0, '-') -> 1
d(0, digit) -> 2
d(0, letter) -> 4
d(0, ws) -> 0
d(1, digit) -> 2
d(2, digit) -> 2
d(2, ws) -> 0
d(4, letter) -> 4
d(4, digit) -> 4
d(4, ws) -> 0
f = {2,4}
Picture this. Your favorite IDE colorizes input that you type into the editing buffer. Identifers (initial letter followed by zero or more digits or letters) are colored blue. Integer literals (one or more digits, optionally prefixed with a minus sign) are colored red. How does the IDE do this?
[ note, this example uses the minimal NFA, above ]
By recording the NFA state(s) for each character.
5 1 f o o - 3 1
0 2 2 0 4 4 4 0 0 1 2 2
Now user types/inserts letter j :
j 5 1 f o o - 3 1
0 4 4 4 0 4 4 4 0 0 1 2 2
^^^^^
only 3 character states change, the rest of the buffer's character states
remain the same.
propagation rule for incremental lexical analysis says we work left to right
updating states until there is no change. There is no need to re-scan the
entire buffer, only the propagation from the insertion point.
Color Table
0 - white
1 - red
2 - red
3 -
4 - blue
The IDE colorizes by looking up each character state in the color table.
(also the color table could be user-defined)
![]() | Caution |
|---|---|
| The material below is not covered during lecture, thus is material that students need not know for the mid-term and final exams. It is here for people who are interested. |
Remember a language is a set (of strings) so we can say, "regular set" or "regular language" and this would mean the same thing.
Example of a non-regular set:
a^nb^n
In other words, this set,
{"", "ab", "aabb", "aaabbb", ... }
Intuitive reasonings -- this language cannot be regular, because there cannot be enough states to count an arbitrary number of a's.
For the mathematically inclined, we have techniques for proving that a set is not regular. The pumping lemma (for regular sets) is one way to prove that a language is not regular. Formally, shows that no possible FA could be written to accept the language. Beginners are often troubled by the pumping lemma for regular sets because it involves establishing a contradiction on a statement that involves nested "forall" and "exists" expressions.
Let L be define as,
a^n b^n
w = x y z
The set of all Java programs is not a regular set.
Consider these tiny Java programs, which compile but are otherwise useless,
class X1 {
}
class X1 {
class X2 {
}
}
class X1 {
class X2 {
class X3 {
}
}
}
Intuitively, if we map the string "class X* {" to the sting "a" and we map the string "}" to the string "b" we get the language L, defined in the previous example.
Intuitively, since L is non-regular, we should observe Java is non-regular.
For the mathematically inclined, we have another technique for proving that a language is non-regular, by defining a homomorphism. Here's how we could do that formally for the set of all Java programs.
Let h be a homomorphism mapping characters from Java programs to either "a" or "b"
h("{") = "a"
h("}") = "b"
h(anything else) = empty string.
Then h(Java) = L. Because regular sets are closed under homomorphism, Java is not a regular set.