awk
,
another is the use of the grep
utility,
and yet another in Shell scripts. Regular expressions
are used to match patterns of characters.
If one wants to match a pattern of characters, then the simplest way to express the characters to be matched is just to put them down. For example, if I want to match the characters main
, then I can use the regular expression:
main
By the way, blanks are treated just like any other character in a regular expression and regular expressions are case-sensitive (uppercase and lowercase letters are different).
A regular expression is one or more characters that describe the sequence of characters that we want to match.
If this were the only way to express character patterns, regular expressions would not be very useful. What makes regular expressions useful is the ability to do things such as matching the word main
but not when it appears as part of another word like domain
or mainEvent
.
Certain characters have special meanings when used in regular expressions. These are:
\ ^ $ . [ ] * + ? ( ) |
We'll explain them along the way. If you need to match a pattern that includes some of these characters, then you can force the character NOT to be treated as a metacharacter by placing the \ character in front of it.
For example, to match the characters $main, you must use the regular expression:
\$main
How would you match the following patterns? (Write down the regular expressions.)
[In these exercises, you will not be able to test directly to see if your regular expression is formed correctly. Use the Unix utility or command that led you to this module to do that. You may have to create some simple files to help you.]
hw1
hw2*
/hw1
#include
In a regular expression, the . (period or dot) character will take the place of any character. If you want to match the character patterns:
message1
message2
message8
messageX
You can simply use the regular expression:
message.
If you want to match a character pattern but only if it occurs at the beginning of a line, then place the ^
character before the pattern. For example, to match the word apple
but only if it appears at the beginning of the line, you would use the regular expression:
^apple
If you want to match a character pattern but only if it occurs at the end of a line, then place a $
character after the pattern. For example, to match the word event
but only if it occurs at the end of a line, you would use the regular expression:
event$
Note: The regular expression:
^$
will match any blank line.
Write down a single regular expression that will match each of the following patterns but only if they occur at the beginning of a line. Repeat the exercise if you only want a match at the end of a line.
file_A file_X file23The '
*
' character is used to match zero or more of the preceding character or regular expression.
For example, the regular expression:
file2*
will match any of the following:
file
file2
file22
Write down a single regular expression that will match any of the following:
sn2a3
sn22a3
sn2b3
sn22b3
If you wanted to match the patterns
LwindowMargin RwindowMarginbut not the patterns
TwindowMargin BwindowMarginyou cannot use the tools we have so far except to list the acceptable patterns. The left and right bracket characters allow us to handle this case. These brackets are used to enclose the definition of a set of characters that we wish to match in a regular expression.
For example, to match any of the letters L or R, we can use the regular expression:
[LR]
We can use this now to get the regular expression we could use above:
[LR]windowMargin
The brackets can be used anywhere in a regular expression. For example to match any pattern that starts with the characters icon
, followed by the numbers 1, 2, or 3, and the characters file
, we could use the regular expression:
icon[123]file
There are some shorcuts that we can use with the brackets. We can specify a range of characters (successive ASCII codes) by using the -
symbol. For example, if the numerical part of the previous example could be any number from 0 to 9, then the appropriate regular expression would be:
icon[0-9]file
You can have several components within the brackets. For example
[a-z123]
will match any lowercase letter or the digits 1, 2, or 3.
Write a single regular expression that will match all of the following:
figure3a figure4x figure9bbut not the following:
figure3A figureA3 figurex7
Write a single regular expression that will match all of the following:
Message34
message3a
message5x
but not the following:
MessageA5
messagexx
Message3A
How would you change the answers to the previous two exercises if you required that the patterns should only match if they occurred at the beginning of a line?
Write a single regular expression that will match any of the following:
sn23x56
sn234xy88
sn10x32
sn9545x452
There is one more shortcut that you can use. Inside the brackets (but not outside), when you use the ^ character, it is interpreted to mean that you want the complement of the following characters within the brackets, that is everything except the characters shown.
For example if you want to match any pattern that does not include a digit, then you can use:
[^0-9]
Write a regular expression that will recognize patterns such as the following:
sn345x111
sn344y231
where the first two letters are always sn
, the next three characters must be digits, the next character is a letter that must not be anything between a
and r
, and the final three characters are digits.
Some Unix utilities allow some extensions to what we have seen so far for working with regular expressions. For example, in awk
you can use parentheses to group portions of regular expressions. For example, if you want to allow zero or more repetitions of the pattern a4
then you could use the regular expression:
(a4)*
Likewise, in awk
, you can use +
to mean match the previous character or regular (sub)expression one or more times (instead of zero or more times).
Also in awk
, you can use ?
to mean match the previous character or regular (sub)expression zero or one time only.
Finally, in awk
, you can use the |
character as a logical "or" for matching either of the regular (sub)expressions on either side.
Here are some useful regular expressions that you may wish to use:
[A-Za-z][A-Za-z]*
This will match any string of characters that don't have digits.
[+\-][0-9][0-9]*
This will match any integer with a preceding + or -.
.*
This will match any string of characters.
How would you use the extensions to simplify the first two regular expressions shown in this section?
Click on to go back to the main directory.
These pages were developed by John Avila SJSU CS Dept.