Unix Lab



Regular Expressions


Regular expressions are important in several places in Unix. One is in the use of the utility called awk, another is the use of the grep utility, and yet another in Shell scripts. Regular expressions are used to match patterns of characters.


The simplest Regular Expressions

If one wants to match a pattern of characters, then the simplest way to express the characters to be matched is just to put them down. For example, if I want to match the characters main, then I can use the regular expression:

main

By the way, blanks are treated just like any other character in a regular expression and regular expressions are case-sensitive (uppercase and lowercase letters are different).

A regular expression is one or more characters that describe the sequence of characters that we want to match.

If this were the only way to express character patterns, regular expressions would not be very useful. What makes regular expressions useful is the ability to do things such as matching the word main but not when it appears as part of another word like domain or mainEvent.


Metacharacters

Certain characters have special meanings when used in regular expressions. These are:


  \  ^  $  .  [  ]  *  +  ?  (  )  |

We'll explain them along the way. If you need to match a pattern that includes some of these characters, then you can force the character NOT to be treated as a metacharacter by placing the \ character in front of it.

For example, to match the characters $main, you must use the regular expression:

\$main

How would you match the following patterns? (Write down the regular expressions.)

[In these exercises, you will not be able to test directly to see if your regular expression is formed correctly. Use the Unix utility or command that led you to this module to do that. You may have to create some simple files to help you.]

hw1

hw2*

/hw1

#include


Using the . , ^, $, and * metacharacters

In a regular expression, the . (period or dot) character will take the place of any character. If you want to match the character patterns:

message1

message2

message8

messageX



You can simply use the regular expression:

message.

If you want to match a character pattern but only if it occurs at the beginning of a line, then place the ^ character before the pattern. For example, to match the word apple but only if it appears at the beginning of the line, you would use the regular expression:

^apple

If you want to match a character pattern but only if it occurs at the end of a line, then place a $ character after the pattern. For example, to match the word event but only if it occurs at the end of a line, you would use the regular expression:

event$

Note: The regular expression: ^$ will match any blank line.

Write down a single regular expression that will match each of the following patterns but only if they occur at the beginning of a line. Repeat the exercise if you only want a match at the end of a line.


file_A

file_X

file23

The '*' character is used to match zero or more of the preceding character or regular expression.

For example, the regular expression:

file2*

will match any of the following:

file file2 file22

Write down a single regular expression that will match any of the following:



sn2a3

sn22a3

sn2b3

sn22b3




Using [ ] to match sets of characters

If you wanted to match the patterns


LwindowMargin

RwindowMargin

but not the patterns

TwindowMargin

BwindowMargin

you cannot use the tools we have so far except to list the acceptable patterns. The left and right bracket characters allow us to handle this case. These brackets are used to enclose the definition of a set of characters that we wish to match in a regular expression.

For example, to match any of the letters L or R, we can use the regular expression:

[LR]

We can use this now to get the regular expression we could use above:

[LR]windowMargin

The brackets can be used anywhere in a regular expression. For example to match any pattern that starts with the characters icon, followed by the numbers 1, 2, or 3, and the characters file, we could use the regular expression:

icon[123]file

There are some shorcuts that we can use with the brackets. We can specify a range of characters (successive ASCII codes) by using the - symbol. For example, if the numerical part of the previous example could be any number from 0 to 9, then the appropriate regular expression would be:

icon[0-9]file

You can have several components within the brackets. For example

[a-z123]

will match any lowercase letter or the digits 1, 2, or 3.

Write a single regular expression that will match all of the following:


figure3a

figure4x

figure9b

but not the following:

figure3A

figureA3

figurex7

Write a single regular expression that will match all of the following:

Message34

message3a

message5x 
but not the following:
MessageA5

messagexx

Message3A

How would you change the answers to the previous two exercises if you required that the patterns should only match if they occurred at the beginning of a line?

Write a single regular expression that will match any of the following:

sn23x56

sn234xy88

sn10x32

sn9545x452

There is one more shortcut that you can use. Inside the brackets (but not outside), when you use the ^ character, it is interpreted to mean that you want the complement of the following characters within the brackets, that is everything except the characters shown.

For example if you want to match any pattern that does not include a digit, then you can use:

[^0-9]

Write a regular expression that will recognize patterns such as the following:

sn345x111

sn344y231

where the first two letters are always sn, the next three characters must be digits, the next character is a letter that must not be anything between a and r, and the final three characters are digits.


Extensions

Some Unix utilities allow some extensions to what we have seen so far for working with regular expressions. For example, in awk you can use parentheses to group portions of regular expressions. For example, if you want to allow zero or more repetitions of the pattern a4 then you could use the regular expression:

(a4)*

Likewise, in awk, you can use + to mean match the previous character or regular (sub)expression one or more times (instead of zero or more times).

Also in awk, you can use ? to mean match the previous character or regular (sub)expression zero or one time only.

Finally, in awk, you can use the | character as a logical "or" for matching either of the regular (sub)expressions on either side.


Some useful regular expressions

Here are some useful regular expressions that you may wish to use:

[A-Za-z][A-Za-z]*
This will match any string of characters that don't have digits.

[+\-][0-9][0-9]*
This will match any integer with a preceding + or -.

.*
This will match any string of characters.

How would you use the extensions to simplify the first two regular expressions shown in this section?


Click on to go back to the main directory.

These pages were developed by John Avila SJSU CS Dept.