awk
is a language that is used for processing text files by looking
for patterns and carrying out appropriate actions when the pattern is found. For
example, suppose you have the following text file (this file is based on one from a
source on
the network that we will use later) where each line contains a name and
a phone number and a field that identifies the location of the phone.
Joe 408 555-5553 cell Joe 415 555-8755 home Sue Ann 202 555-3412 home Minh 619 555-7685 office Jane 408 555-1675 office Jane 408 555-5543 cell Renee 415 555-0542 cell Julie 650 555-2912 cell Sam 408 555-1234 home Ray 408 555-6699 home Bill 503 555-6480 cell Bill 503 555-1100 office Darin 202 555-3430 office Fred 415 555-2127 cell
Although this file is relatively small, imagine a similar file that could be hundreds of times larger. Now assume you wish to list just those lines that have an 408 in the second column.
You could write a Java program to do this, but awk
will do it
easily without much work on your part. For example,
we could do this by issuing an awk
statement at the Unix prompt
that looks like this (BBS-list
is the name of the file):
awk '$2 == "408"' BBS-list
The output will be:
Joe 408 555-5553 cell Jane 408 555-1675 office Jane 408 555-5543 cell Sam 408 555-1234 home Ray 408 555-6699 home
For now don't worry about the rest of the command line. The point is that this will replace having to write a program to do this task.
In this module you will learn how to use awk
to do this and other
tasks. Tutorial materials are available on the net and we will give you questions
and directions that will let you work through some of these materials.
An awk program can be as simple as one rule or can be made up of many rules. A rule consists of a pair of items. The first is a pattern, the second is an action. Actions are enclosed in braces { }. The pattern part describes a pattern that we're looking for in a line of text. The action part is what we want to do if we find that pattern. For example, in the example above, if we find a "408" as the second string on a line, then we want to print that line. Matching the "408" string is the pattern part. In this example we have no action part because we just want the default action which is to display the line.
In awk
programs, neither the pattern part nor the action part are
required.
If the awk
program is a simple rule, we can supply the program to
awk
as part of the command line as we did above. When this happens
we type:
awk 'rule' fileName
where rule
is a pattern-action pair and fileName
is
the name of the text file to be scanned by awk
.
Check out a simple example and determine what happens if the pattern part of a rule is missing. What happens if the action part is missing?
Two files are used in the examples below. One is called BBS-list
(which we saw earlier) and the other is called inventory-shipped
.
Copy these files to a subdirectory under your home directory. They will be found
in /handouts/cs46blab
. Verify that the awk
commands produce the same results for you.
What happens if several rules each produce a match for a particular line? Use
awk
to print every line of the inventory-shipped
file
that contains a 7.
We've seen how to imbed little "throw away" awk
programs in the
command line. Check out
How to run
awk programs. Then do the following exercise:
Create an awk
program in a separate file. Write the program to
print every line in the inventory-shipped
file that contains the
string Feb
or Mar
.
What awk
reads are records. Usually each record is made up of
one line from a text file. This is only because the default record separator
is a newline character found at the end of each line in a text file.
Later, if you wish, you can find out how to change the record separation character, but we shall restrict ourselves to the default value.
Each record is made up of fields. A field is separated from another field by
a field separation character. The defaults are blanks and tabs; either one signals
the end of a field. For example, in the
BBS-list file, each record contains 4 fields. In awk
these
fields are known as $1, $2, $3,
and $4
. For example,
if you look back at the first example
earlier, you will notice the pattern
$2 == "408"
which asks that the second field match the string consisting of the three-letter string: 408.
In awk
, the action { print $n }
means to print just
the nth field of the record matching the pattern. If n
is 0
(zero) then the entire line is printed. Use the command line
form of calling awk
and list the third field of those lines of the
file inventory-shipped
in which the letter r
appears.
We've seen a few simple examples of how to tell awk
what pattern
we're looking for. In this section we'll learn more about the rules. For example,
we would like to specify a pattern by saying something like: look for a
'd
' as the first character of the line. Or, in another case, look
for lines in which the third field is larger than 500.
First read the section on regular expressions.
Then read the section on patterns and come back and do the following exercises.
Copy the file directory.ls
into your directory from
/handouts/cs46blab
. Notice that this is just the long
listing of a directory showing all the files and subdirectories in that directory.
If the first character on a line is a 'd', then the entry corresponds to a
subdirectory. Write a simple awk
program to print the names of
the subdirectories in this long listing.
Write another awk
program to process the directory.ls
file. In this program, list the names of all the Java source files.
Write an awk
program to list all files that have been updated since
November 15, 1999.
Find the total amount of disk space taken up by the files listed in
directory.ls
.
List just the directories appearing in the directory.ls
file.
List the files shown in directory.ls
that are bigger than 1700
bytes.
Click on to go back to the main directory.
Click on to take the quiz for this module.
These pages were developed by John Avila SJSU CS Dept.