Unix Lab

The awk language

How to use awk

awk is a language that is used for processing text files by looking for patterns and carrying out appropriate actions when the pattern is found. For example, suppose you have the following text file (this file is based on one from a source on the network that we will use later) where each line contains a name and a phone number and a field that identifies the location of the phone.

File: BBS-list

Joe          408 555-5553     cell
Joe          415 555-8755     home
Sue Ann      202 555-3412     home
Minh         619 555-7685     office
Jane         408 555-1675     office
Jane         408 555-5543     cell
Renee        415 555-0542     cell
Julie        650 555-2912     cell
Sam          408 555-1234     home
Ray          408 555-6699     home
Bill         503 555-6480     cell
Bill         503 555-1100     office
Darin        202 555-3430     office
Fred         415 555-2127     cell

Although this file is relatively small, imagine a similar file that could be hundreds of times larger. Now assume you wish to list just those lines that have an 408 in the second column.

You could write a Java program to do this, but awk will do it easily without much work on your part. For example, we could do this by issuing an awk statement at the Unix prompt that looks like this (BBS-list is the name of the file):

awk '$2 == "408"' BBS-list

The output will be:

Joe          408 555-5553     cell
Jane         408 555-1675     office
Jane         408 555-5543     cell
Sam          408 555-1234     home
Ray          408 555-6699     home

For now don't worry about the rest of the command line. The point is that this will replace having to write a program to do this task.

In this module you will learn how to use awk to do this and other tasks. Tutorial materials are available on the net and we will give you questions and directions that will let you work through some of these materials.

Pattern-action pairs

An awk program can be as simple as one rule or can be made up of many rules. A rule consists of a pair of items. The first is a pattern, the second is an action. Actions are enclosed in braces { }. The pattern part describes a pattern that we're looking for in a line of text. The action part is what we want to do if we find that pattern. For example, in the example above, if we find a "408" as the second string on a line, then we want to print that line. Matching the "408" string is the pattern part. In this example we have no action part because we just want the default action which is to display the line.

In awk programs, neither the pattern part nor the action part are required.

If the awk program is a simple rule, we can supply the program to awk as part of the command line as we did above. When this happens we type:

awk 'rule' fileName

where rule is a pattern-action pair and fileName is the name of the text file to be scanned by awk.

Check out a simple example and determine what happens if the pattern part of a rule is missing. What happens if the action part is missing?

Two files are used in the examples below. One is called BBS-list (which we saw earlier) and the other is called inventory-shipped. Copy these files to a subdirectory under your home directory. They will be found in /handouts/cs46blab. Verify that the awk commands produce the same results for you.

What happens if several rules each produce a match for a particular line? Use awk to print every line of the inventory-shipped file that contains a 7.

How to run larger awk programs

We've seen how to imbed little "throw away" awk programs in the command line. Check out How to run awk programs. Then do the following exercise:

Create an awk program in a separate file. Write the program to print every line in the inventory-shipped file that contains the string Feb or Mar.

awk input

What awk reads are records. Usually each record is made up of one line from a text file. This is only because the default record separator is a newline character found at the end of each line in a text file.

Later, if you wish, you can find out how to change the record separation character, but we shall restrict ourselves to the default value.

Each record is made up of fields. A field is separated from another field by a field separation character. The defaults are blanks and tabs; either one signals the end of a field. For example, in the BBS-list file, each record contains 4 fields. In awk these fields are known as $1, $2, $3, and $4. For example, if you look back at the first example earlier, you will notice the pattern

$2 == "408"

which asks that the second field match the string consisting of the three-letter string: 408.

In awk, the action { print $n } means to print just the nth field of the record matching the pattern. If n is 0 (zero) then the entire line is printed. Use the command line form of calling awk and list the third field of those lines of the file inventory-shipped in which the letter r appears.

How to specify patterns

We've seen a few simple examples of how to tell awk what pattern we're looking for. In this section we'll learn more about the rules. For example, we would like to specify a pattern by saying something like: look for a 'd' as the first character of the line. Or, in another case, look for lines in which the third field is larger than 500.

First read the section on regular expressions.

Then read the section on patterns and come back and do the following exercises.

Copy the file directory.ls into your directory from /handouts/cs46blab. Notice that this is just the long listing of a directory showing all the files and subdirectories in that directory. If the first character on a line is a 'd', then the entry corresponds to a subdirectory. Write a simple awk program to print the names of the subdirectories in this long listing.

Write another awk program to process the directory.ls file. In this program, list the names of all the Java source files.

Write an awk program to list all files that have been updated since November 15, 1999.

Find the total amount of disk space taken up by the files listed in directory.ls.

List just the directories appearing in the directory.ls file.

List the files shown in directory.ls that are bigger than 1700 bytes.

Click on to go back to the main directory.

Click on to take the quiz for this module.

These pages were developed by John Avila SJSU CS Dept.