2023-04-15

AWK Programming - Part 1

Introduction, patterns, actions, and rules

AWK is a very powerful text processing scripting language and utility. It is a standard feature of most UNIX like operating systems. It is named after its creators: Alfred Aho (A), Peter Weinberger (W), and Brian Kernighan (K). If you want to be an effective user of UNIX like systems, some knowledge of AWK (together with some shell scripting) is absolutely necessary.

I will be using the GNU project's implementation of the AWK programming language, gawk version 5.1.0 throughout this series. GNU as with most of their products, offer a comprehensive documentation for gawk. You can read it here.

I came to know that on my Debian 11 bullseye system, the default installation for awk was mawk, and not gawk. Hmmm…

Anyways let's get started.

Invoking awk

The most common ways of calling awk is as follows:

awk 'program' file1 file2 ...
awk -f program-file file1 file2 ...

In the first case you specify the program as a command line argument to the awk executable. In the second case you put the program in a file and pass that file with the -f option to the executable.

Note that if you have multiple script files, you can provide multiple -f options. In that case all the script files will be concatenated before processing the input data files. (But why do this!!!)

Also, if you want to create an executable awk script, put #!/bin/awk -f (or whatever the location of awk is on your system) as the shebang line. Make sure you do not pass any other command line option, other than the -f option, because, when the operating system creates a process out of the shebang, it passes everything in the shebang line that comes after the executable name, as a single command line argument to the process. This can create problems that are hard to detect.

If you don't provide any input files, data is read from stdin. For example:

ls -l | awk '{print}'

Also, you can mix stdin and input files (not sure why or when this might be useful!!!).

cat ~/.bashrc | awk '{print}' before.txt - after.txt

This will print the content of the before.txt file, then the .bashrc, and finally after.txt file.

Notice the minus - in the list of file names. It means stdin.

Rules - pattern, action pairs

An awk program is a collection of one or more rules. Each rule consists of a pattern and an associated action.

The following pseudocode shows the flow of an awk program:

# iterate over each line of the input file
for line in currentFile.lines:
    # iterate over each rule in the awk program
    for rule in program.rules:
        # match the line against the pattern in the rule
        if match(rule.pattern, line):
            # execute the action if the match is successful
            execute(rule.action, line)

Consider the following fruits.txt file:

apples
oranges
pineapples
awk '/apples/ {print}' fruits.txt

Running this will produce the following output:

apples
pineapples

The 1st and the 3rd line matched with the apples pattern, and hence where printed. In fact, the apples pattern is a regexp.

The BEGIN and END special patterns

The BEGIN and END are special patterns in awk. The action associated with the BEGIN pattern is executed before any of the input files are processed. Similarly, the action associated with the END pattern is executed when all the input files are processed.

We will get back to this pattern-action pair concept soon. But let's learn about records and fields first.

Records, fields, $0, $1, … $NF

Awk works on a file, one record at a time.

What consists a record in a file, depends on the value of the special variable RS, known as the record separator. The default value of RS is the newline character, which means each line of the input file is a record. The special variable NR stores the current record number.

For example, the following command will print even numbered lines of file.txt,

awk 'NR %2 == 0 {print}' file.txt

Similar to records, what consists a field in a record, depends on the value of the FS (field separator) variable. By default fields are separated by whitespace characters (space or tabs), where repeated occurrence of whitespaces is considered as a single whitespace.

In a record, $1 refers to the first field, $2 to the second, and so on. $0 refers to the entire record. The special variable NF stores the number of fields in the current record. So, $NF can be used to refer to the last field, $(NF-1) to the second last field and so on.