AWK Programming - Part 1
Introduction, patterns, actions, and rules
AWK is a very powerful text processing scripting language and utility. It is a standard feature of most UNIX like operating systems. It is named after its creators: Alfred Aho (A), Peter Weinberger (W), and Brian Kernighan (K). If you want to be an effective user of UNIX like systems, some knowledge of AWK (together with some shell scripting) is absolutely necessary.
I will be using the GNU project's implementation of the AWK programming language, gawk version 5.1.0 throughout this series. GNU as with most of their products, offer a comprehensive documentation for gawk. You can read it here.
I came to know that on my Debian 11 bullseye system, the default installation for awk was mawk, and not gawk. Hmmm…
Anyways let's get started.
Invoking awk
The most common ways of calling awk
is as follows:
awk 'program' file1 file2 ...
awk -f program-file file1 file2 ...
In the first case you specify the program as a command line argument to the awk executable. In the second case you put the program in a file and pass that file with the -f
option to the executable.
Note that if you have multiple script files, you can provide multiple -f
options. In that case all the script files will be concatenated before processing the input data files. (But why do this!!!)
Also, if you want to create an executable awk script, put #!/bin/awk -f
(or whatever the location of awk is on your system) as the shebang line. Make sure you do not pass any other command line option, other than the -f
option, because, when the operating system creates a process out of the shebang, it passes everything in the shebang line that comes after the executable name, as a single command line argument to the process. This can create problems that are hard to detect.
If you don't provide any input files, data is read from stdin
. For example:
ls -l | awk '{print}'
Also, you can mix stdin
and input files (not sure why or when this might be useful!!!).
cat ~/.bashrc | awk '{print}' before.txt - after.txt
This will print the content of the before.txt
file, then the .bashrc
, and finally after.txt
file.
Notice the minus -
in the list of file names. It means stdin
.
Rules - pattern, action pairs
An awk program is a collection of one or more rules. Each rule consists of a pattern and an associated action.
The following pseudocode shows the flow of an awk program:
# iterate over each line of the input file for line in currentFile.lines: # iterate over each rule in the awk program for rule in program.rules: # match the line against the pattern in the rule if match(rule.pattern, line): # execute the action if the match is successful execute(rule.action, line)
Consider the following fruits.txt
file:
apples oranges pineapples
awk '/apples/ {print}' fruits.txt
Running this will produce the following output:
apples pineapples
The 1st and the 3rd line matched with the apples pattern, and hence where printed. In fact, the apples pattern is a regexp.
The BEGIN and END special patterns
The BEGIN
and END
are special patterns in awk. The action associated with the BEGIN
pattern is executed before any of the input files are processed. Similarly, the action associated with the END
pattern is executed when all the input files are processed.
We will get back to this pattern-action pair concept soon. But let's learn about records and fields first.
Records, fields, $0, $1, … $NF
Awk works on a file, one record at a time.
What consists a record in a file, depends on the value of the special variable RS,
known as the record separator. The default value of RS
is the newline character, which means each line of the input file is a record. The special variable NR
stores the current record number.
For example, the following command will print even numbered lines of file.txt
,
awk 'NR %2 == 0 {print}' file.txt
Similar to records, what consists a field in a record, depends on the value of the FS
(field separator) variable. By default fields are separated by whitespace characters (space or tabs), where repeated occurrence of whitespaces is considered as a single whitespace.
In a record, $1
refers to the first field, $2
to the second, and so on.
$0
refers to the entire record. The special variable NF
stores the number of fields in the current record. So, $NF
can be used to refer to the last field, $(NF-1)
to the second last field and so on.