Awk - A useful little language

Raunak Ramakrishnan - Jun 1 '18 - - Dev Community

Awk is a small but capable programming language which is used for processing text. It was developed by Aho, Weinberger, Kerninghan at Bell Labs.

Julia Evans made an awesome intro to awk:
AWK comic

Awk scans input file as a sequence of lines and splits each line into fields. The field separator is usually whitespace but you can customize it to any character.

An awk program is a sequence of pattern-action pairs i.e for each line, it checks if it matches the pattern and if yes, it performs the associated action on the line. Awk can be used interactively or to run saved programs.

Here is what Awk does written in Python-like pseudocode:

initialize() # Initializes variables in BEGIN block
for line in input_lines: # Awk divides file / input into a list of lines
    for condition, action in conditions: # A program is a list of condition-action pairs
        if condition(line): #match line against condition
            action() #perform action on match 
Enter fullscreen mode Exit fullscreen mode

Here are some small snippets of Awk:

1. Hello World!

You can run awk programs inline or through a file:

awk 'BEGIN{ print "Hello, World!"}'
Enter fullscreen mode Exit fullscreen mode

Alternatively, you can save this to a file hello.awk:

BEGIN{ print "Hello, World!"}
Enter fullscreen mode Exit fullscreen mode

Then run it as awk -f hello.awk

2. Reading a CSV and printing a specific column

Let's now do something useful! Download this csv which is 2010 census data by zip code in Los Angeles city.

Read the first 3 lines from csv: head -3 2010_Census_Populations_by_Zip_Code.csv

Zip Code,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size
91371,1,73.5,0,1,1,1
90001,57110,26.6,28468,28642,12971,4.4
Enter fullscreen mode Exit fullscreen mode

We will print just the total column using awk -F, '{print $2}' 2010_Census_Populations_by_Zip_Code.csv

The -F, sets the field separator to comma as we need to split by commas for getting fields in a CSV file. $n allows you to use the value in the nth column.

3. Computing some statistics

Awk allows the use of variables and functions. Let's see how to use them by computing the total population in the entire city.

# total.awk
{s += $2}
END {print "Total population:", s}
Enter fullscreen mode Exit fullscreen mode

Variables are by default initialized to 0. Here, we use a variable s to hold the total.

Running this script as awk -F, -f total.awk 2010_Census_Populations_by_Zip_Code.csv, we get output: Total population: 10603988

Special variables and built-in functions

Awk uses some special variables and functions to make your programs more compact:

  • NF : Number of fields in a line
  • NR : Line number
  • $0 : The entire input line
  • length : gives number of characters in a string

Now, we will compute the average household size which is total population divided by total households. The columns of interest are $2 and $6.
We also want the average population per zip code. Our script:

# stats.awk
{ s += $2; h += $6;}
END {print "Total population:", s, "\nTotal households:", h, "\nAverage household size:", s/h, "\nAverage population per zip code:", s/NR}
Enter fullscreen mode Exit fullscreen mode

NR gives us the total number of lines. But we do not want the header line. We can use tail command to skip the 1st line as tail -n +2. Running tail -n +2 2010_Census_Populations_by_Zip_Code.csv | awk -F, -f total.awk gives us :

Total population: 10603988
Total households: 3497698
Average household size: 3.0317
Average population per zip code: 33241.3
Enter fullscreen mode Exit fullscreen mode

4. Pattern matching

We have done some useful things with awk so far, but we have ignored its biggest strength - pattern matching. We can match based on field values, regexes, line numbers.

  • Print every 2nd line : NR%2 == 0 {print $0}. Here $0 stands for the entire line.
  • Print all zip codes with population > 100,000 : $2 > 100000 {print $1}
  • Print all zip codes with population > 10,000 and average household size > 4 : $2 > 10000 && $7 > 4 { print $1}. We can combine conditions using && and || which stand for logical and and or respectively.

Further reading

There is a lot more to Awk. Here are some references:

  • The best resource for learning Awk is The AWK programming language written by the same trio. This book goes over and beyond a typical programming language tutorial and teaches you how to use your Awk superpowers to build versatile systems like a relational database, a parser, an interpreter, etc.

  • The GNU Awk Manual for Effective Awk Programming is a thorough reference.

. . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player