Manipulating files with AWK

gerben voshol
Dec 30, 2022
4 min read

Updated: Jan 6, 2023

Welcome to this tutorial on awk, a powerful text processing tool for parsing and manipulating text files in Unix-like operating systems.

Introduction

Awk was invented in the 1970s and is often used for one-liner programs and manipulating text files. It is particularly useful for working with tabular data, such as in CSV or BED files.

Syntax

Awk scripts are organized as:

awk 'pattern { action; other action }' file

This means that every time the pattern is true, the actions in the brackets are executed. If no pattern is specified, the action is taken for every line in the input file.

For example, the following command will print every line in the input file:

awk '{print}' file | head

The two most important patterns in awk are BEGIN and END, which tell the action to take place before any lines are read and after the last line, respectively.

awk 'BEGIN{sum=0} {sum+=1} END {print sum}' file

This line initializes a variable sum at the start of the script, adds 1 to it every line, and then prints its value at the end.

Note that if a variable hasn't been initialized, it is treated as 0 in numeric expressions and an empty string in string expressions. Awk will not print an error in this case.

Input and Output

In awk, input is split into records and fields. By default, records are separated by newline characters, and each record is subdivided into fields (columns) as determined by the field separator (FS).

There are several built-in variables in awk that are useful for parsing text. The fields of each record are referred to by $number, so the first column would be $1, the second would be $2, etc. $0 refers to the entire record.

For example, to print the second column of each line in the input file, we would use:

awk '{print $2}' file | head

To print the second column followed by the first column, we would use:

awk '{print $2, $1}' file | head

Note that when the different fields are separated with commas in the print statement, they are joined by the output field separator (the OFS variable, described below), which is by default a space. If the comma is omitted between fields (e.g., awk '{print $2 $1}'), they are concatenated without a separator.

We can also print strings using quotation marks:

awk '{print "First column: " $1}' file | head

This will print the text "First column: " followed by the value in the first field for every line in the input file.

Built-in Variables

Awk has several built-in variables that are very useful for parsing text, including:

FS: field separator (default: white space)
OFS: output field separator, i.e. what character separates fields when printing (default:white space)
RS: record separator, i.e. what character records are split on (default: new line)
ORS: output record separator
NR: number of records in input (default: # lines)

Conditionals and Pattern Matching

Like other programming languages, awk allows the use of conditional statements with if and else.

awk '{if(condition) action; else other_action}'

awk uses the following conditional operators:

== equal to
!= not equal to
> greater than
>= greater than or equal to
< less than
<= less than or equal to
&& AND
|| OR
! NOT

For example, the following script will print the line if the first field is equal to "apple":

awk '{if($1 == "apple") print}' file

In addition, awk also supports string matching using regular expressions, using the following expressions:

~ matches
!~ does not match

Awk also allows the use of regular expressions for more complex pattern matching. For example, the following script will print the line if the first field contains the string "apple":

awk '{if($1 ~ /apple/) print}' file

Note that for string matching, the pattern being matched must be enclosed by slashes.

Functions

Awk has several built-in functions for string manipulation, arithmetic, and more.

For example, the sub() function can be used to substitute a string in a field with another string:

awk '{sub("old", "new", $1); print}' file

The length() function returns the length of a string:

awk '{print length($1)}' file

The sqrt() function returns the square root of a number:

awk '{print sqrt($1)}' file

Examples

Here are a few examples of common tasks that can be done with awk:

Count the number of lines in a file:

awk 'END{print NR}' file

Extract the first column from a tab-delimited file:

awk '{print $1}' file

Extract the second and fourth columns from a comma-separated file:

awk '{print $2, $4}' file

Convert a comma-separated file to a tab-delimited file:

awk 'BEGIN{FS=","; OFS="\t"} {print}' file

Replace all occurrences of "old" with "new" in the first field of a tab-delimited file:

awk '{sub("old", "new", $1); print}' file

Print the maximum value in the first column:

awk '{if($1 > max) max=$1} END{print max}' file

Exercises

I hope this tutorial has helped you understand the basics of awk and how it can be used for text processing tasks. Happy coding!

Here are some exercises and solutions to help you practice using awk:

Exercise 1

Write an awk script that counts the number of lines in a file that contain the string "apple".

Solution 1

awk '{if($0 ~ /apple/) count+=1} END{print count}' file

Exercise 2

Write an awk script that extracts the second column from a tab-delimited file and converts it to uppercase.

Solution 2

awk '{print toupper($2)}' file

Exercise 3

Write an awk script that calculates the average value of the third column in a comma-separated file.

Solution 3

awk '{sum+=$3; count+=1} END{print sum/count}' file

Exercise 4

Write an awk script that replaces all occurrences of "apple" with "banana" in the first column of a tab-delimited file.

Solution 4

awk '{sub("apple", "banana", $1); print}' file

Exercise 5

Write an awk script that extracts the first three columns from a space-delimited file, reverses the order of the fields, and prints them separated by tabs.

Solution 5

awk '{print $3 "\t" $2 "\t" $1}' file

I hope these exercises help you practice using awk and become more comfortable with the tool. If you have any questions, feel free to ask!

Bioinformatics one-liners

Count number of sequences in a FASTQ file:

zcat example.fastq.gz | awk 'END{print NR/4}'

Note: this is technically safer than using grep, as you don't have to worry about accidentally counting the quality line.

Convert a multi line FASTA to single line:

awk '/^>/ { if(NR>1) print "";  printf("%s\n",$0); next; } { printf("%s",$0);}  END {printf("\n");}' < zcat example.fastq.gz

Only print annotations on a specific scaffold (chr2) that fall between 5Mb and 6Mb from a BED annotation file.

awk 'BEGIN{FS="\t";OFS="\t"} {if($1 == "chr2" && $2 >=5000000 && $2 <= 6000000) print}' example.bed

Note: when we specify that we only want annotations from chr1, we're using exact match (== "chr2") and not pattern match (~ /chr2/)

Only print lines of GFF annotation file that match the string "exon" in their third column.

 awk 'BEGIN{FS="\t"} {if($3 ~ /exon/) print $0}' example.gff3

Convert from GFF (genome feature file) to BED file

grep -v '^#' example.gff3 | awk 'BEGIN{FS="\t"; OFS="\t"} {print $1,$4-1,$5}'

Note: remember that BED and GFF files have different coordinate systems, i.e. BED start coordinate is 0 based, half-open, GFF is 1-based inclusive! Also, we are first using grep to skip the header lines in the GFF file.