Welcome to this tutorial on awk, a powerful text processing tool for parsing and manipulating text files in Unix-like operating systems.
Introduction
Awk was invented in the 1970s and is often used for one-liner programs and manipulating text files. It is particularly useful for working with tabular data, such as in CSV or BED files.
Syntax
Awk scripts are organized as:
awk 'pattern { action; other action }' file
This means that every time the pattern is true, the actions in the brackets are executed. If no pattern is specified, the action is taken for every line in the input file.
For example, the following command will print every line in the input file:
awk '{print}' file | head
The two most important patterns in awk are BEGIN and END, which tell the action to take place before any lines are read and after the last line, respectively.
awk 'BEGIN{sum=0} {sum+=1} END {print sum}' file
This line initializes a variable sum at the start of the script, adds 1 to it every line, and then prints its value at the end.
Note that if a variable hasn't been initialized, it is treated as 0 in numeric expressions and an empty string in string expressions. Awk will not print an error in this case.
Input and Output
In awk, input is split into records and fields. By default, records are separated by newline characters, and each record is subdivided into fields (columns) as determined by the field separator (FS).
There are several built-in variables in awk that are useful for parsing text. The fields of each record are referred to by $number, so the first column would be $1, the second would be $2, etc. $0 refers to the entire record.
For example, to print the second column of each line in the input file, we would use:
awk '{print $2}' file | head
To print the second column followed by the first column, we would use:
awk '{print $2, $1}' file | head
Note that when the different fields are separated with commas in the print statement, they are joined by the output field separator (the OFS variable, described below), which is by default a space. If the comma is omitted between fields (e.g., awk '{print $2 $1}'), they are concatenated without a separator.
We can also print strings using quotation marks:
awk '{print "First column: " $1}' file | head
This will print the text "First column: " followed by the value in the first field for every line in the input file.
Built-in Variables
Awk has several built-in variables that are very useful for parsing text, including:
FS: field separator (default: white space)
OFS: output field separator, i.e. what character separates fields when printing (default:white space)
RS: record separator, i.e. what character records are split on (default: new line)
ORS: output record separator
NR: number of records in input (default: # lines)
Conditionals and Pattern Matching
Like other programming languages, awk allows the use of conditional statements with if and else.
awk '{if(condition) action; else other_action}'
awk uses the following conditional operators:
== equal to
!= not equal to
> greater than
>= greater than or equal to
< less than
<= less than or equal to
&& AND
|| OR
! NOT
For example, the following script will print the line if the first field is equal to "apple":
awk '{if($1 == "apple") print}' file
In addition, awk also supports string matching using regular expressions, using the following expressions:
~ matches
!~ does not match
Awk also allows the use of regular expressions for more complex pattern matching. For example, the following script will print the line if the first field contains the string "apple":
awk '{if($1 ~ /apple/) print}' file
Note that for string matching, the pattern being matched must be enclosed by slashes.
Functions
Awk has several built-in functions for string manipulation, arithmetic, and more.
For example, the sub() function can be used to substitute a string in a field with another string:
awk '{sub("old", "new", $1); print}' file
The length() function returns the length of a string:
awk '{print length($1)}' file
The sqrt() function returns the square root of a number:
awk '{print sqrt($1)}' file
Examples
Here are a few examples of common tasks that can be done with awk:
Count the number of lines in a file:
awk 'END{print NR}' file
Extract the first column from a tab-delimited file:
awk '{print $1}' file
Extract the second and fourth columns from a comma-separated file:
awk '{print $2, $4}' file
Convert a comma-separated file to a tab-delimited file:
awk 'BEGIN{FS=","; OFS="\t"} {print}' file
Replace all occurrences of "old" with "new" in the first field of a tab-delimited file:
awk '{sub("old", "new", $1); print}' file
Print the maximum value in the first column:
awk '{if($1 > max) max=$1} END{print max}' file
Exercises
I hope this tutorial has helped you understand the basics of awk and how it can be used for text processing tasks. Happy coding!
Here are some exercises and solutions to help you practice using awk:
Exercise 1
Write an awk script that counts the number of lines in a file that contain the string "apple".
Solution 1
Exercise 2
Write an awk script that extracts the second column from a tab-delimited file and converts it to uppercase.
Solution 2
Exercise 3
Write an awk script that calculates the average value of the third column in a comma-separated file.
Solution 3
Exercise 4
Write an awk script that replaces all occurrences of "apple" with "banana" in the first column of a tab-delimited file.
Solution 4
Exercise 5
Write an awk script that extracts the first three columns from a space-delimited file, reverses the order of the fields, and prints them separated by tabs.
Solution 5
I hope these exercises help you practice using awk and become more comfortable with the tool. If you have any questions, feel free to ask!
Bioinformatics one-liners
Count number of sequences in a FASTQ file:
zcat example.fastq.gz | awk 'END{print NR/4}'
Note: this is technically safer than using grep, as you don't have to worry about accidentally counting the quality line.
Convert a multi line FASTA to single line:
awk '/^>/ { if(NR>1) print ""; printf("%s\n",$0); next; } { printf("%s",$0);} END {printf("\n");}' < zcat example.fastq.gz
Only print annotations on a specific scaffold (chr2) that fall between 5Mb and 6Mb from a BED annotation file.
awk 'BEGIN{FS="\t";OFS="\t"} {if($1 == "chr2" && $2 >=5000000 && $2 <= 6000000) print}' example.bed
Note: when we specify that we only want annotations from chr1, we're using exact match (== "chr2") and not pattern match (~ /chr2/)
Only print lines of GFF annotation file that match the string "exon" in their third column.
awk 'BEGIN{FS="\t"} {if($3 ~ /exon/) print $0}' example.gff3
Convert from GFF (genome feature file) to BED file
grep -v '^#' example.gff3 | awk 'BEGIN{FS="\t"; OFS="\t"} {print $1,$4-1,$5}'
Note: remember that BED and GFF files have different coordinate systems, i.e. BED start coordinate is 0 based, half-open, GFF is 1-based inclusive! Also, we are first using grep to skip the header lines in the GFF file.
Comments