Intro into Linux shell commands

gerben voshol
Jan 6, 2023
14 min read

Linux/Unix shell commands can be useful for quickly solving bioinformatics problems without the need for extensive coding. These commands can be used to easily parse and manipulate data without the need for data structures or file handling. The examples provided below demonstrate some of the most useful shell commands for bioinformatics processing. These commands can be modified to work in other shells, although the syntax provided is for bash. The following commands will be explained:

A for loop in bash: This command is used to iterate over a list of items and perform an action on each item in the list.
grep: This command allows you to search for a specific pattern in a file or multiple files. For example, you can use grep '^>' file.fasta to search for lines that start with the '>' character in a FASTA file. You can also use the -v flag to return lines that do not match the pattern, or the -c flag to count the number of matching lines.
awk: This powerful command allows you to perform actions on specific fields in a tabular file. For example, you can use awk '{print $1}' file.txt to print the first field of each line in a tab-delimited file. You can also specify ranges of fields with awk '{print $1, $3-5}' file.txt.
sed: This command allows you to perform text manipulation on a file or input stream. For example, you can use sed 's/pattern/replacement/g' file.txt to replace all occurrences of a specific pattern in a file with a replacement string.
cut: This command allows you to extract specific fields from a tabular file. For example, you can use cut -f1,3 file.txt to extract the first and third fields from each line in a tab-delimited file.
sort: This command allows you to sort a file by specific fields. For example, you can use sort -k2 file.txt to sort a tab-delimited file by the second field. You can also specify the -n flag to sort numerically, or the -r flag to sort in reverse order.
uniq: This command allows you to remove duplicate lines from a file. For example, you can use uniq file.txt to remove duplicate lines in a file. You can also use the -c flag to count the number of occurrences of each unique line.
wc: This commond allows you to count the number of lines, words, and characters in a file or from standard input. For example, you can use wc -l file.txt to count the number of lines in a file.
join: This command allows you to join the lines of two files based on a common field. Note that the "join" command requires that both input files are sorted based on the common field. You can use the "sort" command to sort the files if they are not already sorted.
screen: This command allows you to keep interactive shells running even after you have disconnected from them, allowing you to log into a remote system, do interactive work in a shell, and disconnect while retaining the shell session. This can be especially useful if you have a long-running task that you need to keep running even after you have logged out.

These commands can be used to search for patterns, perform actions on specific fields, manipulate text, extract specific fields, sort data, and remove duplicate lines, respectively. These tools can help make bioinformatics processing more efficient and streamline data manipulation.

The following examples demonstrate some of the most useful shell commands for bioinformatics processing. While the syntax is written for bash, these examples can typically be adapted for use in other shells. These examples are not exhaustive, but rather meant to provide a starting point for those interested in using shell commands for bioinformatics tasks.

Looping using for

for loops can be used in combination with other bash commands. For example, if we wanted to create a new directory for each chromosome and move the corresponding data file into it, we could use the mkdir and mv commands like this:

for chr in {1..22}; do
    mkdir chr$chr
    mv data_chr$chr.txt chr$chr/
done

This is a for loop that iterates over the values between 1 and 22, setting the shell variable $chr to each value in succession. The loop will run the commands inside the curly brackets for each value of $chr. The first command creates a new directory called "chr$chr" (where $chr is the current value of the loop variable). The second command moves the file "data_chr$chr.txt" into the newly created directory. For example, if $chr is set to 1, the loop will create a directory called "chr1" and move the file "data_chr1.txt" into it. If $chr is set to 2, the loop will create a directory called "chr2" and move the file "data_chr2.txt" into it, and so on. When the loop has finished running, there will be 22 directories, each containing one of the data files.

Another common use of for loops in bioinformatics is to loop over a list of samples. For example, if we had a list of samples in a file called samples.txt and we wanted to run ./command on each one, we could use:

for sample in $(cat samples.txt); do ./command $sample done

It’s also possible to nest for loops. For example, if we wanted to loop over both chromosomes and samples, we could use:

for chr in {1..22}; do 
    for sample in $(cat samples.txt); do 
        ./command $sample chr$chr
    done
done

We can use a for loop to loop over a group of files with a specific file extension in our current directory. For example, if we wanted to run a command called "./command" on every file with a ".txt" extension, we could use the following loop:

for file in *.txt; do ./command $file;  done

This will iterate over every ".txt" file in the current directory and run "./command" on each one. The loop variable "$file" will be set to the name of the current file being processed.

Finally, it’s important to note that the for loop construct is just one way to loop over a set of values in the shell. There are also while loops and until loops, which can be useful in certain situations. However, for loops are generally the most commonly used and are a good starting point for those new to shell scripting. With a little practice, you’ll be able to use loops to streamline your workflows and make your bioinformatics analyses more efficient.

Searching files with grep

Grep is a powerful tool for searching for strings or patterns within files. It allows you to search for specific strings or patterns, such as specific words or regular expressions, within a file or group of files.

To search for a specific string within a single file, you can use the following syntax:

grep "string" "filename"

For example, if you want to search for the string "string" within the file "file.txt", you would use the following command:

grep string file.txt

This will print out all lines in the file that contain the string "string".

You can also search for a string within multiple files by using a wildcard in place of the filename. For example, to search for the string "string" within all files with names ending in ".txt" in the current directory, you can use the command:

grep string *.txt

This will print out all lines in the specified files that contain the string "string", preceded by the name of the file on which the given line appears.

Grep also allows you to use regular expressions in your searches. Regular expressions are a way of specifying patterns in strings, and can be very powerful for searching for specific types of matches. For example, if you want to search for lines that begin with a specific string, you can use the "^" character to indicate that the search string should match at the beginning of the line. For example, if you want to search for lines that start with "chr" in a file called "file.vcf", you can use the command:

 grep "^chr" file.vcf

This will print out only those lines that begin with "chr".

Overall, grep is a valuable tool for searching for specific strings or patterns within files. It can save you time and effort when you need to find specific information within large datasets or groups of files. With a little practice and familiarity with regular expressions, you can use grep to quickly find the information you need.

Manipulating files with awk

Awk is a powerful tool for extracting specific fields or columns from text-based genomic files. It can be used to extract a single field, multiple fields, or only lines with specific field values.

To extract a single field from a file, use the following syntax:

awk '{ print $n }' filename

Where n is the number of the field you want to extract. For example, if we wanted to extract the fourth field (column) from a file called data.txt, we would use:

awk '{ print $4 }' data.txt

This would print the contents of the fourth field for each line in the file.

We can also redirect the output of the awk command to a file or pipe it to another command. For example, to save the output to a file called data_positions.txt, we would use:

awk '{ print $4 }' data.txt > data_positions.txt

Or, to send the output to another command called "./command", we would use:

awk '{ print $4 }' data.txt | ./command

Awk is actually a programming language, and the code inside the curly brackets ('{ ... }') is run on each line of the input file. By default, awk splits each line into fields that are separated by white space and assigns them to special variables named $1, $2, $3, etc. We can also specify a different delimiter using the -F option. For example, to split fields on the ',' character in a .csv file, we would use:

awk -F, '{ print }' data.csv

Note that a print command with no arguments simply prints the original line.

In addition to extracting specific fields, we can also use awk to extract only lines with certain field values. For example, if we had a file called data.vcf and wanted to extract only lines with a FILTER value of "PASS" (the seventh field in a VCF file), we would use:

awk '$7 == "PASS" { print }' data.vcf.

The boolean expression '$7 == "PASS"' is evaluated for each line, and only lines for which the expression is true are printed. We can also use more complex boolean expressions to specify multiple conditions. For example, to extract only lines with a chromosome value of 5 and a FILTER value of "PASS", we would use:

awk '$1 == 5 && $7 == "PASS"' data.vcf

Note that with no code, awk will simply print the lines for which the initial boolean expression is true.

Awk can also be used to manipulate and format the output of the fields. For example, we can use the print command to produce lines with arbitrary text that includes field values from the input file. For example, to print "SNP [field 2] is on chromosome [field 1] at position [field 4]", we would use:

awk '{ print "SNP",$2,"is on chromosome",$1,"at position",$4 }' data.txt

We can also use the printf function to specify the format of the output more precisely. For example, to print the fields with a tab character between them, we would use:

awk '{ printf "%s\t%s\t%s\n", $1, $2, $4 }' data.txt

Awk is a versatile tool that can be used in many different contexts in bioinformatics, from extracting specific fields from large datasets to manipulating and formatting the output of other commands.

Manipulating files with sed

Sed is a powerful command-line utility for editing text files. The basic syntax for using sed is:

sed 's/pattern/replacement/' input_file > output_file

Where pattern is a regular expression that specifies the text you want to search for, and replacement is the text you want to replace it with. The s at the beginning of the command stands for "substitute," indicating that we want to perform a substitution. The input_file is the file that you want to edit, and the output_file is the file where the edited version of the input file will be saved. Note that this only replaces the first occurrence of the pattern.

Sed allows you to use flags to modify the behavior of the substitution command. Some common flags include:

g: Perform the substitution globally (i.e., replace all instances of the pattern, not just the first one).
i: Ignore case when searching for the pattern.
w file: Write the modified version of the line to the specified file, instead of to the stdout output.
n: Only perform the substitution on the nth line of the input file.

For example, to replace all instances of the word "pattern" with the word "replacement" globally, ignoring case, you could use the following command:

sed 's/pattern/replacement/gi' input_file > output_file

Sed has many advanced features that can be used to perform more complex manipulations of text files. For example using grouping and referencing as the following example demonstrates:

sed 's/\([A-Z]\+\) \([A-Z]\+\)/\2 \1/' input_file > output_file

This command will swap the first and second words on every line in the input file, and save the edited version in the output file. The  symbols are used to group the regular expressions, and the \1 and \2 symbols are used to reference the groups.

You can also use sed with regular expressions. Regular expressions are a powerful tool for matching and manipulating patterns in text. Sed allows you to use regular expressions in your search patterns and replacements.

For example, to replace all numbers in the input file with a blank space, you could use the following command:

sed 's/[0-9]*/ /g' input_file > output_file

This command uses the regular expression [0-9]* to match any sequence of digits, and replaces it with a blank space. The g flag specifies that the substitution should be performed globally (i.e., on all instances of the pattern).

Sed can also be used in combination with other command-line tools using pipes. For example, to sort the input file and then delete all lines containing the word "apple," you could use the following command:

sort input_file | sed '/apple/d' > output_file

Overall, sed is a powerful and flexible command-line utility for editing text files. In this section, we covered the basic syntax of sed and provided some examples of its usage.

Extracting fields using cut

The cut command is a useful tool for extracting specific fields or columns from a file. By default, cut assumes that fields are separated by tab characters, but the -d option can be used to specify a different delimiter.

To extract a range of fields, use the -f option followed by the range of fields you want to extract. For example, to extract fields 1 through 5 from a file called "data.txt", you would use the following command:

cut -f1-5 data.txt

You can also specify multiple ranges of fields to extract, separated by commas. For example, to extract fields 1 through 5 and 10 through 20 from "data.txt", use the following command:

cut -f1-5,10-20 data.txt

To extract all fields starting from a specific field to the end of each line, use the -f option followed by the starting field and a dash. For example, to extract all fields starting from field 6, use the following command:

cut -f6- data.txt

Keep in mind that cut does not allow you to extract fields in a different order than they appear in the original file. If you want to do this, you can use the awk command instead.

Sorting files using sort

The program sort can be used to sort the contents of a file. It can be used to sort a file based on multiple fields, which can be specified using the -k option. The -k option specifies which column to sort by, and the n option indicates that the sort should be done numerically. For example, the following command will sort the contents of the file sites.txt first by the second column and then by the third column:

sort -k 2n -k 3n sites.txt > sort.txt

This can be useful in bioinformatics when working with data that needs to be sorted by chromosome number and physical position, for example. Sorting data in this way can also be useful when working with other shell tools like uniq, join, and comm.

Remove duplicates using uniq

To get the unique lines from a file using the uniq command, you can use the following syntax:

uniq <input_file>

This will return a list of the unique lines from the input file, with consecutive duplicates removed.

You can also use the uniq command in combination with the sort command to get a list of unique lines from unsorted input. For example:

sort <input_file> | uniq

This will first sort the lines of the input file, and then return a list of the unique lines.

The uniq command also has several options that can be useful for different tasks.

One option is -c, which causes the uniq command to count the number of copies of each unique line in the input. For example:

sort <input_file> | uniq -c

This will return a list of the unique lines in the input file, along with a count of how many copies of each line exist in the input.

Suppose we have a set of genes in “data1.txt” and “data2.txt” with ids in the first column. To determine a list of shared genes also known as a set union, we can do the following:

cat data1.txt data2.txt | awk '{print $1}' | sort | uniq > union

Another option is -d, which causes the uniq command to only print lines that appear more than once in the input. This can be used to perform set intersection on two files, as shown in the following example.

We could determine the set of genes that are shared in common between these two files in the following way:

cat data1.txt data2.txt | awk '{print $1}' | sort | uniq -d >  intersection

The -u option, on the other hand, causes the uniq command to only print lines that appear once in the input. This can be used to perform set subtraction.

awk '{print $1}' data1.txt | cat - intersection | sort | uniq -u > data1-only

In summary, the uniq command is a useful tool for getting unique lines from a file and performing set operations on multiple files. It has several options that can be used to customize its behavior for different tasks.

Joining files using join

The join command allows you to combine information from two different files by matching fields between the two files. The input files must be sorted by the field that you want to join on.

To use join, you can specify the field to join on using the -j option. For example, to join on field 1 in both files, you can use -j 1. If you want to join on different fields in each file, you can use the -1 and -2 options, followed by the field numbers in each file.

By default, join only prints lines that have a matching field in both files. However, you can use the -a option followed by a file number (1 or 2) to print all lines from that file, including those that do not match any lines in the other file.

Here's an example of how you might use join to combine information from two files, "data1.txt" and "data2.txt", that contain the unique gene ids and an additional data fields:

sort -k 1 data1.txt > data1.sort.txt
sort -k 1 data2.txt > data2.sort.txt
join -j 1 data1.sort.txt data2.sort.txt > data1_data2.txt

If you want to retain all the information from one of the input files, you can use the -a option as described above. For example, to retain all the information from "data1.txt" and add additional information from "data2.txt", you could do the following:

sort -k 1 data1.txt > data1.sort.txt
sort -k 1 data2.txt > data2.sort.txt
join -j 1 -a 1 -1 1 -2 1 data1.sort.txt data2.sort.txt > data1_data2.txt

You can use the awk command to fill in any unmatched lines with a placeholder value, such as "NA". For example:

join -j 1 -a 1 -1 1 -2 1 data1.sort.txt data2.sort.txt | awk '{ if (NF < 3) { print $0,NA } else { print } }' > all.txt

This will fill in unmatched lines with "NA" if there are fewer than 3 fields on those lines.

Counting using wc

The wc command is a simple program that allows you to count the number of lines, words, and characters in a file. It can be used on its own or in combination with other commands through the use of pipes.

To use wc, simply enter the command followed by the name of the file you want to analyze. For example, to get a count of the number of lines, words, and characters in a file called data.txt, you would enter the following command:

wc data.txt

You can also specify which type of count you want to see by using options. For example, to see only the line count, you can use the -l option like this:

wc -l data.txt

You can also use wc in combination with other commands, such as set operations, by piping the output to wc -l. This allows you to get counts without creating new files.

For example, you can use the wc -l command in combination with the grep command like this:

grep "pattern" file.txt | wc -l

This will search for the specified pattern in the file and then use wc -l to count the number of lines that contain the pattern.

Overall, the wc command is a simple and useful tool for counting lines, words, and characters in a file, and it can be easily combined with other commands for more advanced analysis.

Retain shell sessions using screen

Screen is a terminal multiplexer that allows you to run multiple shell sessions within a single terminal window, or even after you have disconnected from a remote system. This can be especially useful if you have a long-running task that you need to keep running even after you have logged out.

To use screen, simply type "screen" on the command line. This will start a new shell within screen. You can then work as normal within this shell. When you are ready to disconnect, but keep the shell running in the background, press "Ctrl-a d". This will disconnect you from the screen session, but leave it running in the background.

To reattach to a running screen session, use the command "screen -r". If you have multiple screen sessions running, you can use "screen -ls" to list them, and then use "screen -r -d [pid.tty.host]" to reattach to a specific session. The -d option detaches the session from any other terminal it may be attached to.

While working within a screen session, you can use "Ctrl-a" to issue commands to screen itself. Just be aware that "Ctrl-a" will not move your cursor to the beginning of the line as it does in a standard bash session. For more information on using screen, you can consult the Screen User's Manual.

These are just a few examples of the many shell tricks that can be useful for bioinformatics processing. With a little bit of practice and experimentation, you can use these commands to quickly and efficiently process large datasets and extract the information you need.