Unix pipes allow you to connect the output of one command to the input of another command, allowing you to create a pipeline of multiple commands. This can be useful for tasks such as filtering and transforming data, or for chaining together multiple commands to perform a more complex operation. Pipes are created using the | operator, and allow data to flow between processes without being written to disk.
Suppose we have a file called input.txt containing a list of names, one name per line. We want to extract only the names that start with the letter "A" and convert them to uppercase.
We can use the following pipeline to accomplish this task:
cat input.txt | grep "^A" | tr a-z A-Z
This pipeline consists of three commands connected by pipes:
cat input.txt reads the contents of the input.txt file and writes it to the standard output.
grep "^A" reads the standard input and filters it for lines that start with the letter "A". It writes the matching lines to the standard output.
tr a-z A-Z reads the standard input and converts all lowercase letters to uppercase. It writes the resulting text to the standard output.
The final output of the pipeline will be the names in input.txt that start with "A" in uppercase.
This is just one example of how Unix pipes can be used to create a pipeline of multiple commands. You can use pipes with a wide variety of commands to create pipelines that fit your specific needs.
Redirection
In addition to pipes, Unix shells also support other forms of interprocess communication such as redirecting standard input and output to and from files using the < and > operators. The < operator can be used to redirect the input of a command from a file, while the > operator can be used to redirect the output of a command to a file. For example, the following command redirects the output of the ls command to a file called output.txt:
ls > output.txt
Subsequently, we can use the < operator to redirect the input of the wc command from the output.txt file:
wc -l < output.txt
In this example, the wc command counts the number of lines read from its standard input, the output.txt file.
You can use the < operator with any command that reads from the standard input, not just the wc command. This can be useful for redirecting the input of a command from a file instead of typing it directly on the command line.
For directing standard output, you can also use the >> operator to append the output of a command to an existing file, rather than overwriting the contents of the file.
Suppose we have a file called log.txt that contains a log of some events. We want to add a new entry to the log file using the echo command.
To append the output of the echo command to the log.txt file, we can use the following command:
$ echo "This is a new log entry" >> log.txt
This will add the text "This is a new log entry" to the end of the log.txt file, without overwriting the contents of the file.
In addition to standard input and output, Unix shells also have a standard error stream (stderr) that is separate from the standard output stream (stdout). By default, stderr is printed to the terminal, while stdout is redirected or piped as needed. However, you can also redirect or pipe stderr using the 2> and 2>> operators. For example, the following command redirects the stderr of the ls command to a file called error.txt:
ls 2> error.txt
You can also use the &> operator to redirect both stdout and stderr to the same file. For example using the following command:
ls &> stdout_and_error.txt
This will redirect both the standard output and the standard error of the ls command to the stdout_and_error.txt file.
Anonymous pipes
In addition to pipes and redirection, Unix systems also support anonymous pipes and named pipes. Anonymous pipes, also known as unnamed pipes, allow you to create a one-way communication channel between two processes. They are created using the pipe system call and are only accessible to processes that have a reference to them. Some shells, such as Bash, have a feature called process substitution that allows you to use anonymous pipes in a more flexible way.
Process substitution allows you to connect the input or output of a list of commands to a FIFO (a type of named pipe). The commands will then use the name of this FIFO as if it were a regular file. The notation for process substitution in Bash is <(command list) to pass the result of the list to the standard input of the actual command, or >(command list) to pass the standard output of the actual command to the standard input of the list.
For example, the following command uses process substitution to pass the output of the find command to the wc command, which counts the number of lines in the input:
wc -l <(find / -mindepth 1 -maxdepth 1 -type d) <(find /opt -mindepth 1 -maxdepth 1 -type d)
This command will output the number of directories inside the / and /opt directories.
Overall, anonymous pipes can be useful for simple interprocess communication, but they have a few limitations. They are only accessible to processes that have a reference to them, and they do not have a name in the filesystem, making it difficult to use them for communication between unrelated processes. Process substitution can be a useful feature for working with anonymous pipes in a more flexible way, allowing you to use them in situations where you would normally need to use a named pipe.
Named pipes
A named pipe, also known as a FIFO (first in, first out), is a special file that acts as a communication channel between processes. It is similar to an anonymous pipe, but it has a name in the filesystem, allowing it to be used for communication between unrelated processes. Named pipes are created using the mkfifo command and have the same characteristics as any other file, such as ownership, permissions, and metadata.
One important feature of named pipes is that they support bidirectional communication. This means that multiple processes can read from and write to the same named pipe, allowing for more complex interprocess communication. In Linux, you can create a named pipe using the mkfifo command or the mknod command with the letter "p" to indicate that it is a named pipe. For example:
mkfifo pipe1
mknod pipe2 p
Named pipes can be used in combination with anonymous pipes to create more complex applications. For example, you can use a named pipe as a buffer to transfer data between processes that are running at different speeds, or as a way to synchronize access to shared resources between processes.
To use a named pipe, you can read from and write to it using the same commands and techniques that you would use with a regular file. For example, you can use the cat command to read from a named pipe, or the echo command to write to it. You can also use the tee command to split the output of a command and write it to both a named pipe and a regular file at the same time.
In conclusion, named pipes are a useful tool for interprocess communication in Unix-based systems. They provide a flexible way to transfer data between processes and can be used for a variety of purposes, from simple one-way communication to more complex bidirectional communication and synchronization.
The tee command
The tee command is a utility that reads from standard input and writes the output to standard output and one or more files. It can be useful for splitting the output of a command and writing it to both a file and the terminal at the same time.
Here is an example of using the tee command:
$ ls -l | tee directory_list.txt
total 64
drwxr-xr-x 5 user staff 160 Jan 1 12:00 Desktop
drwxr-xr-x 3 user staff 96 Jan 1 12:00 Documents
drwxr-xr-x 3 user staff 96 Jan 1 12:00 Downloads
drwxr-xr-x 3 user staff 96 Jan 1 12:00 Library
drwxr-xr-x 3 user staff 96 Jan 1 12:00 Movies
drwxr-xr-x 3 user staff 96 Jan 1 12:00 Music
drwxr-xr-x 3 user staff 96 Jan 1 12:00 Pictures
drwxr-xr-x 2 user staff 64 Jan 1 12:00 Public
In this example, the ls -l command lists the files and directories in the current directory and their attributes. The output of this command is passed to tee, which writes it to both the terminal and the file directory_list.txt. If you open the directory_list.txt file, you will see that it contains the same output as the terminal.
You can also use the -a option to append the output to the file instead of overwriting it. For example:
$ ls -l | tee -a directory_list.txt
This will append the output of the ls -l command to the end of the directory_list.txt file, rather than overwriting it.
In addition to writing the output to a file, you can also use the tee command to write the output to a named pipe. For example:
$ ls -l | tee >(wc -l > count.txt)
In this example, the output of the ls -l command is passed to tee, which writes it to the terminal and also to a process substitution using the >() syntax. The process substitution runs the wc -l command, which counts the number of lines in its input. The output of this command is then written to the file count.txt.
You can also use the tee command to split the output of a command and write it to multiple files at the same time. For example:
$ ls -l | tee directory_list.txt >(grep "^-" > files.txt) >(grep "^d" > directories.txt)
In this example, the output of the ls -l command is passed to tee, which writes it to the file directory_list.txt and also to two process substitutions using the >() syntax. The first process substitution runs the grep "^-" command, which filters the input for lines that start with a dash (indicating a regular file). The output of this command is written to the file files.txt. The second process substitution runs the grep "^d" command, which filters the input for lines that start with the letter "d" (indicating a directory). The output of this command is written to the file directories.txt.
Overall, the tee command is a useful tool for splitting the output of a command and writing it to multiple places at the same time. It can be used in combination with pipes and process substitutions to create more complex shell scripts and pipelines.
Overall, pipes and redirection are useful tools for working with the Unix command-line interface and can help you create efficient and powerful shell scripts.
Bioinformatics one-liners
Here are a few examples of common tasks that can be done with unix pipes.
Reverse complement a sequence (I use that a lot when I need to design primers)
echo 'ATTGCTATGCTNNNT' | rev | tr 'ACTG' 'TGAC'
Explanation:
The echo command prints the input string (the DNA sequence) to the standard output.
The rev command reads from the standard input and reverses the order of the characters in each line.
The tr command translates the characters in the input string, replacing A with T, C with G, T with A, and G with C. The result is the reverse complement of the input sequence.
Get the sequence length distribution from a FASTQ file
zcat reads.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'
Explanation:
The zcat command decompresses the reads.fastq.gz file and writes the uncompressed data to the standard output.
The awk command processes the input from zcat one line at a time.
The NR variable holds the current line number of the input. The NR%4 == 2 condition is true for every second line (the sequence lines in a FASTQ file).
The lengths array is used to count the number of sequences of each length. For each sequence line, increments the count for the current sequence length. Finally, the END block prints the count for each unique sequence length at the end of the file.
Subsample from fastq files
The seqtk sample command can be used to subsample reads from a fastq file. By specifying the same seed (-s100), we ensure that the two files are in sync. The seqtk tool can be found at https://github.com/lh3/seqtk.
To subsample 1045174 reads from reads_1.fastq and save the output to reads-ss_1.fastq.gz, we can use the following one-liner:
seqtk sample -s100 reads_1.fastq 1045174 | gzip -c > reads-ss_1.fastq.gz
To subsample 1045174 reads from reads_2.fastq and save the output to reads-ss_2.fastq.gz, we can use the following one-liner:
seqtk sample -s100 reads_2.fastq 1045174 | gzip -c > reads-ss_2.fastq.gz
Comments