SAM (Sequence Alignment/Map) format is a tab-delimited text file that stores information about the alignment of DNA sequences to a reference genome. It has 11 mandatory fields and includes header lines that are denoted with an @ at the start of the line. SAM files are human-readable but can be quite large. A more compact and easier-to-work-with alternative to SAM is the Binary Alignment/Map (BAM) format, which is a binary-compressed version of SAM. Most pipelines use BAM over SAM.
SAM/BAM contains the following columns:
Column | Description |
1 | Read name |
2 | SAM flags a.k.a. Bitwise flag |
3 | Reference name |
4 | Leftmost mapping position |
5 | MAPQ quality score |
6 | CIGAR string |
7 | Name of 2nd read in pair |
8 | Position of 2nd read in pair |
9 | Length of mapping segment |
10 | Sequence of segment |
11 | Phred33 quality score at each position |
SAM flags are found in the second column of a BAM/SAM file and provide information about the mapping of a read. The flag value is an integer that is the sum of several decimal values, each of which corresponds to a specific piece of information about the mapping (see table below). For example, a flag value of 99 (1+2+32+64) for a paired-end mapping dataset means that the read is mapped along with its mate (1 and 2) and that the mate is on the reverse strand (32) and is the first read in the pair (64) e.a. in the correct orientation (FR).
Integer value of SAM flag | Description |
1 | Read is paired |
2 | Read mapped in proper pair |
4 | Read unmapped |
8 | Mate is unmapped |
16 | Read on reverse strand |
32 | Mate on reverse strand |
64 | First read in pair |
128 | Second read in pair |
256 | Not primary alignment |
512 | Alignment fails quality checks |
1024 | PCR or optical duplicate |
2048 | Supplementary alignment |
SAMtools
SAMtools is a suite of programs that are useful for processing mapped reads and for downstream analysis. It has a wide range of functions, including sorting, indexing, and converting between formats. One of the most commonly used functions is samtools view, which allows you to view the contents of a BAM/SAM file in SAM format. You can specify a specific chromosome or subregion to only print alignments in that window, or you can use a BED file to specify multiple regions of interest.
There are several other common tasks that can be performed using SAMtools, including filtering to remove unmapped reads, calculating coverage, and extracting sequences from a BAM/SAM file.
In summary, SAM/BAM format and SAMtools are essential tools for bioinformaticians working with mapped sequence data. They allow for efficient processing and analysis of large datasets, and are commonly used in a wide range of applications.
Filtering
SAMtools has several options for filtering reads in a BAM/SAM file. One option is to remove unmapped reads using the -F 4 flag:
samtools view -b -h -F 4 -o filtered.bam sorted.bam
This creates a new BAM file called "filtered.bam" that includes only mapped reads.
You can also use the -f flag to only include reads that meet certain criteria. For example, to include only properly paired reads, you can use the -f 2 flag:
samtools view -b -h -f 2 -o filtered.bam sorted.bam
There are many other options for filtering reads based on flags, mapping quality, and other criteria. You can find more information in the SAMtools manual (http://www.htslib.org/doc/samtools.html).
Calculating coverage
Coverage and depth are important measures of the quality and reliability of mapped reads. There are several ways to calculate these metrics, and SAMtools provides several useful tools for doing so.
One of these tools is samtools coverage, which calculates coverage for each contig or scaffold in a BAM/SAM file and outputs several summary statistics as a table. To use this tool, simply pass the path to your sorted BAM file:
samtools coverage sorted.bam
You can also specify a specific region or contig using the -r flag:
samtools coverage sorted.bam -r 1
This will only calculate coverage for the specified region or contig.
As a quick way to visualize coverage, you can use the -m option to create a histogram of coverage over a specific region:
samtools coverage -m sorted.bam -r 1:1-1000
This will create a histogram showing the distribution of coverage over the specified region (chromosome 1, positions 1-1000 in this example).
In addition to samtools coverage, there are several other tools and approaches you can use to calculate coverage and depth. These include using bedtools genomecov to calculate coverage for a specific BED file, or using samtools depth to calculate coverage at each base position in a BAM/SAM file.
To calculate per base coverage, you can use the samtools depth command:
samtools depth sorted.bam
This will output a tab-delimited file with three columns: the reference name, the position, and the depth of coverage.
You can also calculate coverage for specific regions by using the -r flag and specifying the region(s) of interest:
samtools depth -r 1:100-200 sorted.bam
This will output the coverage for the region on chromosome 1 from position 100 to 200.
SAMtools stats
SAMtools stats is a useful function that provides some quick summary statistics about your mapping reads. To use this function, simply pass the path to your sorted BAM file:
samtools stats sorted.bam
This will generate a large amount of information, including summary statistics, per-sequence statistics, and coverage statistics. If you just want to see the summary statistics, you can use grep and cut to filter the output:
samtools stats data/file.sorted.bam | grep ^SN | cut -f 2-
This will print only the summary statistics, which can be useful for getting a high-level overview of your mapping data.
Other tricks
There are several other tricks you can use with SAMtools that can be useful in certain situations. One of these is the ability to convert a BAM file to FASTQ or FASTA format. To do this, you can use the samtools fastq or samtools fasta commands:
samtools fastq sorted.bam
This will output the sequences from the BAM file in FASTQ format, which can be useful for subsequent downstream analysis.
You can also use the samtools merge command to combine multiple sorted BAM/SAM files. This can be useful if you have done multiple rounds of mapping and want to combine the data into a single file:
samtools merge file.bam file2.bam ...
By default, the headers will also be merged.
In conclusion, SAMtools is a powerful suite of tools that can be used for a wide range of tasks related to mapping reads and performing downstream analysis. It includes functions for viewing, sorting, filtering, and extracting sequences from BAM/SAM files, as well as tools for calculating coverage and generating summary statistics. These tools can be valuable resources for bioinformaticians working with mapped sequence data. More information can be found in the SAMtools manual found at http://www.htslib.org/doc/samtools.html.
Comments