In this post we will look into a data format that most sequencing facilities return to you after sequencing your samples. These files contain all the necessary information for your downstream analysis and understanding its contents will help you determine how to best proceed. This post assumes you have Linux installed and are familiar with the command line. If not, have a look at the "Linux and Bioinformatics" post to get you started.
After you handed in your DNA or RNA sample, most sequencing facilities will provide you with your high-throughput sequencing data in a format called FASTQ. The filenames often contain the extension ".fastq" or ".fq" frequently followed by a ".gz" at the end indicating that the FASTQ file was compressed using GNU zip (an open source file compression program) or gzip for short.
In the scientific community these files are frequently shared on the European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena/browser/view/SRR17176235?show=reads) and NCBI Sequence Read Archive (SRA; https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR17176235) and can be downloaded as SRA file or directly as FASTQ files. If you already have FASTQ files lying around, you can use those. If not you can download a FASTQ test set using the commands below.
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR171/035/SRR17176235/SRR17176235_1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR171/035/SRR17176235/SRR17176235_2.fastq.gz
After downloading these files we can use a program called zless (short for GNU zip less) to have a look inside these FASTQ files.
zless SRR17176235_1.fastq.gz
As you can see, the FASTQ file format is a simple text based format based on an earlier format called FASTA which was extended to include base quality metrics from sequencing machines.
The FASTQ format usually contains 4 lines for each sequence read output by the sequencing workflow.
Line 1 - sequence identifier
The first line of begins with the '@' character and is followed by a sequence identifier (ID) and an optional description. This line usually contains information specific for the sequencing technology used. In the case of Illumina reads, the default header contains the instrument used, the flowcell ID, lane, etc.
If the file was submitted to the NCBI SRA the header was stripped and replaced with a new identifier (e.g. SRR17176235.1). The NCBI SRA stores NGS information, not format. This hasn't always been the case, but due to the use of non standard sequence identifiers and descriptions it was decided to replace them all together with an unique SRA identifier.
In case you have FASTQ files containing the default Illumina sequence identifiers you can get some useful information from them. Below is an example of an Illumina Casava 1.8 formatted FASTQ file.
In the example above, we can see that the sequence ID starts with a letter D followed by 5 numbers. If we lookup this number in the table with Illumina instrument IDs below, we see that this ID belongs to a HiSeq 2500 sequencer. It is known that a HiSeq 2500 sequencer tends to have an increase of the error rates towards the end of the reads. In this case it would make sense to trim (remove) the low quality ends of the reads using a sequence trimming program. More on this in a latter post. If however the identifier would show that a NovaSeq was used, because of the slightly different error profile, it is better to discard reads when a certain fraction of bases with low quality score are identified.
Some sequencing facilities remove the instrument ID from the sequence identifier line. In the case the flow cell ID is still given, it is possible to use that to identify the sequencer used (see table below). In the example above the flow cell ID is H77FVBCXY and that corresponds to a Rapid Run (2-lane) v2 flow cell used in a HiSeq 2500 sequencer.
Line 2 - Sequence
The line following the sequence identifier contains the raw sequence letters. The FASTQ format uses the following standard IUB/IUPAC conventions for nucleic acid sequences as alphabetic characters. Although in practice only A, C, G, T and N are used in FASTQ files.
Line 3 - Quality score identifier
The quality score identifier line starts with a ‘+’ character and marks the end of the sequence. This line can optionally contain the same sequence identifier as found in line 1.
Line 4 - Base quality scores
After the line containing the quality score identifier, the line encoding the quality values for the raw sequence starts. This line contains the same number of symbols as there are letters in the sequence. Each symbol corresponds to an encoded quality score.
In the past two different quality score encodings were used (see table below). These are the Phred+33 and Phred+64 encoding. The only difference between these encodings is the range of symbols used to encode the Phred quality score. Using the Phred+33 encoding, scores are encoded with the symbol '!' (indicating quality score 0) to '~' (indicating quality score 93). While the Phred+64 encoding uses an '@' to encode a 0 quality score all the way up to letter 'i' for score 41. Some of the downstream analysis programs can estimate the quality encoding used by looking for symbols from '!' to '?', indicating a Phred+33 encoding.
These Phred scores represent the likelihood of the base being called wrong. A Phred score of 10 indicates that there is a 1 in 10 chance the base is called incorrectly (e.a. 90% base call accuracy). The higher the Phred score, the higher the base calling accuracy. Below is a table with commonly seen Phred scores.
In the SRR17176235_1 example, the sequences start with an N and an associated quality score of '#'. The symbol '#' encodes a Phred score of 2 meaning that there is a 63% chance the base was called incorrectly. In the gray box below are the formulas needed to convert from and to Phred quality scores.
The mathematical formula to calculate the Phred quality score (Q) is as follows. Q = -10 log10(P), where P is the probability of an incorrect base call. As noted above, since the score is in minus log scale, the higher the Phred score, the more unlikely that the base is called wrong (e.a. lower P). The following formula can be used to convert from the Phred score back to the base-calling error P = 10^-Q/10.
In another post, I will dive deeper into the contents of the FASTQ file and how to remove unwanted sequences, such as adapter sequences and poly G stretched. If you would like to know more about this subject, or you would like to suggest a topic for a future post, just drop a message below.
コメント