Accurately determining which species a microbial strain belongs to is important for the registration of strains for enzyme production, novel foods and agricultural use. In addition, it can aid knowledge discovery by identifying closely related strains.
Strain identification
In the past strain identification was done either classically based on morphology and physiology or using ribosomal RNA gene sequences such as variable regions within the 16S rDNA gene or inter transcribed spacer ( ITS). The resolution (ability to distinguish related species) of those markers is relatively low. This has led the misidentification of many closely related strains, including many medically important species.
At present, strain identification of closely related species is frequently done using a combination of sequences from several housekeeping genes. This technique is called multi locus sequence typing (MLST) and was first developed in 1998 to identify Neisseria meningitidis. Although the resolution of MLST is higher, it is unable to identify related strains which limits it use for epidemiological investigations. Resolution can be further improved by choosing more genes or more variable genes such as in multi-virulence-locus sequence typing. Other options include the use of RFLP (restriction fragment length polymorphism), RAPD (random-amplified polymorphic DNA) and AFLP (amplified fragment length polymorphism).
With the drop in sequence pricing and advent of efficient algorithms, whole genome-genome comparisons have become a viable option for high resolution strain identification. Originally, whole genome-genome comparisons were done by calculating the mean nucleotide identity of orthologous gene pairs (ANI) between two microbial genomes. More recently, however, ANI scores are calculated using the whole genome (gANI) to delineate species-level diversity. A typical threshold for species boundary is 95% gANI.
The gANI score can be calculated using either an alignment based (e.g. blast, nucmer) or hash based algorithm. Hash based methods are much faster and tend to scale better to an increasing number of comparisons. Fastani is a program using a hash based approach to estimate gANI scores and is several orders of magnitude faster than blast.
Fastani for gANI
Fastani uses the following algorithm to estimate the gANI score. First, it cuts the query genome into L-sized regions (3000 bp by default). It then uses mashmap to map the query region to the reference genome. In the background mashmap uses a concept called minimizers (more on this below). Subsequently, it maps the identified region in the reference genome to the query genome and chooses the best bidirectional hit. This helps reduce errors due to genomic repeats. When the reference and query region are identified, fastani estimates the gANI score using minhash and sketches.
Source: https://www.nature.com/articles/s41467-018-07641-9
To better understand what fastani does, we need to dive a bit deeper into the underlying algorithms. To identify the reference position, mashmap (the mapping engine) uses minimizers. Minimizers are an efficient way of summarizing longer sequences while maintaining some information about their contiguity. Typically, fixed size subsequences (a.k.a k-mers) and their reverse complement are selected across a larger sequence and ordered, the smallest subsequence (either lexicographically or by hash value) is the minimizer. For example, if we take the following sequence and make 7 bp long subsequences we first get the sequence ATATATC and its reverse complement. After all these subsequences are identified, sorted in lexicographical order, the subsequece AAAAGTT is the smallest sequence and therefore the minimizer.
Instead of remembering only the smallest subsequence (the minimizer; AAAAGTT), mashmap remembers the S-smallest subsequences known as the sketch. Fastani then uses the bidirectional best hit and associated sketches and calculates the minhash to estimate the average nucleotide identity for that region. The minhash is basically the number of common hash items divided by the total number of hashes. This is the same as the Jaccard similarity coefficient or Jaccard index. As you can see below on the left, the mashmap similarity scores and the Smith-Waterman identity have a high concordance. Fastani then averages the sum of similarity scores over all the bidirectional regions and reports the gANI score.
Source: https://www.nature.com/articles/s41467-018-07641-9
Fastani was validated using 90K prokaryotic species and there is a clear border between intra-species (95%) and inter-species (<83% ) gANI values (see above right). It should be noted that the input genome assemblies (both reference and query) need to have sufficient quality (N50 ≥10 Kbp). No gANI output is reported for a genome pair if the value is much below 80% or there are less than 50 bidirectional hits. So far, fastani has not been validated for eukaryotes and should be used with caution.
Running fastani
Fastani can be downloaded at https://github.com/ParBLiSS/FastANI and once installed it can be run as follows. The query.txt and reference.txt contain lists of all the genomics fasta files to compare and should be identical for all versus all comparisons.
fastANI -t 10 --rl query.txt --ql reference.txt -o fastani.tsv
Fastani can be parallelized for multi-core processors using the -t option. Parallelization is done by splitting the reference genome in several equal-size parts and searching the query genome against each part of the reference database independently.
The output file (fastani.tsv) will contain tab delimited rows with the query genome, reference genome, gANI value, count of bidirectional fragment mappings and total query fragments.
If you would like to know more about this subject, or you would like to suggest a topic for a future post, just drop a message below.
References
Jain, C., Rodriguez-R, L.M., Phillippy, A.M. et al. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun9, 5114 (2018). https://doi.org/10.1038/s41467-018-07641-9
Comments