top of page

How to quickly find circular sequence matches



In certain cases it is desirable to quickly search circular sequences for complete matches (e.g. with a hash function/dictionary). The problem is that a circular sequence can exist in different forms in a database because it can be "rotated". For example, if we have a circular DNA sequence of ACGT, we can store it as ACGT, but also as CGTA, GTAC or TACG. So if the sequence is n in length, it can be rotated n times and there are up to n different forms of the same sequence. A solution could be to rotate each sequence n times and find this sequence for each rotation. But this is not efficient and can take a long time.


An easier way is to determine the lexicographically smallest rotation. Lexicographic simply means in alphabetical order (from A-Z). For example, both ACGT and GTAC are rotations of the same sequence. The A comes before G (and any other character in the sequence) in the alphabet, so that is the lexicographically smallest rotation ACGT and ACGT.


Another example is the lexicographically smallest rotation of GAATAA and AAGAAT. The one-letter prefixes of that sequence are A, G and T. However, there are many A's and no unambiguous minimum rotation. Then if we look at the two-letter prefixes, in lexicographical order, they are AA, AG, AT, GA, and TA, but there are still two instances of AA, so we need to consider even longer prefixes. The lexicographically smallest three-letter prefix is AAG and there is only one instance of it, so that is the beginning of the canonical rotation (AAGAAT and AAGAAT).


Fortunately, there is a linear-running algorithm to calculate that rotation, namely Booth's algorithm. Below a short Python example and more information can be found on wiki (https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation)


def least_rotation(s):
    # Get the length of the input string
    n = len(s)
    
    # Initialize the failure array with -1
    f = [-1] * (2 * n)
    
    # Initialize the starting position
    k = 0
    
    # Iterate through the string, starting at index 1
    for j in range(1, 2 * n):
        # Calculate the value of i using the failure array
        i = f[j - k - 1]
        
        # While the characters at j % n and (k + i + 1) % n do not match
        while i != -1 and s[j % n] != s[(k + i + 1) % n]:
            # Check if the character at j % n is less than the character at (k + i + 1) % n
            if s[j % n] < s[(k + i + 1) % n]:
                # If it is, update the starting position to j - i - 1
                k = j - i - 1
                
            # Update the value of i using the failure array
            i = f[i]
        
        # If the characters still do not match
        if i == -1 and s[j % n] != s[(k + i + 1) % n]:
            # Check if the character at j % n is less than the character at (k + i + 1) % n
            if s[j % n] < s[(k + i + 1) % n]:
                # If it is, update the starting position to j
                k = j
                
            # Set the value of the failure array at j - k to -1
            f[j - k] = -1
        else:
            # Otherwise, set the value of the failure array at j - k to i + 1
            f[j - k] = i + 1
    
    # Construct the rotated string by concatenating the suffix of the string starting at index k with the prefix of the string up to index k
    rotated_s = s[k:] + s[:k]
    
    # Return the lexicographically minimal rotated string
    return rotated_s



Comentarios


bottom of page