A suffix array is a sorted list of all the suffixes in a string. Consider the string “ball”, its suffixes are as follows:
- ball (suffix starting from index 0)
- all (suffix starting from index 1)
- ll (index 2)
- l (index 3)
After sorting it in lexicographical order, we will get the suffix array, i.e.,
- all (index 1)
- ball (index 0 )
- l (index 3)
- ll (index 2)
First approach
But how do we write a program that can do this for us? One straightforward approach is to generate all the suffixes first and then sort them. It is quite simple to code. However, it is not efficient. Its time complexity is O(n2logn)
, i.e., O(n) to generate the suffixes and O(nlogn)
to sort them, where n is the length of the string.
def naiveSuffixArray(str): sa = [] #suffix array n = len(str) #length of array #generate suffixes for i in range(0, n): sa.append(str[i:n]) #sorting them sa = sorted(sa) return sa str = "banana" SA = naiveSuffixArray(str) print(SA)
[‘a’, ‘ana’, ‘anana’, ‘banana’, ‘na’, ‘nana’]
A better and faster approach
Let’s now talk about a better approach with O’s time complexity (n(logn)2)
. This approach works by sorting the first K
occurrences in suffixes, where K
is a power of 2 and less than 2×n
. For example, it sorts the suffixes based on the first two characters, then four, etc. It does that by keep ranks and next ranks of the suffixes.
To better understand how this algorithm works, let’s apply it on the string “ball” to generate its suffix array.
First, sort the suffixes according to the first two characters by assigning rank and next rank to each suffix. Usually, the rank is obtained using str[i]-"a"
, where str[i]
denotes the starting character of each suffix. The next rank is equal to str[i+1]-"a"
, where str[i+1]
is the second character in the suffix. If it does not exist, the next rank becomes -1.
Index | Suffix | (rank, next rank) |
0 | ball | (1, 0) |
1 | all | (0, 11) |
2 | ll | (11, 11) |
3 | l | (11, -1) |
After sorting it by the rank and next rank, we get:
Index | Suffix | (rank, next rank) |
1 | all | (0, 11) |
0 | ball | (1, 0) |
3 | l | (11, -1) |
2 | ll | (11, 11) |
Now let’s sort the suffixes according to the first four characters. Here, the ranks and next ranks get reassigned. To do that, assign 0 to the rank of the first suffix in the list, then Iterate through the remaining suffixes and check if the rank and the next rank of the previous suffix and the current suffix are equal. If they are, assign the same rank to the current suffix as the previous one. Otherwise, increment the previous rank by 1.
For the next rank’s value, extract the suffix from the i+2 position, where i is the first character in the suffix. Find that suffix in the list and place its rank value as the next rank of the current suffix. If i+1 is out of range, then set -1 as the next rank.
Index | Suffix | (rank, next_rank) |
1 | all | (0, 2) |
0 | ball | (1, 3) |
3 | l | (2, -1) |
2 | ll | (3, -1) |
As you can observe, the list won’t change after sorting, and since we have considered all the characters, the above-given list is the resulting suffix array.
Index | Suffix |
1 | all |
0 | ball |
3 | l |
2 | ll |
Implementation
def suffixArray(str): n = len(str) #sorting by first two characters #get the index, rank and next rank for each suffix suffixes = [ {"index":i, "rank": ord(str[i])-ord('a'), "next_rank": (ord(str[i+1])-ord('a')) if (i!=n-1) else -1 } for i in range(0, n)] #array of dictionary #sort the suffixes suffixes = sorted(suffixes, key = lambda i: (i['rank'], i['next_rank'])) #create an array to keep track of indices of the suffixes array index = [0]*n #sort for k=4, 8, ... characters k=4 while k<(2*n): rank=0 #rank 0 to the suffix 0 prev_rank = suffixes[0]['rank'] #saving the pervious rank of the first suffix suffixes[0]['rank'] = rank #assigning new ranks to suffixes for i in range(1, n): if (suffixes[i]['rank']==prev_rank and suffixes[i]['next_rank']==suffixes[i-1]['next_rank']): prev_rank = suffixes[i]['rank'] suffixes[i]['rank'] = rank else: prev_rank = suffixes[i]['rank'] rank = rank + 1 suffixes[i]['rank'] = rank #assigning next ranks for i in range(0,n): next_index = int(suffixes[i]['index']+k/2) index = [ind for ind, suffix in enumerate(suffixes) if suffix["index"] == next_index] suffixes[i]['next_rank'] = suffixes[index[0]]['rank'] if index else -1 #sort the suffixes suffixes = sorted(suffixes, key = lambda i: (i['rank'], i['next_rank'])) k = 2*k return suffixes #testing str = "ball" suffixes = suffixArray(str) SA = [] SA_index = [] for i in range(0, len(suffixes)): SA_index.append(suffixes[i]['index']) SA.append(str[suffixes[i]['index']:]) print(SA_index) print(SA)
Output
[1, 0, 3, 2] [‘all’, ‘ball’, ‘l’, ‘ll’]