Implementation of Suffix Arrays

A suffix array is a sorted list of all the suffixes in a string. Consider the string “ball”, its suffixes are as follows:

  • ball (suffix starting from index 0)
  • all (suffix starting from index 1)
  • ll (index 2)
  • l (index 3)

After sorting it in lexicographical order, we will get the suffix array, i.e.,

  • all (index 1)
  • ball (index 0 )
  • l (index 3)
  • ll (index 2)

First approach

But how do we write a program that can do this for us? One straightforward approach is to generate all the suffixes first and then sort them. It is quite simple to code. However, it is not efficient. Its time complexity is O(n2logn), i.e., O(n) to generate the suffixes and O(nlogn) to sort them, where n is the length of the string.  

def naiveSuffixArray(str):
  sa = [] #suffix array
  n = len(str) #length of array
  #generate suffixes
  for i in range(0, n):
    sa.append(str[i:n])
  #sorting them
  sa = sorted(sa)
  return sa

str = "banana"
SA = naiveSuffixArray(str)
print(SA)
[‘a’, ‘ana’, ‘anana’, ‘banana’, ‘na’, ‘nana’]

A better and faster approach

Let’s now talk about a better approach with O’s time complexity (n(logn)2). This approach works by sorting the first K occurrences in suffixes, where K is a power of 2 and less than 2×n. For example, it sorts the suffixes based on the first two characters, then four, etc. It does that by keep ranks and next ranks of the suffixes.

To better understand how this algorithm works, let’s apply it on the string “ball” to generate its suffix array.

First, sort the suffixes according to the first two characters by assigning rank and next rank to each suffix. Usually, the rank is obtained using str[i]-"a", where str[i] denotes the starting character of each suffix. The next rank is equal to str[i+1]-"a", where str[i+1] is the second character in the suffix. If it does not exist, the next rank becomes -1.

Index Suffix (rank, next rank)
0 ball (1, 0)
1 all (0, 11)
2 ll (11, 11)
3 l (11, -1)

After sorting it by the rank and next rank, we get:

Index Suffix (rank, next rank)
1 all (0, 11)
0 ball (1, 0)
3 l (11, -1)
2 ll (11, 11)

Now let’s sort the suffixes according to the first four characters. Here, the ranks and next ranks get reassigned. To do that, assign 0 to the rank of the first suffix in the list, then Iterate through the remaining suffixes and check if the rank and the next rank of the previous suffix and the current suffix are equal. If they are, assign the same rank to the current suffix as the previous one. Otherwise, increment the previous rank by 1.

For the next rank’s value, extract the suffix from the i+2 position, where i is the first character in the suffix. Find that suffix in the list and place its rank value as the next rank of the current suffix. If i+1 is out of range, then set -1 as the next rank.

Index Suffix (rank, next_rank)
1 all (0, 2)
0 ball (1, 3)
3 l (2, -1)
2 ll (3, -1)

As you can observe, the list won’t change after sorting, and since we have considered all the characters, the above-given list is the resulting suffix array.

Index Suffix
1 all
0 ball
3 l
2 ll

Implementation

def suffixArray(str):
  n = len(str) 
  #sorting by first two characters
  #get the index, rank and next rank for each suffix
  suffixes = [ {"index":i, "rank": ord(str[i])-ord('a'), "next_rank":  (ord(str[i+1])-ord('a')) if (i!=n-1) else -1 } for i in range(0, n)] #array of dictionary

  #sort the suffixes
  suffixes = sorted(suffixes, key = lambda i: (i['rank'], i['next_rank']))
  #create an array to keep track of indices of the suffixes array
  index = [0]*n
  #sort for k=4, 8, ... characters
  k=4
  while k<(2*n):
    rank=0 #rank 0 to the suffix 0
    prev_rank = suffixes[0]['rank'] #saving the pervious rank of the first suffix
    suffixes[0]['rank']  = rank  
    #assigning new ranks to suffixes
    for i  in range(1, n):
      if (suffixes[i]['rank']==prev_rank and suffixes[i]['next_rank']==suffixes[i-1]['next_rank']):
        prev_rank = suffixes[i]['rank']
        suffixes[i]['rank'] = rank
      else:
        prev_rank = suffixes[i]['rank']
        rank = rank + 1
        suffixes[i]['rank'] = rank
    #assigning next ranks
    for i in range(0,n):
      next_index = int(suffixes[i]['index']+k/2)
      index =  [ind for ind, suffix in enumerate(suffixes) if suffix["index"] == next_index]
      suffixes[i]['next_rank'] = suffixes[index[0]]['rank'] if  index else -1
    #sort the suffixes
    suffixes = sorted(suffixes, key = lambda i: (i['rank'], i['next_rank']))
    k = 2*k
  return suffixes
  


#testing 
str = "ball"
suffixes = suffixArray(str)
SA = []
SA_index = []
for i in range(0, len(suffixes)):
  SA_index.append(suffixes[i]['index'])
  SA.append(str[suffixes[i]['index']:])

print(SA_index)
print(SA)

Output

[1, 0, 3, 2] [‘all’, ‘ball’, ‘l’, ‘ll’]

 

Back to: Data Structure > Data Structure - Arrays

Leave a Reply

Your email address will not be published. Required fields are marked *