Approximate String Matching

(or Fuzzy String Matching) is a method to find an approximate match between a pattern and a string.

Approximate matching means to be able to convert a pattern to match a string (or a string to match a pattern) with the minimal number of changes.

An example of this is the match between the pattern “coat” and the string “cost”. There is a single substitution which is substituting ‘a’ with ‘s’.

Levenshtein distance

(or edit distance) is the minimal number of edits required in a string to match another string.

Those edits can be insertion, deletion, or substitution. In a more general case we can consider all edits as substitutions, i.e. deletion can be a substitution with null, and insertion can be a substitution of null with a character. The Levenshtein distance in the previous example is 1.

Online vs Offline approximate string matching

Online means that we can preprocess the pattern before matching, but not the string.

Offline means exactly the opposite, we can preprocess the string before matching to make the search faster for any pattern.

Approximate String Matching Algorithms

Classical Dynamic Programming Algorithm(Needleman & Wunsch)

Like any other dynamic programming we build a data structure (Table in this case) that holds the old results that are interesting to us.

In this table is of size (n+1)x(m+1) where n is the string length, and m is the pattern length.

The following example should make things clear.

Here we are trying to match the string “start” with the pattern “star” as an example.

Our score function returns 0 if the 2 characters match, and -1 if any edit is required.

S 0 -1 -2 -3 -4
T -1 0 -1 -2 -3
A -2 -1 0 -1 -2
R -3 -2 -1 0 -1

Our objective in the table above is to reach the maximum score can be reached in the bottom right corner cell, that is the final score of matching the 2 strings.

According to our scoring function here in this example any score cannot be more than 0.

When filling any cell in the table we write max(T[i-1,j-1]+S(i,j), T[i-1,j]+d, T[I,j-1]+d) where d is the penalty for an edit, and S(i,j) returns d if the characters in i of the string and j of the pattern are different, and the match value(0 in our case) for the matching characters. d is -1 in put case here.

To find the matching in the strings we follow the largest score from the final score place (the bottom right corner) to the start place (the top left corner).

This will give us the matching of

where _ is used instead of Null, which is the removal edit.

This algorithm can be parallelized by filling diagonal cells simultaneously, where none of them depends on the other.

NFA Approach

The original NFA is defined by a (k+1)×(m+1) matrix of binary states (active=match or inactive=nomatch), let the states be called sij . States of the first column, si0, are always active, signifying that the empty string is always matched. Subsequent columns are labeled in order by the characters of the pattern.
The text is scanned once, character by character from start to end. For every new character of text, the entire NFA is updated in parallel according to the following rules, for each state sij , i, j > 0:

match: If s(i−1)j is active and the current text character matches the pattern character of
column i, then sij becomes active.
replace: If s(i−1)(j−1) is active then sij becomes active.
insert: If si(j−1) is active then sij becomes active.
delete: If s(i−1)(j−1) becomes active by any rule, then so does sij .

and similarly for i = 0, j > 0 but with only the first rule. If a state in the last column becomes active, then there is a corresponding match between the text and the pattern.
It turns out it is not necessary to encode and update the entire original NFA; it suffices to consider the (m − k) diagonals that start at some s0i, 0 < i <= m − k and extend downwards to the right

The reason for this is that the lower left triangle of the NFA is always active and the upper right triangle such that if any state there becomes active then there will be at least one match. Ignoring the upper right triangle loses some information about how short a match can be made but this is generally considered to be of no concern at this point (formally we are only addressing the question of wether there is some match within edit-distance k).

This is what happens as the pattern “tit” is matched against the text “tat” while allowing for a maximum edit distance of one (meaning the number of rows in the NFA is two).

The NFA encoded in this way can be updated with a relatively small number of basic integer operations,
such as are almost always well implemented in the instruction set of relevant processors.

CUDA Implementation

An obvious and trivial way to parallelize the string matching task is to simply divide the text into many separate pieces. There are two immediate drawbacks to doing this on a single machine: On one hand the searches have to overlap by the allowed edit distance and on the other hand it seems difficult to handle memory access in an efficient manner. The architecture of the T1 processor from Nvidia and the language CUDA rather suggest a natural way to extend the bit-parallelism of the NFA algorithm.
To understand why, though, we first need to know something about the architecture:

Code on the T1 is executed in blocks of up to 512 threads, this is done virtually in parallel though formally speaking only 16 or 32 operations in a block at a time are ever done strictly in parallel (8 blocks may be processed concurrently on different cores). 16 KB of shared memory is attached to each core in such a way that accessing it (with some care) is essentially no more expensive than an operation on a register. This memory is shared between threads in a block but not between blocks. Further it is a readily supported matter to synchronize between threads in a block but much more complicated to do the same between blocks. In addition to the shared memory there is 4GB of global memory that is relatively slow to access. Though each thread (processor core) operates with individual 32-bit words, on a higher level it is possible to think about the operations of a block as being on single huge words, e.g. of size 512 x 32 = 16384.This is also how we choose to implement the NFA algorithm; each block in CUDA represents an NFA and is responsible for a separate portion of the text. Since the portions have to overlap, it is on one hand desirable to have as few as possible. On the other hand we need at the least 8 blocks to take advantage of all cores in the T1 processor and more blocks still to allow the scheduler its full potential. Experimentally we find that with little dependence on other parameters, a grid size of about 300 is optimal.
Each block reserves 512 32-bit words (uint) shared memory for the NFA (actually padded by an additional uint on each side), 512 words as a buffer for text and 4 x 512 words as a buffer for the compiled pattern (512 words each for the possible characters A, C, G and T). This amounts to about 12 KB of the available 16 KB. Program execution proceeds roughly as follows:

1: (CPU) Get data, parameters and pattern to RAM and GPU global memory.
2: (CPU) Compile the pattern and put it in global memory.
3: (GPU) Execute Kernel per block:

4: Initialize masks and constants.
5: Load compiled pattern global ! shared.
6: while more text do

7: Load text to buffer global ! shared.
8: for each character in text buffer do
9: Subroutine: NFA Update

10: Check for match…
11: end for

12: end while

13: (CPU) Get result from GPU.

Though the loops unfortunately add some overhead in themselves, obviously the most critical part is (if not the memory operations) the subroutine “NFA Update”. An inspection of the .ptx file reveals that this part is implemented with approximately 30 4-cycle instructions and it is difficult to imagine that this could be much reduced unless a completely different  approach was considered. Several considerations were taken wrt. memory latency and it now appears that all read latency is efficiently masked by arithmetic operations. The text is compressed tightly into integers, each of the 4 DNA characters represented by a two bit combination so as to minimize the amount of data that needs to be moved from global to shared  memory. Access to the compiled pattern and the NFA is always done in sequences so that all the 16 memory banks are used and no conflict occurs. Access to the text buffer is always to the same position by all threads so as to enable broadcasting.

Experimental Results

Sequential and parallel implementations of the NFA algorithm were compared using a sample set of
DNA from the fruit fly as obtained from This set is about 160MB in uncompressed
fasta format and contains slightly more than 162 million DNA characters (A, C, G or T). An arbitrary
substring of 1024 characters was selected from the text and the algorithms run for edit distances in the
interval [1, 15]. The result is presented as a graph in the image below (note the difference between the axes for
the curves!).

For k = 15 the GPU was able to find all matches in 8.5 seconds whereas it took the CPU 406 seconds, i.e. 48 times longer.

The notable irregularities in both curves in figure 4 correspond to places where the number of NFA diagonals packed into each 32-bit computer word, changes. Since m is fixed, only k affects the number of words used and for certain intervals of k, this remains constant. E.g. k 2 [8, 9] implies that we pack three diagonals into each word whereas k 2 [10, 15] implies two.

The present implementation assumes that each word contains two or more full diagonals of the NFA, meaning k < 16. The reason for this is that we wish to make the sub routine “Update NFA” as simple as possible. For larger k (as for certain cases of k in the interval [1, 15]) it is possible to pack the diagonals more tightly (use almost every bit of the words) by using a more complicated scheme. Then, however, the sub routine has to be modified slightly; the best way is probably to have different subroutines for several different cases (e.g. for the cases of one diagonal per word, several diagonal per word or several words per diagonal). For very large k this is unlikely to be of much interest as there appear to be faster algorithms known.


This algorithm’s main aim is to minimize memory traffic; since for the recent past and the foreseeable future, offchip memory bandwidth most probably will be the constraining resource.

It’s based on a well known algorithm called Aho-Corasick algorithm.

This algorithm does online approximate string matching using a dictionary not a single pattern.

This algorithms originally works with an NFA representing whether the string is in the pattern dictionary or not, but when it fails it goes into an alternative path for finding how much it matches.

The NFA is actually a tree, and it makes use of common prefixes between patterns.

For example this tree tests for the patterns “testing” and “testcase”

thanks to this property, the number of states in a keyword tree is sub-linear with respect to the total number of characters in the input dictionary.

Now what if the NFA fails?

We need to add fail transitions, which will decrease non-determinism, actually it will remove it if we handled all possible fails.

The following image shows how fail transitions added.

Both NFA and DFA have pros and cons.

The NFA is super efficient in memory, but the search requires a lot of logical steps to be able to find the correct path with the best match.

The DFA is much more efficient in the number of logical steps, the next state transition can be resolved directly by accessing a memory location with a single memory access and minimal runtime, but each state requires a fully populated array of transitions, one transition for each character in the alphabet (with the ASCI character set and a 32-bit address space, we need 1 KB of memory per state). This space inefficiency can unfortunately lead to a severe run-time overhead. In fact, dictionaries with a few hundreds of thousands of keywords will require several Gbytes of memory.

To intuitively understand the pros and cons of each approach, we have evaluated these aspects using three dictionaries: a textual dictionary with approximately 190k keywords, a binary and textual dictionary of the same size, and a smaller textual dictionary that contains the 20k most common keywords of the English dictionary. We distinguish two cases: heavy-hitting, when the input contains a high percentage of words in the dictionary, and the opposite case, low-hitting. We first consider the behavior of the NFA, counting the number of steps that are required per input character and the number of fail steps per character in the following figure

We can see that, on average, for each character we have to search 5 to 6 entries, and we have a fail transition every three characters in low hitting mode. The NFA search becomes more efficient in heavy hitting mode, because it is following the original transitions of the keyword tree, therefore reducing the number of fail transitions and search at each state (we parse more often states that are further away from the root and have fewer valid transitions). To putthis in perspective, to sustain a scanning rate of 1 Gbit/sec, we only have 24 clock cycles per input byte, on a 3 GHz processor. Searching 5 or 6 entries with a naive sequential algorithm would probably require tens of clock cycles. And a main memory request generated by a fail transitions would also require tens of clock cycles in the most aggressive memory design, and hundreds of clock cycles in a commodity processor.

In the figure above we can see that the scanning performance of a DFA of a small dictionary (only 600 keywords) can go from more than 2 Gbits/sec down to 60 Mbits/sec on a state of art Intel Xeon processor, based on the hitting pattern. With heavy hitting, the access pattern is randomized across the DFA, generating a memory traffic that cannot be handled efficiently  by commodity processors. Multi-threading can help, as shown in [19]: using 128 HW threads, the memory latency can be hidden to a point that each thread is able to fully  pipeline its memory requests. Therefore, NFAs and DFAs in their original form are not scalable solutions in both space and speed.

SCAMPI tries to take the advantage of each of the NFA and DFA, and to do that it depends on 3 main ideas:

  1. Dividing the states into 2 groups, 1 contains the most used states and is called the cache, the other groups is for the uncached states. The purpose of this is to access the memory a maximum of once per state transition.
  2. Dividing the uncached states into a number of CAMs, those CAMs should be small rather than 1 large and monolithic CAM.
    CAMs = content-addressable memory. i.e. accessing memory by their data instead of address.
  3. An arsenal on locality-enhancing techniques to allocate these small CAMs in a way that is memory efficient and minimizes the number of main memory accesses.

The cached states have the structure of an NFA, and the uncached states have the structure of a DFA.

When a state fails it makes a transition to a cached state.

This is how it looks after caching.

And here is how it looks after dividing the states into CAMs.

For example, in the specific case in the next figure, the parsing of the string testasting, requires 13 steps, counting the fail transitions, with a string of 10 characters, and only two main memory requests, the transitions from state 4 to 5, and from state 10 to 11. All the other transitions can be resolved with cached information, thanks to the data layout that maps the information of states 5-6 and 11-12-13 in single units of memory transfer (e.g., a cache line or a DMA).


•Onsjö, Aono – 2009 – Online Approximate String Matching with CUDA.
•Petrini, Agarwal, Pasetto – 2009 – SCAMPI: a Scalable CAM-based Algorithm for Multiple Pattern Inspection.