Approximate String Matching
(or Fuzzy String Matching) is a method to find an approximate match between a pattern and a string.
Approximate matching means to be able to convert a pattern to match a string (or a string to match a pattern) with the minimal number of changes.
An example of this is the match between the pattern “coat” and the string “cost”. There is a single substitution which is substituting ‘a’ with ‘s’.
Levenshtein distance
(or edit distance) is the minimal number of edits required in a string to match another string.
Those edits can be insertion, deletion, or substitution. In a more general case we can consider all edits as substitutions, i.e. deletion can be a substitution with null, and insertion can be a substitution of null with a character. The Levenshtein distance in the previous example is 1.
Online vs Offline approximate string matching
Online means that we can preprocess the pattern before matching, but not the string.
Offline means exactly the opposite, we can preprocess the string before matching to make the search faster for any pattern.
Approximate String Matching Algorithms
Classical Dynamic Programming Algorithm(Needleman & Wunsch)
Like any other dynamic programming we build a data structure (Table in this case) that holds the old results that are interesting to us.
In this table is of size (n+1)x(m+1) where n is the string length, and m is the pattern length.
The following example should make things clear.
Here we are trying to match the string “start” with the pattern “star” as an example.
Our score function returns 0 if the 2 characters match, and -1 if any edit is required.
| S | T | A | R | T | |
| S | 0 | -1 | -2 | -3 | -4 |
| T | -1 | 0 | -1 | -2 | -3 |
| A | -2 | -1 | 0 | -1 | -2 |
| R | -3 | -2 | -1 | 0 | -1 |
Our objective in the table above is to reach the maximum score can be reached in the bottom right corner cell, that is the final score of matching the 2 strings.
According to our scoring function here in this example any score cannot be more than 0.
When filling any cell in the table we write max(T[i-1,j-1]+S(i,j), T[i-1,j]+d, T[I,j-1]+d) where d is the penalty for an edit, and S(i,j) returns d if the characters in i of the string and j of the pattern are different, and the match value(0 in our case) for the matching characters. d is -1 in put case here.
To find the matching in the strings we follow the largest score from the final score place (the bottom right corner) to the start place (the top left corner).
This will give us the matching of
START
STAR_
where _ is used instead of Null, which is the removal edit.
This algorithm can be parallelized by filling diagonal cells simultaneously, where none of them depends on the other.
NFA Approach
The original NFA is defined by a (k+1)×(m+1) matrix of binary states (active=match or inactive=nomatch), let the states be called sij . States of the first column, si0, are always active, signifying that the empty string is always matched. Subsequent columns are labeled in order by the characters of the pattern.
The text is scanned once, character by character from start to end. For every new character of text, the entire NFA is updated in parallel according to the following rules, for each state sij , i, j > 0:
and similarly for i = 0, j > 0 but with only the first rule. If a state in the last column becomes active, then there is a corresponding match between the text and the pattern.
It turns out it is not necessary to encode and update the entire original NFA; it suffices to consider the (m − k) diagonals that start at some s0i, 0 < i <= m − k and extend downwards to the right
The reason for this is that the lower left triangle of the NFA is always active and the upper right triangle such that if any state there becomes active then there will be at least one match. Ignoring the upper right triangle loses some information about how short a match can be made but this is generally considered to be of no concern at this point (formally we are only addressing the question of wether there is some match within edit-distance k).
This is what happens as the pattern “tit” is matched against the text “tat” while allowing for a maximum edit distance of one (meaning the number of rows in the NFA is two).
The NFA encoded in this way can be updated with a relatively small number of basic integer operations,
such as are almost always well implemented in the instruction set of relevant processors.
CUDA Implementation
An obvious and trivial way to parallelize the string matching task is to simply divide the text into many separate pieces. There are two immediate drawbacks to doing this on a single machine: On one hand the searches have to overlap by the allowed edit distance and on the other hand it seems difficult to handle memory access in an efficient manner. The architecture of the T1 processor from Nvidia and the language CUDA rather suggest a natural way to extend the bit-parallelism of the NFA algorithm.
To understand why, though, we first need to know something about the architecture:
Code on the T1 is executed in blocks of up to 512 threads, this is done virtually in parallel though formally speaking only 16 or 32 operations in a block at a time are ever done strictly in parallel (8 blocks may be processed concurrently on different cores). 16 KB of shared memory is attached to each core in such a way that accessing it (with some care) is essentially no more expensive than an operation on a register. This memory is shared between threads in a block but not between blocks. Further it is a readily supported matter to synchronize between threads in a block but much more complicated to do the same between blocks. In addition to the shared memory there is 4GB of global memory that is relatively slow to access. Though each thread (processor core) operates with individual 32-bit words, on a higher level it is possible to think about the operations of a block as being on single huge words, e.g. of size 512 x 32 = 16384.This is also how we choose to implement the NFA algorithm; each block in CUDA represents an NFA and is responsible for a separate portion of the text. Since the portions have to overlap, it is on one hand desirable to have as few as possible. On the other hand we need at the least 8 blocks to take advantage of all cores in the T1 processor and more blocks still to allow the scheduler its full potential. Experimentally we find that with little dependence on other parameters, a grid size of about 300 is optimal.
Each block reserves 512 32-bit words (uint) shared memory for the NFA (actually padded by an additional uint on each side), 512 words as a buffer for text and 4 x 512 words as a buffer for the compiled pattern (512 words each for the possible characters A, C, G and T). This amounts to about 12 KB of the available 16 KB. Program execution proceeds roughly as follows:
1: (CPU) Get data, parameters and pattern to RAM and GPU global memory.
2: (CPU) Compile the pattern and put it in global memory.
3: (GPU) Execute Kernel per block:
4: Initialize masks and constants.
5: Load compiled pattern global ! shared.
6: while more text do
7: Load text to buffer global ! shared.
8: for each character in text buffer do
9: Subroutine: NFA Update
10: Check for match…
11: end for
12: end while
13: (CPU) Get result from GPU.
Though the loops unfortunately add some overhead in themselves, obviously the most critical part is (if not the memory operations) the subroutine “NFA Update”. An inspection of the .ptx file reveals that this part is implemented with approximately 30 4-cycle instructions and it is difficult to imagine that this could be much reduced unless a completely different approach was considered. Several considerations were taken wrt. memory latency and it now appears that all read latency is efficiently masked by arithmetic operations. The text is compressed tightly into integers, each of the 4 DNA characters represented by a two bit combination so as to minimize the amount of data that needs to be moved from global to shared memory. Access to the compiled pattern and the NFA is always done in sequences so that all the 16 memory banks are used and no conflict occurs. Access to the text buffer is always to the same position by all threads so as to enable broadcasting.
Experimental Results
Sequential and parallel implementations of the NFA algorithm were compared using a sample set of
DNA from the fruit fly as obtained from www.fruitfly.org. This set is about 160MB in uncompressed
fasta format and contains slightly more than 162 million DNA characters (A, C, G or T). An arbitrary
substring of 1024 characters was selected from the text and the algorithms run for edit distances in the
interval [1, 15]. The result is presented as a graph in the image below (note the difference between the axes for
the curves!).
For k = 15 the GPU was able to find all matches in 8.5 seconds whereas it took the CPU 406 seconds, i.e. 48 times longer.
The notable irregularities in both curves in figure 4 correspond to places where the number of NFA diagonals packed into each 32-bit computer word, changes. Since m is fixed, only k affects the number of words used and for certain intervals of k, this remains constant. E.g. k 2 [8, 9] implies that we pack three diagonals into each word whereas k 2 [10, 15] implies two.
The present implementation assumes that each word contains two or more full diagonals of the NFA, meaning k < 16. The reason for this is that we wish to make the sub routine “Update NFA” as simple as possible. For larger k (as for certain cases of k in the interval [1, 15]) it is possible to pack the diagonals more tightly (use almost every bit of the words) by using a more complicated scheme. Then, however, the sub routine has to be modified slightly; the best way is probably to have different subroutines for several different cases (e.g. for the cases of one diagonal per word, several diagonal per word or several words per diagonal). For very large k this is unlikely to be of much interest as there appear to be faster algorithms known.
SCAMPI
This algorithm’s main aim is to minimize memory traffic; since for the recent past and the foreseeable future, offchip memory bandwidth most probably will be the constraining resource.
It’s based on a well known algorithm called Aho-Corasick algorithm.
This algorithm does online approximate string matching using a dictionary not a single pattern.
This algorithms originally works with an NFA representing whether the string is in the pattern dictionary or not, but when it fails it goes into an alternative path for finding how much it matches.
The NFA is actually a tree, and it makes use of common prefixes between patterns.
For example this tree tests for the patterns “testing” and “testcase”
thanks to this property, the number of states in a keyword tree is sub-linear with respect to the total number of characters in the input dictionary.
Now what if the NFA fails?
We need to add fail transitions, which will decrease non-determinism, actually it will remove it if we handled all possible fails.
The following image shows how fail transitions added.
Both NFA and DFA have pros and cons.
The NFA is super efficient in memory, but the search requires a lot of logical steps to be able to find the correct path with the best match.
The DFA is much more efficient in the number of logical steps, the next state transition can be resolved directly by accessing a memory location with a single memory access and minimal runtime, but each state requires a fully populated array of transitions, one transition for each character in the alphabet (with the ASCI character set and a 32-bit address space, we need 1 KB of memory per state). This space inefficiency can unfortunately lead to a severe run-time overhead. In fact, dictionaries with a few hundreds of thousands of keywords will require several Gbytes of memory.
To intuitively understand the pros and cons of each approach, we have evaluated these aspects using three dictionaries: a textual dictionary with approximately 190k keywords, a binary and textual dictionary of the same size, and a smaller textual dictionary that contains the 20k most common keywords of the English dictionary. We distinguish two cases: heavy-hitting, when the input contains a high percentage of words in the dictionary, and the opposite case, low-hitting. We first consider the behavior of the NFA, counting the number of steps that are required per input character and the number of fail steps per character in the following figure
We can see that, on average, for each character we have to search 5 to 6 entries, and we have a fail transition every three characters in low hitting mode. The NFA search becomes more efficient in heavy hitting mode, because it is following the original transitions of the keyword tree, therefore reducing the number of fail transitions and search at each state (we parse more often states that are further away from the root and have fewer valid transitions). To putthis in perspective, to sustain a scanning rate of 1 Gbit/sec, we only have 24 clock cycles per input byte, on a 3 GHz processor. Searching 5 or 6 entries with a naive sequential algorithm would probably require tens of clock cycles. And a main memory request generated by a fail transitions would also require tens of clock cycles in the most aggressive memory design, and hundreds of clock cycles in a commodity processor.
In the figure above we can see that the scanning performance of a DFA of a small dictionary (only 600 keywords) can go from more than 2 Gbits/sec down to 60 Mbits/sec on a state of art Intel Xeon processor, based on the hitting pattern. With heavy hitting, the access pattern is randomized across the DFA, generating a memory traffic that cannot be handled efficiently by commodity processors. Multi-threading can help, as shown in [19]: using 128 HW threads, the memory latency can be hidden to a point that each thread is able to fully pipeline its memory requests. Therefore, NFAs and DFAs in their original form are not scalable solutions in both space and speed.
SCAMPI tries to take the advantage of each of the NFA and DFA, and to do that it depends on 3 main ideas:
- Dividing the states into 2 groups, 1 contains the most used states and is called the cache, the other groups is for the uncached states. The purpose of this is to access the memory a maximum of once per state transition.
- Dividing the uncached states into a number of CAMs, those CAMs should be small rather than 1 large and monolithic CAM.
CAMs = content-addressable memory. i.e. accessing memory by their data instead of address. - An arsenal on locality-enhancing techniques to allocate these small CAMs in a way that is memory efficient and minimizes the number of main memory accesses.
The cached states have the structure of an NFA, and the uncached states have the structure of a DFA.
When a state fails it makes a transition to a cached state.
This is how it looks after caching.
And here is how it looks after dividing the states into CAMs.
For example, in the specific case in the next figure, the parsing of the string testasting, requires 13 steps, counting the fail transitions, with a string of 10 characters, and only two main memory requests, the transitions from state 4 to 5, and from state 10 to 11. All the other transitions can be resolved with cached information, thanks to the data layout that maps the information of states 5-6 and 11-12-13 in single units of memory transfer (e.g., a cache line or a DMA).
References












May 15, 2010 at 2:37 pm
Hey Zakria,
I wanted to ask about NFA and DFA? i have no previous information about this terms.
Thanks in advance,
Ismail
May 17, 2010 at 8:16 am
Hey Ismail
Both DFA & NFA are used in knowing whether a string is valid (accepted) or not.
DFA stands for Deterministic Finite Automaton, and is a set of sets and arcs like a graph, but the arcs are directed, and has an input character on it that represents the character just read in a string.
Start at the start state and read a character, and move on the arc that matches the character read.
And when you finish the string you look whether the state you are in is an accept state or not.
NFA stands for Non-deterministic Finite Automaton.
The NFA is the same except that you can at some point backtrack because there is not arc that matches the character in hand. If you fail after all backtracking you fail to match the staring.
Look at those links for more:
http://en.wikipedia.org/wiki/Deterministic_finite-state_machine
http://en.wikipedia.org/wiki/Nondeterministic_finite_state_machine
May 21, 2010 at 11:29 am
ah, manta ma2akhadtesh ToC.
May 15, 2010 at 7:53 pm
Hey Zakaria,
I have a question regarding the use of approximate string matching on GPGPUs. From the different algorithms that you have explained, which of them do you think is employed by search engines ? When you search in a search engine for example as you type your are given suggestions. Also when you submit your string you get suggestions whether you actually meant another similar / close string.
Also, if it is more efficient to use GPUs. Why do you think that most of if not all of search engines make use of CPUs. For example Google’s cluster of computers are made of Pentium 3s!
Thanks,
Ahmed Labib
May 17, 2010 at 8:43 am
1st of all, I think search engines use different search algorithms & techniques. However I don’t know any of them.
But I do know that search engines index websites in order to be able to find fast results.
The same indexing idea is used by windows nowadays to find quickly results, and be able to auto-complete search strings.
Concerning the P3 Google machines.
Google uses off the shelf components, which are cheap old components that are efficient if compared their price to performance and features. And there’s no general purpose GPUs nowadays that are off the shelf, it’s a very new idea that’s not even yet deployed. So they can’t use only GPUs, they need CPUs.
May 17, 2010 at 10:29 am
By the way
None of the algorithms mentioned he is used in search engines for searching the web I think.
Since all the algorithms here are online searches, but the pattern can not be pre-processed in search engines since it’s the search string the user enters.
May 18, 2010 at 5:57 am
Hey Ziko,
Regarding the DP algorithm, what is the advantage of the GPGPU in the first algorithm (Approximate String Matching using the classical Dynamic Programming Algorithm) ?
Peace,
Mostafa Magdi
May 19, 2010 at 1:31 pm
The classical DP algorithm was designed far before GPUs, GPGPUs, and any multicore processing units.
However you can take advantage of multicore GPGPUs to compute it ultra fast by letting every core calculate a cell among the cells that can be calculated simultaneously, like diagonal cells that can be calculated using only the previous diagonal, or the 1st row & column.
May 18, 2010 at 1:37 pm
Hi Ziko,
Great blog post on an interesting topic. It is most unfortunate that I couldn’t attend the presentation itself. Had a little question: In the NFA approach, how are we going to find the match string?
Best regards,
Amasis
May 19, 2010 at 1:55 pm
The NFA approach is not intended to find the matching string, it only finds whether there is an accepted string within the given threshold of Levenshtein distance.
Even in the enhanced NFA approach you don’t complete your search once you find the string to be accepted, so you don’t know the Levenshtein distance of the best match.
May 21, 2010 at 11:02 am
Dear Zakariya,
What makes you think the the memory access is the bottle neck in those algorithms?
May 22, 2010 at 8:57 am
Dear Yahia
According to experiments and statistics done and mentioned in the SCAMPI
paper in the references, memory access is the bottleneck for their
algorithm, and they were trying to minimize the number of memory access
times for that reason using cache and others.
And also in the NFA algorithm paper, they were trying to optimize memory
access for the same reason.
And if you look at the delay of a single memory access and compare it to
the delay of an operation on any kind of processor you will notice the
difference.
May 21, 2010 at 11:32 am
Hey Zakariya,
I am sorry I was not there for your presentation. I had one question though, I don’t get what was wrong with the classical DP algorithm? What led to developing more complex algorithms instead of the classical DP?
May 22, 2010 at 9:18 am
Hey Majid
The classical DP algorithm is the simplest among those algorithms, and it does a great job.
However sometimes you have other limitations constraining you, like memory for example.
The DP algorithm space complexity & time complexity are mxn, this can be OK for time complexity in most applications (not all of them), but space complexity is not OK with most applications; since memory access has a huge delay, and therefore you will need to waste a lot of time accessing memory.
And also the complexity of DP algorithm can be catastrophic when searching a dictionary like in SCAMPI.
And the DP time and space complexity can be shrunk if features like finding the best matching string is not required, and also if you only need to find whether the strings match within a threshold like the NFA approach.
The DP algorithm space complexity can be reduced if you don’t need to get the matching string, using a modified one that removes the old results to store the new ones, however you will still have the number of memory accesses.
May 22, 2010 at 11:11 am
Dear Zakria,
I’m sorry but I didn’t get how are we going to get the best match string after finding the score in the Dynamic Programming algorithm.
Regards,
Ahmed Attia.
May 26, 2010 at 5:35 am
Dear Attia
You look for the best score adjacent to the final score, if it’s above it, then you have an insertion, if it’s beside it you have a deletion, if it’s diagonal then you have a substitution.
Then you follow the same from the best score in the bottom right corner till you reach the start cell at the top left corner.
Using those edits you can get the string that matches the pattern.
May 29, 2010 at 10:00 pm
hey Zakaria
hope you are fine. What do you mean by pre-processing the pattern or string in the online\offline matching ?
May 30, 2010 at 1:24 am
Hey Alaa
Pre-processing the pattern\string mean to be able to use it in constructing a kind of data structure before seeing the string\pattern respectively.
For example in the NFA approach, we use the pattern to construct an NFA, and the size of NFA depends on the size of the pattern. Afterwards we can read the string character by character.
Usually the pre-processed part (most probably the pattern) is much shorter than the other part (most probably the string).
July 19, 2011 at 1:39 am
Where can I get the source code for what is described in this post?
July 23, 2011 at 7:38 pm
I did not implement it but I just read papers about those algorithms.
However if you wanna implement this in something like Cuda for NVidia cards or Stream for ATI