Fast approximate string matching with large edit distances in Big Data

Share onTweet about this on TwitterShare on FacebookShare on Google+Share on RedditBuffer this pageShare on LinkedInShare on VK

1000x faster

1 million times faster spelling correction for edit distance 3
After my blog post 1000x times faster spelling correction got more than 50.000 views I revisited both algorithm and implementation to see if it could be further improved.

While the basic idea of Symmetric Delete spelling correction algorithm remains unchanged the implementation has been significantly improved to unleash the full potential of the algorithm.

This results in a 10 times faster spelling correction and 5 times faster dictionary generation and 2…7 times less memory consumption in v3.0 compared to v1.6 .

Compared to Peter Norvig’s algorithm it is now 1,000,000 times faster for edit distance=3 and 10,000 times faster for edit distance=2.

In Norvig’s tests 76% of spelling errors had an edit distance 1. 98.9% of spelling errors got covered with edit distance 2. For simple spelling correction of natural language with edit distance 2 the accuracy is good enough and the performance Norvig’s algorithm is sufficient.

The speed of our algorithm enables edit distance 3 for spell checking and thus improves the accuracy by 1%. Beyond the accuracy improvement the speed advantage of our algorithm is useful for automatic spelling correction in large corpora as well as in search engines, where many requests in parallel need to be processed.

Billion times faster approximate string matching for edit distance > 4
But the true potential of the algorithm lies in edit distances > 3 and beyond spell checking.

The many orders of magnitude faster algorithm opens up new application fields for approximate string matching and a scaling sufficient for big data and real-time. Our algorithm enables fast approximate string and pattern matching with long strings or feature vectors, huge alphabets, large edit distances, in very large data bases, with many concurrent processes and real time requirements.

Application fields:

  • Spelling correction in search engines, with many parallel requests
  • Automatic Spelling correction in large corpora
  • Genome data analysis,
  • Matching DNA sequences
  • Browser fingerprint analysis
  • Realtime Image recognition (search by image, autonomous cars, medicine)
  • Face recognition
  • Iris recognition
  • Speech recognition
  • Voice recognition
  • Feature recognition
  • Fingerprint identification
  • Signature Recognition
  • Plagiarism detection (in music /in text)
  • Optical character recognition
  • Audio fingerprinting
  • Fraud detection
  • Address deduplication
  • Misspelled names recognition
  • Spectroscopy based chemical and biological material identification
  • File revisioning
  • Spam detection
  • Similarity search,
  • Similarity matching
  • Approximate string matching,
  • Fuzzy string matching,
  • Fuzzy string comparison,
  • Fuzzy string search,
  • Pattern matching,
  • Data cleaning
  • and many more

Edit distance metrics
While we are using the Damerau-Levenshtein distance for spelling correction for other applications it could be easily exchanged with the Levenshtein distance or similar other edit distances by simply modifying the respective function.

In our algorithm the speed of the edit distance calculation has only a very small influence on the overall lookup speed. That’s why we are using only a basic implementation rather than a more sophisticated variant.

Benchmark
Because of all the applications for approximate string matching beyond spell check we extended the benchmark to lookups with higher edit distances. That’s where the power of the symmetric delete algorithm truly shines and excels other solutions. With previous spell checking algorithms the required time explodes with larger edit distances.

Below are the results of a benchmark of our Symmetric Delete algorithm and Peter Norvig’s algorithm for different edit distances, each with 1000 lookups:

input term best correction edit distance maximum edit distance SymSpell
ms per 1000 lookups
Peter Norvig
ms per 1000 lookups
factor
marsupilamimarsupilami no correction* >20 9 568,568,000
marsupilamimarsupilami no correction >20 8 161,275,000
marsupilamimarsupilami no correction >20 7 37,590,000
marsupilamimarsupilami no correction >20 6 5,528,000
marsupilamimarsupilami no correction >20 5 679,000
marsupilamimarsupilami no correction >20 4 46,592
marsupilami no correction >4 4 459
marsupilami no correction >4 3 159 159,421,000 1:1,000,000
marsupilami no correction >4 2 31 257,597 1:8,310
marsupilami no correction >4 1 4 359 1:90
hzjuwyzacamodation accomodation 10 10 7,598,000
otuwyzacamodation accomodation 9 9 1,727,000
tuwyzacamodation accomodation 8 8 316,023
uwyzacamodation accomodation 7 7 78,647
wyzacamodation accomodation 6 6 19,599
yzacamodation accomodation 5 5 2,963
zacamodation accomodation 4 4 727
acamodation accomodation 3 3 180 173,232,000 1:962,000
acomodation accomodation 2 2 33 397,271 1:12,038
hous hous 1 1 24 161 1:7
house house 0 1 1 3 1:3

*Correct or unknown word, which is not in the dictionary and there are also no suggestions within an edit distance of <=maximum edit distance. This is a quite common case (e.g. rare words, new words, domain specific words, foreign words, names), in applications beyond spelling correction (e.g. fingerprint recognition) it might be the default case.

For the benchmark we used the C# implementation of our SymSpell as well as a faithful C# port from Lorenzo Stoakes of Peter Norvig’s algorithm, which has been extended to support edit distance 3. The use of C# implementations for both cases allows to focus solely on the algorithm and should exclude language specific bias.

Dictionary corpus:
The English text corpus used to generate the dictionary used in the above benchmarks has a size 6.18 MByte, 1,105,286 terms, 29,157 unique terms, longest term with 18 characters.
The dictionary size and the number of indexed terms have almost no influence on the average lookup time of o(1).

Speed gain
The speed advantage grows exponentially with the edit distance:

  • For an edit distance=1 it’s 1 order of magnitude faster,
  • for an edit distance=2 it’s 4 orders of magnitude faster,
  • for an edit distance=3 it’s 6 orders of magnitude faster.
  • for an edit distance=4 it’s 8 orders of magnitude faster.

Computational complexity and findings from benchmark
Our algorithm is constant time ( O(1) time ), i.e. independent of the dictionary size (but depending on the average term length and maximum edit distance), because our index is based on a Hash Table which has an average search time complexity of O(1).

Precalculation cost
In our algorithm we need auxiliary dictionary entries with precalculated deletes and their suggestions. While the number of the auxiliary entries is significant compared to the 29,157 original entries the dictionary size grows only sub-linear with edit distance: log(ed)

maximum edit distance number of dictionary entries (including precalculated deletes)
20 11,715,602
15 11,715,602
10 11,639,067
9 11,433,097
8 10,952,582
7 10,012,557
6 8,471,873
5 6,389,913
4 4,116,771
3 2,151,998
2 848,496
1 223,134

The precalculation costs consist of additional memory usage and creation time for the auxiliary delete entries in the dictionary:

cost maximum edit distance SymSpell Peter Norvig factor
memory usage 1 32 MB 229 MB 1:7.2
memory usage 2 87 MB 229 MB 1:2.6
memory usage 3 187 MB 230 MB 1:1.2
dictionary creation time 1 3341 ms 3640 ms 1:1.1
dictionary creation time 2 4293 ms 3566 ms 1:0.8
dictionary creation time 3 7962 ms 3530 ms 1:0.4

Due to an efficient implementation those costs are negligible for edit distances <=3:

  • 7 times less memory requirement and a similar dictionary creation time (ed=1).
  • 2 times less memory requirement and a similar dictionary creation time (ed=2).
  • similar memory requirement and a 2 times higher dictionary creation time (ed=3).

Source code
The C# implementation of our Symmetric Delete Spelling Correction algorithm is released on GitHub as Open Source under the GNU Lesser General Public License (LGPL).

C# (original)
https://github.com/wolfgarbe/symspell

Ports
The following third party ports to other programming languages have not been tested by myself whether they are an exact port, error free, provide identical results or are as fast as the original algorithm:

C++ (third party port)
https://github.com/erhanbaris/SymSpellPlusPlus

Go (third party port)
https://github.com/heartszhang/symspell
https://github.com/sajari/fuzzy

Java (third party port)
https://github.com/gpranav88/symspell

Javascript (third party port)
https://github.com/itslenny/SymSpell.js
https://github.com/dongyuwei/SymSpell
https://github.com/IceCreamYou/SymSpell
https://github.com/Yomguithereal/mnemonist/blob/master/symspell.js

Python (third party port)
https://github.com/ppgmg/spark-n-spell-1/blob/master/symspell_python.py

Ruby (third party port)
https://github.com/PhilT/symspell

Swift (third party port)
https://github.com/Archivus/SymSpell

Comparison to other approaches and common misconceptions

A Trie as standalone spelling correction
Why don’t you use a Trie instead of your algorithm?
Tries have a comparable search performance to our approach. But a Trie is a prefix tree, which requires a common prefix. This makes it suitable for autocomplete or search suggestions, but not applicable for spell checking. If your typing error is e.g. in the first letter, than you have no common prefix, hence the Trie will not work for spelling correction.

A Trie as replacement for the hash table
Why don’t you use a Trie for the dictionary instead of the hash table?
Of course you could replace the hash table with a Trie (that is just a arbitrary lookup component of O(1) speed for a *single* lookup) at the cost of added code complexity, but without performance gain.
A HashTable is slower than a Trie only if there are collisions, which are unlikely in our case. For a maximum edit distance of 2 and an average word length of 5 and 100,000 dictionary entries we need to additionally store (and hash) 1,500,000 deletes. With a 32 bit hash (4,294,967,296 possible distinct hashes) the collision probability seems negligible.
With a good hash function even a similarity of terms (locality) should not lead to increased collisions, if not especially desired e.g. with Locality sensitive hashing.

BK-Trees
Would be BK-Trees an alternative option?
Yes, but BK-Trees have a search time of O(log dictionary_size), whereas our algorithm is constant time ( O(1) time ), i.e. independent of the dictionary size.

Ternary search tree
Why don’t you use a ternary search tree?
The lookup time in a Ternary Search Tree is O(log n), while it is only 0(1) in our solution. Also, while a Ternary Search Tree could be used for the dictionary lookup instead of a hash table, it doesn’t address the spelling error candidate generation. And the tremendous reduction of the number of spelling error candidates to be looked-up in the dictionary is the true innovation of our Symmetric Delete Spelling Correction algorithm.

Precalculation
Does the speed advantage simply comes from precalulation of candidates?
No! The speed is a result of the combination of all three components outlined below:

  • Pre-calculation, i.e. the generation of possible spelling error variants (deletes only) and storing them at index time is just the first precondition.
  • A fast index access at search time by using a hash table with an average search time complexity of O(1) is the second precondition.
  • But only our Symmetric Delete Spelling Correction on top of this allows to bring this O(1) speed to spell checking, because it allows a tremendous reduction of the number of spelling error candidates to be pre-calculated (generated and indexed).
  • Applying pre-calculation to Norvig’s approach would not be feasible because pre-calculating all possible delete + transpose + replace + insert candidates of all terms would result in a huge time and space consumption.

Correction vs. Completion
How can I add auto completion similar to Google’s Autocompletion?
There is a difference between correction and suggestion/completion!

Correction: Find the correct word for a word which contains errors. Missing letters/errors can be on start/middle/end of the word. We can find only words equal/below the maximum edit distance, as the computational complexity is dependent from the edit distance.

Suggestion/completion: Find the complete word for an already typed substring (prefix!). Missing letters can be only at the end of the word. We can find words/word combinations of any length, as the computational complexity is independent from edit distance and word length.

The code above implements only correction, but not suggestion/completion!
It still finds suggestions/completions equal/below the maximum edit distance, i.e. it starts to show words only if there are <= 2 letters missing (for maximum edit distance=2). Nevertheless the code can be extended to handle both correction and suggestion/completion. During the process of dictionary creation you have to add also all substrings (prefixes only!) of a word to the dictionary, when you are adding a new word to the dictionary. All substring entries of a specific term then have to contain a link to the complete term. Alternatively, for suggestion/completion you could use a completely different algorithm/structure like a Trie, which inherently lists all complete words for a given prefix.

Share onTweet about this on TwitterShare on FacebookShare on Google+Share on RedditBuffer this pageShare on LinkedInShare on VK

17 thoughts on “Fast approximate string matching with large edit distances in Big Data

  1. Hi! thank you very much for your code. I’m in a personal proyect to standardize street address. I’ve a dictionary (street name error (key), street name correct (value that I need to return as correct). I cant implement my problem using your code. Pleas could you help me for undestand what I need to edit in your code?
    Thank you very much!

  2. Look for the line: CreateDictionary(“big.txt”,””);
    The “big.txt” text file should contain the correct street names, one per line. That’s all.
    Our algorithm will the automatically create all possible errors within the maximum edit distance: private static int editDistanceMax=2;
    You can call Correct(word,””); where word should contain the possibly wrong spelled street name.

  3. I came across your name on the web as I was searching for articles on data cleansing.

    We are involved in a project to audit electricity meters in greater Ekurhuleni Metropolitan Municipality is Gauteng South Africa. We do data collection by sending field workers into each house in the municipality to inspect electricity meters.

    The field workers carry tablets that have a mobile application used to open a FORM that they complete when they do meter inspection. When they do inspection, the Field Workers (FWs) open a blank FORM and fill in the address and all other data that must be collected. They also take photos and GPS coordinates of the location where the meter is located.

    At this point in time we have finished phase 1 of the project and we are sitting with 42000 records.

    The problem then we are facing is that, of the 42000 records in our “dirty” Meter Audit database, only 5000 match the street names correctly. This poses a major challenge as we cannot export this data into any other system until we get it clean.

    I want to find out if there is any way you can help us clean this data? We do now have a clean addresses database that we acquired only after we finished phase 1. We want to use the clean database to match with our dirty DB and clean them. I can send you the spreadsheet of the “dirty” Meter Audit database as well the correct and clean addresses database.

    Can you help?

    ———————————-
    Best Regards
    Fikile Kentane
    Mobile: +27-81-31-0063
    Office: +27-450-3213
    Email: fikilek@innopen.co.za

  4. Hi! Thank you very much for your solution. Just one question.
    We would like to use this solution to correct “company names”, we have a big list (one per line), for example:

    – a21, Inc.
    – Aaron’s, Inc.
    – Abbott Laboratories
    – AbbVie
    – Abercrombie & Fitch
    – ABM Industries
    – ABS Capital Partners
    etc etc

    as you can see we DO NOT have one word “to check”, but one (or more) words.
    How can we do it ? Thank you so much!

    P.S. We are implementing that code in java (https://github.com/gpranav88/symspell/blob/master/SymSpell.java)

  5. Hi,
    I came across this article when i was searching algorithms related to Fuzzy Search.

    The Project am working on needs to validate the accuracy of the values, because we are using an OCR to read the Receipts to get the data at times the values are not clear. I wanted to share an example just wondering if your algorithm will help us with this issue.
    We have a list of products which are client needs to find .
    Example:
    Getorade Orange.
    but when we generate using OCR if the value is not clear it might come as
    Getorae Orang
    So my question is does your algorithm gives us the percentage of the accuracy.
    I apologize if i have asked out of topic.

    Thanks.
    Syed.

  6. @Syed
    Our algorithm gives you a measure of similarity (Damerau-Levenshtein edit distance).
    Regarding to your input term (or input phrase “Getorae Orang”) it finds (very fast) all similar terms (or phrases) from the dictionary.
    For every similar term (or phrase) found in the dictionary it gives you the Damerau-Levenshtein edit distance http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance to your input term (look for suggestion.distance in the source code).
    The edit distance describes how many characters have been added, deleted, altered or transposed between the input term and the dictionary term.

    In line 35..38 you may specify whether you want only the best match or all matches within a certain edit distance (number of character operations difference):
    35 private static int verbose = 0;
    36 //0: top suggestion
    37 //1: all suggestions of smallest edit distance
    38 //2: all suggestions <= editDistanceMax (slower, no early termination)

    If your product list contains phrases (with white space) you may add those phrases to the dictionary with CreateDictionaryEntry("Getorade Orange", "").

  7. @Damiano
    You may add company names (containing white space) to the dictionary with CreateDictionaryEntry(“Abbott Laboratories”, “”).

  8. @Fikile
    You may add the correct street names (containing white space) from the clean database to the dictionary with CreateDictionaryEntry(“example street”, “”).
    Then with Correct(“misspelled street”,””); you will get the correct street name from the dictionary.

  9. @Wolf
    Thank you for the reply. I have a doubt in the algorithm which i wanted to ask.

    CreateDictionary(“D:\\ProductList.txt”,””);

    My doubt is in CreateDictionary we will pass all the Product list which we have received and, did not understand where we pass the String which we have recovered from the Receipt and to compare with the dictionary. We have added all the product’s in a text file and called it as above, and i did not understand where do we pass the string that needs to be compared with the dictionary.

    Thank you.

  10. @Syed
    Unfortunately you did not follow my comment and blog post http://blog.faroo.com/2015/09/29/how-to-correct-company-names-street-names-product-names/ .

    You should NOT use CreateDictionary(); . Comment it out!
    Instead use CreateDictionaryEntry("product name", ""); in a loop:

    System.IO.StreamReader file = new System.IO.StreamReader("D:\\ProductList.txt", System.Text.Encoding.UTF8);
    while((line = file.ReadLine()) != null) CreateDictionaryEntry(line, "");
    file.Close();

    Then use Correct("misspelled product name",""); to display the correct product name from the dictionary on the console.

    For individual post-processing use the following code:

    List suggestions = null;

    //check term in dictionary for closest matches;
    //sort by ascending edit distance, then by descending word frequency
    suggestions = Lookup("misspelled product name", "", editDistanceMax);

    //do whatever you want with suggestion.term and suggestion.distance
    foreach (var suggestion in suggestions)
    Console.WriteLine(suggestion.term+" "+suggestion.distance.ToString());

  11. Is there any way of finding a port of this algorithm in either Swift or Obj-c? The one linked above leads to an empty project (I’m trying to use this for a custom iOS keyboard)

  12. “Edit Distance 3” is not a “large edit distance”. This approach does not scale – there is no way to make this work with, for instance, edit distance 6

  13. @BlueRaja Just have a look at the table with the benchmark results above: A single lookup with edit distance=6 is executed within 19 milliseconds! For that edit distance our algorithm is about 12 orders of magnitude faster than Norvigs. So what’s your point?

  14. Your speed metric appeared to be a bit confusing.
    Am I correct – for “marsupilamimarsupilami” 568,568,000ms/1000lookups meaning roughly 10min lookup per word?

    Due to the fact I am dealing with Japanese word segmentation, want to mention:
    Space is not a clear indication of a word boundary in any language. Space (like a dot) is just a cue.
    So I can imagine some weird corrections for marsupilamimarsupilami
    Mars up, I lame…
    I am soup…
    May be it opposes some common sense, and not needed at all (your Peter Norvig’ quote: 99% quality achieved, horey)
    What do you think of spell correction using your algorithm for languages lacking such obvious cues?

    My algorithm is running at 100000char/second.
    It is O(n) unsupervised (not relying on any ques, such as spaces/punctuation/script changes) lattice building, where n=string length to be parsed.
    I’m thinking of introducing an error margin to it. It means that it will be possible to calculate edit distances and rank them on-line taking additional O(log m) m = word count in a dict. In other words: “on-line any-distance”.
    This can be easily adapted for any language.

    Some other characteristics of my algorithm are: self-index static dictionary containing 365000 japanese words (writings and readings) occupying 1.5 times space of source data (~5Mb), running on js client side (chrome extension).

    My original intention was to bring word segmentation to a client side with linear time complexity algorithm occupying reasonable space. This is done.
    I never planned to deal with spell correction and I don’t see any struggle in this field (99%…)
    Do you think it’s worth an effort to compile some tests and put a demo somewhere?

  15. Your speed metric appeared to be a bit confusing.
    Am I correct – for “marsupilamimarsupilami” 568,568,000ms/1000lookups meaning roughly 10min lookup per word?

    yes, 10 minutes per lookup for marsupilamimarsupilami (string length=22, maximum editdistance=9, no correction in the dictionary),
    1.7 seconds per lookup for tuwyzacamodation (string length=16, maximum editdistance=9, accomodation as best correction in the dictionary)

    Due to the fact I am dealing with Japanese word segmentation, want to mention:
    Space is not a clear indication of a word boundary in any language. Space (like a dot) is just a cue.
    So I can imagine some weird corrections for marsupilamimarsupilami
    Mars up, I lame…
    I am soup…
    May be it opposes some common sense, and not needed at all (your Peter Norvig’ quote: 99% quality achieved, horey)
    What do you think of spell correction using your algorithm for languages lacking such obvious cues?

    The Symspell code (as it is) is indended to work at word basis. So it is a preprocessing task to transform a text to separate words, whatever transformation logic you think is appropriate.

    In Languages where words are not separated by white spaces (e.g. CJK languages), you have to do word segmentation first (break a string into separate words), before you can do spelling correction:
    http://blog.faroo.com/2009/06/09/lightweight-chinese-word-segmentation/

    Most word segmentation algorithms are based on dictionaries.
    If you want to word segment a misspelled text, you have to use approximate string matching in the dictionary. This can be efficiently done with Symspell’s Symmetric Delete spelling correction algorithm.

    Also the Symspell code (as it is) does not support compound splitting (i.e. inserting a missing space), but it can be modified to do so:
    see http://blog.faroo.com/2012/06/24/1000x-faster-spelling-correction-source-code-released/#comment-961469

    My algorithm is running at 100000char/second.
    It is O(n) unsupervised (not relying on any ques, such as spaces/punctuation/script changes) lattice building, where n=string length to be parsed.
    I’m thinking of introducing an error margin to it. It means that it will be possible to calculate edit distances and rank them on-line taking additional O(log m) m = word count in a dict. In other words: “on-line any-distance”.
    This can be easily adapted for any language.

    Do you mean spelling correction/approximate string search with arbitrary maximum edit distance?

    Some other characteristics of my algorithm are: self-index static dictionary containing 365000 japanese words (writings and readings) occupying 1.5 times space of source data (~5Mb), running on js client side (chrome extension).

    If the size of the dictionary is an issue (especially in RAM) it can be drastically reduced by using a bloom filter or cuckoo filter.

    My original intention was to bring word segmentation to a client side with linear time complexity algorithm occupying reasonable space. This is done.
    I never planned to deal with spell correction and I don’t see any struggle in this field (99%…)
    Do you think it’s worth an effort to compile some tests and put a demo somewhere?

    Of course its worth, if your algorithm is faster and/or requires less space than symspell! The 99% is for spelling correction, but as mentioned in my blog post above there are countless application fields for approximate string search beyond spelling correction. And for Big Data performance and scaling are key. So go ahead and please post a link to your benchmark here.

Leave a Reply

Your email address will not be published. Required fields are marked *