Fast approximate string matching with large edit distances in Big Data

1000x faster

1 million times faster spelling correction for edit distance 3
After my blog post 1000x times faster spelling correction got more than 50.000 views I revisited both algorithm and implementation to see if it could be further improved.

While the basic idea of Symmetric Delete spelling correction algorithm remains unchanged the implementation has been significantly improved to unleash the full potential of the algorithm.

This results in a 10 times faster spelling correction and 5 times faster dictionary generation and 2…7 times less memory consumption in v3.0 compared to v1.6 .

Compared to Peter Norvig’s algorithm it is now 1,000,000 times faster for edit distance=3 and 10,000 times faster for edit distance=2.

In Norvig’s tests 76% of spelling errors had an edit distance 1. 98.9% of spelling errors got covered with edit distance 2. For simple spelling correction of natural language with edit distance 2 the accuracy is good enough and the performance Norvig’s algorithm is sufficient.

The speed of our algorithm enables edit distance 3 for spell checking and thus improves the accuracy by 1%. Beyond the accuracy improvement the speed advantage of our algorithm is useful for automatic spelling correction in large corpora as well as in search engines, where many requests in parallel need to be processed.

Billion times faster approximate string matching for edit distance > 4
But the true potential of the algorithm lies in edit distances > 3 and beyond spell checking.

The many orders of magnitude faster algorithm opens up new application fields for approximate string matching and a scaling sufficient for big data and real-time. Our algorithm enables fast approximate string and pattern matching with long strings or feature vectors, huge alphabets, large edit distances, in very large data bases, with many concurrent processes and real time requirements.

Application fields:

  • Spelling correction in search engines, with many parallel requests
  • Automatic Spelling correction in large corpora
  • Genome data analysis,
  • Matching DNA sequences
  • Browser fingerprint analysis
  • Realtime Image recognition (search by image, autonomous cars, medicine)
  • Face recognition
  • Iris recognition
  • Speech recognition
  • Voice recognition
  • Feature recognition
  • Fingerprint identification
  • Signature Recognition
  • Plagiarism detection (in music /in text)
  • Optical character recognition
  • Audio fingerprinting
  • Fraud detection
  • Address deduplication
  • Misspelled names recognition
  • Spectroscopy based chemical and biological material identification
  • File revisioning
  • Spam detection
  • Similarity search,
  • Similarity matching
  • Approximate string matching,
  • Fuzzy string matching,
  • Fuzzy string comparison,
  • Fuzzy string search,
  • Pattern matching,
  • Data cleaning
  • and many more

Edit distance metrics
While we are using the Damerau-Levenshtein distance for spelling correction for other applications it could be easily exchanged with the Levenshtein distance or similar other edit distances by simply modifying the respective function.

In our algorithm the speed of the edit distance calculation has only a very small influence on the overall lookup speed. That’s why we are using only a basic implementation rather than a more sophisticated variant.

Benchmark
Because of all the applications for approximate string matching beyond spell check we extended the benchmark to lookups with higher edit distances. That’s where the power of the symmetric delete algorithm truly shines and excels other solutions. With previous spell checking algorithms the required time explodes with larger edit distances.

Below are the results of a benchmark of our Symmetric Delete algorithm and Peter Norvig’s algorithm for different edit distances, each with 1000 lookups:

input term best correction edit distance maximum edit distance SymSpell
ms per 1000 lookups
Peter Norvig
ms per 1000 lookups
factor
marsupilamimarsupilami no correction* >20 9 568,568,000
marsupilamimarsupilami no correction >20 8 161,275,000
marsupilamimarsupilami no correction >20 7 37,590,000
marsupilamimarsupilami no correction >20 6 5,528,000
marsupilamimarsupilami no correction >20 5 679,000
marsupilamimarsupilami no correction >20 4 46,592
marsupilami no correction >4 4 459
marsupilami no correction >4 3 159 159,421,000 1:1,000,000
marsupilami no correction >4 2 31 257,597 1:8,310
marsupilami no correction >4 1 4 359 1:90
hzjuwyzacamodation accomodation 10 10 7,598,000
otuwyzacamodation accomodation 9 9 1,727,000
tuwyzacamodation accomodation 8 8 316,023
uwyzacamodation accomodation 7 7 78,647
wyzacamodation accomodation 6 6 19,599
yzacamodation accomodation 5 5 2,963
zacamodation accomodation 4 4 727
acamodation accomodation 3 3 180 173,232,000 1:962,000
acomodation accomodation 2 2 33 397,271 1:12,038
hous hous 1 1 24 161 1:7
house house 0 1 1 3 1:3

*Correct or unknown word, which is not in the dictionary and there are also no suggestions within an edit distance of <=maximum edit distance. This is a quite common case (e.g. rare words, new words, domain specific words, foreign words, names), in applications beyond spelling correction (e.g. fingerprint recognition) it might be the default case.

For the benchmark we used the C# implementation of our SymSpell as well as a faithful C# port from Lorenzo Stoakes of Peter Norvig’s algorithm, which has been extended to support edit distance 3. The use of C# implementations for both cases allows to focus solely on the algorithm and should exclude language specific bias.

Dictionary corpus:
The English text corpus used to generate the dictionary used in the above benchmarks has a size 6.18 MByte, 1,105,286 terms, 29,157 unique terms, longest term with 18 characters.
The dictionary size and the number of indexed terms have almost no influence on the average lookup time of o(1).

Speed gain
The speed advantage grows exponentially with the edit distance:

  • For an edit distance=1 it’s 1 order of magnitude faster,
  • for an edit distance=2 it’s 4 orders of magnitude faster,
  • for an edit distance=3 it’s 6 orders of magnitude faster.
  • for an edit distance=4 it’s 8 orders of magnitude faster.

Computational complexity and findings from benchmark
Our algorithm is constant time ( O(1) time ), i.e. independent of the dictionary size (but depending on the average term length and maximum edit distance), because our index is based on a Hash Table which has an average search time complexity of O(1).

Precalculation cost
In our algorithm we need auxiliary dictionary entries with precalculated deletes and their suggestions. While the number of the auxiliary entries is significant compared to the 29,157 original entries the dictionary size grows only sub-linear with edit distance: log(ed)

maximum edit distance number of dictionary entries (including precalculated deletes)
20 11,715,602
15 11,715,602
10 11,639,067
9 11,433,097
8 10,952,582
7 10,012,557
6 8,471,873
5 6,389,913
4 4,116,771
3 2,151,998
2 848,496
1 223,134

The precalculation costs consist of additional memory usage and creation time for the auxiliary delete entries in the dictionary:

cost maximum edit distance SymSpell Peter Norvig factor
memory usage 1 32 MB 229 MB 1:7.2
memory usage 2 87 MB 229 MB 1:2.6
memory usage 3 187 MB 230 MB 1:1.2
dictionary creation time 1 3341 ms 3640 ms 1:1.1
dictionary creation time 2 4293 ms 3566 ms 1:0.8
dictionary creation time 3 7962 ms 3530 ms 1:0.4

Due to an efficient implementation those costs are negligible for edit distances <=3:

  • 7 times less memory requirement and a similar dictionary creation time (ed=1).
  • 2 times less memory requirement and a similar dictionary creation time (ed=2).
  • similar memory requirement and a 2 times higher dictionary creation time (ed=3).

Source code
The C# implementation of our Symmetric Delete Spelling Correction algorithm is released on GitHub as Open Source under the GNU Lesser General Public License (LGPL).

C# (original)
https://github.com/wolfgarbe/symspell

Ports
The following third party ports to other programming languages have not been tested by myself whether they are an exact port, error free, provide identical results or are as fast as the original algorithm:

C++ (third party port)
https://github.com/erhanbaris/SymSpellPlusPlus

Go (third party port)
https://github.com/heartszhang/symspell
https://github.com/sajari/fuzzy

Java (third party port)
https://github.com/gpranav88/symspell

Javascript (third party port)
https://github.com/itslenny/SymSpell.js
https://github.com/dongyuwei/SymSpell
https://github.com/IceCreamYou/SymSpell
https://github.com/Yomguithereal/mnemonist/blob/master/symspell.js

Python (third party port)
https://github.com/ppgmg/spark-n-spell-1/blob/master/symspell_python.py

Ruby (third party port)
https://github.com/PhilT/symspell

Swift (third party port)
https://github.com/Archivus/SymSpell

Comparison to other approaches and common misconceptions

A Trie as standalone spelling correction
Why don’t you use a Trie instead of your algorithm?
Tries have a comparable search performance to our approach. But a Trie is a prefix tree, which requires a common prefix. This makes it suitable for autocomplete or search suggestions, but not applicable for spell checking. If your typing error is e.g. in the first letter, than you have no common prefix, hence the Trie will not work for spelling correction.

A Trie as replacement for the hash table
Why don’t you use a Trie for the dictionary instead of the hash table?
Of course you could replace the hash table with a Trie (that is just a arbitrary lookup component of O(1) speed for a *single* lookup) at the cost of added code complexity, but without performance gain.
A HashTable is slower than a Trie only if there are collisions, which are unlikely in our case. For a maximum edit distance of 2 and an average word length of 5 and 100,000 dictionary entries we need to additionally store (and hash) 1,500,000 deletes. With a 32 bit hash (4,294,967,296 possible distinct hashes) the collision probability seems negligible.
With a good hash function even a similarity of terms (locality) should not lead to increased collisions, if not especially desired e.g. with Locality sensitive hashing.

BK-Trees
Would be BK-Trees an alternative option?
Yes, but BK-Trees have a search time of O(log dictionary_size), whereas our algorithm is constant time ( O(1) time ), i.e. independent of the dictionary size.

Ternary search tree
Why don’t you use a ternary search tree?
The lookup time in a Ternary Search Tree is O(log n), while it is only 0(1) in our solution. Also, while a Ternary Search Tree could be used for the dictionary lookup instead of a hash table, it doesn’t address the spelling error candidate generation. And the tremendous reduction of the number of spelling error candidates to be looked-up in the dictionary is the true innovation of our Symmetric Delete Spelling Correction algorithm.

Precalculation
Does the speed advantage simply comes from precalulation of candidates?
No! The speed is a result of the combination of all three components outlined below:

  • Pre-calculation, i.e. the generation of possible spelling error variants (deletes only) and storing them at index time is just the first precondition.
  • A fast index access at search time by using a hash table with an average search time complexity of O(1) is the second precondition.
  • But only our Symmetric Delete Spelling Correction on top of this allows to bring this O(1) speed to spell checking, because it allows a tremendous reduction of the number of spelling error candidates to be pre-calculated (generated and indexed).
  • Applying pre-calculation to Norvig’s approach would not be feasible because pre-calculating all possible delete + transpose + replace + insert candidates of all terms would result in a huge time and space consumption.

Correction vs. Completion
How can I add auto completion similar to Google’s Autocompletion?
There is a difference between correction and suggestion/completion!

Correction: Find the correct word for a word which contains errors. Missing letters/errors can be on start/middle/end of the word. We can find only words equal/below the maximum edit distance, as the computational complexity is dependent from the edit distance.

Suggestion/completion: Find the complete word for an already typed substring (prefix!). Missing letters can be only at the end of the word. We can find words/word combinations of any length, as the computational complexity is independent from edit distance and word length.

The code above implements only correction, but not suggestion/completion!
It still finds suggestions/completions equal/below the maximum edit distance, i.e. it starts to show words only if there are <= 2 letters missing (for maximum edit distance=2). Nevertheless the code can be extended to handle both correction and suggestion/completion. During the process of dictionary creation you have to add also all substrings (prefixes only!) of a word to the dictionary, when you are adding a new word to the dictionary. All substring entries of a specific term then have to contain a link to the complete term. Alternatively, for suggestion/completion you could use a completely different algorithm/structure like a Trie, which inherently lists all complete words for a given prefix.