**Update:** SymSpell C# implementation released as Open Source.

**Update2:** SymSpell 100,000 times faster for edit distance=3.

**Update3:** Spelling correction is now also part of FAROO search.

**Update4:** SymSpell source code now on GitHub.

**Update5:** Improved implementation now 1,000,000 times faster for edit distance=3.

**Update6:** SymSpellCompound: Compound aware automatic spelling correction.

Recently I answered a question on Quora about spelling correction for search engines. When I described our algorithm I was pointed to Peter Norvig’s page where he outlined his approach.

Both algorithms are based on Edit distance (Damerau-Levenshtein distance).

Both try to find the dictionary entries with smallest edit distance from the query term.

If the edit distance is 0 the term is spelled correctly, if the edit distance is <=2 the dictionary term is used as spelling suggestion.
But our way to search the dictionary is different, resulting in a significant performance gain and language independence.
Three ways to search for minimum edit distance in a dictionary:
**1. Naive approach**

The obvious way of doing this is to compute the edit distance from the query term to each dictionary term, before selecting the string(s) of minimum edit distance as spelling suggestion. This exhaustive search is inordinately expensive.

Source: Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze: Introduction to Information Retrieval.

The performance can be significantly improved by terminating the edit distance calculation as soon as a treshold of 2 or 3 has been reached.

**2. Peter Norvig**

Generate all possible terms with an edit distance <=2 *(deletes + transposes + replaces + inserts)* from the query term and search them in the dictionary.

For a word of length n, an alphabet size a, an edit distance d=1, there will be n deletions, n-1 transpositions, a*n alterations, and a*(n+1) insertions, for a total of 2n+2an+a-1 terms at search time.

Source: Peter Norvig: How to Write a Spelling Corrector.

This is much better than the naive approach, but still expensive at search time (114,324 terms for n=9, a=36, d=2) and language dependent (because the alphabet is used to generate the terms, which is different in many languages and huge in Chinese: **a=70,000** Unicode Han characters)

**3. Symmetric Delete Spelling Correction (FAROO)**

Generate terms with an edit distance <=2 *(deletes only)* from each dictionary term and add them together with the original term to the dictionary. This has to be done only once during a pre-calculation step.

Generate terms with an edit distance <=2 *(deletes only)* from the input term and search them in the dictionary.

For a word of length n, an alphabet size of a, an edit distance of 1, there will be just n deletions, for a total of n terms at search time.

This is **three orders of magnitude less expensive** (36 terms for n=9 and d=2) and **language independent** (the alphabet is not required to generate deletes).

The cost of this approach is the pre-calculation time and storage space of x deletes for every original dictionary entry, which is acceptable in most cases.

The number x of deletes for a single dictionary entry depends on the maximum edit distance: x=n for edit distance=1, x=n*(n-1)/2 for edit distance=2, x=n!/d!/(n-d)! for edit distance=d (combinatorics: k out of n combinations without repetitions, and k=n-d),

E.g. for a maximum edit distance of 2 and an average word length of 5 and 100,000 dictionary entries we need to additionally store 1,500,000 deletes.

**Remark 1:** During the precalculation, different words in the dictionary might lead to same delete term: delete(sun,1)==delete(sin,1)==sn.

While we generate only one new dictionary entry (sn), inside we need to store both original terms as spelling correction suggestion (sun,sin)

**Remark 2:** There are four different comparison pair types:

- dictionary entry==input entry,
- delete(dictionary entry,p1)==input entry
- dictionary entry==delete(input entry,p2)
- delete(dictionary entry,p1)==delete(input entry,p2)

The last comparison type is required for replaces and transposes only. But we need to check whether the suggested dictionary term is really a replace or an adjacent transpose of the input term to prevent false positives of higher edit distance (bank==bnak and bank==bink, but bank!=kanb and bank!=xban and bank!=baxn).

**Remark 3:** Instead of a dedicated spelling dictionary we are using the search engine index itself. This has several benefits:

- It is dynamically updated. Every newly indexed word, whose frequency is over a certain threshold, is automatically used for spelling correction as well.
- As we need to search the index anyway the spelling correction comes at almost no extra cost.
- When indexing misspelled terms (i.e. not marked as a correct in the index) we do a spelling correction on the fly and index the page for the correct term as well.

**Remark 4:** We have implemented query suggestions/completion in a similar fashion. This is a good way to prevent spelling errors in the first place. Every newly indexed word, whose frequency is over a certain threshold, is stored as a suggestion to all of its prefixes (they are created in the index if they do not yet exist). As we anyway provide an instant search feature the lookup for suggestions comes also at almost no extra cost. Multiple terms are sorted by the number of results stored in the index.

**Reasoning**

In our algorithm we are exploiting the fact that the edit distance between two terms is symmetrical:

- We can generate all terms with an edit distance <2 from the query term (trying to reverse the query term error) and checking them against all dictionary terms,
- We can generate all terms with an edit distance <2 from each dictionary term (trying to create the query term error) and check the query term against them.
- We can combine both and meet in the middle, by transforming the correct dictionary terms to erroneous strings, and transforming the erroneous input term to the correct strings.

Because adding a char on the dictionary is equivalent to removing a char from the input string and vice versa, we can on both sides restrict our transformation to deletes only.

**We are using variant 3, because the delete-only-transformation is language independent and three orders of magnitude less expensive.**

**Where does the speed come from?**

**Pre-calculation**, i.e. the generation of possible spelling error variants (deletes only) and storing them at index time is the first precondition.- A fast index access at search time by
**using a hash table**with an average search time complexity of O(1) is the second precondition. - But
**only our Symmetric Delete Spelling Correction**on top of this allows to bring this O(1) speed to spell checking, because it allows a tremendous reduction of the number of spelling error candidates to be pre-calculated (generated and indexed). **Applying pre-calculation to Norvig’s approach would not be feasible**because pre-calculating all possible delete + transpose + replace + insert candidates of all terms would result in a huge time and space consumption.

**Computational Complexity**

Our algorithm is constant time ( O(1) time ), i.e. independent of the dictionary size (but depending on the average term length and maximum edit distance), because our index is based on a Hash Table which has an average search time complexity of O(1).

**Comparison to other approaches**

BK-Trees have a search time of O(log dictionary_size), whereas our algorithm is constant time ( O(1) time ), i.e. independent of the dictionary size.

Tries have a **comparable search performance** to our approach. But a Trie is a prefix tree, which requires a common prefix. This makes it suitable for autocomplete or search suggestions, but **not applicable for spell checking**. If your typing error is e.g. in the first letter, than you have no common prefix, hence the Trie will not work for spelling correction.

**Application**

Possible application fields of our algorithm are those of fast approximate dictionary string matching: spell checkers for word processors and search engines, correction systems for optical character recognition, natural language translation based on translation memory, record linkage, de-duplication, matching DNA sequences, fuzzy string searching and fraud detection.

———

BTW, by using a similar principle our web search is three orders of magnitude more efficient as well. While Google touches 1000 servers for every query, we need to query just one (server/peer).

That’s not because of DHT! Vice versa, because even for a complex query in a web scale index only one of the servers needs to be queried, it enables the use of DHT for web search.

Our algorithm improves the efficiency of central servers in a data center to the same extent.

Pingback: Quora

Pingback: 1000x Faster Spelling Correction: Source Code released « FAROO Blog

Pingback: Spelling correction, Query completion and Instant search « FAROO Blog

Just for the sake of discussion, another option is using a bk-tree, which can be done as long as your distance function maintains triangular inequality (which I believe the do distance function does.)

BK-Trees have a search time of O(log n) for n=dictionary_size.

Our algorithm is constant time ( O(1) time ), i.e. independent of the dictionary size (but depending on the average term length and maximum edit distance).

Thanks Wolf. I’m glad i found this, i’d been researching this issue on and off for a while, your approach makes perfect sense. I benchmarked it against our current methods and it indeed performed well. We released our code for a similar approach written in golang: https://github.com/sajari/fuzzy

This symmetric deletion method allows corrections of up to edit distance 4. eg:

“abcd” and “cdef” have edit distance 4, but by apply two deletions to each term, you can get them to match: abcd => cd and cdef => cd .

This means that you are going to end up considering far more correction candidates than Norvig’s algorithm, which can be very costly.

@Hamish : Thanks for sharing!

@Jason: see Remark 2 of the above blog post. In comparison type 4 (applying deletes on both dictionary term and input term) we need to calculate

the true Damerau–Levenshtein distance in order to prevent false positives of higher edit distance (e.g. false positives of edit distance=4).

But this does NOT mean that we have to calculate all candidates for edit distance 4:

1. with symmetric deletion method we need to calculate much less candidates (at search time) and need much less comparisons for the same edit distance than we would have with Norvig’s algorithm (only deletes instead of deletes + transposes + replaces + inserts).

2. of those fewer candidates only a fraction (type 4) requires the calculation of the true Damerau-Levenshtein distance to exclude false positives.

3. our benchmark of both algorithms in the follow-up post ( http://blog.faroo.com/2012/06/24/1000x-faster-spelling-correction-source-code-released/ ) proves that our algorithm is 3 to 5 orders of magnitude faster for the same edit distance.

I think your index lookups will have complexity O(log(n)) right? So the algorithmic complexity is O(log(n)) not O(1). Same as a bk-tree.

Thanks for sharing. How does your method perform against a Trie-based spellchecker? I think the later is pretty fast.

The search index as you call it it’s actually a hash map, the problem with it is that in worst case of a search you will end up with an O(n). Even so, if you just compare some words, it’s an efficient way to implement a hash map.

@Adam: No, our index is based on a Hash Table which has an average search time complexity of O(1). http://en.wikipedia.org/wiki/Hash_table

With regards to point 3, number of combinations, it seems to me that:

With max edit distance 2 the number of combinations to consider is not: ((n * n-1)/2), but it’s: ((n * n-1)/2) + n + 1.

You also have to add the + n to still take into account that although you’re taking edit distance 2, you should still consider the cases with only 1 deletion, hence the + n.

And the case of 0 deletions, hence the +1.

Other than that, thanx for this blog post, I’m definitely gonna try this approach.

@Alexandru: Yes, the index is using a hash table / hash map, which is the base for an average search time complexity of O(1). But our Symmetric Delete Spelling Correction on top of this allows to bring this O(1) speed to spell checking / fuzzy string matching / approximate string matching / similarity search (without previously generating and indexing all possible delete + transpose + replace + insert candidates of all terms, which would consume too much time and space).

@ R. Berson: A Trie has a comparable search performance to our approach. But a Trie is a prefix tree ( http://en.wikipedia.org/wiki/Trie ), which requires a common prefix. This makes it suitable for autocomplete or search suggestions, but not for spell checking. If your typing error is in the first letter, than you have no common prefix, hence the Trie will not work for spelling correction.

@Sander: Yes, you are right.

I would to say this is an excellent, smart solution. A comparable solution would be to use ternary search tree. But it is complex to implement a ternary search tree.

There is a huge “if” to say a HashTable has time complexity of O(1). With so many “similar” words (edit distance <= 2) in the dictionary, it is expected that there are many many collisions.

Generally, a HashTable is slower than a Trie. The big disadanvantage of a Trie-based method is the huge memory consumption (26x dictionary size for English), although a HashTable also need a lot of extra memory to reduce collisions.

With a little trick, you can easily search a Trie for words with maximal edit distance of X. For example:

#define ALPHABET_SIZE = 26;

struct TrieNode {

struct TrieNode *child[ALPHABET_SIZE];

string word;

TrieNode() {

for(int i = 0; i < ALPHABET_SIZE; ++i) child[i] = NULL;

};

}

// a search algorithm for matches between a Trie and a target word with maximal // edit distance of X.

void searchX(TrieNode *root, string target, int index, int X, vector& res) {

int n = target.size();

if(root == NULL || index == n) return;

if(root != NULL && !root->word.empty() && index+X >= n-1) {

res.push_back(root->word);

return;

}

searchX(root, target, index+1, X-1, res); // delete

for(int i = 0; i child[i], target, index, X-1, res); // insert

if(i == target[index]-‘a’)

searchX(root->child[i], target, index+1, X, res); // match

else

searchX(root->child[i], target, index+1, X-1, res); // replace

}

return;

}

Thank you very much for your detailed comment and code.

>> A comparable solution would be to use ternary search tree. But it is complex to implement a ternary search tree.The lookup time in a Ternary Search Tree is O(log n), while it is only 0(1) in our solution.

Also, while a Ternary Search Tree could be used for the dictionary lookup instead of a hash table, it doesn’t address the spelling error candidate generation.

And the tremendous reduction of the number of spelling error candidates to be looked-up in the dictionary is the true innovation of our Symmetric Delete Spelling Correction algorithm.

>> There is a huge “if” to say a HashTable has time complexity of O(1). With so many “similar” words (edit distance <= 2) in the dictionary, it is expected that there are many many collisions.For a maximum edit distance of 2 and an average word length of 5 and 100,000 dictionary entries we need to additionally store (and hash) 1,500,000 deletes.

With a 32 bit hash (4,294,967,296 possible distinct hashes) the collision probability seems negligible.

With a good hash function a similarity of terms (locality) should not lead to increased collisions, if not especially desired e.g. with http://en.wikipedia.org/wiki/Locality-sensitive_hashing

>> Generally, a HashTable is slower than a Trie. The big disadanvantage of a Trie-based method is the huge memory consumption (26x dictionary size for English), although a HashTable also need a lot of extra memory to reduce collisions.A HashTable is slower than a Trie only if there are collisions, which are unlikely in this case as mentioned above.

Of course you could replace the hash table with a Trie (that is just a arbitrary lookup component of O(1) speed for a *single* lookup) at the cost of added code complexity, but without performance gain.

>> With a little trick, you can easily search a Trie for words with maximal edit distance of X. For example … a search algorithm for matches between a Trie and a target wordI think your code is an implementation of Peter Norvigs Algorithm with deletes + transposes + replaces + inserts (described in the blog post above) on top of a Trie. As mentioned above replacing the hash table with a Trie is ok, but adds code complexity without performance gain.

But our Symmetric Delete Spelling Correction on top of a hash table (or trie) allows a tremendous reduction of the number of spelling error candidates to be looked-up: 36 candidate lookups vs. 114,324 candidate lookups in Norvigs Algorithm (for word length=9, alphabet size=36, edit distance=2)

See also the paragraph: Where does the speed come from? in the blog post above.

For a benchmark of Norvigs Algorithm and our Symmetric Delete Spelling Correction see http://blog.faroo.com/2012/06/24/1000x-faster-spelling-correction-source-code-released/ Our algorithm is 3 orders of magnitude faster for an edit distance=2 and 5 orders of magnitude faster for an edit distance=3.

I think your algorithm explained in this post is basically the same

as one described in this paper (2007/02):

> Thomas Bocek et al. Fast similarity search in large dictionaries. 2007.

> http://fastss.csg.uzh.ch/ifi-2007.02.pdf

Especially please take a look on the section 3.4 ‘FastSS with Candidates’

of the paper, on which the researchers explain a variant of their algorithm

(‘FastSSwC’). Here I quote a bit from their explanation:

> 3.4.1 Indexing

> For each indexed word and a given k only the words in the

> deletion neighborhood are stored, pointing to the original

> word, and no additional information. The space bound is the

> same as for FastSS, O(nm^k). Ud(fest,2) contains fest, est,

> fst, fet, fes, fe, fs, ft, es, et, st pointing to fest.

For your interest, below is the link to the researchers’ website,

which includes the summary of their work, related papers, and their

implementations etc.

> http://fastss.csg.uzh.ch/

I hope my comment above will be of some interest to you.

Thank you very much for the reference and hat tip to the FastSSwC authors. Indeed it looks similar, although we are using the Damerau-Levenshtein edit distance, which additionaly supports adjacent transpositions as a very frequent case in spelling correction, unlike FastSSwC which uses Levenshtein edit distance.

Also FastSSwC is calculating the Levenshtein distance for every match, while we calculate the Damerau-Levenshtein distance only if for the match there have been characters deleted both on the dictionary term and the input term, making it faster.

@Wolf, Why are you only calculating the Damerau-Levenshtein distance if there is a deletion in both? Shouldn’t you do so in all cases? If you have the input term “Fast” and a dictionary entry “FastSS”, the edit distance would be 2. In this case, the input term has no deletions in a match to the dictionary entry.

@Jake

For each of the pre-calculated deletes in the dictionary we know how many characters have been deleted.

We even don’t need to store this number, it can be easily calculated from: length of original dictionary term – length of delete term (there is always a pointer from the delete term to the original term).

In all cases where a deletion has been done only on one side (either on the dictionary term or on the input term) the number of deletes equals the Dammerau-Levenshtein edit distance. Hence there is no need to do execute the expensive Dammerau-Levenshtein algorithm.

But when the deletion has been done both on the dictionary term and on the input term, the number of deletes doesn’t necessarily equals the Dammerau-Levenshtein edit distance:

input term “afstss” and dictionary term “fastss”

-> a(f)stss == (f)astss

2 deletes, but Dammerau-Levenshtein edit distance = 1

input term “astssx” and dictionary term “fastss”

-> astss(x) == (f)astss

2 deletes, but Dammerau-Levenshtein edit distance = 2

Hence we need to do execute the expensive Dammerau-Levenshtein algorithm in order to calculate it.

@Wolf, Thanks.

So basically, you know by the length of the difference of the terms and the fact that you had an exact match.

There was a paper I read recently that had an improvement to insertion time into this hash, which involves splitting words in half when they are larger than a certain length. I didn’t get around to really understanding the speedup, but I thought I’d post it in case you were interested. http://arxiv.org/abs/1008.1191

Pingback: How to correct company names, street names & product names? | FAROO Blog

@Wolf in http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/#comment-625518 :

“If your typing error is in the first letter, than you have no common prefix”

What makes you say this? If there’s a root node in the trie you should be able to find those errors as well. The root node (which doesn’t contain a character) serves as a common prefix.

@TWiStErRob :

The main idea is to generate and index (precalculate) all possible defective variants (generating all possible errors by applying deletes + transposes + replaces + inserts within the maximum edit distance) at index time, so that you don’t have to do this at search time.

But to precalculate all variants would require huge time and memory.

So the point of the described algorithm is to generate much less variants by using only deletes instead of deletes, inserts, replacements and transpositions.

If you are

storing all those precalculated variantsin a hash table or Trie, then you are able to match an input term with a typing error in the first letter to a dictionary term. There is no big performance difference whether you use our Symmetric Delete Spelling Correctionon topof a hash table or Trie.But you can’t use a Trie as a

standalonemethod to match an input term with a typing error in the first letter to a dictionary term.If you are

only storing the pure dictionary terms(without generating and storing all defective variants)you can’t match an input term with a typing error in the first letter(but you can e.g. match an input term where the last letter is missing – e.g. for query completion).Example:@Wolf, I am working on an Android app to recognize hand-written Java code. OCR causes a lot of the wrong detection and I want to correct it using your SpellChecker – SymSpell. I have observed that it takes a lot of time to form dictionary as the editDistanceMax increases. It takes a lot of time. I don’t want to compute dictionary every time.

I want to compute dictionary only once with maxEditDistance of 10 and store it in local storage, and then perform the lookup. Can you please help me how to modify your code so that I compute dictionary only once and then perform the lookup from the dictionary stored in local storage ?

@Partho Mandal: In order to prevent der precalculation cost of SymSpell on every start of your App you may use

serialization.In the

developer buildof your app after line 391 CreateDictionary(“big.txt”,””);serializetheDictionaryandwordlistto a file.Then in the

final buildinstead of CreateDictionary just deserialize Dictionary and wordlist from the file.Either use Java serialization or Protocol Buffers, Flat buffers or Cap’n Proto for serialization.

I’m talking about this: https://github.com/wolfgarbe/symspell

In my tries, the results were not good at all. A wordlist containing 1300000 words consumes 4.5GB memory! What about perfomance? A simle BK-Tree implementation that consumes 350mb memory a with larger word list outperformed this code (I haven’t measured but at least 10 times)

And the worst thing is everything is static.

Which implementation of SymSpell (C# or port) did you use for your test?

Which implementation of BK-Tree did you use for your test (link to github)?

At what maximum edit distance did you get the 10 times faster results?

BK-Trees have a search time of

O(log dictionary_size), whereas SymSpell isconstant time(O(1)time ), i.e. independent of the dictionary size.Of course the final performance also depends on the constant factors. Those constant factors might be better for BK-Tree than in SymSpell.

Benchmark:CompetitorsPeter Norvig’s algorithm (C# port from Lorenzo Stoakes)

BK-Tree1 (C# implementation of tgriffith)

BK-Tree2 (C# implementation of TarasRoshko)

BK-Tree3 (C# implementation of Xenopax)

SymSpell v4.1 (C# implementation)

Test set-up 129,157 unique terms: Using English text corpus to generate dictionary.maximum edit distance=3Loop with 1000 lookups for “acamodation” (edit distance=3 to correct word “accommodation”)

Results 1:Norvig’s algorithm: 173,232,000 ms

BK-Tree1: 273,693 ms

BK-Tree2: 158,000 ms

BK-Tree3: 121,000 ms

SymSpell: 155 ms

Test set-up 229,157 unique terms: Using English text corpus to generate dictionary.maximum edit distance=2Loop with 1000 lookups for “acomodation” (edit distance=2 to correct word “accommodation”)

Results 2:Norvig’s algorithm: 397,271 ms

BK-Tree1: 177,174 ms

BK-Tree2: 72,000 ms

BK-Tree3: 48,620 ms

SymSpell: 26 ms

Test set-up 3500,000 unique terms: Using Google Books Ngram data to generate dictionary.maximum edit distance=2Loop with 1000 lookups for “acomodation” (edit distance=2 to correct word “accommodation”)

Results 3:BK-Tree1: 1,243,000 ms

BK-Tree2: 647,000 ms

BK-Tree3: 398,800 ms

SymSpell: 47 ms

SymSpell is

780 times fasterthan the fastest BK-Tree (for 29,000 dictionary entries and edit distance=3).SymSpell is

1,870 times fasterthan the fastest BK-Tree (for 29,000 dictionary entries and edit distance=2).SymSpell is

8,485 timesfaster than the fastest BK-Tree (for 500,000 dictionary entries and edit distance=2).Trade

5…10x spacefor1000…10,000x speed(BK-Tree vs. SymSpell).@koray: My results contradict your findings that a BK-Tree is 100 times faster than SymSpell (6 orders of magnitude difference in claims).To reproduce your results please add more context (which BK-Tree implementation, maximum edit distance, Levenshtein variant).

Levenshtein distance variantsThere are three different levenshtein distances:

Levenshtein distance: adjacent transposition (AC->CA) counted as 2 edits

Restricted Damerau-Levenshtein distance (Optimal string alignment algorithm): adjacent transposition counted as 1 edit, but substrings can’t be edited more than once: ed(“CA” , “ABC”) =3

True Damerau-Levenshtein distance: adjacent transposition counted as 1 edit, substrings can be edited more than once: ed(“CA” , “ABC”) =2

SymSpell uses the True Damerau-Levenshtein distance, which is

best suited for spelling correction, butmore computationally intensive. This should be taken into account when benchmarking, in order to not compare apples to oranges. The BK-Tree1 und BK-Tree3 codes which originally used only the Levenshtein Distance have been modified to use True Dammerau-Levenshtein Edit Distance (which made them 5 times slower).