1000x Faster Spelling Correction: Source Code released

Share onTweet about this on TwitterShare on FacebookShare on Google+Share on RedditBuffer this pageShare on LinkedInShare on VK

In a followup to our recent post 1000x Faster Spelling Correction algorithm we are releasing today a C# implementation of our Symmetric Delete Spelling Correction algorithm as Open Source:

Update1: The source code is now also on GitHub.
Update2: Improved implementation now 1,000,000 times faster for edit distance=3.

// SymSpell: 1000x faster through Symmetric Delete spelling correction algorithm
// The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup 
// for a given Damerau-Levenshtein distance. It is three orders of magnitude faster and language independent.
// Opposite to other algorithms only deletes are required, no transposes + replaces + inserts.
// Transposes + replaces + inserts of the input term are transformed into deletes of the dictionary term.
// Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters!
// Copyright (C) 2012 Wolf Garbe, FAROO Limited
// Version: 1.6
// Author: Wolf Garbe <wolf.garbe@faroo.com>
// Maintainer: Wolf Garbe <wolf.garbe@faroo.com>
// URL: http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/
// Description: http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/
// License:
// This program is free software; you can redistribute it and/or modify
// it under the terms of the GNU Lesser General Public License, 
// version 3.0 (LGPL-3.0) as published by the Free Software Foundation.
// http://www.opensource.org/licenses/LGPL-3.0
// Usage: single word + Enter:  Display spelling suggestions
//        Enter without input:  Terminate the program

using System;
using System.Linq;
using System.Text.RegularExpressions;
using System.Collections.Generic;
using System.IO;
using System.Diagnostics;

static class SymSpell
    private static int editDistanceMax=2;
    private static int verbose = 0;
    //0: top suggestion
    //1: all suggestions of smallest edit distance 
    //2: all suggestions <= editDistanceMax (slower, no early termination)

    private class dictionaryItem
        public string term = "";
        public List<editItem> suggestions = new List<editItem>();
        public int count = 0;

        public override bool Equals(object obj)
            return Equals(term, ((dictionaryItem)obj).term);
        public override int GetHashCode()
            return term.GetHashCode(); 

    private class editItem
        public string term = "";
        public int distance = 0;

        public override bool Equals(object obj)
            return Equals(term, ((editItem)obj).term);
        public override int GetHashCode()
            return term.GetHashCode();

    private class suggestItem
        public string term = "";
        public int distance = 0;
        public int count = 0;

        public override bool Equals(object obj)
            return Equals(term, ((suggestItem)obj).term);
        public override int GetHashCode()
            return term.GetHashCode();

    private static Dictionary<string, dictionaryItem> dictionary = new Dictionary<string, dictionaryItem>();

    //create a non-unique wordlist from sample text
    //language independent (e.g. works with Chinese characters)
    private static IEnumerable<string> parseWords(string text)
        return Regex.Matches(text.ToLower(), @"[\w-[\d_]]+")
                    .Select(m => m.Value);

    //for every word there all deletes with an edit distance of 1..editDistanceMax created and added to the dictionary
    //every delete entry has a suggestions list, which points to the original term(s) it was created from
    //The dictionary may be dynamically updated (word frequency and new words) at any time by calling createDictionaryEntry
    private static bool CreateDictionaryEntry(string key, string language)
        bool result = false;
        dictionaryItem value;
        if (dictionary.TryGetValue(language+key, out value))
            //already exists:
            //1. word appears several times
            //2. word1==deletes(word2) 
            value = new dictionaryItem();
            dictionary.Add(language+key, value);

        //edits/suggestions are created only once, no matter how often word occurs
        //edits/suggestions are created only as soon as the word occurs in the corpus, 
        //even if the same term existed before in the dictionary as an edit from another word
        if (string.IsNullOrEmpty(value.term))
            result = true;
            value.term = key;

            //create deletes
            foreach (editItem delete in Edits(key, 0, true))
                editItem suggestion = new editItem();
                suggestion.term = key;
                suggestion.distance = delete.distance;

                dictionaryItem value2;
                if (dictionary.TryGetValue(language+delete.term, out value2))
                    //already exists:
                    //1. word1==deletes(word2) 
                    //2. deletes(word1)==deletes(word2) 
                    if (!value2.suggestions.Contains(suggestion)) AddLowestDistance(value2.suggestions, suggestion);
                    value2 = new dictionaryItem();
                    dictionary.Add(language+delete.term, value2);
        return result;

    //create a frequency disctionary from a corpus
    private static void CreateDictionary(string corpus, string language)
        if (!File.Exists(corpus))
            Console.Error.WriteLine("File not found: " + corpus);

        Console.Write("Creating dictionary ...");
        long wordCount = 0;
        foreach (string key in parseWords(File.ReadAllText(corpus)))
            if (CreateDictionaryEntry(key, language)) wordCount++;
        Console.WriteLine("\rDictionary created: " + wordCount.ToString("N0") + " words, " + dictionary.Count.ToString("N0") + " entries, for edit distance=" + editDistanceMax.ToString());

    //save some time and space
    private static void AddLowestDistance(List<editItem> suggestions, editItem suggestion)
        //remove all existing suggestions of higher distance, if verbose<2
        if ((verbose < 2) && (suggestions.Count > 0) && (suggestions[0].distance > suggestion.distance)) suggestions.Clear();
        //do not add suggestion of higher distance than existing, if verbose<2
        if ((verbose == 2) || (suggestions.Count == 0) || (suggestions[0].distance >= suggestion.distance)) suggestions.Add(suggestion);

    //inexpensive and language independent: only deletes, no transposes + replaces + inserts
    //replaces and inserts are expensive and language dependent (Chinese has 70,000 Unicode Han characters)
    private static List<editItem> Edits(string word, int editDistance, bool recursion)
        List<editItem> deletes = new List<editItem>();
        if (word.Length > 1)
            for (int i = 0; i < word.Length; i++)
                editItem delete = new editItem();
                delete.term=word.Remove(i, 1);
                if (!deletes.Contains(delete))
                    //recursion, if maximum edit distance not yet reached
                    if (recursion && (editDistance < editDistanceMax)) 
                        foreach (editItem edit1 in Edits(delete.term, editDistance,recursion))
                            if (!deletes.Contains(edit1)) deletes.Add(edit1); 

        return deletes;

    private static int TrueDistance(editItem dictionaryOriginal, editItem inputDelete, string inputOriginal)
        //We allow simultaneous edits (deletes) of editDistanceMax on on both the dictionary and the input term. 
        //For replaces and adjacent transposes the resulting edit distance stays <= editDistanceMax.
        //For inserts and deletes the resulting edit distance might exceed editDistanceMax.
        //To prevent suggestions of a higher edit distance, we need to calculate the resulting edit distance, if there are simultaneous edits on both sides.
        //Example: (bank==bnak and bank==bink, but bank!=kanb and bank!=xban and bank!=baxn for editDistanceMaxe=1)
        //Two deletes on each side of a pair makes them all equal, but the first two pairs have edit distance=1, the others edit distance=2.

        if (dictionaryOriginal.term == inputOriginal) return 0; else
        if (dictionaryOriginal.distance == 0) return inputDelete.distance;
        else if (inputDelete.distance == 0) return dictionaryOriginal.distance;
        else return DamerauLevenshteinDistance(dictionaryOriginal.term, inputOriginal);//adjust distance, if both distances>0

    private static List<suggestItem> Lookup(string input, string language, int editDistanceMax)
        List<editItem> candidates = new List<editItem>();

        //add original term
        editItem item = new editItem();
        item.term = input;
        item.distance = 0;
        List<suggestItem> suggestions = new List<suggestItem>();
        dictionaryItem value;

        while (candidates.Count>0)
            editItem candidate = candidates[0];

            //save some time
            //early termination
            //suggestion distance=candidate.distance... candidate.distance+editDistanceMax                
            //if canddate distance is already higher than suggestion distance, than there are no better suggestions to be expected
            if ((verbose < 2)&&(suggestions.Count > 0)&&(candidate.distance > suggestions[0].distance)) goto sort;
            if (candidate.distance > editDistanceMax) goto sort;  

            if (dictionary.TryGetValue(language+candidate.term, out value))
                if (!string.IsNullOrEmpty(value.term))
                    //correct term
                    suggestItem si = new suggestItem();
                    si.term = value.term;
                    si.count = value.count;
                    si.distance = candidate.distance;

                    if (!suggestions.Contains(si))
                        //early termination
                        if ((verbose < 2) && (candidate.distance == 0)) goto sort;     

                //edit term (with suggestions to correct term)
                dictionaryItem value2;
                foreach (editItem suggestion in value.suggestions)
                    //save some time 
                    //skipping double items early
                    if (suggestions.Find(x => x.term == suggestion.term) == null)
                        int distance = TrueDistance(suggestion, candidate, input);
                        //save some time.
                        //remove all existing suggestions of higher distance, if verbose<2
                        if ((verbose < 2) && (suggestions.Count > 0) && (suggestions[0].distance > distance)) suggestions.Clear();
                        //do not process higher distances than those already found, if verbose<2
                        if ((verbose < 2) && (suggestions.Count > 0) && (distance > suggestions[0].distance)) continue;

                        if (distance <= editDistanceMax)
                            if (dictionary.TryGetValue(language+suggestion.term, out value2))
                                suggestItem si = new suggestItem();
                                si.term = value2.term;
                                si.count = value2.count;
                                si.distance = distance;

            }//end foreach

            //add edits 
            if (candidate.distance < editDistanceMax)
                foreach (editItem delete in Edits(candidate.term, candidate.distance,false))
                    if (!candidates.Contains(delete)) candidates.Add(delete);
        }//end while

        sort: suggestions = suggestions.OrderBy(c => c.distance).ThenByDescending(c => c.count).ToList();
        if ((verbose == 0)&&(suggestions.Count>1))  return suggestions.GetRange(0, 1); else return suggestions;

    private static void Correct(string input, string language)
        List<suggestItem> suggestions = null;
        //Benchmark: 1000 x Lookup
        Stopwatch stopWatch = new Stopwatch();
        for (int i = 0; i < 1000; i++)
            suggestions = Lookup(input,language,editDistanceMax);
        //check in dictionary for existence and frequency; sort by edit distance, then by word frequency
        suggestions = Lookup(input, language, editDistanceMax);

        //display term and frequency
        foreach (var suggestion in suggestions)
            Console.WriteLine( suggestion.term + " " + suggestion.distance.ToString() + " " + suggestion.count.ToString());
        if (verbose == 2) Console.WriteLine(suggestions.Count.ToString() + " suggestions");

    private static void ReadFromStdIn()
        string word;
        while (!string.IsNullOrEmpty(word = (Console.ReadLine() ?? "").Trim()))

    public static void Main(string[] args)
        //e.g. http://norvig.com/big.txt , or any other large text corpus

    // Damerau–Levenshtein distance algorithm and code 
    // from http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
    public static Int32 DamerauLevenshteinDistance(String source, String target)
        Int32 m = source.Length;
        Int32 n = target.Length;
        Int32[,] H = new Int32[m + 2, n + 2];

        Int32 INF = m + n;
        H[0, 0] = INF;
        for (Int32 i = 0; i <= m; i++) { H[i + 1, 1] = i; H[i + 1, 0] = INF; }
        for (Int32 j = 0; j <= n; j++) { H[1, j + 1] = j; H[0, j + 1] = INF; }

        SortedDictionary<Char, Int32> sd = new SortedDictionary<Char, Int32>();
        foreach (Char Letter in (source + target))
            if (!sd.ContainsKey(Letter))
                sd.Add(Letter, 0);

        for (Int32 i = 1; i <= m; i++)
            Int32 DB = 0;
            for (Int32 j = 1; j <= n; j++)
                Int32 i1 = sd[target[j - 1]];
                Int32 j1 = DB;

                if (source[i - 1] == target[j - 1])
                    H[i + 1, j + 1] = H[i, j];
                    DB = j;
                    H[i + 1, j + 1] = Math.Min(H[i, j], Math.Min(H[i + 1, j], H[i, j + 1])) + 1;

                H[i + 1, j + 1] = Math.Min(H[i + 1, j + 1], H[i1, j1] + (i - i1 - 1) + 1 + (j - j1 - 1));

            sd[ source[ i - 1 ]] = i;
        return H[m + 1, n + 1];

The implementation supports now edit distances of any size (default=2).

With previous spell checking algorithms the required time explodes with larger edit distances. They try to omit this with early termination when suggestions of smaller edit distances are found.

We did a quick benchmark with 1000 lookups:

Term Best correction Edit distance Faroo
Peter Norvig
marsupilami no correction* >3 1,772 165,025,000 93,129
acamodation accommodation 3 1,874 175,622,000 93,715
acomodation accommodation 2 162 348,191 2,149
hous house 1 71 179 2
house house 0 0 17 n/a

*Correct word, but not in dictionary and there are also no corrections within an edit distance of <=3. This is a quite common case (e.g. rare words, new words, domain specific words, foreign words, names).

The speed advantage grows exponentially with the edit distance:
For an edit distance=1 it’s the same order of magnitude,
for an edit distance=2 it’s 3 orders of magnitude faster,
for an edit distance=3 it’s 5 orders of magnitude faster.

Source code
The C# implementation of our Symmetric Delete Spelling Correction algorithm is released on GitHub as Open Source under the GNU Lesser General Public License (LGPL).

C# (original)

The following third party ports to other programming languages have not been tested by myself whether they are an exact port, error free, provide identical results or are as fast as the original algorithm:

C++ (third party port)

Go (third party port)

Java (third party port)

Javascript (third party port)

Python (third party port)

Ruby (third party port)

Swift (third party port)

Share onTweet about this on TwitterShare on FacebookShare on Google+Share on RedditBuffer this pageShare on LinkedInShare on VK

45 thoughts on “1000x Faster Spelling Correction: Source Code released

  1. Pingback: 1000x Faster Spelling Correction algorithm « FAROO Blog

  2. Pingback: Spelling correction, Query completion and Instant search « FAROO Blog

  3. while converting this code to VB.NET I am facing the following problems
    Line no. 98: What is m? and what is the meaning of =>? (in VB => gets converted to >=)
    Line no. 278: What is m?
    Line no. 314: What is c?

    An early response will be highly appretiated.


  4. 98: m => m.Value
    VB: Function(m) m.Value

    278: x => x.term == suggestion.term
    VB: Function(x) x.term = suggestion.term

    314: c => c.distance
    VB: Function(c) c.distance

    Those are all Lambda expressions:
    http://msdn.microsoft.com/en-us/library/bb397687.aspx (C#)
    http://msdn.microsoft.com/en-us/library/bb531253.aspx (Visual Basic, fully supported from Visual Basic 2010)

    You might use a C# to VB Converter to translate

  5. While I totally agree to Pharao that the code shoud be open soource especially to port it to other platforms (ie linux servers) I see the problem that evil people will find ways to inject spam, censor or fool the ranking programmatically….

    Maybe this could be made bulletproof – if not consider a closed source linux port and an independent security team to review its result…

    Respect for this great project!

  6. Thx. There is something wrong with the wordpress sourcecode formatter.
    Should be: sd [ source [ i – 1 ] ] = i;

  7. @wolf – Great work and thanks for sharing this code.

    Java is not my best suit, so its taking little more time for me to digest your code. I do have couple questions for you:
    – Does this code work out of box or does it have any dependencies?
    – How do I test this code? Download and install JDK and run the code? or do I need to setup anymore stuff to run the code?

    Thanks in advance!

  8. The dictionary cannot be serialized even on small corpus (5mb) 🙂
    SerializationException was unhandled
    The internal array cannot expand to greater than Int32.MaxValue elements.

    As far as I can see, no practical use apart from looking through algorythms’ implementation. To make use of the code one must write some persistance layer for the precalculated dictionary =)

    Just a note for next readers, thanks for the code anyway =)

  9. Not sure what happened with your serialization. For the sample corpus of 6MB ( http://norvig.com/big.txt ) we get 29,157 unique words and 2,151,998 pre-calculated dictionary entries for a maximum edit distance of 3.
    2,151,998 is three orders of magnitude below 2,147,483,647 (Int32.MaxValue).

    The code was meant to be simple and comprehensible, just to explain and benchmark our algorithm. Depending on your application you can:

    1. just go with the one-time pre-calculation and without persistence, e.g. if you have a long time running server,
    2. use a more efficient serialization library, e.g. http://code.google.com/p/protobuf-net/
    3. use a NoSQL database to implement a persistent dictionary without size restrictions, e.g. MongoDB, LevelDB, Couchbase etc. (that’s what we are doing in production with a Terabyte corpus).
  10. Pingback: Coder Read #20140114: Poor Man’s Firebase

  11. I’m trying to make use of this cool algorithm for my iPhone app. Basically, I’m trying to implement autocorrection as the user type letters/words from their keyboard. I have a dictionary of about 300k words in a txt file with one word per line. I’d like to be able to use that dictionary with this algorithm and return 2 or 3 words to suggest the users as they type. Can you give me a general idea on how i can use this algorithm to achieve what I’m looking for?


    p.s. I can read the dictionary from a file in to an array in memory pretty fast, about 0.4second.

  12. Have a look at line 357 CreateDictionary(“big.txt”,”en”); in order to read the 300k words from you txt file. Just change “big.txt” to the path where your dictionary is located.
    Then comment out line 358 ReadFromStdIn(); Instead call Correct(word,”en”); from your keyboard loop, where word is the substring the user has just typed.
    In line 337-342 currently the suggested corrections are printed to the console. Change this according to your requirements.

  13. Do you have any performance number on how long did it take to read in your big.txt file? I see the big.txt is pretty big. With the search being this fast, the only concern I have left is how long it’ll take to read in the dictionary file.

    I’m going to port this to Objective-C so it can work in iPhone. You don’t happen to have an objective-C version of this, do you?

    Thanks for getting back so quickly.

  14. 40 seconds on an Intel Core 2 Duo with 2.53 GHz for big.txt, which contains more than 1 million words, of which are 29,157 unique. This leads to 2 million spelling error variants (deletes only) for a maximum edit distance of 3, taking 500 MB in RAM.
    Besides the time to read the dictionary also the memory space required to store those spelling error variants might be a concern. You may reduce this heavily if you set the maximum edit distance to 2 (in line 34).

    Of course you could do the spelling correction also server side, then memory and time for the dictionary shouldn’t be an issue at all.

    We don’t have an objective-C version. Would be nice if you could share and publish your port.

  15. Hmm… that maybe an issue. I need to be able to search as soon as the user start typing on the keyboard, which can be 2 second or less. I don’t have the option of doing the spelling correction on the server side.

  16. With maximum edit distance=2 everything should be fine. If you want to send your dictionary file (zip’ed) to info@faroo.com I can do a quick space/time benchmark for you (on a PC). Also, you need to generate the spelling error variants only once, not every time the user starts the app, by storing them to IOS Core Data and making them persistent.

  17. Hi.
    Thank you for your helpful code.
    I am not good with C# at all, but I really need your code to auto correct some text files.
    The goal is to send a text file from matlab, and then get back the corrected as a text file.
    The language of the text files is German.
    on line 357, when i change the big.txt to the directory of my own dictionary, do i need to make any other changes? maybe the “en”?
    Could you help me to modifythe code to read the input from a text and also save the output in a text file as well?
    Thank you for your support and great code.

  18. 34 private static int editDistanceMax=2;
    35 private static int verbose = 0; private static string output = "";

    354 public static void Main(string[] args)
    355 {
    356 //e.g. http://norvig.com/big.txt , or any other large text corpus
    357 CreateDictionary("big.txt","de");
    358 //ReadFromStdIn();
    foreach (string word in parseWords(File.ReadAllText("input.txt")))
    List suggestions = Lookup(word,"de",editDistanceMax);
    //punctuation marks and case of corrected word gets lost
    if (suggestions.Count==0) output+=word+" "; else output+=suggestions.First()+" ";
    359 }

  19. Hello…
    I want to ask something…
    I’ve compiled this source code in my visual studio 2010, but after debug, the result just “File canot found : big.txt”

    May you explain whats wrong with that?

  20. hi, I want to Know how I will binding this code with platform contains text box to work as google to autocompletion and correction inserted words,iam trying it by visual studio2010,it is run but without platform,plz can you help me
    with my best regarde

  21. hi,
    is this code using as autocorrection and completion program.and how make binding this code with textbox in form using visual studio 2010 to be like google smart autocorrection and completion.
    plz replay with my best regarde

  22. Comment out line 358 ReadFromStdIn(); Instead call Correct(textBox1.Text,”en”); from your textBox1.TextChanged() event, where textBox1.Text is the substring the user has just typed.
    In line 337-342 currently the suggested corrections are printed to the console. Change this according to your requirements, e.g. you could populate comboBox1.items with the suggested corrections.

  23. Hello Wolf,

    Has John Tran, or anyone else for that matter, made any progress on the Objective-C port?

    By the way, thanks for making the source available to all! Greatly appreciated.

  24. I don’t know what is the current state of the port or whether it has been finished. I’m always happy to set a link to any port.

  25. hi
    how i can add Auto completion for insertion words to get a number of suggestion in list box to choose target words as google Autocompletion,

  26. There is a difference between correction and suggestion/completion!

    Correction: Find the correct word for a word which contains errors. Missing letters/errors can be on start/middle/end of the word. We can find only words equal/below the maximum edit distance, as the computational complexity is dependent from the edit distance.

    Suggestion/completion: Find the complete word for an already typed substring (prefix!). Missing letters can be only at the end of the word. We can find words/word combinations of any length, as the computational complexity is independent from edit distance and word length.

    The code above implements only correction, but not suggestion/completion!
    It still finds suggestions/completions equal/below the maximum edit distance, i.e. it starts to show words only if there are <= 2 letters missing (for maximum edit distance=2). Nevertheless the code can be extended to handle both correction and suggestion/completion. During the process of dictionary creation you have to add also all substrings (prefixes only!) of a word to the dictionary, when you are adding a new word to the dictionary. All substring entries of a specific term then have to contain a link to the complete term. Alternatively, for suggestion/completion you could use a completely different algorithm/structure like a Trie, which inherently lists all complete words for a given prefix.

  27. thank you for answer ,but my target is to create a program like smart google by using levenshtiene distance to Auto completion and Auto correction at the same time ,this is the requirement?

  28. Levenshtein distance is not required for auto completion, it has nothing to do with it. (Damerau-)Levenshtein distance is only required as a similarity measure for spelling error correction.
    How to combine both auto correction and auto completion in a single algorithm is described in my previous comment!

  29. private static IEnumerable parseWords(string text)
    return Regex.Matches(text.ToLower(), @”[\w-[\d_]]+”)
    .Select(m => m.Value);

    what is the use of this class of function ?

  30. parseWords(string text) creates a non-unique word list from text (corpus = large text sample which is correctly spelled).
    This word list is then used to create a frequency dictionary. The dictionary is later used as reference for spell checking and spelling suggestions.

  31. Hi
    I will know where we can calculate the ranking distance for each word and how i can make this code to get all words in list order based on ranking distance or cost?

  32. Set private static int verbose = 2; (see line 35), in order to include all suggestions.
    Then call Correct(string input, string language) from your code (see line 318).
    Then you get a list of suggestions (see line 335), sorted by edit distance, and then by word frequency.
    The sorting is done in line 314. You may modify this line to sort according whatever you want (e.g. a custom cost function).

  33. Pingback: Data cleaning of company names, street names & product names | FAROO Blog

  34. UPDATE: SymSpellCompound: compound aware automatic spelling correction

    Take the input string of length n.

    1. do a spell checking of the whole string

    2. If there is no match within the maximimum edit distance:
    do a spell checking of all substrings(0,p) with p=n-maximum edit distance;p>=maximum edit distance;p=p-maximum edit distance until there is a match within the maximimum edit distance

    3. apply the algorithm recursively to the remainder of the string

    To detect the missing space in stackoverflow you need 5 spell checks with a maximum edit distance of 2:
    stackov -> stack
    erflow -> overflow

    As SymSpell is extremely fast, an additional constant factor doesn’t hurt the resulting performance to much.

  35. The python code for the above algorithm is not available. can anyone help?
    How do i implement edit distance 3 in python ?
    D-L distance in python?

Leave a Reply

Your email address will not be published. Required fields are marked *