Sub-millisecond compound aware automatic spelling correction

Source: https://www.flickr.com/photos/theredproject/3968278028

SymSpell just got better. SymSpellCompound now supports compound aware automatic spelling correction of multi-word input strings. It is built on top of SymSpell‘s 1 million times faster spelling correction algorithm.

Examples

More info on this Blog post on Medium.
The source code is released as Open Source on Github.

Elias-Fano: quasi-succinct compression of sorted integers in C#

Introduction

This blog post explores the Elias-Fano encoding, which allows as a very efficient compression of sorted lists of integers, in the context of Information retrieval (IR).

Elias-Fano encoding is quasi succinct, which means it is almost as good as the best theoretical possible compression scheme for sorted integers. While it can be used to compress any sorted list of integers, we will use it for compressing posting lists of inverted indexes.

While gap compression has been around for over 30 years, and some of the foundations of Elias-Fano encoding even date back to a 1972 publication by Peter Elias, Elias Fano encoding itself has been published in 2012. Being a rather recent development beyond the papers there is not much actual implementation code available. That’s why I want to contribute my implementation of that beautiful and efficient algorithm as Open Source.

Elias-Fano compression uses delta coding, also known as gap compression. It is an invertible transformation that maps large absolute integers to smaller integers of gaps/deltas, that require less bits. The list of integers is sorted and then the delta values (gaps) between two consecutive values are calculated. As the deltas are always smaller than the absolute values we can encode them with fewer bits and thus achieve a compression (this is true for any delta compression).

Elias-Fano as any gap compression requires the lists to be sorted. Therefore it is applicable only when the original order of elements is without meaning and can be lost.

Elias-Fano encodes the gaps within the average distance with fewer bits than the rare outliers above, by splitting the encoding of a gap in low bits and high bits:

  • The low bits l= log(u/n) are stored explicitly.
  • The remaining high bits are stored in unary coding

This requires at most 2 + log(u/n) bits per element and is quasi succinct with less than half a bit away from the succinct bound!

Inverted index and posting lists
While Elias-Fano encoding can be used to compress any sorted list of integers, a typical application in Information Retrieval is compressing posting lists of inverted indexes as core of a search engine. Hence here comes a short recap of both posting lists and inverted indexes:

An Inverted Index is the central data structure of most information retrieval systems and search engines. It maps terms of a vocabulary V to their locations of occurrences in a collection of documents:

  • A dictionary maps each term of the vocabulary to a separate posting list.
  • A posting list is a list of the document ids (DocID) of all documents where the term appears in the text.

A document id is the index number of an document within a directory of all documents. A document id is usually represented by an integer, hence posting list are lists of integers. A 32 bit unsigned integer allows the addressing of 4,294,967,296 (4 billion) documents, while an 64 bit unsigned integer allows the addressing of 18,446,744,073,709,551,616 (18 quintillion) documents.

Of course we could use a posting list with URLs instead of DocIDs. But URLs (77 byte average) take much more memory than DocIDs (4 or 8 byte), and a document is referred to from all posting lists of the terms (300 terms/document average) contained within the document text.
For 1 billion pages it would be 300*77byte*1 billion = 23 TB for posting lists of DocID vs. (77+300*4)*1 billion= 1277 GB for posting lists of URL, which is a factor of 18.

Index time and retrieval time

At index time the inverted index is created. After a crawler fetches the documents (web pages) from the web, they are parsed into single terms (any HTML markup is stripped). Duplicate terms are removed (unless you create an positional index where the position of every occurrence of a word within a page is stored). During this step also the term frequency (number off occurrences of a term within a document) can be counted for TF/IDF ranking.

For each term of a document the document id is inserted into the posting list of that term. Static inverted indexes are built once (e.g. with MapReduce) and never updated. Dynamic inverted indexes can be updated incrementally in real time or in batches (e.g. with MapReduce).

At search time for each term contained in the query the corresponding posting list is retrieved from the inverted index.

Boolean queries are performed by intersecting the posting lists of multiple terms, so that only those DocID are added to the result list which occur an all of the posting lists (AND) or in any of the posting lists (OR)

Performance and scaling

Implementing an inverted index seems pretty straight forward.
But when it comes to billions of documents, an index which has to be updated in real time, and queries of many concurrent users that require very low response times we have to give more thought on the data structures and implementation

Low memory consumption and fast response time are key performance indicators of inverted indexes and information retrieval systems (the latter with additional KPI as precision, recall and ranking). Posting list are responsible for most of the memory consumption in an inverted index (apart from than the storage of the documents itself). Therefore the reduction of the memory consumption of posting lists is paramount. Some believe that Index compression is one of the main differences between a “toy” indexer and one that works on real-world collections.

Posting list compression

This can be achieved by different posting list compression algorithms. All of the compression algorithms listed below use an invertible transformation that maps large integers of the DocIDs to smaller integers, that require less bits.

Posting list compression reduces the size, thus either less memory is required for a certain number of documents or more documents can be indexed in a certain amount of memory. Also, by reducing the size of a posting list, storing and accessing the posting list in much faster RAM becomes feasible instead storing and retrieving the posting list from slower hard disk or SSD. This leads to faster indexing and query response times.

Posting list compression comes with a cost of additional compression time (index time) and decompression time (query time). For performance comparison of different compression algorithms and implementations always the triple of compression ratio, compression time and decompression time should be considered.

For efficient query processing and intersection (with techniques as skipping) the compression algorithm should support direct access with only partial decompression of the posting list.

Posting list compression algorithms

bitstuffing/bitpacking
Instead of being fixed-size (32 or 64 bits per value), integer values can have any size. The number of bits per DocID is chosen as small as possible, but so that the largest DocID can be still encoded. Storing 17-bits integers with bitpacking achieves a 47% reduction compared to an unsigned 32 bit integer! There is a speed penalty when the number of bits per DocID is no multiple of 8 and therefore byte borders are crossed.

binary packing/frame of reference (FOR)
Similar to bitpacking, but the posting list is partitioned into smaller blocks first. Then each block is compressed separately. Adapting the number of bits to the range of each block individually allows a more effective compression, but comes at the cost of increased overhead as minimum value, length of block, and number of bits/DocID need to be stored for each block.

Patched frame of reference (PFOR)
Similar to frame of reference, but within a block those DocIDs are identified which as outliers unnecessary expand the value range, leading to more bits/DocID and thus prevent an effective compression. Outlier DocIDs are then separately encoded.

delta coding / gap compression
The DocID of the posting list are sorted and then the delta values (gaps) between two consecutive DocIDs are calculated. As the deltas are always smaller than the absolute values we can encode them with fewer bits and thus achieve a compression.

Elias-Fano coding

The most efficient of the Elias compression family, and quasi succinct, which means it is almost as good as the best theoretical possible compression scheme. It can still be further improved by splitting the posting list into blocks and compressing them individually (partitioned Elias-Fano coding).
It compresses gaps of sorted integers (DocIDs): Given n (number of DocIDs) and u (maximum DocID value = number of indexed docs) we have a monotone sequence 0 = x0, x1, x2, … , xn-1 = u, with strictly monotone/increasing DocIDs, no duplicate DocIDs allowed and strictly positive deltas, no zero allowed.

Elias-Fano encodes the gaps within the average distance with fewer bits than the rare outliers above, by splitting the encoding of a gap in low bits and high bits:

  • The low bits l= log(u/n) are stored explicitly.
  • The remaining high bits are stored in unary coding

This requires at most 2 + log(u/n) bits per element and is quasi succinct with less than half a bit away from the succinct bound! The compression ratio depends highly (and solely) on the average delta between DocIDs/items (delta = gap = value range/number of values = number of indexed docs/posting list length):

  • 1 billion docs/10 million DocIDs = 100 (delta) = 8,6 bit/DocID max (8,38 real)
  • 1 billion docs/100 million DocID =10 (delta) = 5,3 bit/DocID max (4,76 real)

Papers:
http://vigna.di.unimi.it/ftp/papers/QuasiSuccinctIndices.pdf
http://shonan.nii.ac.jp/seminar/029/wp-content/uploads/sites/12/2013/07/Sebastiano_Shonan.pdf
http://www.di.unipi.it/~ottavian/files/elias_fano_sigir14.pdf
http://hpc.isti.cnr.it/hpcworkshop2014/PartitionedEliasFanoIndexes.pdf

Implementation specifics

Because the algorithm itself is quite straightforward, but used for huge posting lists, the optimization potential lay in a careful implementation rather than in optimizing the algorithm itself.
Reusing predefined arrays instead of dynamically creating and increasing Lists, preventing if/then branches to allow efficient processor caching, using basic types instead of objects, plain variables instead indexed array cells and generally shaving the cost of every single operation.

Algorithm-wise a translation table is used to decode/decompress the high bits of up to 8 DocIDs which may be contained within a single byte in parallel.

Posting List Compression Benchmark

The benchmark evaluates how well the Elias-Fano algorithm and our implementation perform for different posting list sizes, number of indexed documents and average delta in respect to the key performance indicators (KPI) compression ratio, compression time and decompression time.

We are using synthetic data for the following reasons: even for web scale Big Data they are easy and fast to obtain and exchange without legal restrictions, their properties are easier to understand and to adapt to specific requirements, they don’t need to be stored but can be recreated on demand or on the fly. As the creation of massive test data is often faster than loading from disk, this less influences the benchmark. Creation on the fly makes huge test sets possible, which would not fit into RAM as a whole.

number of DocID
(posting list length)
indexed docs delta (*) uncompressed size (**) compressed size bits/docid calculated bits/docid measured compression ratio compression time decompression time
10 1 billion 100,000,000 40 41 28.50 32.80 0.98 0 ms 0 ms
100 1 billion 10,000,000 400 315 25.25 25.20 1.27 0 ms 0 ms
1,000 1 billion 1,000,000 4,000 2,686 21.93 21.49 1.49 0 ms 0 ms
10,000 1 billion 100,000 40,000 22,610 18.61 18.09 1.77 0 ms 0 ms
100,000 1 billion 10,000 400,000 184,855 15.29 14.79 2.16 1 ms 1 ms
1,000,000 1 billion 1,000 4,000,000 1,436,895 11.97 11.50 2.78 12 ms 7 ms
10,000,000 1 billion 100 40,000,000 10,134,762 8.64 8.11 3.95 99 ms 80 ms
100,000,000 1 billion 10 400,000,000 59,448,464 5.32 4.76 6.73 1,013 ms 795 ms
1.000,000,000 1 billion 1 4.000,000,000 125,000,006 2.00 1.00 32.00 6,298 ms 6,748 ms

(*) Delta d is the distance between two DocIDs in a sorted posting list of a certain term. Delta d depends on the length l of the posting list and the number of indexed pages (delta = gap = value range/number of values = number of indexed docs/posting list length). This also means that the term occurs every d pages, e.g. if delta d=10 then the term occurs on every 10th page. Delta is the only factor which determines the compression ratio (compressibility).

(**) 32 bit unsigned integer = 4 Byte/DocID.

Hardware: Intel Core i7-6700HQ (4 core, up to 3.50 GHz) 16 GB DDR4 RAM
Software: Windows 10 64-Bit, .NET Framework 4.6.1
Tests were executed in a single thread, multiple threads would be used in a multi user/multi query scenario

Index compression estimation

The compression ratio highly (and only) depends on the average delta between DocIDs/values (delta = gap = value range/number of values = number of indexed docs/posting list length). For frequent terms the average delta between DocIDs is smaller and the compression ratio higher (few bits/DocID), for rare terms the average delta between DocIDs is higher and the compression ratio lower (more bits/DocID). Therefore we need to know the term frequency (and thus the average delta between DocIDs of that term) for every term of the whole corpus to be indexed.

In order to calculate the compression ratio and the size of the whole compressed index (= the sum of all compressed posting lists, and not only the size of a single posting list) we have to take into account the distribution of the length of the posting lists respective the distribution of deltas between posting lists. The distribution of natural language follows the Zipf’s law.

Zipf’s Law, Heap’s Law and Long tail

Zipf’s Law states that the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. In an English corpus the word “the” is the most frequently occurring word, which accounts for 6% of all word occurrences. The second-place word “of” accounts for 3% of words (1/2 of “the”), followed by “and” with 2% (1/3 of “the”).

The probability Pr for a term of rank r can be calculated with the following formula:
Pr= P1 * 1/r , where P1 is the probability of the most frequent term, which is between 0.06 and 0.1 in English depending on corpus. Phil Goetz states P1 as a function of the Vocabulary V (the number of distinct words) : P1=1/ln(1.78*V)

In the Oxford English Corpus the following probabilities are observed (with P1≈0.09):

Vocabulary size % of content Examples
10 25% the, of, and, to, that, have
100 50% from, because, go, me, our, well, way
1000 75% girl, win, decide, huge, difficult, series
7000 90% tackle, peak, crude, purely, dude, modest
50,000 95% saboteur, autocracy, calyx, conformist
>1,000,000 99% laggardly, endobenthic, pomological

The vocabulary size v is the number of terms with rank r <= v. All probabilities derived from Zip's law are approximations which differ between different corpora and languages.

Zipf’s Law is based on the Harmonic series ( 1 + 1/2 +1/3 . . . + 1/n ). The divergence of the Harmonic series has been proved already 1360 by Nicole Oresme and later by Jakob Bernoulli. That means that the sum of the series H(n)=∞ for n=∞.
An approximation for the sum H(n) ≈ ln n + γ , where γ is the Euler-Mascheroni constant with an value of 0,5772156649…

Heaps’ law is an empirical law which describes the number of distinct words (Vocabulary V) in a text (document or set of documents) as a function of the text length:
V = Kn^b
where the vocabulary V is the number of distinct words in an instance text of size n. K and b are free parameters determined empirically. With English text corpora, typically K is between 10 and 100, and b is between 0.4 and 0.6.

Zipf’s law on word frequency and Heaps’ law on the growth of distinct words are observed in Indo-European language family, but it does not hold for languages like Chinese, Japanese and Korean.

The long tail is the name for a long-known feature of some statistical distributions (such as Zipf, power laws, Pareto distributions and general Lévy distributions). In “long-tailed” distributions a high-frequency or high-amplitude population is followed by a low-frequency or low-amplitude population which gradually “tails off” asymptotically. The events at the far end of the tail have a very low probability of occurrence.

As a rule of thumb, for such population distributions the majority of occurrences (more than half, and where the Pareto principle applies, 80%) are accounted for by the first 20% of items in the distribution. What is unusual about a long-tailed distribution is that the most frequently occurring 20% of items represent less than 50% of occurrences; or in other words, the least frequently occurring 80% of items are more important as a proportion of the total population.

Sources
https://en.wikipedia.org/wiki/Long_tail
https://en.wikipedia.org/wiki/Zipf%27s_law
https://moz.com/blog/illustrating-the-long-tail
https://blogemis.com/2015/09/26/zipfs-law-and-the-math-of-reason/
http://mathworld.wolfram.com/ZipfsLaw.html
http://www.cs.sfu.ca/CourseCentral/456/jpei/web%20slides/L06%20-%20Text%20statistics.pdf

Synthetic posting list creation

The distribution of the length of the posting lists follows Zipf’s law. But we have to distinguish positional posting list and non-positional posting list:

  • Positional posting list can contain multiple postings per document. Frequent terms occur multiple multiple times per document, and each occurrence is stored together with its position within the document. This structure is helpful for supporting phrase and proximity queries.
  • Non-positional posting list store only one posting per document. They record in which documents the term occurs at least once, the positions of the occurrences within the document are not stored.

For the creation of synthetic posting lists we need to calculate the length l of the posting list for every term. For positional posting lists we can use the following formula:

Posting list length (for term of rank r) : l = postings * MostFrequentTermProbability / r)

where

  • postings = indexedDocs * uniqueTermsPerDoc
  • indexedDocs is the number of all documents in the corpus to be indexed
  • uniqueTermsPerDoc is about 300
  • MostFrequentTermProbability is about 0.06 in an English corpus
  • rank is the rank of the term in the frequency table. The term frequencies in the table are distributed by Zipf’s law.
  • Postings is the number of all DocIDs in the index, which is the same as the sum of all postingListLength in the index.

For non-positional posting lists we can use the following formulas

Posting list length (for term with rank r) : l = Pterm_r_in_doc * indexedDocs

where

  • probability of term with rank r (Zip’s law): Pterm_r = MostFrequentTermProbability / r
  • probability of term with any rank other than r : Pterm_not_r = 1-Pterm_r
  • probability of term with rank r occurs not in a doc (with t terms per doc): Pterm_r_not_in_doc = Pterm_not_r ^ t
  • probability of term with rank r occurs at least once in a doc : Pterm_r_in_doc = 1 – Pterm_r_not_in_doc

The maximum posting list length l <= indexedDocs . This is because even if frequent terms occur multiple times within a document, in a non-positional index they are indexed only once per document. For each posting list we then create DocIDs up to the calculated Posting list length. The value of the DocIDs is randomly selected between 1 and indexedDocs. We have to prevent duplicate DocID within a posting list, e.g. by using a hash set to check whether a DocID already exists or not.

Real-world posting list data

While we are using synthetic data it is also possible to use real-world data for testing. There are several data sets available, although for some you have to pay:

Wikipedia dump

Gov2: TREC 2004 Terabyte Track test collection, consisting of 25 million .gov web pages crawled in early 2004 (24,622,347 docs, 35,636,425 terms, 5,742,630,292 postings)

ClueWeb09: ClueWeb 2009 TREC Category B collection, consisting of 50 million English web pages crawled between January and February 2009 (50,131,015 docs, 92,094,694 terms, 15,857,983,641 postings)

The last two data sets are also available for free in a processed, anonymized form without term names.

While in synthetic data the DocIDs are usually random, in Real-world data sets the cluster properties of docIDs (some terms are more dense in some parts of the collection than in others because the pages of a domain have been indexed consecutively) can be exploited. This may lead to additional compression.

Stop words and the resolving power of terms

H.P.Luhn wrote 1958 in the IBM Journal about the “resolving power of significant words”, featuring a word frequency diagram with the word frequencies distributed according to zipfs law. There he defined an lower and an upper cut-off limit for word frequencies, where only within that “sweet spot” the words where significant and had resolving or discriminatory power in queries. The terms outside the two limits would be excluded as non-significant, being too common or too rare. For the 20 most frequent terms this is very easy to comprehend: they are appearing in almost all documents of the collection and results would stay the same whether or not those terms are in the query.

For the most frequent words to be irrelevant and excluded, this resembles the concept of stop words. If we look at the 100 most common words in English we can immediately see the low resolving power. If we exclude the 100 most common words we lose almost nothing in result quality, but can significantly improve indexing performance and save space (50% for the Oxford English Corpus).

For 1 billion documents with 300 unique terms each we would spare 50 billion docIDs to be indexed. The Posting List for the most frequent term “the” alone would contain 1 million DocIDs, and the Posting List for the 100th popular term “us” would still contain 180 million DocIDs.

Of course we have to be careful when dealing with meaningful combinations of frequent words as “The Who” or “Take That”.

Index Compression Benchmark

The benchmark evaluates how well the Elias-Fano algorithm and our implementation perform for different numbers of indexed documents in respect to the key performance indicators (KPI) compression ratio, compression time and decompression time. This time we are benchmarking the whole index (all documents from a corpus are indexed) instead of single posting lists.

Again we are using synthetic data for the reasons stated above.

indexed pages vocabulary uncompressed size (**) compressed size bits/docid calculated bits/docid measured compression ratio compression time decompression time
1 million 1 billion 1,200 MB
10 million 1 billion 12 GB
100 million 1 billion 120 GB
1 billion 1 billion 1,200 GB

(*) average word length, vocabulary, including/excluding 100 most frequent words (stop words). Do not contribute to meaningful results (paper)

(**) Uncompressed index size = 300 unique words/page * number of indexed pages * byte/DocID (32 bit unsigned integer = 4 Byte/DocID)

Hardware: Intel Core i7-6700HQ (6MB Cache, up to 3.50 GHz) 16 GB DDR4 RAM
Software: Windows 10 64-Bit, .NET Framework 4.6.1

Compressed intersection

Over 70% of the web queries contain multiple query terms. For those Boolean queries intersecting the posting lists of all query terms is required. When posting lists are compressed, they need to be uncompressed before or during the intersection.

Naive approach: decompress the whole posting lists for each query term and keep them during the intersection in RAM. This leads to high decompression time and memory consumption.

Improved approach: decompress only the currently compared items of the posting lists on the fly and discard them immediately after comparison. Terminate the decompression and intersection as soon as top-k ranked results are retrieved.

Github

The C# implementation of the Elias-Fano compression is released on GitHub as Open Source under the GNU Lesser General Public License (LGPL):
https://github.com/wolfgarbe/EliasFanoCompression

  • EliasFanoInitTable
  • EliasFanoCompress
  • EliasFanoDecompress
  • SortedRandomIntegerListGenerator: generates a sorted list of random integers from 2 parameters: number of items (length of posting list), range of items (number of indexed pages)
  • ZipDistributedPostingListGenerator: generates complete set posting lists with zipfian distributed length (word frequency)

Very fast Data cleaning of product names, company names & street names

The correction of product names, company names, street names & addresses is a frequent task of data cleaning and deduplication. Often those names are misspelled, either due to OCR errors or mistakes of the human data collectors.

The difference is that those names often consist of multiple words, white space and punctuation. For large data or even Big data applications also speed is very important.

Our algorithm supports both requirements and is up to 1 million times faster compared to conventional approaches (see benchmark). The C# source code is available as Open Source in another Blog post and GitHub). A simple modification of the original source code will add support of names with multiple words, white space and punctuation:

Instead of 357 CreateDictionary("big.txt",""); which parses the a given text file into single words simply use CreateDictionaryEntry("company/street/product name", "") to add company, street & product names to the dictionary.

Then with Correct("misspelled street",""); you will get the correct street name from the dictionary. In line 35..38 you may specify whether you want only the best match or all matches within a certain edit distance (number of character operations difference):

35 private static int verbose = 0;
36 //0: top suggestion
37 //1: all suggestions of smallest edit distance
38 //2: all suggestions <= editDistanceMax (slower, no early termination)

For every similar term (or phrase) found in the dictionary the algorithm gives you the Damerau-Levenshtein edit distance to your input term (look for suggestion.distance in the source code). The edit distance describes how many characters have been added, deleted, altered or transposed between the input term and the dictionary term. This is a measure of similarity between the input term (or phrase) and similar terms (or phrases) found in the dictionary.

Fast approximate string matching with large edit distances in Big Data

1000x faster

1 million times faster spelling correction for edit distance 3
After my blog post 1000x times faster spelling correction got more than 50.000 views I revisited both algorithm and implementation to see if it could be further improved.

While the basic idea of Symmetric Delete spelling correction algorithm remains unchanged the implementation has been significantly improved to unleash the full potential of the algorithm.

This results in a 10 times faster spelling correction and 5 times faster dictionary generation and 2…7 times less memory consumption in v3.0 compared to v1.6 .

Compared to Peter Norvig’s algorithm it is now 1,000,000 times faster for edit distance=3 and 10,000 times faster for edit distance=2.

In Norvig’s tests 76% of spelling errors had an edit distance 1. 98.9% of spelling errors got covered with edit distance 2. For simple spelling correction of natural language with edit distance 2 the accuracy is good enough and the performance Norvig’s algorithm is sufficient.

The speed of our algorithm enables edit distance 3 for spell checking and thus improves the accuracy by 1%. Beyond the accuracy improvement the speed advantage of our algorithm is useful for automatic spelling correction in large corpora as well as in search engines, where many requests in parallel need to be processed.

Billion times faster approximate string matching for edit distance > 4
But the true potential of the algorithm lies in edit distances > 3 and beyond spell checking.

The many orders of magnitude faster algorithm opens up new application fields for approximate string matching and a scaling sufficient for big data and real-time. Our algorithm enables fast approximate string and pattern matching with long strings or feature vectors, huge alphabets, large edit distances, in very large data bases, with many concurrent processes and real time requirements.

Application fields:

  • Spelling correction in search engines, with many parallel requests
  • Automatic Spelling correction in large corpora
  • Genome data analysis,
  • Matching DNA sequences
  • Browser fingerprint analysis
  • Realtime Image recognition (search by image, autonomous cars, medicine)
  • Face recognition
  • Iris recognition
  • Speech recognition
  • Voice recognition
  • Feature recognition
  • Fingerprint identification
  • Signature Recognition
  • Plagiarism detection (in music /in text)
  • Optical character recognition
  • Audio fingerprinting
  • Fraud detection
  • Address deduplication
  • Misspelled names recognition
  • Spectroscopy based chemical and biological material identification
  • File revisioning
  • Spam detection
  • Similarity search,
  • Similarity matching
  • Approximate string matching,
  • Fuzzy string matching,
  • Fuzzy string comparison,
  • Fuzzy string search,
  • Pattern matching,
  • Data cleaning
  • and many more

Edit distance metrics
While we are using the Damerau-Levenshtein distance for spelling correction for other applications it could be easily exchanged with the Levenshtein distance or similar other edit distances by simply modifying the respective function.

In our algorithm the speed of the edit distance calculation has only a very small influence on the overall lookup speed. That’s why we are using only a basic implementation rather than a more sophisticated variant.

Benchmark
Because of all the applications for approximate string matching beyond spell check we extended the benchmark to lookups with higher edit distances. That’s where the power of the symmetric delete algorithm truly shines and excels other solutions. With previous spell checking algorithms the required time explodes with larger edit distances.

Below are the results of a benchmark of our Symmetric Delete algorithm and Peter Norvig’s algorithm for different edit distances, each with 1000 lookups:

input term best correction edit distance maximum edit distance SymSpell
ms per 1000 lookups
Peter Norvig
ms per 1000 lookups
factor
marsupilamimarsupilami no correction* >20 9 568,568,000
marsupilamimarsupilami no correction >20 8 161,275,000
marsupilamimarsupilami no correction >20 7 37,590,000
marsupilamimarsupilami no correction >20 6 5,528,000
marsupilamimarsupilami no correction >20 5 679,000
marsupilamimarsupilami no correction >20 4 46,592
marsupilami no correction >4 4 459
marsupilami no correction >4 3 159 159,421,000 1:1,000,000
marsupilami no correction >4 2 31 257,597 1:8,310
marsupilami no correction >4 1 4 359 1:90
hzjuwyzacamodation accomodation 10 10 7,598,000
otuwyzacamodation accomodation 9 9 1,727,000
tuwyzacamodation accomodation 8 8 316,023
uwyzacamodation accomodation 7 7 78,647
wyzacamodation accomodation 6 6 19,599
yzacamodation accomodation 5 5 2,963
zacamodation accomodation 4 4 727
acamodation accomodation 3 3 180 173,232,000 1:962,000
acomodation accomodation 2 2 33 397,271 1:12,038
hous hous 1 1 24 161 1:7
house house 0 1 1 3 1:3

*Correct or unknown word, which is not in the dictionary and there are also no suggestions within an edit distance of <=maximum edit distance. This is a quite common case (e.g. rare words, new words, domain specific words, foreign words, names), in applications beyond spelling correction (e.g. fingerprint recognition) it might be the default case.

For the benchmark we used the C# implementation of our SymSpell as well as a faithful C# port from Lorenzo Stoakes of Peter Norvig’s algorithm, which has been extended to support edit distance 3. The use of C# implementations for both cases allows to focus solely on the algorithm and should exclude language specific bias.

Dictionary corpus:
The English text corpus used to generate the dictionary used in the above benchmarks has a size 6.18 MByte, 1,105,286 terms, 29,157 unique terms, longest term with 18 characters.
The dictionary size and the number of indexed terms have almost no influence on the average lookup time of o(1).

Speed gain
The speed advantage grows exponentially with the edit distance:

  • For an edit distance=1 it’s 1 order of magnitude faster,
  • for an edit distance=2 it’s 4 orders of magnitude faster,
  • for an edit distance=3 it’s 6 orders of magnitude faster.
  • for an edit distance=4 it’s 8 orders of magnitude faster.

Computational complexity and findings from benchmark
Our algorithm is constant time ( O(1) time ), i.e. independent of the dictionary size (but depending on the average term length and maximum edit distance), because our index is based on a Hash Table which has an average search time complexity of O(1).

Precalculation cost
In our algorithm we need auxiliary dictionary entries with precalculated deletes and their suggestions. While the number of the auxiliary entries is significant compared to the 29,157 original entries the dictionary size grows only sub-linear with edit distance: log(ed)

maximum edit distance number of dictionary entries (including precalculated deletes)
20 11,715,602
15 11,715,602
10 11,639,067
9 11,433,097
8 10,952,582
7 10,012,557
6 8,471,873
5 6,389,913
4 4,116,771
3 2,151,998
2 848,496
1 223,134

The precalculation costs consist of additional memory usage and creation time for the auxiliary delete entries in the dictionary:

cost maximum edit distance SymSpell Peter Norvig factor
memory usage 1 32 MB 229 MB 1:7.2
memory usage 2 87 MB 229 MB 1:2.6
memory usage 3 187 MB 230 MB 1:1.2
dictionary creation time 1 3341 ms 3640 ms 1:1.1
dictionary creation time 2 4293 ms 3566 ms 1:0.8
dictionary creation time 3 7962 ms 3530 ms 1:0.4

Due to an efficient implementation those costs are negligible for edit distances <=3:

  • 7 times less memory requirement and a similar dictionary creation time (ed=1).
  • 2 times less memory requirement and a similar dictionary creation time (ed=2).
  • similar memory requirement and a 2 times higher dictionary creation time (ed=3).

Source code
The C# implementation of our Symmetric Delete Spelling Correction algorithm is released on GitHub as Open Source under the GNU Lesser General Public License (LGPL).

C# (original)
https://github.com/wolfgarbe/symspell

Ports
The following third party ports to other programming languages have not been tested by myself whether they are an exact port, error free, provide identical results or are as fast as the original algorithm:

C++ (third party port)
https://github.com/erhanbaris/SymSpellPlusPlus

Go (third party port)
https://github.com/heartszhang/symspell
https://github.com/sajari/fuzzy

Java (third party port)
https://github.com/gpranav88/symspell

Javascript (third party port)
https://github.com/itslenny/SymSpell.js
https://github.com/dongyuwei/SymSpell
https://github.com/IceCreamYou/SymSpell
https://github.com/Yomguithereal/mnemonist/blob/master/symspell.js

Python (third party port)
https://github.com/ppgmg/spark-n-spell-1/blob/master/symspell_python.py

Ruby (third party port)
https://github.com/PhilT/symspell

Swift (third party port)
https://github.com/Archivus/SymSpell

Comparison to other approaches and common misconceptions

A Trie as standalone spelling correction
Why don’t you use a Trie instead of your algorithm?
Tries have a comparable search performance to our approach. But a Trie is a prefix tree, which requires a common prefix. This makes it suitable for autocomplete or search suggestions, but not applicable for spell checking. If your typing error is e.g. in the first letter, than you have no common prefix, hence the Trie will not work for spelling correction.

A Trie as replacement for the hash table
Why don’t you use a Trie for the dictionary instead of the hash table?
Of course you could replace the hash table with a Trie (that is just a arbitrary lookup component of O(1) speed for a *single* lookup) at the cost of added code complexity, but without performance gain.
A HashTable is slower than a Trie only if there are collisions, which are unlikely in our case. For a maximum edit distance of 2 and an average word length of 5 and 100,000 dictionary entries we need to additionally store (and hash) 1,500,000 deletes. With a 32 bit hash (4,294,967,296 possible distinct hashes) the collision probability seems negligible.
With a good hash function even a similarity of terms (locality) should not lead to increased collisions, if not especially desired e.g. with Locality sensitive hashing.

BK-Trees
Would be BK-Trees an alternative option?
Yes, but BK-Trees have a search time of O(log dictionary_size), whereas our algorithm is constant time ( O(1) time ), i.e. independent of the dictionary size.

Ternary search tree
Why don’t you use a ternary search tree?
The lookup time in a Ternary Search Tree is O(log n), while it is only 0(1) in our solution. Also, while a Ternary Search Tree could be used for the dictionary lookup instead of a hash table, it doesn’t address the spelling error candidate generation. And the tremendous reduction of the number of spelling error candidates to be looked-up in the dictionary is the true innovation of our Symmetric Delete Spelling Correction algorithm.

Precalculation
Does the speed advantage simply comes from precalulation of candidates?
No! The speed is a result of the combination of all three components outlined below:

  • Pre-calculation, i.e. the generation of possible spelling error variants (deletes only) and storing them at index time is just the first precondition.
  • A fast index access at search time by using a hash table with an average search time complexity of O(1) is the second precondition.
  • But only our Symmetric Delete Spelling Correction on top of this allows to bring this O(1) speed to spell checking, because it allows a tremendous reduction of the number of spelling error candidates to be pre-calculated (generated and indexed).
  • Applying pre-calculation to Norvig’s approach would not be feasible because pre-calculating all possible delete + transpose + replace + insert candidates of all terms would result in a huge time and space consumption.

Correction vs. Completion
How can I add auto completion similar to Google’s Autocompletion?
There is a difference between correction and suggestion/completion!

Correction: Find the correct word for a word which contains errors. Missing letters/errors can be on start/middle/end of the word. We can find only words equal/below the maximum edit distance, as the computational complexity is dependent from the edit distance.

Suggestion/completion: Find the complete word for an already typed substring (prefix!). Missing letters can be only at the end of the word. We can find words/word combinations of any length, as the computational complexity is independent from edit distance and word length.

The code above implements only correction, but not suggestion/completion!
It still finds suggestions/completions equal/below the maximum edit distance, i.e. it starts to show words only if there are <= 2 letters missing (for maximum edit distance=2). Nevertheless the code can be extended to handle both correction and suggestion/completion. During the process of dictionary creation you have to add also all substrings (prefixes only!) of a word to the dictionary, when you are adding a new word to the dictionary. All substring entries of a specific term then have to contain a link to the complete term. Alternatively, for suggestion/completion you could use a completely different algorithm/structure like a Trie, which inherently lists all complete words for a given prefix.

1000x Faster Spelling Correction: Source Code released

In a followup to our recent post 1000x Faster Spelling Correction algorithm we are releasing today a C# implementation of our Symmetric Delete Spelling Correction algorithm as Open Source:

Update1: The source code is now also on GitHub.
Update2: Improved implementation now 1,000,000 times faster for edit distance=3.

// SymSpell: 1000x faster through Symmetric Delete spelling correction algorithm
//
// The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup 
// for a given Damerau-Levenshtein distance. It is three orders of magnitude faster and language independent.
// Opposite to other algorithms only deletes are required, no transposes + replaces + inserts.
// Transposes + replaces + inserts of the input term are transformed into deletes of the dictionary term.
// Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters!
//
// Copyright (C) 2012 Wolf Garbe, FAROO Limited
// Version: 1.6
// Author: Wolf Garbe <wolf.garbe@faroo.com>
// Maintainer: Wolf Garbe <wolf.garbe@faroo.com>
// URL: http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/
// Description: http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/
//
// License:
// This program is free software; you can redistribute it and/or modify
// it under the terms of the GNU Lesser General Public License, 
// version 3.0 (LGPL-3.0) as published by the Free Software Foundation.
// http://www.opensource.org/licenses/LGPL-3.0
//
// Usage: single word + Enter:  Display spelling suggestions
//        Enter without input:  Terminate the program

using System;
using System.Linq;
using System.Text.RegularExpressions;
using System.Collections.Generic;
using System.IO;
using System.Diagnostics;

static class SymSpell
{
    private static int editDistanceMax=2;
    private static int verbose = 0;
    //0: top suggestion
    //1: all suggestions of smallest edit distance 
    //2: all suggestions <= editDistanceMax (slower, no early termination)

    private class dictionaryItem
    {
        public string term = "";
        public List<editItem> suggestions = new List<editItem>();
        public int count = 0;

        public override bool Equals(object obj)
        {
            return Equals(term, ((dictionaryItem)obj).term);
        }
     
        public override int GetHashCode()
        {
            return term.GetHashCode(); 
        }       
    }

    private class editItem
    {
        public string term = "";
        public int distance = 0;

        public override bool Equals(object obj)
        {
            return Equals(term, ((editItem)obj).term);
        }
     
        public override int GetHashCode()
        {
            return term.GetHashCode();
        }       
    }

    private class suggestItem
    {
        public string term = "";
        public int distance = 0;
        public int count = 0;

        public override bool Equals(object obj)
        {
            return Equals(term, ((suggestItem)obj).term);
        }
     
        public override int GetHashCode()
        {
            return term.GetHashCode();
        }       
    }

    private static Dictionary<string, dictionaryItem> dictionary = new Dictionary<string, dictionaryItem>();

    //create a non-unique wordlist from sample text
    //language independent (e.g. works with Chinese characters)
    private static IEnumerable<string> parseWords(string text)
    {
        return Regex.Matches(text.ToLower(), @"[\w-[\d_]]+")
                    .Cast<Match>()
                    .Select(m => m.Value);
    }

    //for every word there all deletes with an edit distance of 1..editDistanceMax created and added to the dictionary
    //every delete entry has a suggestions list, which points to the original term(s) it was created from
    //The dictionary may be dynamically updated (word frequency and new words) at any time by calling createDictionaryEntry
    private static bool CreateDictionaryEntry(string key, string language)
    {
        bool result = false;
        dictionaryItem value;
        if (dictionary.TryGetValue(language+key, out value))
        {
            //already exists:
            //1. word appears several times
            //2. word1==deletes(word2) 
            value.count++;
        }
        else
        {
            value = new dictionaryItem();
            value.count++;
            dictionary.Add(language+key, value);
        }

        //edits/suggestions are created only once, no matter how often word occurs
        //edits/suggestions are created only as soon as the word occurs in the corpus, 
        //even if the same term existed before in the dictionary as an edit from another word
        if (string.IsNullOrEmpty(value.term))
        {
            result = true;
            value.term = key;

            //create deletes
            foreach (editItem delete in Edits(key, 0, true))
            {
                editItem suggestion = new editItem();
                suggestion.term = key;
                suggestion.distance = delete.distance;

                dictionaryItem value2;
                if (dictionary.TryGetValue(language+delete.term, out value2))
                {
                    //already exists:
                    //1. word1==deletes(word2) 
                    //2. deletes(word1)==deletes(word2) 
                    if (!value2.suggestions.Contains(suggestion)) AddLowestDistance(value2.suggestions, suggestion);
                }
                else
                {
                    value2 = new dictionaryItem();
                    value2.suggestions.Add(suggestion);
                    dictionary.Add(language+delete.term, value2);
                }
            }
        }
        return result;
    }

    //create a frequency disctionary from a corpus
    private static void CreateDictionary(string corpus, string language)
    {
        if (!File.Exists(corpus))
        {
            Console.Error.WriteLine("File not found: " + corpus);
            return;
        }

        Console.Write("Creating dictionary ...");
        long wordCount = 0;
        foreach (string key in parseWords(File.ReadAllText(corpus)))
        {
            if (CreateDictionaryEntry(key, language)) wordCount++;
        }
        Console.WriteLine("\rDictionary created: " + wordCount.ToString("N0") + " words, " + dictionary.Count.ToString("N0") + " entries, for edit distance=" + editDistanceMax.ToString());
    }

    //save some time and space
    private static void AddLowestDistance(List<editItem> suggestions, editItem suggestion)
    {
        //remove all existing suggestions of higher distance, if verbose<2
        if ((verbose < 2) && (suggestions.Count > 0) && (suggestions[0].distance > suggestion.distance)) suggestions.Clear();
        //do not add suggestion of higher distance than existing, if verbose<2
        if ((verbose == 2) || (suggestions.Count == 0) || (suggestions[0].distance >= suggestion.distance)) suggestions.Add(suggestion);
    }

    //inexpensive and language independent: only deletes, no transposes + replaces + inserts
    //replaces and inserts are expensive and language dependent (Chinese has 70,000 Unicode Han characters)
    private static List<editItem> Edits(string word, int editDistance, bool recursion)
    {
        editDistance++;
        List<editItem> deletes = new List<editItem>();
        if (word.Length > 1)
        {
            for (int i = 0; i < word.Length; i++)
            {
                editItem delete = new editItem();
                delete.term=word.Remove(i, 1);
                delete.distance=editDistance;
                if (!deletes.Contains(delete))
                {
                    deletes.Add(delete);
                    //recursion, if maximum edit distance not yet reached
                    if (recursion && (editDistance < editDistanceMax)) 
                    {
                        foreach (editItem edit1 in Edits(delete.term, editDistance,recursion))
                        {
                            if (!deletes.Contains(edit1)) deletes.Add(edit1); 
                        }
                    }                   
                }
            }
        }

        return deletes;
    }

    private static int TrueDistance(editItem dictionaryOriginal, editItem inputDelete, string inputOriginal)
    {
        //We allow simultaneous edits (deletes) of editDistanceMax on on both the dictionary and the input term. 
        //For replaces and adjacent transposes the resulting edit distance stays <= editDistanceMax.
        //For inserts and deletes the resulting edit distance might exceed editDistanceMax.
        //To prevent suggestions of a higher edit distance, we need to calculate the resulting edit distance, if there are simultaneous edits on both sides.
        //Example: (bank==bnak and bank==bink, but bank!=kanb and bank!=xban and bank!=baxn for editDistanceMaxe=1)
        //Two deletes on each side of a pair makes them all equal, but the first two pairs have edit distance=1, the others edit distance=2.

        if (dictionaryOriginal.term == inputOriginal) return 0; else
        if (dictionaryOriginal.distance == 0) return inputDelete.distance;
        else if (inputDelete.distance == 0) return dictionaryOriginal.distance;
        else return DamerauLevenshteinDistance(dictionaryOriginal.term, inputOriginal);//adjust distance, if both distances>0
    }

    private static List<suggestItem> Lookup(string input, string language, int editDistanceMax)
    {
        List<editItem> candidates = new List<editItem>();

        //add original term
        editItem item = new editItem();
        item.term = input;
        item.distance = 0;
        candidates.Add(item);
 
        List<suggestItem> suggestions = new List<suggestItem>();
        dictionaryItem value;

        while (candidates.Count>0)
        {
            editItem candidate = candidates[0];
            candidates.RemoveAt(0);

            //save some time
            //early termination
            //suggestion distance=candidate.distance... candidate.distance+editDistanceMax                
            //if canddate distance is already higher than suggestion distance, than there are no better suggestions to be expected
            if ((verbose < 2)&&(suggestions.Count > 0)&&(candidate.distance > suggestions[0].distance)) goto sort;
            if (candidate.distance > editDistanceMax) goto sort;  

            if (dictionary.TryGetValue(language+candidate.term, out value))
            {
                if (!string.IsNullOrEmpty(value.term))
                {
                    //correct term
                    suggestItem si = new suggestItem();
                    si.term = value.term;
                    si.count = value.count;
                    si.distance = candidate.distance;

                    if (!suggestions.Contains(si))
                    {
                        suggestions.Add(si);
                        //early termination
                        if ((verbose < 2) && (candidate.distance == 0)) goto sort;     
                    }
                }

                //edit term (with suggestions to correct term)
                dictionaryItem value2;
                foreach (editItem suggestion in value.suggestions)
                {
                    //save some time 
                    //skipping double items early
                    if (suggestions.Find(x => x.term == suggestion.term) == null)
                    {
                        int distance = TrueDistance(suggestion, candidate, input);
                     
                        //save some time.
                        //remove all existing suggestions of higher distance, if verbose<2
                        if ((verbose < 2) && (suggestions.Count > 0) && (suggestions[0].distance > distance)) suggestions.Clear();
                        //do not process higher distances than those already found, if verbose<2
                        if ((verbose < 2) && (suggestions.Count > 0) && (distance > suggestions[0].distance)) continue;

                        if (distance <= editDistanceMax)
                        {
                            if (dictionary.TryGetValue(language+suggestion.term, out value2))
                            {
                                suggestItem si = new suggestItem();
                                si.term = value2.term;
                                si.count = value2.count;
                                si.distance = distance;

                                suggestions.Add(si);
                            }
                        }
                    }
                }
            }//end foreach

            //add edits 
            if (candidate.distance < editDistanceMax)
            {
                foreach (editItem delete in Edits(candidate.term, candidate.distance,false))
                {
                    if (!candidates.Contains(delete)) candidates.Add(delete);
                }
            }
        }//end while

        sort: suggestions = suggestions.OrderBy(c => c.distance).ThenByDescending(c => c.count).ToList();
        if ((verbose == 0)&&(suggestions.Count>1))  return suggestions.GetRange(0, 1); else return suggestions;
    }

    private static void Correct(string input, string language)
    {
        List<suggestItem> suggestions = null;
    
        /*
        //Benchmark: 1000 x Lookup
        Stopwatch stopWatch = new Stopwatch();
        stopWatch.Start();
        for (int i = 0; i < 1000; i++)
        {
            suggestions = Lookup(input,language,editDistanceMax);
        }
        stopWatch.Stop();
        Console.WriteLine(stopWatch.ElapsedMilliseconds.ToString());
        */
        
        //check in dictionary for existence and frequency; sort by edit distance, then by word frequency
        suggestions = Lookup(input, language, editDistanceMax);

        //display term and frequency
        foreach (var suggestion in suggestions)
        {
            Console.WriteLine( suggestion.term + " " + suggestion.distance.ToString() + " " + suggestion.count.ToString());
        }
        if (verbose == 2) Console.WriteLine(suggestions.Count.ToString() + " suggestions");
    }

    private static void ReadFromStdIn()
    {
        string word;
        while (!string.IsNullOrEmpty(word = (Console.ReadLine() ?? "").Trim()))
        {
            Correct(word,"en");
        }
    }

    public static void Main(string[] args)
    {
        //e.g. http://norvig.com/big.txt , or any other large text corpus
        CreateDictionary("big.txt","en");
        ReadFromStdIn();
    }

    // Damerau–Levenshtein distance algorithm and code 
    // from http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
    public static Int32 DamerauLevenshteinDistance(String source, String target)
    {
        Int32 m = source.Length;
        Int32 n = target.Length;
        Int32[,] H = new Int32[m + 2, n + 2];

        Int32 INF = m + n;
        H[0, 0] = INF;
        for (Int32 i = 0; i <= m; i++) { H[i + 1, 1] = i; H[i + 1, 0] = INF; }
        for (Int32 j = 0; j <= n; j++) { H[1, j + 1] = j; H[0, j + 1] = INF; }

        SortedDictionary<Char, Int32> sd = new SortedDictionary<Char, Int32>();
        foreach (Char Letter in (source + target))
        {
            if (!sd.ContainsKey(Letter))
                sd.Add(Letter, 0);
        }

        for (Int32 i = 1; i <= m; i++)
        {
            Int32 DB = 0;
            for (Int32 j = 1; j <= n; j++)
            {
                Int32 i1 = sd[target[j - 1]];
                Int32 j1 = DB;

                if (source[i - 1] == target[j - 1])
                {
                    H[i + 1, j + 1] = H[i, j];
                    DB = j;
                }
                else
                {
                    H[i + 1, j + 1] = Math.Min(H[i, j], Math.Min(H[i + 1, j], H[i, j + 1])) + 1;
                }

                H[i + 1, j + 1] = Math.Min(H[i + 1, j + 1], H[i1, j1] + (i - i1 - 1) + 1 + (j - j1 - 1));
            }

            sd[ source[ i - 1 ]] = i;
        }
        return H[m + 1, n + 1];
    }
}

Updated:
The implementation supports now edit distances of any size (default=2).

Benchmark:
With previous spell checking algorithms the required time explodes with larger edit distances. They try to omit this with early termination when suggestions of smaller edit distances are found.

We did a quick benchmark with 1000 lookups:

Term Best correction Edit distance Faroo
ms/1000
Peter Norvig
ms/1000
Factor
marsupilami no correction* >3 1,772 165,025,000 93,129
acamodation accommodation 3 1,874 175,622,000 93,715
acomodation accommodation 2 162 348,191 2,149
hous house 1 71 179 2
house house 0 0 17 n/a

*Correct word, but not in dictionary and there are also no corrections within an edit distance of <=3. This is a quite common case (e.g. rare words, new words, domain specific words, foreign words, names).

The speed advantage grows exponentially with the edit distance:
For an edit distance=1 it’s the same order of magnitude,
for an edit distance=2 it’s 3 orders of magnitude faster,
for an edit distance=3 it’s 5 orders of magnitude faster.

Source code
The C# implementation of our Symmetric Delete Spelling Correction algorithm is released on GitHub as Open Source under the GNU Lesser General Public License (LGPL).

C# (original)
https://github.com/wolfgarbe/symspell

Ports
The following third party ports to other programming languages have not been tested by myself whether they are an exact port, error free, provide identical results or are as fast as the original algorithm:

C++ (third party port)
https://github.com/erhanbaris/SymSpellPlusPlus

Go (third party port)
https://github.com/heartszhang/symspell
https://github.com/sajari/fuzzy

Java (third party port)
https://github.com/gpranav88/symspell

Javascript (third party port)
https://github.com/itslenny/SymSpell.js
https://github.com/dongyuwei/SymSpell
https://github.com/IceCreamYou/SymSpell
https://github.com/Yomguithereal/mnemonist/blob/master/symspell.js

Python (third party port)
https://github.com/ppgmg/spark-n-spell-1/blob/master/symspell_python.py

Ruby (third party port)
https://github.com/PhilT/symspell

Swift (third party port)
https://github.com/Archivus/SymSpell

1000x Faster Spelling Correction algorithm

1000x faster

Update: SymSpell C# implementation released as Open Source.
Update2: SymSpell 100,000 times faster for edit distance=3.
Update3: Spelling correction is now also part of FAROO search.
Update4: SymSpell source code now on GitHub.
Update5: Improved implementation now 1,000,000 times faster for edit distance=3.
Update6: SymSpellCompound: Compound aware automatic spelling correction.

Recently I answered a question on Quora about spelling correction for search engines. When I described our algorithm I was pointed to Peter Norvig’s page where he outlined his approach.

Both algorithms are based on Edit distance (Damerau-Levenshtein distance).
Both try to find the dictionary entries with smallest edit distance from the query term.
If the edit distance is 0 the term is spelled correctly, if the edit distance is <=2 the dictionary term is used as spelling suggestion. But our way to search the dictionary is different, resulting in a significant performance gain and language independence. Three ways to search for minimum edit distance in a dictionary: 1. Naive approach
The obvious way of doing this is to compute the edit distance from the query term to each dictionary term, before selecting the string(s) of minimum edit distance as spelling suggestion. This exhaustive search is inordinately expensive.
Source: Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze: Introduction to Information Retrieval.

The performance can be significantly improved by terminating the edit distance calculation as soon as a treshold of 2 or 3 has been reached.

2. Peter Norvig
Generate all possible terms with an edit distance <=2 (deletes + transposes + replaces + inserts) from the query term and search them in the dictionary.
For a word of length n, an alphabet size a, an edit distance d=1, there will be n deletions, n-1 transpositions, a*n alterations, and a*(n+1) insertions, for a total of 2n+2an+a-1 terms at search time.
Source: Peter Norvig: How to Write a Spelling Corrector.

This is much better than the naive approach, but still expensive at search time (114,324 terms for n=9, a=36, d=2) and language dependent (because the alphabet is used to generate the terms, which is different in many languages and huge in Chinese: a=70,000 Unicode Han characters)

3. Symmetric Delete Spelling Correction (FAROO)
Generate terms with an edit distance <=2 (deletes only) from each dictionary term and add them together with the original term to the dictionary. This has to be done only once during a pre-calculation step.
Generate terms with an edit distance <=2 (deletes only) from the input term and search them in the dictionary.
For a word of length n, an alphabet size of a, an edit distance of 1, there will be just n deletions, for a total of n terms at search time.

This is three orders of magnitude less expensive (36 terms for n=9 and d=2) and language independent (the alphabet is not required to generate deletes).
The cost of this approach is the pre-calculation time and storage space of x deletes for every original dictionary entry, which is acceptable in most cases.

The number x of deletes for a single dictionary entry depends on the maximum edit distance: x=n for edit distance=1, x=n*(n-1)/2 for edit distance=2, x=n!/d!/(n-d)! for edit distance=d (combinatorics: k out of n combinations without repetitions, and k=n-d),
E.g. for a maximum edit distance of 2 and an average word length of 5 and 100,000 dictionary entries we need to additionally store 1,500,000 deletes.

Remark 1: During the precalculation, different words in the dictionary might lead to same delete term: delete(sun,1)==delete(sin,1)==sn.
While we generate only one new dictionary entry (sn), inside we need to store both original terms as spelling correction suggestion (sun,sin)

Remark 2: There are four different comparison pair types:

  1. dictionary entry==input entry,
  2. delete(dictionary entry,p1)==input entry
  3. dictionary entry==delete(input entry,p2)
  4. delete(dictionary entry,p1)==delete(input entry,p2)

The last comparison type is required for replaces and transposes only. But we need to check whether the suggested dictionary term is really a replace or an adjacent transpose of the input term to prevent false positives of higher edit distance (bank==bnak and bank==bink, but bank!=kanb and bank!=xban and bank!=baxn).

Remark 3: Instead of a dedicated spelling dictionary we are using the search engine index itself. This has several benefits:

  1. It is dynamically updated. Every newly indexed word, whose frequency is over a certain threshold, is automatically used for spelling correction as well.
  2. As we need to search the index anyway the spelling correction comes at almost no extra cost.
  3. When indexing misspelled terms (i.e. not marked as a correct in the index) we do a spelling correction on the fly and index the page for the correct term as well.

Remark 4: We have implemented query suggestions/completion in a similar fashion. This is a good way to prevent spelling errors in the first place. Every newly indexed word, whose frequency is over a certain threshold, is stored as a suggestion to all of its prefixes (they are created in the index if they do not yet exist). As we anyway provide an instant search feature the lookup for suggestions comes also at almost no extra cost. Multiple terms are sorted by the number of results stored in the index.

Reasoning
In our algorithm we are exploiting the fact that the edit distance between two terms is symmetrical:

  1. We can generate all terms with an edit distance <2 from the query term (trying to reverse the query term error) and checking them against all dictionary terms,
  2. We can generate all terms with an edit distance <2 from each dictionary term (trying to create the query term error) and check the query term against them.
  3. We can combine both and meet in the middle, by transforming the correct dictionary terms to erroneous strings, and transforming the erroneous input term to the correct strings.
    Because adding a char on the dictionary is equivalent to removing a char from the input string and vice versa, we can on both sides restrict our transformation to deletes only.

We are using variant 3, because the delete-only-transformation is language independent and three orders of magnitude less expensive.

Where does the speed come from?

  • Pre-calculation, i.e. the generation of possible spelling error variants (deletes only) and storing them at index time is the first precondition.
  • A fast index access at search time by using a hash table with an average search time complexity of O(1) is the second precondition.
  • But only our Symmetric Delete Spelling Correction on top of this allows to bring this O(1) speed to spell checking, because it allows a tremendous reduction of the number of spelling error candidates to be pre-calculated (generated and indexed).
  • Applying pre-calculation to Norvig’s approach would not be feasible because pre-calculating all possible delete + transpose + replace + insert candidates of all terms would result in a huge time and space consumption.

Computational Complexity
Our algorithm is constant time ( O(1) time ), i.e. independent of the dictionary size (but depending on the average term length and maximum edit distance), because our index is based on a Hash Table which has an average search time complexity of O(1).

Comparison to other approaches
BK-Trees have a search time of O(log dictionary_size), whereas our algorithm is constant time ( O(1) time ), i.e. independent of the dictionary size.
Tries have a comparable search performance to our approach. But a Trie is a prefix tree, which requires a common prefix. This makes it suitable for autocomplete or search suggestions, but not applicable for spell checking. If your typing error is e.g. in the first letter, than you have no common prefix, hence the Trie will not work for spelling correction.

Application
Possible application fields of our algorithm are those of fast approximate dictionary string matching: spell checkers for word processors and search engines, correction systems for optical character recognition, natural language translation based on translation memory, record linkage, de-duplication, matching DNA sequences, fuzzy string searching and fraud detection.

———

BTW, by using a similar principle our web search is three orders of magnitude more efficient as well. While Google touches 1000 servers for every query, we need to query just one (server/peer).
That’s not because of DHT! Vice versa, because even for a complex query in a web scale index only one of the servers needs to be queried, it enables the use of DHT for web search.
Our algorithm improves the efficiency of central servers in a data center to the same extent.