Very fast Data cleaning of product names, company names & street names

The correction of product names, company names, street names & addresses is a frequent task of data cleaning and deduplication. Often those names are misspelled, either due to OCR errors or mistakes of the human data collectors.

The difference is that those names often consist of multiple words, white space and punctuation. For large data or even Big data applications also speed is very important.

Our algorithm supports both requirements and is up to 1 million times faster compared to conventional approaches (see benchmark). The C# source code is available as Open Source in another Blog post and GitHub). A simple modification of the original source code will add support of names with multiple words, white space and punctuation:

Instead of 357 CreateDictionary("big.txt",""); which parses the a given text file into single words simply use CreateDictionaryEntry("company/street/product name", "") to add company, street & product names to the dictionary.

Then with Correct("misspelled street",""); you will get the correct street name from the dictionary. In line 35..38 you may specify whether you want only the best match or all matches within a certain edit distance (number of character operations difference):

35 private static int verbose = 0;
36 //0: top suggestion
37 //1: all suggestions of smallest edit distance
38 //2: all suggestions <= editDistanceMax (slower, no early termination)

For every similar term (or phrase) found in the dictionary the algorithm gives you the Damerau-Levenshtein edit distance to your input term (look for suggestion.distance in the source code). The edit distance describes how many characters have been added, deleted, altered or transposed between the input term and the dictionary term. This is a measure of similarity between the input term (or phrase) and similar terms (or phrases) found in the dictionary.