Very fast Data cleaning of product names, company names & street names

Share onTweet about this on TwitterShare on FacebookShare on Google+Share on RedditBuffer this pageShare on LinkedInShare on VK

The correction of product names, company names, street names & addresses is a frequent task of data cleaning and deduplication. Often those names are misspelled, either due to OCR errors or mistakes of the human data collectors.

The difference is that those names often consist of multiple words, white space and punctuation. For large data or even Big data applications also speed is very important.

Our algorithm supports both requirements and is up to 1 million times faster compared to conventional approaches (see benchmark). The C# source code is available as Open Source in another Blog post and GitHub). A simple modification of the original source code will add support of names with multiple words, white space and punctuation:

Instead of 357 CreateDictionary("big.txt",""); which parses the a given text file into single words simply use CreateDictionaryEntry("company/street/product name", "") to add company, street & product names to the dictionary.

Then with Correct("misspelled street",""); you will get the correct street name from the dictionary. In line 35..38 you may specify whether you want only the best match or all matches within a certain edit distance (number of character operations difference):

35 private static int verbose = 0;
36 //0: top suggestion
37 //1: all suggestions of smallest edit distance
38 //2: all suggestions <= editDistanceMax (slower, no early termination)

For every similar term (or phrase) found in the dictionary the algorithm gives you the Damerau-Levenshtein edit distance to your input term (look for suggestion.distance in the source code). The edit distance describes how many characters have been added, deleted, altered or transposed between the input term and the dictionary term. This is a measure of similarity between the input term (or phrase) and similar terms (or phrases) found in the dictionary.

Share onTweet about this on TwitterShare on FacebookShare on Google+Share on RedditBuffer this pageShare on LinkedInShare on VK

4 thoughts on “Very fast Data cleaning of product names, company names & street names

  1. Hi,
    Am having two tables with the names of company name and provider names.
    Both of this tables have same names but with some changes (eg: Alpine Communications in my provider name table and Alpine long distance communication in my company name table) I want to find the distance between the records.

    Note : My table contains 50k records.

    Can anyone plz explain how to do it.

  2. Solution 1:
    Create the dictionary from the first table, treating each line as a single word (with CreateDictionaryEntry() )
    Lookup the second table in the dictionary, treating each line as a single word (with Lookup() )
    This will return the match with the smallest (character) edit distance.
    This works on character level, and will be slow if the edit distance between the two lines is high: e.g. 10 seconds for edit distance 3 for 50 k records; 1000 seconds for edit distance 6 for 50 k records.

    Solution 2:
    Create an additional translation table from all unique words contained in both tables. first column: unique word ; second column: a single unique Unicode character (no special meaning, just unique).
    Translate both the company names and the provider names to Unicode strings by using the translation table (one word -> one unicode char). After that in each line of the two tables we have a single term, consisting of those Unicode characters.
    Create the dictionary from the first table, using the translated single term (with CreateDictionaryEntry() )
    Lookup the second table in the dictionary, using the translated single term (with Lookup() )
    This will return the match with the smallest (word) edit distance. This match has to be translated back into the original record, by using the translation table.
    This works on word level, and will be very fast, even if there are several words (each with many characters) difference between the two records.
    But spelling errors within words are treated as if this would be two complete different words.

    Solution 3:
    Create the dictionary from the first table, splitting each line into single words (with CreateDictionaryEntry() ) For each word you have to additonally store the record/line numbers where the word occurs (can be multiple line numbers per word)
    Lookup the second table in the dictionary, splitting each line into single words (with Lookup() ) You have to lookup each word separately.
    To lookup one record from the second table you have to look up all single words of that record separately and collect all returned record/line numbers of those words.
    The line/record number which will be most frequent identifies the most similar record from the first table.
    This is quit fast, allows both spelling errors on a character level and missing word errors.

  3. Hi,
    I have a table with roughly 41k records and I changed my code according to the suggestions made in this post. Unfortunately, every time I pass each line to the method CreateDictionaryEntry(line, “”) my system runs out of memory due to the edits of my passed line.
    Is there any workaround for this?

  4. To what number did you set editDistanceMax (default=2) ? How much RAM do you have free in your machine? The required memory grows with the maximum edit distance! Do you use the c# code or one of the ports to other languages (Java, Javascript, …)? At what record number the error occurs (sounds strange if it really was every time from record 1 on)? Do you use compounds (together with one of the 3 solutions proposed in the comment section)? Perhaps you can post your code (all of the modifications you have made to the original code), then I could have a look.

Leave a Reply

Your email address will not be published. Required fields are marked *