Very fast Data cleaning of product names, company names & street names

Share onTweet about this on TwitterShare on FacebookShare on Google+Share on RedditBuffer this pageShare on LinkedInShare on VK

The correction of product names, company names, street names & addresses is a frequent task of data cleaning and deduplication. Often those names are misspelled, either due to OCR errors or mistakes of the human data collectors.

The difference is that those names often consist of multiple words, white space and punctuation. For large data or even Big data applications also speed is very important.

Our algorithm supports both requirements and is up to 1 million times faster compared to conventional approaches (see benchmark). The C# source code is available as Open Source in another Blog post and GitHub). A simple modification of the original source code will add support of names with multiple words, white space and punctuation:

Instead of 357 CreateDictionary("big.txt",""); which parses the a given text file into single words simply use CreateDictionaryEntry("company/street/product name", "") to add company, street & product names to the dictionary.

Then with Correct("misspelled street",""); you will get the correct street name from the dictionary. In line 35..38 you may specify whether you want only the best match or all matches within a certain edit distance (number of character operations difference):

35 private static int verbose = 0;
36 //0: top suggestion
37 //1: all suggestions of smallest edit distance
38 //2: all suggestions <= editDistanceMax (slower, no early termination)

For every similar term (or phrase) found in the dictionary the algorithm gives you the Damerau-Levenshtein edit distance to your input term (look for suggestion.distance in the source code). The edit distance describes how many characters have been added, deleted, altered or transposed between the input term and the dictionary term. This is a measure of similarity between the input term (or phrase) and similar terms (or phrases) found in the dictionary.

Share onTweet about this on TwitterShare on FacebookShare on Google+Share on RedditBuffer this pageShare on LinkedInShare on VK

8 thoughts on “Very fast Data cleaning of product names, company names & street names

  1. Hi,
    Am having two tables with the names of company name and provider names.
    Both of this tables have same names but with some changes (eg: Alpine Communications in my provider name table and Alpine long distance communication in my company name table) I want to find the distance between the records.

    Note : My table contains 50k records.

    Can anyone plz explain how to do it.

  2. Solution 1:
    Create the dictionary from the first table, treating each line as a single word (with CreateDictionaryEntry() )
    Lookup the second table in the dictionary, treating each line as a single word (with Lookup() )
    This will return the match with the smallest (character) edit distance.
    This works on character level, and will be slow if the edit distance between the two lines is high: e.g. 10 seconds for edit distance 3 for 50 k records; 1000 seconds for edit distance 6 for 50 k records.

    Solution 2:
    Create an additional translation table from all unique words contained in both tables. first column: unique word ; second column: a single unique Unicode character (no special meaning, just unique).
    Translate both the company names and the provider names to Unicode strings by using the translation table (one word -> one unicode char). After that in each line of the two tables we have a single term, consisting of those Unicode characters.
    Create the dictionary from the first table, using the translated single term (with CreateDictionaryEntry() )
    Lookup the second table in the dictionary, using the translated single term (with Lookup() )
    This will return the match with the smallest (word) edit distance. This match has to be translated back into the original record, by using the translation table.
    This works on word level, and will be very fast, even if there are several words (each with many characters) difference between the two records.
    But spelling errors within words are treated as if this would be two complete different words.

    Solution 3:
    Create the dictionary from the first table, splitting each line into single words (with CreateDictionaryEntry() ) For each word you have to additonally store the record/line numbers where the word occurs (can be multiple line numbers per word)
    Lookup the second table in the dictionary, splitting each line into single words (with Lookup() ) You have to lookup each word separately.
    To lookup one record from the second table you have to look up all single words of that record separately and collect all returned record/line numbers of those words.
    The line/record number which will be most frequent identifies the most similar record from the first table.
    This is quit fast, allows both spelling errors on a character level and missing word errors.

  3. Hi,
    I have a table with roughly 41k records and I changed my code according to the suggestions made in this post. Unfortunately, every time I pass each line to the method CreateDictionaryEntry(line, “”) my system runs out of memory due to the edits of my passed line.
    Is there any workaround for this?

  4. To what number did you set editDistanceMax (default=2) ? How much RAM do you have free in your machine? The required memory grows with the maximum edit distance! Do you use the c# code or one of the ports to other languages (Java, Javascript, …)? At what record number the error occurs (sounds strange if it really was every time from record 1 on)? Do you use compounds (together with one of the 3 solutions proposed in the comment section)? Perhaps you can post your code (all of the modifications you have made to the original code), then I could have a look.

  5. Hi,

    I thought I’d also post a question since you seem to be extremely willing to help and offer advice.

    I would like to match social media posts (short text) to a database of movies/TV shows. The database contains information on movie or TV show names, characters and actors. If enough evidence is found in the input text, then I want the algorithm to classify the text to the movie it belongs to, or do nothing if there is not enough evidence.

    Since users are prone to misspelling and can spell things in a variety of fashions, I’ve been looking to implement some sort of fuzzy string matching algorithm. I’ve tried a similar solution from SeatGeek (FuzzyWuzzy) and while it does a pretty great job, it’s proved to be way to slow.

    Do you think that your algorithm would be suitable for this kind of application? I’ve read your adjustment for matching phrases, but that’s still matching phrases to phrases. I would like to match phrases/names to sentences, which might take a lot more time and I think that the algorithm would need major adjustments. Any advice be much appreciated.

    Also, thank you for making this publicly available!

  6. @Brian:
    1. Create a dictionary from the database (movie or TV show names, characters and actors). Use CreateDictionaryEntry() for each of those database keywords.
    For each word (dictionary entry) you have to additionally store the data base record numbers (ID of movie or TV show, characters and actors) where the word occurs (can be multiple record numbers per unique word/dictionary entry).

    2. Split the input text into separate words. Lookup each word in the dictionary separately. Collect all returned record numbers/IDs of those words.

    3. After all words are looked up, count the frequency of each record numbers/ID.
    The record numbers/ID number which will be most frequent identifies the most similar record from the data base.

    This is quit fast, allows both spelling errors on a character level, missing word errors and out of order words.

  7. @Wolf: Thank you for the quick and informative response. The only thing that worries me with this approach is that I’d be getting a lot of false positive, especially by not taking into account the order, which matters quite a lot.

    If we take “The walking dead” for example, the words “the” and “walking” could easily by found in a lot of posts, but they wouldn’t necessarily have anything to do with the actual show. Someone could for example say “Walking down the street on the day of the dead.”, which would indicate a perfect math, but it is of course far from that.

    Basically I want to take order into account and do the matching on a “multi word” level. I also want to return the same word written together as a full match, because hashtags like that are very common (#TheWalkingDead), which I believe won’t be possible with the word by word approach. Taking only the initials into account (TwD or GoT for GameOfThrones) is also interesting, but that’s more of a down the line addition.

    Anyway, I’m off to studying some C# so I can implement your algorithm. If you have any further suggestions, they would be much appreciated.

  8. @Brian: You don’t necessarily need to study C# to use the algorithm. There are several 3rd party ports to other programming languages available.

    There are quite a few options to increase precision:
    1. Capital letters: Create only dictionary entries for terms which start with capital letters, or give them a higher rank.
    2. Create also dictionary entries for bi-grams (two consecutive words): give bi-grams a higher priority.
    3. Proximity/distance of words: For each dictionary entry also store the positions of the terms in the description in the dictionary: give a higher rank, if the proximity is closer, if the order is the same like in the query, if the position difference is the same like in the query (e.g. phrase: difference=1).
    4. Stop-words (the 100 most frequent words): allow stop-words only within bi-grams.

Leave a Reply

Your email address will not be published. Required fields are marked *