Near Duplicate Detection

Share onTweet about this on TwitterShare on FacebookShare on Google+Share on RedditBuffer this pageShare on LinkedInShare on VK

FAROO now has implemented a robust Near Duplicate Detection.
This promotes original content and diversity in results, while filtering out scraped or syndicated duplicates.

We had to solve two challenges:

  1. a robust algorithm, that identifies duplicates even if they appear within a different template with different menu, header or footer.
  2. do it web scale, i.e. every single new web page needs to be compared to the whole corpus of existing web pages
 

The screenshots below illustrate the effect of Near Duplicate Detection. See the difference between Google and FAROO results:

FAROO – with Near Duplicate Detection

FAROO: with NEAR Duplicate Detection

 

Google – without Near Duplicate Detection

Google: without Near Duplicate Detection

Share onTweet about this on TwitterShare on FacebookShare on Google+Share on RedditBuffer this pageShare on LinkedInShare on VK

Leave a Reply

Your email address will not be published. Required fields are marked *