Near Duplicate Detection

FAROO now has implemented a robust Near Duplicate Detection.
This promotes original content and diversity in results, while filtering out scraped or syndicated duplicates.

We had to solve two challenges:

  1. a robust algorithm, that identifies duplicates even if they appear within a different template with different menu, header or footer.
  2. do it web scale, i.e. every single new web page needs to be compared to the whole corpus of existing web pages

The screenshots below illustrate the effect of Near Duplicate Detection. See the difference between Google and FAROO results:

FAROO – with Near Duplicate Detection

FAROO: with NEAR Duplicate Detection


Google – without Near Duplicate Detection

Google: without Near Duplicate Detection