FAROO now has implemented a robust Near Duplicate Detection.
This promotes original content and diversity in results, while filtering out scraped or syndicated duplicates.
We had to solve two challenges:
- a robust algorithm, that identifies duplicates even if they appear within a different template with different menu, header or footer.
- do it web scale, i.e. every single new web page needs to be compared to the whole corpus of existing web pages
The screenshots below illustrate the effect of Near Duplicate Detection. See the difference between Google and FAROO results:
FAROO – with Near Duplicate Detection
Google – without Near Duplicate Detection