Codename Locust: A collaborative crawler swarm

While FAROO already contains a distributed crawler, each peer is yet crawling the web independently and un-coordinated.
The next generation is a swarm of collaborating crawlers.
They are grazing the web in the shortest possible time, with no overlap, while leaving no blind spot.

The new species exhibits the following behavior:

  • No overlap and no blind spots also under churn.
  • Dynamic task sharing for growing user number with low communication cost.
  • Complete crawling and re-crawling: Detects crawling completion and switches to re-crawl mode.
  • Politeness: low impact on websites by both the individual crawler and the swarm.
  • Low impact to the peer and workgroup.
  • Limited crawl queue size.
  • Exploiting geographic proximity.
  • Exploiting interest and linguistic proximity:  not the absolute size of the index is important, but the overlap between user interest and indexed content.
  • Relevance based crawling prioritization.
  • Spam reduction.
  • Spider trap proof.

A small swarm is already harvesting the green leaves from the web. With the next release we will set the free the rest of the bread.