While FAROO already contains a distributed crawler, each peer is yet crawling the web independently and un-coordinated.
The next generation is a swarm of collaborating crawlers.
They are grazing the web in the shortest possible time, with no overlap, while leaving no blind spot.
The new species exhibits the following behavior:
- No overlap and no blind spots also under churn.
- Dynamic task sharing for growing user number with low communication cost.
- Complete crawling and re-crawling: Detects crawling completion and switches to re-crawl mode.
- Politeness: low impact on websites by both the individual crawler and the swarm.
- Low impact to the peer and workgroup.
- Limited crawl queue size.
- Exploiting geographic proximity.
- Exploiting interest and linguistic proximity: not the absolute size of the index is important, but the overlap between user interest and indexed content.
- Relevance based crawling prioritization.
- Spam reduction.
- Spider trap proof.
A small swarm is already harvesting the green leaves from the web. With the next release we will set the free the rest of the bread.