Codename Locust: A collaborative crawler swarm

Share onTweet about this on TwitterShare on FacebookShare on Google+Share on RedditBuffer this pageShare on LinkedInShare on VK

While FAROO already contains a distributed crawler, each peer is yet crawling the web independently and un-coordinated.
The next generation is a swarm of collaborating crawlers.
They are grazing the web in the shortest possible time, with no overlap, while leaving no blind spot.

The new species exhibits the following behavior:

  • No overlap and no blind spots also under churn.
  • Dynamic task sharing for growing user number with low communication cost.
  • Complete crawling and re-crawling: Detects crawling completion and switches to re-crawl mode.
  • Politeness: low impact on websites by both the individual crawler and the swarm.
  • Low impact to the peer and workgroup.
  • Limited crawl queue size.
  • Exploiting geographic proximity.
  • Exploiting interest and linguistic proximity:  not the absolute size of the index is important, but the overlap between user interest and indexed content.
  • Relevance based crawling prioritization.
  • Spam reduction.
  • Spider trap proof.

A small swarm is already harvesting the green leaves from the web. With the next release we will set the free the rest of the bread.

Share onTweet about this on TwitterShare on FacebookShare on Google+Share on RedditBuffer this pageShare on LinkedInShare on VK

4 thoughts on “Codename Locust: A collaborative crawler swarm

  1. Pingback: Goodbye 2008, Welcome 2009! « FAROO Blog

  2. Wolf, very interested in any moves you have related to real time search (I mean “immediate” in layman terms not strict subsecond real time in tech terms). I am very interested in how Twitter reveals stuff that Google totally misses. Can your engine key into the Tittersteam? Bernard

  3. @Bernard:

    Given a significant number of users, we can almost instantly index everything, including twitter streams. This is because the p2p client indexes instantly every visited webpage.
    As soon as somebody visits a twitter page in his browser, it is indexed and included in search results.
    And the distributed crawler swarm mentioned in this post will keep the twitter streams updated in the index once they are discovered.

    So our p2p crawling means in principle also real-time search.
    But of course, this works not yet efficiently, just because our user base is too small. This is why this year we will focus on distribution.

    I read your post on real time search with great interest.
    News search shows a specific phenomenon. At the same pace as search engines are indexing the news – people are losing the interest in it.
    Or put it different, once the news are indexed they are no news anymore (There’s nothing older than yesterday’s newspaper).
    The integration of social media shifts the cross point of this two curves and may improve news search significantly.
    Especially the use of twitter for “breaking news” reporting/discovery/ranking in combination with a traditional (or p2p) search engine is compelling.

  4. Pingback: FAROO - Real-time Social Discovery & Search « FAROO Blog

Leave a Reply

Your email address will not be published. Required fields are marked *