Revisited: Deriving crawler start points from visited pages by monitoring HTTP traffic

User Driven Crawling

Yesterday Charles Knight of AltSearchEngines pointed me at an interesting article at BNET “Cisco Files to Patent to Enter the Search Engine Business” .

The title of the mentioned patent application no 20090313241 is “Seeding search engine crawlers using intercepted network traffic” .

That caught my eye, as it describes pretty much the same idea that FAROO is using already for some years.

In our blog post “New active, community directed crawler” we outlined already two years ago how our “Crawler start points are derived from visited pages“ .

We are also using HTTP monitoring to detect the URLs of the visited web pages, by intercepting the TCP network traffic using raw sockets since the initial FAROO release in 2005 .

Instant crawling of all visited web pages and their contained links are part of FAROO since the same time.

In 2007 this was even subject of research in the diploma thesis ”Analysis of the growth of a decentralized Peer-to-Peer search engine index“ of Britta Jerichow at Cologne University of Applied Sciences. Although meanwhile both crawler and index architecture were improved substantially the paper already validated both theoretically and experimentally the principal feasibility of our approach.

Already in a publication from 2001 (in German) I outlined the idea of a distributed peer-to-peer search engine, in which the users as source of the growth of the web content also assure its findability, including a fully automated content ranking by the users.

Application Fields:

Deriving Crawler start points from visited pages is not only important to discover and crawl blind spots in the web. Those blind spots are formed by web pages, which are not connected to the rest of the web. Thus they can’t be found just by traversing links.

But there are four much more important application fields for user driven crawling:

  • First is real-time search. Even for big incumbents in the search engine markets, it is impossible to crawl the whole web (100 billion pages? ) within minutes, to discover new content timely (billion pages per day). Only if the crawler is selectively directed to the new created pages, the web scale real time search becomes feasible and efficient, instead looking for the needle in the hay stack.

    By aggregating and analyzing all visited web pages of our users for discovery and implicit voting, we utilize the “wisdom of crowds”.
    Our users are our scouts. They bring in their collective intelligence and turn the crawler there where new pages emerge.

    We published this back in 2007 at AltSearchEngines Great Debate: Peer-to-Peer (P2P) Search: ” FAROO also uses user powered crawling. Pages which are changing often like, for example, news, are visited frequently by users. And with FAROO they are therefore also re-indexed more often. So the FAROO users implicitly control the distributed crawler in a way that frequently changing pages are kept fresh in the distributed index, while preventing unnecessary traffic on rather static pages.”

  • The second is attention based ranking, used by FAROO since 2005. Meanwhile also many Twitter based real-time search engines rank their results according to the number of votes or mentions of a url.
    It proved to be an efficient ranking method for real-time search, superior to link analysis as there are no incoming links yet, when the content is created.
    While most real time search engines are using an explicit voting, we showed in our blog post “The limits of tweet based web search” that implicit voting by analyzing visited web pages is much more effective.
  • Third is indexing the deep web (sometimes also referred to as hidden web). It consists of web pages that are created solely on demand from a database, if a user searches for a specific product or service. But because there are no incoming links from the web, those pages can’t be discovered and crawled by normal search engines, although they start to work on alternate ways to index the hidden web, which is much bigger than the visible web.
  • Forth is personalization and behavioural targeted online advertising, based on click streams identified from network traffic. This technique got some buzz when it was tested in the UK by Phorm.

Beyond search of course, there is an even wider application field of prioritizing / routing / throttling / blocking / intercepting / logging traffic and users depending on the monitored URL, both from ISP and other authorities.

Conclusion:

After all, this looks to me as there were some evidence of Prior Art.

Well, this is not the first time that somebody came across an idea, which already was used sometime before by FAROO. See also “BrowseRank? Welcome to the club!” . And given the level of innovation in our distributed search engine architecture, that breaks with almost every legacy paradigm, this will be probably not the last time.

That’s why we publish our ideas early, even if they inspire our competition sometimes ;-) This prevents that smart solutions are locked by patents of big companies, which are the only ones to have enough resources to patent every idea which come to their mind.

Opposite to every web page detected that makes the web more accessible, every patent issued locks one more piece of the ingenuity of our human species.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>