FAROO 3.0 with Next Generation Protocol

There had been some silence on the blog recently. But for a very good reason. All hands have been needed on deck to push forward our most important project so far. During the last months we did a full redesign of our p2p protocol and storage layer. All aimed at higher speed and better efficiency.

Earlier this week when we finished the implementation and updated, it had been really a great moment to see everything working smoothly on the new protocol.

All our performance predictions have come real, and the search latency is truly unique for a such massively distributed and highly dynamic p2p search grid.
target hit
The search speed increased 10 times. The storage efficiency doubled and the storage speed increased ten times. The message size and bandwidth requirements have been reduced to 50%.

Now we are reaching a latency below a second, even if the search results are provided from the second side of the planet. From peers that have been selected in real time from out of million highly dynamic peers, fully distributed, without the support of any centralized entity.

That’s why we decided to kick into next gear and give instant search a try. Now search results will appear instantly while you are still typing in your query.

To benefit from all the improvements you will need to upgrade to the new version. We are still giving final touches, but the new FAROO 3.0 release is imminent.

Six degrees of distribution in search

Crisis reveals character, and this is especially true for distributed systems. Everything beyond the standard case may led to a crisis if not considered beforehand.

Six degrees of distribution

The network wide scale adds a new dimension to everything, completely changing the perspective and puting many centralized approaches into question.

Joining peers, updates and recovery look different from a bird’s eye view than from the ground:

  • When the network size grows the bootstrapping algorithm needs to scale.
  • Even if the whole system fails, and all peers want to reconnect at the same time the system should be able to recover gracefully.
  • Every system needs to evolve over time, hence software distribution is required to work on large scale, perhaps frequently or immediately.

That’s why it is important to look at the scaling of all operational aspects, not only at the main search functionality. The weakest element defines the overall scalability and reliability of a system.

The benefits of a distributed architecture (as low cost, high availability and autonomy) can be fully used only, if the operational side is also fully distributed.

There should nowhere be a centralized element, which can fail, be attacked or blocked as a single point of failure or which does simply not scale. Not for crawling, not for indexing & search, not for ranking and discovery, not for bootstrap, and not for update.

Let’s have a closer look at those six degrees of distribution:

Distributed Crawling

Sometimes only the crawler is distributed, while the index and the search engine are still centralized. An example is the Grub distributed crawler, once used by Wikia Search of Wikipedia Founder Jimmy Wales.

A distributed crawler itself provides only limited benefit. Transferring the crawled pages back to a central server doesn’t save much of the bandwidth, compared to what the server would need to download the pages itself. Additionally there is overhead for reliably distributing the workload across the unreliable crawler elements.

The benefits of such hybrid approach are rather for applications beyond a search engine: if only selected information are transferred back (like scraping email-addresses), and the spider is harder to detect and block for the webmaster, as the load comes from different ip’s.

Distributed crawling will live up to its promises only as part of fully distributed search engine architecture. Where the crawlers are not controlled by a single instance, but crawling autonomous led solely by wisdom of crowd of its users. Huge network wide effects can be achieved by utilizing the geographic or contextual proximity between distributed index and crawler parts.

With FAROO’s user powered crawling pages which are changing often (e.g. news) are also re-indexed more often. So the FAROO users implicitly control the distributed crawler in a way that frequently changing pages are kept fresh in the distributed index, while preventing unnecessary traffic on rather static pages.

Distributed Discovery

Even for big incumbents in the search engine markets, it is impossible to crawl the whole web (100 billion pages?) within minutes, to discover new content timely (billion pages per day). Only if the crawler is selectively directed to the new created pages, the web scale real time search becomes feasible and efficient, instead looking for the needle in the hay stack.

By aggregating and analyzing all visited web pages of our users for discovery, we utilize the “wisdom of crowds”. Our users are our scouts. They bring in their collective intelligence and turn the crawler there where new pages emerge. In addition to instantly indexing all visited web pages our active, community directed crawler is also deriving its crawler start points from discovered pages.

Beyond real time search this is also important to discover and crawl blind spots in the web. Those blind spots are formed by web pages, which are not connected to the rest of the web. Thus they can’t be found just by traversing links.

Distributed discovery also helps indexing the deep web (sometimes also referred to as hidden web). It consists of web pages that are created solely on demand from a database, if a user searches for a specific product or service. But because there are no incoming links from the web, those pages can’t be discovered and crawled by normal search engines, although they start to work on alternate ways to index the hidden web, which is much bigger than the visible web.

Distributed Index & Search

Storing web scale information is not so much of a problem. Expensive are the huge data centers required for answering millions of queries in parallel. The resulting costs of billion dollars can be ommitted can be omitted only with a fully decentralized search engine like FAROO.

Incumbents already envision 10 million servers. A distributed index scales naturally, as more users are also providing the additional infrastructure required for their queries. It also benefits from the increase of hardware ressources, doubling every two years according to Moore’s Law.

Recycling unused computer resources is also much more sustainable than building new giant data centers, which consume more energy than a whole city.

The indexes of all big search engines are distributed across hundreds of thousands computers, within huge data centers. But by distributing the search index to the same to the edge of the network where already both user and content reside, the data have not anymore to travel forth and back to a central search instance, which is consequently eliminated. This prevents not only a single point of failure, but also combines the index distribution across multiple computers with leveraging the geographic proximity normally achieved by spreading multiple data centers across the globe.

Last but not least a distributed index is the only architecture where privacy is system inherent, as opposite to the policy based approaches of centralized search engines where the privacy policy might be subject to changes.

Zooming in from the macroscopic view, every distributed layer has its own challenges again. E.g. for the index peers usually do not behave like they should: They are overloaded, there is user activity, the resource quota is exhausted, they are behind a NAT, their dynamic IP has changed or they just quit.

Those challenges have been perfectly summarized in “The Eight Fallacies of Distributed Computing”. Yet going into all the details and our solutions would certainly go beyond the scope of this post.

Distributed Ranking

A additional benefit is a distributed attention based ranking, utilizing the wisdom of crowds. Monitoring the browsing habits of the users and aggregating those “implicit” votes across the whole web promises a more democratic and timely ranking (important for real time search).

While most real time search engines are using an explicit voting, we showed in our blog post “The limits of tweet based web search” that implicit voting by analyzing visited web pages is much more effective (by two orders of magnitude!).

This also eliminates shortcomings of a Wikipedia like approach where content is contributed in a highly distributed way, but the audit is still centralized. Implicit voting automatically involves everybody in a truly democratic ranking. The groups of adjudicators and users become identical, therefore pulling together for optimum results.

Distributed Bootstrap

The first time when a new peer want to connect to the p2p network, it has to contact to known peers (super peers, root peers, bootstrap peers, rendezvous peers) to learn about the addresses of the other peers. This is called bootstrap process.

The addresses of the known peers are either shipped in a list together with the client software or they are loaded dynamically from web caches.

The new peers then store the addresses of the peers they learned from the super peer. The second time the new peers can directly connect to those addresses, without contacting the super peer first.

But if a peer has been offline for some time, most of the addresses he stored become invalid because they are dynamic IP addresses. If the peer fails to connect to the p2p network using the stored addresses, he starts again the bootstrap process using the super peers.

Scaling
During a strong network growth many peers are accessing to the super peers in order to connect to the p2p network. Then the super peer becomes the bottleneck and a single point of failure in an otherwise fully decentralized system. If super peers become overloaded, no new peers can join the system, which prevents a further network growth.

Recovery
If the whole p2p network breaks down due to a web wide incident and all peers try to reconnect at the same time this leads to a extreme load on the super peers.
This would prevent a fast recovery, as peers would fail to connect but keep tying and causing additional load.
Those problems have been experienced in practice in the Skype network.

Security
Another issue is that the super peers make the whole p2p network vulnerable because of their centralized nature. Both blocking and observing of the whole p2p network become possible just by blocking/observing the few super peer nodes.

FAROO is using a fully distributed bootstrap algorithm, which

  • eliminates the super peers as last centralized element, as bottleneck and single point of failure in an otherwise distributed system.
  • provides an organic scaling also for the bootstrap procedure.
  • ensures a fast recovery in case of a system wide incident.
  • makes the p2p network immune to the blocking or monitoring of super peers.

Distributed Update

The distributed system becomes automatically smarter just by the increasing relevance of the collected attention data.
But you may want to refine the underlying algorithms, to improve the efficiency of the p2p overlay, to extend the data model, or to add new functions. And the example of Windows shows that it might be necessary to apply security patches, network wide, frequently and immediately. Updating p2p clients requires a very efficient software update distribution.

10 million peers and 5 Mbyte client software size would require to distribute 50 Terabytes for a full network update. Even for a 100 Mbit/s network connection a central update would last 50 days, if you manage to evenly distribute updates over time.

FAROO is using a distributed, cell division like update instead, where all peers pass on the DNA of a new version to each other within minutes. Of course there is some signature stuff to ensure the integrity of the network.

Divide and Conquer

By consequently distributing every function we ensured a true scalability of the whole system, while eliminating every single point of failure.
Our peers are not outposts of a centralized system, but rather part of distributed Cyborg (combining the power of users, algorithms & resources) living in the net.

This is a system which works on a quiet sunny day, but also on a stormy one. It would be even suitable to extreme mobile scenarios, where peers are scattered across a battlefield or carried by a rescue team.

The system recovers autonomously from a disaster, even if there is no working central instance left, the surviving peers find itself, forming a powerful working distributed system again once they awake. If you have seen Terminator reassembling after run over by a truck you get the idea 😉

In biology organisms naturally deal with the rise and falls of its cells, which as simple elements form superior systems.
We believe that evolution works in search too, and that the future belongs to multicellulars 😉

Revisited: Deriving crawler start points from visited pages by monitoring HTTP traffic

User Driven Crawling

Yesterday Charles Knight of AltSearchEngines pointed me at an interesting article at BNET “Cisco Files to Patent to Enter the Search Engine Business” .

The title of the mentioned patent application no 20090313241 is “Seeding search engine crawlers using intercepted network traffic” .

That caught my eye, as it describes pretty much the same idea that FAROO is using already for some years.

In our blog post “New active, community directed crawler” we outlined already two years ago how our “Crawler start points are derived from visited pages“ .

We are also using HTTP monitoring to detect the URLs of the visited web pages, by intercepting the TCP network traffic using raw sockets since the initial FAROO release in 2005 .

Instant crawling of all visited web pages and their contained links are part of FAROO since the same time.

In 2007 this was even subject of research in the diploma thesis ”Analysis of the growth of a decentralized Peer-to-Peer search engine index“ of Britta Jerichow at Cologne University of Applied Sciences. Although meanwhile both crawler and index architecture were improved substantially the paper already validated both theoretically and experimentally the principal feasibility of our approach.

Already in a publication from 2001 (in German) I outlined the idea of a distributed peer-to-peer search engine, in which the users as source of the growth of the web content also assure its findability, including a fully automated content ranking by the users.

Application Fields:

Deriving Crawler start points from visited pages is not only important to discover and crawl blind spots in the web. Those blind spots are formed by web pages, which are not connected to the rest of the web. Thus they can’t be found just by traversing links.

But there are four much more important application fields for user driven crawling:

  • First is real-time search. Even for big incumbents in the search engine markets, it is impossible to crawl the whole web (100 billion pages? ) within minutes, to discover new content timely (billion pages per day). Only if the crawler is selectively directed to the new created pages, the web scale real time search becomes feasible and efficient, instead looking for the needle in the hay stack.

    By aggregating and analyzing all visited web pages of our users for discovery and implicit voting, we utilize the “wisdom of crowds”.
    Our users are our scouts. They bring in their collective intelligence and turn the crawler there where new pages emerge.

    We published this back in 2007 at AltSearchEngines Great Debate: Peer-to-Peer (P2P) Search: ” FAROO also uses user powered crawling. Pages which are changing often like, for example, news, are visited frequently by users. And with FAROO they are therefore also re-indexed more often. So the FAROO users implicitly control the distributed crawler in a way that frequently changing pages are kept fresh in the distributed index, while preventing unnecessary traffic on rather static pages.”

  • The second is attention based ranking, used by FAROO since 2005. Meanwhile also many Twitter based real-time search engines rank their results according to the number of votes or mentions of a url.
    It proved to be an efficient ranking method for real-time search, superior to link analysis as there are no incoming links yet, when the content is created.
    While most real time search engines are using an explicit voting, we showed in our blog post “The limits of tweet based web search” that implicit voting by analyzing visited web pages is much more effective.
  • Third is indexing the deep web (sometimes also referred to as hidden web). It consists of web pages that are created solely on demand from a database, if a user searches for a specific product or service. But because there are no incoming links from the web, those pages can’t be discovered and crawled by normal search engines, although they start to work on alternate ways to index the hidden web, which is much bigger than the visible web.
  • Forth is personalization and behavioural targeted online advertising, based on click streams identified from network traffic. This technique got some buzz when it was tested in the UK by Phorm.

Beyond search of course, there is an even wider application field of prioritizing / routing / throttling / blocking / intercepting / logging traffic and users depending on the monitored URL, both from ISP and other authorities.

Conclusion:

After all, this looks to me as there were some evidence of Prior Art.

Well, this is not the first time that somebody came across an idea, which already was used sometime before by FAROO. See also “BrowseRank? Welcome to the club!” . And given the level of innovation in our distributed search engine architecture, that breaks with almost every legacy paradigm, this will be probably not the last time.

That’s why we publish our ideas early, even if they inspire our competition sometimes 😉 This prevents that smart solutions are locked by patents of big companies, which are the only ones to have enough resources to patent every idea which come to their mind.

Opposite to every web page detected that makes the web more accessible, every patent issued locks one more piece of the ingenuity of our human species.

Real-time search – all or nothing at all

On the entry of incumbents into the real-time search space

The discovery of topical, fresh and novel information has always been an important aspect of search.
But the perception of what recent is and the appreciation for the real-time aspect has changed dramatically with the popularity of services like Twitter.

As a result a wave of Real-time Search engine startups emerged.

This has been caused and supported by several facts:

  1. People really care about real-time information.
  2. The big players were chary in integrating real-time features.
  3. The free Twitter API provided even small startups with sufficient data from day one.
  4. The limited focus on recent and popular information kept infrastructure cost moderate, temporarily lowering the market entry barrier to search.

But it was obvious from the very beginning that a separated real time search space would exist only for a limited time frame.
Now the incumbents entered the arena, and closing the real-time gap, pushed by the sustained interest of users in real-time search.

This confirms the importance of real-time to search. It also makes real-time an essential part of search and search without that aspect will be considered incomplete.
But the same is true vice-versa. And this poses a huge challenge to real-time search startups.
Only services with an unified and holistic vision of both real-time and general web search will persist in the long run, now that integrated solutions from strong brands exist.

Search is about user experience and scaling. So far scaling has been synonymous to costs. Google envisions 10 Million Servers . What about startups, whose war chest is too small for such brute force approach of copying the Internet to a central system?

A decentralized P2P architecture with its organic scaling characteristics is the adequate answer for a truly web scale approach to search.
People powered search as a smart distributed architecture combined with wisdom of crowds liberates search from the primacy of money. With FAROO there is an App for that 😉

The limits of tweet based web search…

…and how to overcome by utilizing the implicit web.

Many of the recent Real-time search engines are based on Twitter. They use the URLs enclosed in tweets for discovery and ranking of new and popular pages.
It might be worth to have a closer look at the quantity structure of the underlying foundation, to explore the feasibility and limits of this approach.

Recently there has been an interesting visualization of Twitter stats. It essentially proves that as with other social service only a small fraction of the users is actively contributing. This lack of representativeness may even put promising ideas like the “wisdom of crowds” into question.

But there is another fact: also those people who are contributing, publish an even smaller fraction of the information they know.

Both factors make up the huge difference in efficiency between implicit and explicit voting. Explicit voting requires the user to actively express his interest e.g. by tweeting a link. For implicit voting no extra user action is required – if a user is visiting a web page this is already counted as vote.

A short calculation:

Twitter now has 44.5 million users and provides about 20,000 Tweets per minute. If every second tweet contains a URL this would be 10,000 URLs per minute.

According to Nielsen the number of visited Web Pages per Person per Month is 1,591.

The 44.5 million users visiting 1.6 million web pages per minute, while explicitely voting only for 10,000 per minute.

Implicit voting and discovery provides 160 times more attention data than explicit.

This means that 280,000 users with implicit voting could provide the same amount of information as 44.5 million users with explicit voting. Or that implicit discovery during one day finds as much web pages as explicit discovery during a half year.

This shows drastically the limits of a web search which is based solely on explicite votes and mentions, and which potential can be leveraged by using the implicite web.

Beyond the mainstream
This becomes even more important, if we look beyond mainstream topics or the English language.
Then its simply impossible to achieve the critical mass of explicite votes in order to have a statistical significant attention based ranking or popularity based discovery.

Time and Votes are precious
Time is also a crucial factor, especially for real time search.
We want to discover a new page as soon as possible. And we want assess almost instantly how popular this new page becomes.
If we fail with a reliable ranking in a short time, this page still will be buried among steady stream of insignificant noise.
But both goals conflict with the fact, that the number of votes is proportional with the observation time. For new pages the small number of explicit votes is not sufficiently representative to provide a reliable ranking.

Again the much higher frequency of implicit votes helps us.

Relevance vs. Equality
But we can also improve on explicit votes. We just should not treat them as equal – because they are not.
Some of them we trust more than others, and with some we share more common interest than with others. For the very same reason, why we follow some people and some not.
This helps us to get more value and meaning out of the very first vote.

FAROO is going into this direction by combining Real-time Search with a Peer-to-peer infrastructure.

A holistic approach
The discovery of topical, fresh and novel information has always been an important aspect of search. But the perception of what recent is, has changed dramatically with the popularity of services like Twitter, and led to Real-time Search engines.

Real-time search shouldn’t be separated, but part of a unified and distributed web search approach.

The era of pure document centered search is over. The equally important role of users and conversation, both as target of search as well as by contributing to discovery and ranking should be reflected in a adequate infrastructure.

A distributed infrastructure
As long as both source and recipients of information are distributed the natural design for search is distributed. P2P provides an efficient alternative to the ubiquios concentration and centralization in search.

A peer-to-peer client allows the implicit discovery and attention ranking of every visited web page. This is important, as the majority of pages also in real time search belongs to the long tail. They appear once or not at all in the Twitter stream, and can’t be discovered and ranked through explicit voting.

In real time search the amount of index data is limited, because only recent documents, with high attention and reputation need to be indexed. This allows a centralized infrastructure at moderate cost. But as soon as search moves beyond the short head of real time search and aims to fully index the long tail of the whole web, then a distributed peer-to-peer architecture provides a huge cost advantage.

Edit
There is an interesting reaction from the TRENDPRENEUR blog, which further explores the topic: Link voting: real-time respect

FAROO – Real-time Social Discovery & Search

The discovery of topical, fresh and novel information has always been an important aspect of search. Often recent events in sports, culture and economics are triggering the demand for more information.

But the perception of what recent is, has changed dramatically with the popularity of services like Twitter.
Once an index was considered up to date if pages were re-indexed once a week, but under the term “Real time search” documents are now expected to be found in search results within minutes from their creation.

There are two main challenges:

  • First, the discovery of relevant, changed documents as a brute force approach of indexing the whole web within every minute is not feasible.
  • Second, those documents need to be ranked right away when they appear. With the dramatically increased number of participants in content creation in social networks, blogging and micro-blogging also the amount of noise increased. To make real time search feasible, its necessary to separate the relevant documents from the increased stream of noise. Traditional ranking methods based on links fail, as new documents naturally have no history and record of incoming links. Ranking based on the absolute number of votes again penalizes new documents, which is the opposite of what we want for real time search.

The answer to both challenges is taking the crowd sourced approach to search, where the users are discovering and ranking new and relevant documents.

This sounds familiar to FAROO’s P2P architecture of instant, user driven crawling and attention based ranking (see also) . And in fact all the required genes for real-time search have been inherent parts of FAROO’s P2P architecture, long before real time search became so ubiquitous popular.

To really utilize the wisdom of crowds and deliver competitive results requires a large user base. But we will unleash the power of our approach right now by opening up in several ways:

  • First, with the introduction of attention connectors to other social services we are now able to leverage a much more representative base of attention data for discovery and ranking. We do deep link crawling for all discovered links and use the number of votes among other parameters for ranking.
  • And second, with providing a browser based access to our real time search service we are removing all installation hurdles and platform barriers. Our p2p client additionally offers enhanced privacy, personalized results and continuous search.

So, apart from Social Discovery and Attention Based Ranking how does FAROO differ from other real time search services?

Social Noise Filter
We analyze trust and reputation of the original source and the recommending middle man and the attention and popularity of information among the final consumer in order to separate the relevant documents from the constant real time stream of noise.

Social Tagging
There is nothing as powerful as the human brain for categorizing information. We use again the collective intelligence of the users and aggregate the tags from all users and all connected services for a specific document. Of course you are able to search for tags and use them as filters in the faceted search.

Rich Visual Preview
A picture says a thousand words. Whenever possible a teaser picture from the article is shown in front of the text summary, not just a thumbnail of the whole webpage.
The author is displayed if available, and can be used for filtering.

Sentiment Detection
It’s not just the pure news, but also the emotions which involve us and make information outstanding. FAROO detects and visualizes which kinds of sentiments have been triggered in the conversation.

RSS and ATOM result feeds
You can subscribe to the result streams, applying any combination of the faceted search filters. So you can get notified and browse through the news in you preferred web or client based feed reader.

Multi Language support
The real time search services are still dominated by English content. But meanwhile the country with the most internet users is China, and due to the long tail the vast majority of Internet users use different languages than English. So a language indifferent voting, ranking and searching is certainly not appropriate. Multi language search results come together with a localized user interface.

Faceted Search
Our faceted search enables to navigate a multi-dimensional information space by combining text search with a progressive narrowing of choices in each dimension. This helps to cope with the increasing flow of information by narrowing, drill down, refining and filtering.
Faceted search provides also a simple statistical overview of the current and recent activities in different languages, sources and topics.

Architecture and Approach
But the most signifiant difference is, that for us real time search is just one part of a much broader, unified and distributed web search approach.

We believe that the era of document centered search is over. The equally important role of users and conversation, both as target of search as well as by contributing to discovery and ranking should be reflected in a adequate infrastructure.

As long as both source and recipients of information are distributed the natural design for search is distributed, despite the increasing tendencies to incapacitate the collective force of users by removing the distributed origins of the internet through cloud services and cloud based operating systems. P2P provides an efficient alternative to those concentration and centralization tendencies in search.

In the longer perspective, with an increased peer-to-peer user base the real time search capability based on a client approach with implicit discovery and attention ranking is superior to explicit mentions, as every visited web page is covered. This is important, as the majority of links also in real time search belongs to the long tail. They appear once or not at all in the Twitter stream, and can’t be discovered and ranked by popularity through explicit voting.

In real time search the amount of index data is limited, because only recent documents with high attention and reputation need to be indexed. This allows a centralized infrastructure at moderate cost. But as soon as search moves beyond the short head of real time search and aims to fully index the long tail of the whole web, then our distributed peer-to-peer architecture provides a huge cost advantage.

Scaling & Market Entry Barrier

In web search we have three different types of scaling issues:

1. Search load grows with user number
P2P scales organically, as every additional user also provides additional infrastructure

2. With the growth of the internet more documents needs be indexed (requiring more index space)
P2P scales, as the average hard disk size of the users grows, and the number of users who might provide disk space grows as well

3. With the growth of the internet more documents needs to be crawled in the same time
P2P scales as the average bandwidth per user grows, and the number of users who might take part in crawling grows as well.
Additionally P2P users help to smarten up the crawling by discovering the most relevant and recently changed documents.

For market dominating incumbents the scaling in web search is not so much a problem.
For now they solve it just with money, derived from a quasi advertising monopoly and its giant existing user base. But this brute force approach of replicating the whole internet into one system doesn’t leave the Internet unchanged. It bears the danger that one day the original is replaced by its copy.

But for small companies the huge infrastructure costs are posing an effective market entry barrier. Opposite to other services, where the infrastructure requirements are proportional to the user number, for web search you have to index the whole internet from the first user on, to provide competitive search results.
This is where P2P comes in, effectively reducing the infrastructure costs and lowering the market entry barrier.

EDIT:
Try our beta at search.faroo.com or see the screencast: