The limits of tweet based web search…

…and how to overcome by utilizing the implicit web.

Many of the recent Real-time search engines are based on Twitter. They use the URLs enclosed in tweets for discovery and ranking of new and popular pages.
It might be worth to have a closer look at the quantity structure of the underlying foundation, to explore the feasibility and limits of this approach.

Recently there has been an interesting visualization of Twitter stats. It essentially proves that as with other social service only a small fraction of the users is actively contributing. This lack of representativeness may even put promising ideas like the “wisdom of crowds” into question.

But there is another fact: also those people who are contributing, publish an even smaller fraction of the information they know.

Both factors make up the huge difference in efficiency between implicit and explicit voting. Explicit voting requires the user to actively express his interest e.g. by tweeting a link. For implicit voting no extra user action is required – if a user is visiting a web page this is already counted as vote.

A short calculation:

Twitter now has 44.5 million users and provides about 20,000 Tweets per minute. If every second tweet contains a URL this would be 10,000 URLs per minute.

According to Nielsen the number of visited Web Pages per Person per Month is 1,591.

The 44.5 million users visiting 1.6 million web pages per minute, while explicitely voting only for 10,000 per minute.

Implicit voting and discovery provides 160 times more attention data than explicit.

This means that 280,000 users with implicit voting could provide the same amount of information as 44.5 million users with explicit voting. Or that implicit discovery during one day finds as much web pages as explicit discovery during a half year.

This shows drastically the limits of a web search which is based solely on explicite votes and mentions, and which potential can be leveraged by using the implicite web.

Beyond the mainstream
This becomes even more important, if we look beyond mainstream topics or the English language.
Then its simply impossible to achieve the critical mass of explicite votes in order to have a statistical significant attention based ranking or popularity based discovery.

Time and Votes are precious
Time is also a crucial factor, especially for real time search.
We want to discover a new page as soon as possible. And we want assess almost instantly how popular this new page becomes.
If we fail with a reliable ranking in a short time, this page still will be buried among steady stream of insignificant noise.
But both goals conflict with the fact, that the number of votes is proportional with the observation time. For new pages the small number of explicit votes is not sufficiently representative to provide a reliable ranking.

Again the much higher frequency of implicit votes helps us.

Relevance vs. Equality
But we can also improve on explicit votes. We just should not treat them as equal – because they are not.
Some of them we trust more than others, and with some we share more common interest than with others. For the very same reason, why we follow some people and some not.
This helps us to get more value and meaning out of the very first vote.

FAROO is going into this direction by combining Real-time Search with a Peer-to-peer infrastructure.

A holistic approach
The discovery of topical, fresh and novel information has always been an important aspect of search. But the perception of what recent is, has changed dramatically with the popularity of services like Twitter, and led to Real-time Search engines.

Real-time search shouldn’t be separated, but part of a unified and distributed web search approach.

The era of pure document centered search is over. The equally important role of users and conversation, both as target of search as well as by contributing to discovery and ranking should be reflected in a adequate infrastructure.

A distributed infrastructure
As long as both source and recipients of information are distributed the natural design for search is distributed. P2P provides an efficient alternative to the ubiquios concentration and centralization in search.

A peer-to-peer client allows the implicit discovery and attention ranking of every visited web page. This is important, as the majority of pages also in real time search belongs to the long tail. They appear once or not at all in the Twitter stream, and can’t be discovered and ranked through explicit voting.

In real time search the amount of index data is limited, because only recent documents, with high attention and reputation need to be indexed. This allows a centralized infrastructure at moderate cost. But as soon as search moves beyond the short head of real time search and aims to fully index the long tail of the whole web, then a distributed peer-to-peer architecture provides a huge cost advantage.

Edit
There is an interesting reaction from the TRENDPRENEUR blog, which further explores the topic: Link voting: real-time respect

2 thoughts on “The limits of tweet based web search…

  1. Pingback: Revisited: Deriving crawler start points from visited pages by monitoring HTTP traffic « FAROO Blog

  2. Pingback: Six degrees of distribution in search « FAROO Blog

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>