Real-time search – all or nothing at all

On the entry of incumbents into the real-time search space

The discovery of topical, fresh and novel information has always been an important aspect of search.
But the perception of what recent is and the appreciation for the real-time aspect has changed dramatically with the popularity of services like Twitter.

As a result a wave of Real-time Search engine startups emerged.

This has been caused and supported by several facts:

  1. People really care about real-time information.
  2. The big players were chary in integrating real-time features.
  3. The free Twitter API provided even small startups with sufficient data from day one.
  4. The limited focus on recent and popular information kept infrastructure cost moderate, temporarily lowering the market entry barrier to search.

But it was obvious from the very beginning that a separated real time search space would exist only for a limited time frame.
Now the incumbents entered the arena, and closing the real-time gap, pushed by the sustained interest of users in real-time search.

This confirms the importance of real-time to search. It also makes real-time an essential part of search and search without that aspect will be considered incomplete.
But the same is true vice-versa. And this poses a huge challenge to real-time search startups.
Only services with an unified and holistic vision of both real-time and general web search will persist in the long run, now that integrated solutions from strong brands exist.

Search is about user experience and scaling. So far scaling has been synonymous to costs. Google envisions 10 Million Servers . What about startups, whose war chest is too small for such brute force approach of copying the Internet to a central system?

A decentralized P2P architecture with its organic scaling characteristics is the adequate answer for a truly web scale approach to search.
People powered search as a smart distributed architecture combined with wisdom of crowds liberates search from the primacy of money. With FAROO there is an App for that 😉

The limits of tweet based web search…

…and how to overcome by utilizing the implicit web.

Many of the recent Real-time search engines are based on Twitter. They use the URLs enclosed in tweets for discovery and ranking of new and popular pages.
It might be worth to have a closer look at the quantity structure of the underlying foundation, to explore the feasibility and limits of this approach.

Recently there has been an interesting visualization of Twitter stats. It essentially proves that as with other social service only a small fraction of the users is actively contributing. This lack of representativeness may even put promising ideas like the “wisdom of crowds” into question.

But there is another fact: also those people who are contributing, publish an even smaller fraction of the information they know.

Both factors make up the huge difference in efficiency between implicit and explicit voting. Explicit voting requires the user to actively express his interest e.g. by tweeting a link. For implicit voting no extra user action is required – if a user is visiting a web page this is already counted as vote.

A short calculation:

Twitter now has 44.5 million users and provides about 20,000 Tweets per minute. If every second tweet contains a URL this would be 10,000 URLs per minute.

According to Nielsen the number of visited Web Pages per Person per Month is 1,591.

The 44.5 million users visiting 1.6 million web pages per minute, while explicitely voting only for 10,000 per minute.

Implicit voting and discovery provides 160 times more attention data than explicit.

This means that 280,000 users with implicit voting could provide the same amount of information as 44.5 million users with explicit voting. Or that implicit discovery during one day finds as much web pages as explicit discovery during a half year.

This shows drastically the limits of a web search which is based solely on explicite votes and mentions, and which potential can be leveraged by using the implicite web.

Beyond the mainstream
This becomes even more important, if we look beyond mainstream topics or the English language.
Then its simply impossible to achieve the critical mass of explicite votes in order to have a statistical significant attention based ranking or popularity based discovery.

Time and Votes are precious
Time is also a crucial factor, especially for real time search.
We want to discover a new page as soon as possible. And we want assess almost instantly how popular this new page becomes.
If we fail with a reliable ranking in a short time, this page still will be buried among steady stream of insignificant noise.
But both goals conflict with the fact, that the number of votes is proportional with the observation time. For new pages the small number of explicit votes is not sufficiently representative to provide a reliable ranking.

Again the much higher frequency of implicit votes helps us.

Relevance vs. Equality
But we can also improve on explicit votes. We just should not treat them as equal – because they are not.
Some of them we trust more than others, and with some we share more common interest than with others. For the very same reason, why we follow some people and some not.
This helps us to get more value and meaning out of the very first vote.

FAROO is going into this direction by combining Real-time Search with a Peer-to-peer infrastructure.

A holistic approach
The discovery of topical, fresh and novel information has always been an important aspect of search. But the perception of what recent is, has changed dramatically with the popularity of services like Twitter, and led to Real-time Search engines.

Real-time search shouldn’t be separated, but part of a unified and distributed web search approach.

The era of pure document centered search is over. The equally important role of users and conversation, both as target of search as well as by contributing to discovery and ranking should be reflected in a adequate infrastructure.

A distributed infrastructure
As long as both source and recipients of information are distributed the natural design for search is distributed. P2P provides an efficient alternative to the ubiquios concentration and centralization in search.

A peer-to-peer client allows the implicit discovery and attention ranking of every visited web page. This is important, as the majority of pages also in real time search belongs to the long tail. They appear once or not at all in the Twitter stream, and can’t be discovered and ranked through explicit voting.

In real time search the amount of index data is limited, because only recent documents, with high attention and reputation need to be indexed. This allows a centralized infrastructure at moderate cost. But as soon as search moves beyond the short head of real time search and aims to fully index the long tail of the whole web, then a distributed peer-to-peer architecture provides a huge cost advantage.

Edit
There is an interesting reaction from the TRENDPRENEUR blog, which further explores the topic: Link voting: real-time respect

FAROO – Real-time Social Discovery & Search

The discovery of topical, fresh and novel information has always been an important aspect of search. Often recent events in sports, culture and economics are triggering the demand for more information.

But the perception of what recent is, has changed dramatically with the popularity of services like Twitter.
Once an index was considered up to date if pages were re-indexed once a week, but under the term “Real time search” documents are now expected to be found in search results within minutes from their creation.

There are two main challenges:

  • First, the discovery of relevant, changed documents as a brute force approach of indexing the whole web within every minute is not feasible.
  • Second, those documents need to be ranked right away when they appear. With the dramatically increased number of participants in content creation in social networks, blogging and micro-blogging also the amount of noise increased. To make real time search feasible, its necessary to separate the relevant documents from the increased stream of noise. Traditional ranking methods based on links fail, as new documents naturally have no history and record of incoming links. Ranking based on the absolute number of votes again penalizes new documents, which is the opposite of what we want for real time search.

The answer to both challenges is taking the crowd sourced approach to search, where the users are discovering and ranking new and relevant documents.

This sounds familiar to FAROO’s P2P architecture of instant, user driven crawling and attention based ranking (see also) . And in fact all the required genes for real-time search have been inherent parts of FAROO’s P2P architecture, long before real time search became so ubiquitous popular.

To really utilize the wisdom of crowds and deliver competitive results requires a large user base. But we will unleash the power of our approach right now by opening up in several ways:

  • First, with the introduction of attention connectors to other social services we are now able to leverage a much more representative base of attention data for discovery and ranking. We do deep link crawling for all discovered links and use the number of votes among other parameters for ranking.
  • And second, with providing a browser based access to our real time search service we are removing all installation hurdles and platform barriers. Our p2p client additionally offers enhanced privacy, personalized results and continuous search.

So, apart from Social Discovery and Attention Based Ranking how does FAROO differ from other real time search services?

Social Noise Filter
We analyze trust and reputation of the original source and the recommending middle man and the attention and popularity of information among the final consumer in order to separate the relevant documents from the constant real time stream of noise.

Social Tagging
There is nothing as powerful as the human brain for categorizing information. We use again the collective intelligence of the users and aggregate the tags from all users and all connected services for a specific document. Of course you are able to search for tags and use them as filters in the faceted search.

Rich Visual Preview
A picture says a thousand words. Whenever possible a teaser picture from the article is shown in front of the text summary, not just a thumbnail of the whole webpage.
The author is displayed if available, and can be used for filtering.

Sentiment Detection
It’s not just the pure news, but also the emotions which involve us and make information outstanding. FAROO detects and visualizes which kinds of sentiments have been triggered in the conversation.

RSS and ATOM result feeds
You can subscribe to the result streams, applying any combination of the faceted search filters. So you can get notified and browse through the news in you preferred web or client based feed reader.

Multi Language support
The real time search services are still dominated by English content. But meanwhile the country with the most internet users is China, and due to the long tail the vast majority of Internet users use different languages than English. So a language indifferent voting, ranking and searching is certainly not appropriate. Multi language search results come together with a localized user interface.

Faceted Search
Our faceted search enables to navigate a multi-dimensional information space by combining text search with a progressive narrowing of choices in each dimension. This helps to cope with the increasing flow of information by narrowing, drill down, refining and filtering.
Faceted search provides also a simple statistical overview of the current and recent activities in different languages, sources and topics.

Architecture and Approach
But the most signifiant difference is, that for us real time search is just one part of a much broader, unified and distributed web search approach.

We believe that the era of document centered search is over. The equally important role of users and conversation, both as target of search as well as by contributing to discovery and ranking should be reflected in a adequate infrastructure.

As long as both source and recipients of information are distributed the natural design for search is distributed, despite the increasing tendencies to incapacitate the collective force of users by removing the distributed origins of the internet through cloud services and cloud based operating systems. P2P provides an efficient alternative to those concentration and centralization tendencies in search.

In the longer perspective, with an increased peer-to-peer user base the real time search capability based on a client approach with implicit discovery and attention ranking is superior to explicit mentions, as every visited web page is covered. This is important, as the majority of links also in real time search belongs to the long tail. They appear once or not at all in the Twitter stream, and can’t be discovered and ranked by popularity through explicit voting.

In real time search the amount of index data is limited, because only recent documents with high attention and reputation need to be indexed. This allows a centralized infrastructure at moderate cost. But as soon as search moves beyond the short head of real time search and aims to fully index the long tail of the whole web, then our distributed peer-to-peer architecture provides a huge cost advantage.

Scaling & Market Entry Barrier

In web search we have three different types of scaling issues:

1. Search load grows with user number
P2P scales organically, as every additional user also provides additional infrastructure

2. With the growth of the internet more documents needs be indexed (requiring more index space)
P2P scales, as the average hard disk size of the users grows, and the number of users who might provide disk space grows as well

3. With the growth of the internet more documents needs to be crawled in the same time
P2P scales as the average bandwidth per user grows, and the number of users who might take part in crawling grows as well.
Additionally P2P users help to smarten up the crawling by discovering the most relevant and recently changed documents.

For market dominating incumbents the scaling in web search is not so much a problem.
For now they solve it just with money, derived from a quasi advertising monopoly and its giant existing user base. But this brute force approach of replicating the whole internet into one system doesn’t leave the Internet unchanged. It bears the danger that one day the original is replaced by its copy.

But for small companies the huge infrastructure costs are posing an effective market entry barrier. Opposite to other services, where the infrastructure requirements are proportional to the user number, for web search you have to index the whole internet from the first user on, to provide competitive search results.
This is where P2P comes in, effectively reducing the infrastructure costs and lowering the market entry barrier.

EDIT:
Try our beta at search.faroo.com or see the screencast:

FAROO at TMT.Communities’09

Gosia of FAROO has been speaker and special guest at the TMT.Communities’09 conference in Warsaw, Poland.

The conference toke place in July, 18th, at the Warsaw Stock Exchange in the Chamber of Listings, and was held under the motto “Generation C”.

Here a short excerpt from the conference web site:
“What is Generation C? It’s a group of people all over the world aged 15 to 45 choosing a digitally-enhanced lifestyle and thus empowering hardware, application and service providers but also grassroots organizations like creative commons or Piratbyrån. In the world of Generation C it’s all about content, communication and cooperation. And since the content is digital it doesn’t exist without a proper medium and your favorite device. It all ads up to a digital world where people are the most important component individually and form powerfull and influencial communities all together. ”

We used the change to evangelize the power of P2P search again 😉


Wyszukiwanie P2P – demokratyzacja wyszukiwania (Peer-to-peer Search – Democratic Search)


Warsaw Stock Exchange, Chamber of Listings

Lightweight Chinese Word Segmentation

If you are building an internet search engine you know that there are a lot of different languages and character sets to consider. You might try to keep your algorithms language independent and with Unicode the character problems seems to be solved.

But still this is not sufficient for some languages, whose internet population meanwhile is larger than those of the U.S.

In the Chinese language words are not separated by white spaces.
But words are an important unit in information retrieval. Many operations, such as indexing and search are based on words. Therefore the word segmentation of un-segmented Chinese text is essential for a truly international search engine.

An additional challenge is a lightweight word segmenting algorithm, which could be integrated into a distributed p2p search client. While a decent recall and precision is prerequisite, small size and high speed are essential. The small size is required to keep the installation package of the p2p search client compact and the memory requirements low, while the fast speed is necessary for real time crawling. Especially the required small size is a challenge, as many segmentation approaches are based on large dictionaries.

FAROO’s lightweight word segmentation algorithm handles full & half-width as well as traditional & simplified characters.
Having your search terms in one character form you still find also all documents in the opposite character form.
This applies also for the term highlighting. Of course also documents with mixed Latin and Chinese characters are properly processed.

The Sleeping Power

A plea for restoring end-to-end connectivity

When the internet was born, it was truly decentralized. The most natural, core function was that users could communicate directly with each other.

But then an unholy alliance of unfounded security fears, technology naysayers, and advocates of centralized technology & walled gardens degraded the Internet, virtually removing the end to end connectivity. Well in theory, you still can connect from on public IP to another, but in a world where almost every user is behind a router this ceased to work.

You may say there is port forwarding, but it requires to configure the router manually which is simply beyond the scope of the average user. Wait, there is UPnP, which lets the application configure the router automatically! Great, but this functionality in most routers is disabled by default. Enabling it manually requires configuring the router, which is beyond the scope of the average user. Here you go again! And then there is STUN, STUNT, TURN and ICE, more hole punching hacks than standards, all operating in a gray area of specifications and differing implementations of routers or again requiring auxiliary constructions in form of additional, centralized traversal servers. But with IPv6 all will be better, right? In the IPv6 address space there are enough addresses for every atom in the universe. But before this comes effective, there are already proposals for IPv6-to-IPv6 NAT.

It’s unbelievable, after 30 years, the Internet has almost completely lost its end-to-end connectivity.

Don’t let them fool you. Neither the limited address space in IPv4, nor security is a founded reason to remove end to end connectivity. As long as the operating system asks the user for his confirmation, if he wants to allow inbound access to this computer, to this specific application everything should be fine.

Over time, people forgot about the decentralized origins of the Internet and got used to a centralized architecture. There the users connect to a centralized service provider and are solely able to communicate over this middleman, from whom they are now dependent and whom they have to pay in one or another way. The current move toward the cloud is only the next step into a fully centralized system, controlled by few big players, manifesting monopolies, and imposing additional borders and taxes. Due to the lack of standards it removes the rest of independence from users and small companies.

Well, of course the users have still a plenty of unused resources (disk space, bandwidth, processor cycles), they already paid for, and which would be super sufficient to serve as infrastructure for all kind of services. Together, they are far more powerful than all those big guys out there. Utilizing their own resources would prevent that the users had to pay a second time, making them independent from providers, who are locking them and their data into walled gardens.

Just somebody “forgot” to standardize the way all those users could unite their forces.

In such a system people would own their data, they could make could grant or remove access at their will. They wouldn’t be exposed for unsolicited data mining and their communication couldn’t be blocked, censored, inspected nor monitored. There simply wouldn’t be central instances, where providers are held as deputy for the interests of monopolistic incrusted industries or political interests.

The average user does not feel sorry, because he did not bother with that technical stuff. He just doesn’t know about the potential applications and healthy competition to the big centralized incumbents he is going to miss due to the connectivity restrictions.
Social networks, instant messaging, micro blogging, all those naturally decentralized services are still forced into a centralized corset, keeping the users in dependency of divided and walled communities.

We believe that the sleeping power of the masses can be unleashed by overcoming their artificial isolation …

Codename Locust: A collaborative crawler swarm

While FAROO already contains a distributed crawler, each peer is yet crawling the web independently and un-coordinated.
The next generation is a swarm of collaborating crawlers.
They are grazing the web in the shortest possible time, with no overlap, while leaving no blind spot.

The new species exhibits the following behavior:

  • No overlap and no blind spots also under churn.
  • Dynamic task sharing for growing user number with low communication cost.
  • Complete crawling and re-crawling: Detects crawling completion and switches to re-crawl mode.
  • Politeness: low impact on websites by both the individual crawler and the swarm.
  • Low impact to the peer and workgroup.
  • Limited crawl queue size.
  • Exploiting geographic proximity.
  • Exploiting interest and linguistic proximity:  not the absolute size of the index is important, but the overlap between user interest and indexed content.
  • Relevance based crawling prioritization.
  • Spam reduction.
  • Spider trap proof.

A small swarm is already harvesting the green leaves from the web. With the next release we will set the free the rest of the bread.

BrowseRank? Welcome to the club!

Microsoft Research just published a paper “BrowseRank: Letting Web Users Vote for Page Importance” at the SIGIR (Special Interest Group on Information Retrieval) conference this week in Singapore.

This paper describes a method for computing page importance, referred to as BrowseRank.

FAROO has been doing something very similar with its attention based PeerRank for some time already.

FAROO’s “If users spend a long time on a page, visit it often, put it to bookmarks or prints it out, this page goes up in ranking.”
sounds very familiar to
Microsoft’s “The more visits of the page made by the users and the longer time periods spent by the users on the page, the more likely the page is important.”, doesn’t it?

Also the term implicit voting used in the paper caused a kind of Déjà vu: “we are voting automatically on the fly, implicit without manual action.” from our blog post Attention economy, the implicit web and myware.

A very significant difference is though, that FAROO maintains the privacy of the user because it calculates the PeerRank in a decentralized manner, while Microsoft would collect all click streams of all users in a central server.

It’s great to see that also Microsoft’s research paper confirms that attention based ranking is able to outperform PageRank both for relevancy and for spam suppression.

This is certainly an excellent technical paper, but from a scientific publication I would expect previously existing applications of user behavior data for ranking search results to be mentioned in the chapter ‘Related Work’.

FAROO at the CHORUS P2P Workshop

FAROO joined the CHORUS P2P Workshop 1P2P4mm, which was colocated with the InfoScale 2008 conference.

This first workshop on peer to peer architectures for multimedia retrieval (1p2p4mm) took place in Vico Equense, Naples, Italy, on June 6 2008. The workshop was arranged by the CHORUS Coordination Action to discuss what challenges must be met and what bottlenecks must be addressed by research and engineering efforts in the near future.

We had a great and intense discussion on the true benefits of p2p for search and on building a joint p2p platform and a better connection between academic and web2.0 communities as possible measures to reach the critical mass (in terms of number of users) and gain traction as a serious alternative approach.

For more information and the position papers of the participants please visit the workshop homepage.