No Limits – FAROO iPad / iPhone App v2.0

After 6 months and some incremental updates, we just finished our next major release, with many improvements and new features you will love.

Well, this is not merely an update, it’s rather a complete rewrite. Based on a new flexible architecture all of the previous limitations are gone.

While some of the better newsreaders for iOS are able to provide a dozen sources per page, we decided to remove all those limitations once and for all, both for iPhone and iPad.

The FAROO App now supports an unlimited number of feeds. The animation stays smooth as ever.

But thats not all. Here the complete list of the improvements:

  1. New architecture allows an unlimited number of streams and items per stream.
  2. New Google Reader Synchronization.
  3. New Archive function which allows to store articles for later reading.
  4. New RSS Feed search assistant to search for new feeds by person, topic or blog.
  5. New georgeous animated front page.
  6. New sleek animated transition between stream view and text view.
  7. New central menu for the new functions.
  8. New help function.

As always, this is an Universal App, running both on the iPhone and the iPad. And it is FREE. Get it from the AppStore right now.

Revisited: Deriving crawler start points from visited pages by monitoring HTTP traffic

User Driven Crawling

Yesterday Charles Knight of AltSearchEngines pointed me at an interesting article at BNET “Cisco Files to Patent to Enter the Search Engine Business” .

The title of the mentioned patent application no 20090313241 is “Seeding search engine crawlers using intercepted network traffic” .

That caught my eye, as it describes pretty much the same idea that FAROO is using already for some years.

In our blog post “New active, community directed crawler” we outlined already two years ago how our “Crawler start points are derived from visited pages“ .

We are also using HTTP monitoring to detect the URLs of the visited web pages, by intercepting the TCP network traffic using raw sockets since the initial FAROO release in 2005 .

Instant crawling of all visited web pages and their contained links are part of FAROO since the same time.

In 2007 this was even subject of research in the diploma thesis ”Analysis of the growth of a decentralized Peer-to-Peer search engine index“ of Britta Jerichow at Cologne University of Applied Sciences. Although meanwhile both crawler and index architecture were improved substantially the paper already validated both theoretically and experimentally the principal feasibility of our approach.

Already in a publication from 2001 (in German) I outlined the idea of a distributed peer-to-peer search engine, in which the users as source of the growth of the web content also assure its findability, including a fully automated content ranking by the users.

Application Fields:

Deriving Crawler start points from visited pages is not only important to discover and crawl blind spots in the web. Those blind spots are formed by web pages, which are not connected to the rest of the web. Thus they can’t be found just by traversing links.

But there are four much more important application fields for user driven crawling:

  • First is real-time search. Even for big incumbents in the search engine markets, it is impossible to crawl the whole web (100 billion pages? ) within minutes, to discover new content timely (billion pages per day). Only if the crawler is selectively directed to the new created pages, the web scale real time search becomes feasible and efficient, instead looking for the needle in the hay stack.

    By aggregating and analyzing all visited web pages of our users for discovery and implicit voting, we utilize the “wisdom of crowds”.
    Our users are our scouts. They bring in their collective intelligence and turn the crawler there where new pages emerge.

    We published this back in 2007 at AltSearchEngines Great Debate: Peer-to-Peer (P2P) Search: ” FAROO also uses user powered crawling. Pages which are changing often like, for example, news, are visited frequently by users. And with FAROO they are therefore also re-indexed more often. So the FAROO users implicitly control the distributed crawler in a way that frequently changing pages are kept fresh in the distributed index, while preventing unnecessary traffic on rather static pages.”

  • The second is attention based ranking, used by FAROO since 2005. Meanwhile also many Twitter based real-time search engines rank their results according to the number of votes or mentions of a url.
    It proved to be an efficient ranking method for real-time search, superior to link analysis as there are no incoming links yet, when the content is created.
    While most real time search engines are using an explicit voting, we showed in our blog post “The limits of tweet based web search” that implicit voting by analyzing visited web pages is much more effective.
  • Third is indexing the deep web (sometimes also referred to as hidden web). It consists of web pages that are created solely on demand from a database, if a user searches for a specific product or service. But because there are no incoming links from the web, those pages can’t be discovered and crawled by normal search engines, although they start to work on alternate ways to index the hidden web, which is much bigger than the visible web.
  • Forth is personalization and behavioural targeted online advertising, based on click streams identified from network traffic. This technique got some buzz when it was tested in the UK by Phorm.

Beyond search of course, there is an even wider application field of prioritizing / routing / throttling / blocking / intercepting / logging traffic and users depending on the monitored URL, both from ISP and other authorities.

Conclusion:

After all, this looks to me as there were some evidence of Prior Art.

Well, this is not the first time that somebody came across an idea, which already was used sometime before by FAROO. See also “BrowseRank? Welcome to the club!” . And given the level of innovation in our distributed search engine architecture, that breaks with almost every legacy paradigm, this will be probably not the last time.

That’s why we publish our ideas early, even if they inspire our competition sometimes 😉 This prevents that smart solutions are locked by patents of big companies, which are the only ones to have enough resources to patent every idea which come to their mind.

Opposite to every web page detected that makes the web more accessible, every patent issued locks one more piece of the ingenuity of our human species.

The limits of tweet based web search…

…and how to overcome by utilizing the implicit web.

Many of the recent Real-time search engines are based on Twitter. They use the URLs enclosed in tweets for discovery and ranking of new and popular pages.
It might be worth to have a closer look at the quantity structure of the underlying foundation, to explore the feasibility and limits of this approach.

Recently there has been an interesting visualization of Twitter stats. It essentially proves that as with other social service only a small fraction of the users is actively contributing. This lack of representativeness may even put promising ideas like the “wisdom of crowds” into question.

But there is another fact: also those people who are contributing, publish an even smaller fraction of the information they know.

Both factors make up the huge difference in efficiency between implicit and explicit voting. Explicit voting requires the user to actively express his interest e.g. by tweeting a link. For implicit voting no extra user action is required – if a user is visiting a web page this is already counted as vote.

A short calculation:

Twitter now has 44.5 million users and provides about 20,000 Tweets per minute. If every second tweet contains a URL this would be 10,000 URLs per minute.

According to Nielsen the number of visited Web Pages per Person per Month is 1,591.

The 44.5 million users visiting 1.6 million web pages per minute, while explicitely voting only for 10,000 per minute.

Implicit voting and discovery provides 160 times more attention data than explicit.

This means that 280,000 users with implicit voting could provide the same amount of information as 44.5 million users with explicit voting. Or that implicit discovery during one day finds as much web pages as explicit discovery during a half year.

This shows drastically the limits of a web search which is based solely on explicite votes and mentions, and which potential can be leveraged by using the implicite web.

Beyond the mainstream
This becomes even more important, if we look beyond mainstream topics or the English language.
Then its simply impossible to achieve the critical mass of explicite votes in order to have a statistical significant attention based ranking or popularity based discovery.

Time and Votes are precious
Time is also a crucial factor, especially for real time search.
We want to discover a new page as soon as possible. And we want assess almost instantly how popular this new page becomes.
If we fail with a reliable ranking in a short time, this page still will be buried among steady stream of insignificant noise.
But both goals conflict with the fact, that the number of votes is proportional with the observation time. For new pages the small number of explicit votes is not sufficiently representative to provide a reliable ranking.

Again the much higher frequency of implicit votes helps us.

Relevance vs. Equality
But we can also improve on explicit votes. We just should not treat them as equal – because they are not.
Some of them we trust more than others, and with some we share more common interest than with others. For the very same reason, why we follow some people and some not.
This helps us to get more value and meaning out of the very first vote.

FAROO is going into this direction by combining Real-time Search with a Peer-to-peer infrastructure.

A holistic approach
The discovery of topical, fresh and novel information has always been an important aspect of search. But the perception of what recent is, has changed dramatically with the popularity of services like Twitter, and led to Real-time Search engines.

Real-time search shouldn’t be separated, but part of a unified and distributed web search approach.

The era of pure document centered search is over. The equally important role of users and conversation, both as target of search as well as by contributing to discovery and ranking should be reflected in a adequate infrastructure.

A distributed infrastructure
As long as both source and recipients of information are distributed the natural design for search is distributed. P2P provides an efficient alternative to the ubiquios concentration and centralization in search.

A peer-to-peer client allows the implicit discovery and attention ranking of every visited web page. This is important, as the majority of pages also in real time search belongs to the long tail. They appear once or not at all in the Twitter stream, and can’t be discovered and ranked through explicit voting.

In real time search the amount of index data is limited, because only recent documents, with high attention and reputation need to be indexed. This allows a centralized infrastructure at moderate cost. But as soon as search moves beyond the short head of real time search and aims to fully index the long tail of the whole web, then a distributed peer-to-peer architecture provides a huge cost advantage.

Edit
There is an interesting reaction from the TRENDPRENEUR blog, which further explores the topic: Link voting: real-time respect

FAROO introduces Continuous Search

About 40 percent of the searches people make on the Internet are duplicate queries they have made at least once before.

Now FAROO assists in the time consuming task of staying up to date, and alerts you in real time about relevant news. Based on attention data FAROO automatically detects queries with long term relevancy to the user. Opposite of other solutions there is no extra action from the user required.

This serves also as smart discovery search, providing the user automatically with updates in his fields of interest.

Additionally, also a list of currently Hot Topics and related images are displayed. This provides in many cases a good visual feedback of breaking events or topics dominating the news.

Currently we are using Twitter data for update detection and real time search, as our index is not yet comprehensive enough.
But in the long run our own p2p data of all visited web pages will provide even more relevant results.

The new set of features will be available with our next release.

BrowseRank? Welcome to the club!

Microsoft Research just published a paper “BrowseRank: Letting Web Users Vote for Page Importance” at the SIGIR (Special Interest Group on Information Retrieval) conference this week in Singapore.

This paper describes a method for computing page importance, referred to as BrowseRank.

FAROO has been doing something very similar with its attention based PeerRank for some time already.

FAROO’s “If users spend a long time on a page, visit it often, put it to bookmarks or prints it out, this page goes up in ranking.”
sounds very familiar to
Microsoft’s “The more visits of the page made by the users and the longer time periods spent by the users on the page, the more likely the page is important.”, doesn’t it?

Also the term implicit voting used in the paper caused a kind of Déjà vu: “we are voting automatically on the fly, implicit without manual action.” from our blog post Attention economy, the implicit web and myware.

A very significant difference is though, that FAROO maintains the privacy of the user because it calculates the PeerRank in a decentralized manner, while Microsoft would collect all click streams of all users in a central server.

It’s great to see that also Microsoft’s research paper confirms that attention based ranking is able to outperform PageRank both for relevancy and for spam suppression.

This is certainly an excellent technical paper, but from a scientific publication I would expect previously existing applications of user behavior data for ranking search results to be mentioned in the chapter ‘Related Work’.

Unconference & BoF at Web 2.0 Expo

“The Social Side of Search”, a Micro-Unconference initiated by FAROO, took place on April, 25 in the Oracle Booth at the Web 2.0 Expo in San Francisco.

The day before we presented FAROO at the Birds of a Feather (BoF) Session “People Powered Search”:

People powered search:

Social Networks are very successful, there are social networks for near everything, Only in search you are still on your own? Searching together is natural: Asking friends, family, Experts …

  • Why we haven’t social search yet (on large scale)? Chicken-egg problem, Many users required to be useful, First your plain search must be competitive , than you can add features -> costs!
  • What could be the benefits? Personalization, benefit from search experience of your community …
  • What could be the risks? Spammers, Edit wars, Privacy, locked in community -> alternate opinions get filtered out

Examples of how searching together could benefit – a lot of different flavours:

  • Providing Infrastructure (using P2P technology)
  • Directing the crawler (websites which often appear in results are crawled more frequently/deeper)
  • User generated Ranking (using attention data)
  • Annotating results
  • Editing results
  • Creating results
  • Bringing users with similar search interests together ( FAROO Social Search )
  • Collaborative Searching: Partitioning search among users
  • Personalization using your Social Graph
  • Many more …

Collaboration not only between users, but also between social search projects:

Todays social networks have one problem: walled gardens ( possible workaround: open social, friendfeed api ). Would it be possible to define a standard/protocol to have all social search initiatives to work together from begin?

  • Can the user take its profile with him?
  • Can the user take its attention data/query stream with him?
  • Are the privacy settings standardized?
  • Can the different search projects exchange index and usage data and use them together, to join their forces? Intense discussion on this topic at Alternative Search Engines Day, a conference hosted by Charles Knight

 

Of course also beyond the unconference the Web 2.0 Expo was a great place to meet interesting people and look what others are heading for.


San Francisco day …

… and night

Echo in the blogosphere

For a p2p model it is essentially to share a common vision with your users. Therefore it’s always interesting to see how your ideas are discussed and perceived.

A very encouraging and profound example is the ReadWriteWeb blog post “Could P2P Search Change the Game?” by Bernard Lunn.

 
Additionally here is a short roundup of selected previous blog posts:

FAROO: The Social Side of Search

FAROO gets exciting new social search network functions.

Combining the two mega trends search and social networks, we try to harness the wisdom of crowds and network effects for search.

Why social search? Because searching today is being alone with your question. There is no conversation at all.
But in the real world you are successful if you are not only silently piling through heaps of documents, but ask your colleges and get hints from your friends.

Many social networks are entertaining. But when it comes to search, recommending and connecting you to people who are working at the same topic at the same time, you are on your own again. Because social networks help you to stay in touch with people you already know, they barely help you finding the right (yet unknown to you) people at the right time.

Why context sensitive search advertising is so successful? Because you are presented the right ads at just the moment when you are interested in a specific topic. Now how it would be, if you were presented not ads, but like-minded people? You got the idea. That’s what FAROO’s social search is all about.

And we reconcile social search AND privacy. Unlock the collective intelligence of a search community of peers without sacrificing your privacy. To use the social features no registration is required. You are using an arbitrary alias or nickname.

We just bring people with same interest together. You can at once communicate with like-minded people, profit from their search experience, follow their discoveries, exchange ideas. But you may decide later if you want to become friends and when to reveal your identity.

FAROO’s Social Search is pure opt-in. It can bring you into conversation. But only if you like.

The social features are in alpha stage yet, but here is a Sneak Peak.

Screenshot

What do you think about it?