FAROO at TMT.Communities’09

Gosia of FAROO has been speaker and special guest at the TMT.Communities’09 conference in Warsaw, Poland.

The conference toke place in July, 18th, at the Warsaw Stock Exchange in the Chamber of Listings, and was held under the motto “Generation C”.

Here a short excerpt from the conference web site:
“What is Generation C? It’s a group of people all over the world aged 15 to 45 choosing a digitally-enhanced lifestyle and thus empowering hardware, application and service providers but also grassroots organizations like creative commons or Piratbyrån. In the world of Generation C it’s all about content, communication and cooperation. And since the content is digital it doesn’t exist without a proper medium and your favorite device. It all ads up to a digital world where people are the most important component individually and form powerfull and influencial communities all together. ”

We used the change to evangelize the power of P2P search again 😉

Wyszukiwanie P2P – demokratyzacja wyszukiwania (Peer-to-peer Search – Democratic Search)

Warsaw Stock Exchange, Chamber of Listings

Lightweight Chinese Word Segmentation

If you are building an internet search engine you know that there are a lot of different languages and character sets to consider. You might try to keep your algorithms language independent and with Unicode the character problems seems to be solved.

But still this is not sufficient for some languages, whose internet population meanwhile is larger than those of the U.S.

In the Chinese language words are not separated by white spaces.
But words are an important unit in information retrieval. Many operations, such as indexing and search are based on words. Therefore the word segmentation of un-segmented Chinese text is essential for a truly international search engine.

An additional challenge is a lightweight word segmenting algorithm, which could be integrated into a distributed p2p search client. While a decent recall and precision is prerequisite, small size and high speed are essential. The small size is required to keep the installation package of the p2p search client compact and the memory requirements low, while the fast speed is necessary for real time crawling. Especially the required small size is a challenge, as many segmentation approaches are based on large dictionaries.

FAROO’s lightweight word segmentation algorithm handles full & half-width as well as traditional & simplified characters.
Having your search terms in one character form you still find also all documents in the opposite character form.
This applies also for the term highlighting. Of course also documents with mixed Latin and Chinese characters are properly processed.