If you are building an internet search engine you know that there are a lot of different languages and character sets to consider. You might try to keep your algorithms language independent and with Unicode the character problems seems to be solved.
But still this is not sufficient for some languages, whose internet population meanwhile is larger than those of the U.S.
In the Chinese language words are not separated by white spaces.
But words are an important unit in information retrieval. Many operations, such as indexing and search are based on words. Therefore the word segmentation of un-segmented Chinese text is essential for a truly international search engine.
An additional challenge is a lightweight word segmenting algorithm, which could be integrated into a distributed p2p search client. While a decent recall and precision is prerequisite, small size and high speed are essential. The small size is required to keep the installation package of the p2p search client compact and the memory requirements low, while the fast speed is necessary for real time crawling. Especially the required small size is a challenge, as many segmentation approaches are based on large dictionaries.
FAROO’s lightweight word segmentation algorithm handles full & half-width as well as traditional & simplified characters.
Having your search terms in one character form you still find also all documents in the opposite character form.
This applies also for the term highlighting. Of course also documents with mixed Latin and Chinese characters are properly processed.