How do you know what language the query a user entered is in, and how do you search languages like Japanese, that don’t use spaces between the words? And how do you identify languages used in an unstructured, multi-lingual document? Multilingual search is a hard problem.
Most of the search players all use the same basic technology provided by Basistech, check out their customer list: Google, Amazon, MSN, Yahoo, Endeca, Peoplesoft, Overture, you name it, they have them.
The technology, called the Rosette Linguistic Platform, “helps your applications unlock the meaning of unstructured text by determining the language and encoding of a given document, converting the text to Unicode so that it can be processed, identifying the basic linguistic features and structure, and locating key concepts like the names of people and places.”
In other words, it deals with Asian and Arabic language search problems, and does entity extraction (extracting names of people, places and companies and such). It identifies individual words for languages such as Japanese that do not use spaces between words, breaks compound words into their individual components, and identifies parts-of-speech such as verb, adjective, etc.
Once it has done its job making sense of the languages it finds, the search technology of the vendor takes over.
They have a pretty cool demo that explains what the technology does – here’s a screenshot: