Like the rest of the world, people in Japan rely on search engines every day to tap the ocean of information that is the World Wide Web.
But despite the familiarity of Google, Yahoo and other popular search engines, what goes on behind the scenes once we enter our search terms is less understood by the general public.
Even more befuddling is the question of just how these online services manage to locate and index Web sites written in the three scripts used in Japanese.
Following are some questions and answers about search engines and how they parse one of the world’s most complex writing systems:
How does a search engine work?
A search engine is an information-retrieval system that searches documents on the World Wide Web based on specific keywords, producing a list of documents containing those keywords.
Search engines use small programs known as Web crawlers, spiders or bots to go out across the Internet and copy documents they find for later processing.
Words in target documents are stored in long data lists called indexes. Also, copies of the original documents themselves are stored on server computers for retrieval should the original Web pages be updated or deleted. This is known as caching.
Search engines are able to quickly locate Web pages containing a user’s search terms by scanning the indexes rather than exhaustively looking at every word of every document stored in their vast archives of cached documents.
The search engine ranks the documents it finds according to proprietary methods of Web-site analysis. Google, for example, says on its Web site that it looks at how a document is linked to the rest of the Web, using the “collective intelligence of the Web to determine a page’s importance.” Looking inside a document, it also “factors in fonts, subdivisions and the precise location of each word,” as well as content of neighboring pages to turn up relevant pages.
Why can searches in Japanese be frustrating?
Anyone looking for a document containing a keyword in Japanese must know beforehand which of the language’s three writing systems — the hiragana and katakana phonetic syllabaries or kanji — it is written.
For example, a basic Google search for “ramen” in hiragana finds some 3 million documents, but misses tens of millions of Web pages where the word is written in katakana only. Meanwhile, a literature buff who is unsure of the archaic kanji for “wagahai” (an old way of saying “I”) is likely to miss out on many pages discussing Natsume Soseki’s classic novel “Wagahai wa Neko de Aru” (“I am a Cat”).
When the user is looking for a phrase, rather than individual words, entering all the possible combinations of hiragana, katakana and new and old kanji can be prohibitively time-consuming. Making matters worse, the three scripts are sometimes used in unconventional form for humorous effect, throwing search engines off the trail of many a quirky blog.
OK, it’s tough for us humans. But is it any easier from the computer’s point of view?
Unfortunately, no. Experts say far more computing power is required to search Japanese text than English.
One reason is that Japanese doesn’t use spaces. With no spaces to separate words, search engines attempting to index a document must work out for themselves where words begin and end. Imagine having to figure out that “amanaplanacanalpanama” means, “A man, a plan, a canal — Panama,” and you get the idea. Another problem is that Japanese commonly write words in several different ways.
So the programmers at Yahoo Japan designed their search engine to assume a user wants to find the term “hikkoshi” (moving, or relocation) whether the query is in the correct kanji or kana, or in the many common — but syntactically incorrect — variations. Yahoo Japan’s search engine also tries to figure out when someone is using an archaic kanji or an uncommon katakana construction, but the company is quick to acknowledge the coverage is only partial.
Also, broadband-service provider NTT Resonant Inc., which operates the well-known Japanese portal Goo, is trying to improve indexing by building giant databases of names for people, places and organizations.
Perhaps more impressive from a computing standpoint, the company is also programming its search engine to determine the grammatical role played by each word in each document scanned. This also improves indexing, and thus the quality of search results, according to search services manager Masayuki Sugizaki.
At this point, how does an English-language search compare with one in Japanese in terms of the reliability and ranking of documents it finds?
Sugizaki said that because English is such a widely spoken language worldwide, Web pages in that language are far more interlinked than those in Japanese.
Because ranking is based so much on interlinkage on the Web, Japanese-language search engines still have less information to go on when trying to guide users to documents judged by others as worthy. Sugizaki said Japanese search results are more likely to turn up blog pages than English searches.
Is Japan, a country of great technological innovation, trying in any way to redefine the Web search?
Yes. This year, the government set out to take the lead in next-generation search engines with a project to collect consumer behavior data.
The idea is to take the Web search beyond entering keywords or phrases into a blank and pushing the enter key. In the new concept, computers will keep track of people’s behavior and act on that data, for example by notifying a wine buff with a GPS-equipped mobile phone that bottles of Beaujolais Nouveau are on sale around the corner.
This year, the government allocated ¥4.6 billion for the project, which consists of 10 partnerships with corporations and government-affiliated organizations with expertise in related search engine technologies.
Participants hope the move gives Japan an edge over South Korea and Taiwan, whose high-tech proficiency is starting to catch up with Japan. However, as exciting as it may sound to retailers, the newfangled robo-search engines raise obvious privacy concerns.
So a government committee has been tasked with determining what kind of personal data, and how much, should be made available to the next generation of search technology.