PHP
Article

Efficient Chinese Search with Elasticsearch

By Damien Alexandre

If you have played with Elasticsearch, you already know that analyzing and tokenization are the most important steps while indexing content, and without them your pertinency is going to be bad, your users unhappy and your results poorly sorted.

Even with English content you can lose pertinence with a bad stemming, miss some documents when not performing proper elision and so on. And that’s worse if you are indexing another language; the default analyzers are not all-purpose.

When dealing with Chinese documents, everything is even more complex, even by considering only Mandarin which is the official language in China and the most spoken worldwide. Let’s dig into Chinese content tokenization and expose the best ways of doing it with Elasticsearch.

logo (6)

Chinese characters are logograms, they represents a word or a morpheme (the smallest meaningful unit of language). Put together, their meaning can change and represent a whole new word. Another difficulty is that there is no space between words or sentences, making it very hard for a computer to know where a word starts or ends.

There are tens of thousands of Chinese characters, even if in practice, written Chinese requires a knowledge of between three and four thousand. Let’s see an example: the word “volcano” (火山) is in fact the combination of:

  • 火: fire
  • 山: mountainsky

Our tokenizer must be clever enough to avoid separating those two logograms, because the meaning is changed when they are not together.

Another difficulty is the spelling variants used:

  • simplified Chinese: 书法 ;
  • traditional Chinese, more complex and richer: 書法 ;
  • and pinyin, a Romanized form of Mandarin: shū fǎ.

Analyzing Chinese content

At the time of this writing, here are the solutions available with Elasticsearch:

These analyzers are very different and we will compare how well they perform with a simple test word: 手机.
It means “Cell phone” and is composed of two logograms, which mean “hand” and “machine” respectively. The 机 logogram also composes a lot of other words:

  • 机票: plane ticket
  • 机器人: robot
  • 机枪: machine gun
  • 机遇: opportunity

Our tokenization must not split those logograms, because if I search for “Cell phone”, I do not want any documents about Rambo owning a machine gun and looking bad-ass.

rambo

We are going to test our solutions with the great _analyze API:

curl -XGET 'http://localhost:9200/chinese_test/_analyze?analyzer=paoding_analyzer1' -d '手机'

Also, did I mention this awesome cheat sheet for Elasticsearch yet?

The default Chinese analyzer

Already available on your Elasticsearch instance, this analyzer uses the ChineseTokenizer class of Lucene, which only separates all logograms into tokens. So we are getting two tokens: and .

The Elasticsearch standard analyzer produces the exact same output. For this reason, Chinese is deprecated and soon to be replaced by standard, and you should avoid it.

The paoding plugin

Paoding is almost an industry standard and is known as an elegant solution. Sadly, the plugin for Elasticsearch is unmaintained and I only managed to make it work on version 1.0.1, after some modifications. Here is how to install it manually:

git clone git@github.com:damienalexandre/elasticsearch-analysis-paoding.git /tmp/elasticsearch-analysis-paoding
    cd /tmp/elasticsearch-analysis-paoding
    mvn clean package
    sudo /usr/share/elasticsearch/bin/plugin -url file:/tmp/elasticsearch-analysis-paoding/target/releases/elasticsearch-analysis-paoding-1.2.2.zip -install elasticsearch-analysis-paoding

    # Copy all the dic config files to the ES config path - make sure to set the permissions rights, ES needs to write in /etc/elasticsearch/config/paoding!
    sudo cp -r config/paoding /etc/elasticsearch/config/

After this clumsy installation process (to be done on all your nodes), we now have a new paoding tokenizer and two collectors: max_word_len and most_word. No analyzer is exposed by default so we have to declare a new one:

PUT /chinese_test
    {
      "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
          "tokenizer": {
            "paoding1": {
              "type": "paoding",
              "collector": "most_word"
            },
            "paoding2": {
              "type": "paoding",
              "collector": "max_word_len"
            }
          },
          "analyzer": {
            "paoding_analyzer1": {
              "type": "custom",
              "tokenizer": "paoding1",
              "filter": ["standard"]
            },
            "paoding_analyzer2": {
              "type": "custom",
              "tokenizer": "paoding2",
              "filter": ["standard"]
            }
          }
        }
      }
    }

Both configurations provide good results, with a clean and unique token. Behavior is also very good with more complex sentences.

The cjk analyzer

Very straightforward analyzer, it only transforms any text into a bi-gram. “Batman” becomes a list of meaningless tokens: Ba, at, tm, ma, an. For Asian languages, this tokenizer is a good and very simple solution at the price of a bigger index and sometime not perfectly relevant results.

In our case, a two-logogram word, only 手机 is indexed, which is looking good, but if we take a longer word like 元宵节 (Lantern festival), two tokens are generated: 元宵 and 宵节, meaning respectively lantern and Xiao Festival.

The smart chinese plugin

Very easy to install thanks to the guys at Elasticsearch maintaining it:

bin/plugin -install elasticsearch/elasticsearch-analysis-smartcn/2.3.0

It exposes a new smartcn analyzer, as well as as the smartcn_tokenizer tokenizer, using the SmartChineseAnalyzer from Lucene.

It operates a probability suite to find an optimal separation of words, using the Hidden Markov model and a big number of training texts. So there is already a training dictionary embedded which is quite good on common text – our example is properly tokenized.

The ICU plugin

Another official plugin. Elasticsearch supports the “International Components for Unicode” libraries.

bin/plugin -install elasticsearch/elasticsearch-analysis-icu/2.4.1

This plugin is also recommended if you deal with any language other than English, I use it all the time for French content!

It exposes an icu_tokenizer tokenizer that we will use, as well as a lot of great analysis tools like icu_normalizer, icu_folding, icu_collation, etc.

It works with a dictionary for Chinese and Japanese texts, containing information about word frequency to deduce logogram groups. On 手机, everything is fine and works as expected, but on 元宵节, two tokens are produced: 元宵 and – that’s because lantern and festival are more common than Lantern festival.

Results breakdown

Analyzer 手机 (cell phone) 元宵节 (Lantern festival) 元宵節 (Lantern festival with traditional)
chinese [手] [机] [元] [宵] [节] [元] [宵] [節]
paoding most_word [手机] [元宵] [元宵节] [元宵] [節]
paoding max_word_len [手机] [元宵节] [元宵] [節]
cjk [手机] [元宵] [宵节] [元宵] [宵節]
smartcn [手机] [元宵节] [元宵] [節]
icu_tokenizer [手机] [元宵] [节] [元宵節]

These tests have been done with Elasticsearch 1.3.2 except for Paoding under ES 1.0.1.

From my point of view, paoding and smartcn get the best results. The chinese tokenizer is very bad and the icu_tokenizer is a bit disappointing on 元宵节, but handles traditional Chinese very well.

Support for traditional Chinese

As stated in the introduction, you may have to deal with traditional Chinese either from your documents or from users’ search requests. You need a normalization step to translate those traditional inputs into modern Chinese, because plugins like smartcn or paoding can’t manipulate it correctly.

You can do so from your application or try to handle it inside Elasticsearch directly with the elasticsearch-analysis-stconvert plugin. It can transform both words in traditional and modern Chinese, both-ways. Sadly, you will have to compile it manually, much like the paoding plugin shown above.

The last solution is to use cjk: if you can’t tokenize input correctly, you still have good chances of catching the documents you need, and then improve pertinency with a signal based on the icu_tokenizer, which is quite good too.

Going further with Chinese?

There is no perfect one-size-fits-all solution for analyzing with Elasticsearch, regardless of the content you deal with, and that’s true for Chinese as well. You have to compose and build your own analyzer with the information you get. For example, I’m going with cjk and smartcn tokenization on my search fields, using multi-fields and the multi-match query.

To learn more about Chinese I recommand Chineasy which is a great way to get some basic reading skills! Learning such a rich language is not easy and you should also read this article before going for it, just so you know what’s you’re getting into! 快乐编码

Comments
TaylorRen

Wow, you must have been programming for Chinese search very often!

Stomme_poes

Hey cool, I remember this example from BasisTech http://www.basistech.com/text-analytics/rosette/base-linguistics/
(under tokenisation, where the word "student" erroneously appears in the text "Beijing University Biology Department", I thought through improper bi-gramming).

When learning about tokenisation in Sphinx, I ran across BasisTech. Their plugin for CJK languages is the only option so far that I've heard of for Sphinx, and apparently they also work with ES. However, unlike most ES or Sphinx plugins, BasisTech Rosetta is proprietary.

Mittineague

Sorry, but I don't even recognize that as being a word. Seems more like a "making text search find glyphs representing concepts" thing to me.

Stomme_poes

flyGOmachines!! (airplanes)

Mountainsky is that stuff they mine in places like Colorado and sell on nature calendars. It's real pretty. Compare with plainssky, which they don't mine in Kansas because nobody wants it.

TaylorRen

The case "student" appears in text "Beijing University Biology Department" is totally probable. But normally, such mistakes should be avoided.

Recommended
Sponsors
Because We Like You
Free Ebooks!

Grab SitePoint's top 10 web dev and design ebooks, completely free!

Get the latest in PHP, once a week, for free.