Efficient Chinese Search with Elasticsearch

Key Takeaways

Analyzing and tokenization are essential steps in indexing content in Elasticsearch, particularly when dealing with languages other than English. For Chinese, this process is more complex due to the nature of logograms and the lack of spaces between words or sentences.
Several solutions are available for analyzing Chinese content in Elasticsearch, including the default Chinese analyzer, the paoding plugin, the cjk analyzer, the smart Chinese analyzer, and the ICU plugin. Each has its strengths and weaknesses and should be chosen based on specific needs.
Paoding and smartcn are considered the most effective analyzers for Chinese content. However, it’s important to note that no one-size-fits-all solution exists for analyzing with Elasticsearch – the optimal solution depends on the specific content and requirements.
Traditional Chinese requires a normalization step to translate into modern Chinese for proper manipulation by plugins like smartcn or paoding. This can be done within the application or directly in Elasticsearch using the elasticsearch-analysis-stconvert plugin. The cjk analyzer can also be used if proper tokenization of input isn’t possible.

If you have played with Elasticsearch, you already know that analyzing and tokenization are the most important steps while indexing content, and without them your pertinency is going to be bad, your users unhappy and your results poorly sorted.

Even with English content you can lose pertinence with a bad stemming, miss some documents when not performing proper elision and so on. And that’s worse if you are indexing another language; the default analyzers are not all-purpose.

When dealing with Chinese documents, everything is even more complex, even by considering only Mandarin which is the official language in China and the most spoken worldwide. Let’s dig into Chinese content tokenization and expose the best ways of doing it with Elasticsearch.

What is so hard about Chinese search?

Chinese characters are logograms, they represents a word or a morpheme (the smallest meaningful unit of language). Put together, their meaning can change and represent a whole new word. Another difficulty is that there is no space between words or sentences, making it very hard for a computer to know where a word starts or ends.

There are tens of thousands of Chinese characters, even if in practice, written Chinese requires a knowledge of between three and four thousand. Let’s see an example: the word “volcano” (火山) is in fact the combination of:

火: fire
山: mountainsky

Our tokenizer must be clever enough to avoid separating those two logograms, because the meaning is changed when they are not together.

Another difficulty is the spelling variants used:

simplified Chinese: 书法 ;
traditional Chinese, more complex and richer: 書法 ;
and pinyin, a Romanized form of Mandarin: shū fǎ.

Analyzing Chinese content

At the time of this writing, here are the solutions available with Elasticsearch:

the default Chinese analyzer, based on deprecated classes from Lucene 4;
the paoding plugin, sadly not maintened but based on very good dictionaries;
the cjk analyzer that makes bi-grams of your contents;
the smart chinese analyzer, distributed under an officialy supported plugin;
and finally the ICU plugin and his tokenizer.

These analyzers are very different and we will compare how well they perform with a simple test word: 手机.
It means “Cell phone” and is composed of two logograms, which mean “hand” and “machine” respectively. The 机 logogram also composes a lot of other words:

机票: plane ticket
机器人: robot
机枪: machine gun
机遇: opportunity

Our tokenization must not split those logograms, because if I search for “Cell phone”, I do not want any documents about Rambo owning a machine gun and looking bad-ass.

rambo

We are going to test our solutions with the great _analyze API:

curl -XGET 'http://localhost:9200/chinese_test/_analyze?analyzer=paoding_analyzer1' -d '手机'

Also, did I mention this awesome cheat sheet for Elasticsearch yet?

The default Chinese analyzer

Already available on your Elasticsearch instance, this analyzer uses the ChineseTokenizer class of Lucene, which only separates all logograms into tokens. So we are getting two tokens: 手 and 机.

The Elasticsearch standard analyzer produces the exact same output. For this reason, Chinese is deprecated and soon to be replaced by standard, and you should avoid it.

The paoding plugin

Paoding is almost an industry standard and is known as an elegant solution. Sadly, the plugin for Elasticsearch is unmaintained and I only managed to make it work on version 1.0.1, after some modifications. Here is how to install it manually:

git clone git@github.com:damienalexandre/elasticsearch-analysis-paoding.git /tmp/elasticsearch-analysis-paoding
    cd /tmp/elasticsearch-analysis-paoding
    mvn clean package
    sudo /usr/share/elasticsearch/bin/plugin -url file:/tmp/elasticsearch-analysis-paoding/target/releases/elasticsearch-analysis-paoding-1.2.2.zip -install elasticsearch-analysis-paoding

    # Copy all the dic config files to the ES config path - make sure to set the permissions rights, ES needs to write in /etc/elasticsearch/config/paoding!
    sudo cp -r config/paoding /etc/elasticsearch/config/

After this clumsy installation process (to be done on all your nodes), we now have a new paoding tokenizer and two collectors: max_word_len and most_word. No analyzer is exposed by default so we have to declare a new one:

PUT /chinese_test
    {
      "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
          "tokenizer": {
            "paoding1": {
              "type": "paoding",
              "collector": "most_word"
            },
            "paoding2": {
              "type": "paoding",
              "collector": "max_word_len"
            }
          },
          "analyzer": {
            "paoding_analyzer1": {
              "type": "custom",
              "tokenizer": "paoding1",
              "filter": ["standard"]
            },
            "paoding_analyzer2": {
              "type": "custom",
              "tokenizer": "paoding2",
              "filter": ["standard"]
            }
          }
        }
      }
    }

Both configurations provide good results, with a clean and unique token. Behavior is also very good with more complex sentences.

The cjk analyzer

Very straightforward analyzer, it only transforms any text into a bi-gram. “Batman” becomes a list of meaningless tokens: Ba, at, tm, ma, an. For Asian languages, this tokenizer is a good and very simple solution at the price of a bigger index and sometime not perfectly relevant results.

In our case, a two-logogram word, only 手机 is indexed, which is looking good, but if we take a longer word like 元宵节 (Lantern festival), two tokens are generated: 元宵 and 宵节, meaning respectively lantern and Xiao Festival.

The smart chinese plugin

Very easy to install thanks to the guys at Elasticsearch maintaining it:

bin/plugin -install elasticsearch/elasticsearch-analysis-smartcn/2.3.0

It exposes a new smartcn analyzer, as well as as the smartcn_tokenizer tokenizer, using the SmartChineseAnalyzer from Lucene.

It operates a probability suite to find an optimal separation of words, using the Hidden Markov model and a big number of training texts. So there is already a training dictionary embedded which is quite good on common text – our example is properly tokenized.

The ICU plugin

Another official plugin. Elasticsearch supports the “International Components for Unicode” libraries.

bin/plugin -install elasticsearch/elasticsearch-analysis-icu/2.4.1

This plugin is also recommended if you deal with any language other than English, I use it all the time for French content!

It exposes an icu_tokenizer tokenizer that we will use, as well as a lot of great analysis tools like icu_normalizer, icu_folding, icu_collation, etc.

It works with a dictionary for Chinese and Japanese texts, containing information about word frequency to deduce logogram groups. On 手机, everything is fine and works as expected, but on 元宵节, two tokens are produced: 元宵 and 节 – that’s because lantern and festival are more common than Lantern festival.

Results breakdown

Analyzer	手机 (cell phone)	元宵节 (Lantern festival)	元宵節 (Lantern festival with traditional)
chinese	[手] [机]	[元] [宵] [节]	[元] [宵] [節]
paoding most_word	[手机]	[元宵] [元宵节]	[元宵] [節]
paoding max_word_len	[手机]	[元宵节]	[元宵] [節]
cjk	[手机]	[元宵] [宵节]	[元宵] [宵節]
smartcn	[手机]	[元宵节]	[元宵] [節]
icu_tokenizer	[手机]	[元宵] [节]	[元宵節]

These tests have been done with Elasticsearch 1.3.2 except for Paoding under ES 1.0.1.

From my point of view, paoding and smartcn get the best results. The chinese tokenizer is very bad and the icu_tokenizer is a bit disappointing on 元宵节, but handles traditional Chinese very well.

Support for traditional Chinese

As stated in the introduction, you may have to deal with traditional Chinese either from your documents or from users’ search requests. You need a normalization step to translate those traditional inputs into modern Chinese, because plugins like smartcn or paoding can’t manipulate it correctly.

You can do so from your application or try to handle it inside Elasticsearch directly with the elasticsearch-analysis-stconvert plugin. It can transform both words in traditional and modern Chinese, both-ways. Sadly, you will have to compile it manually, much like the paoding plugin shown above.

The last solution is to use cjk: if you can’t tokenize input correctly, you still have good chances of catching the documents you need, and then improve pertinency with a signal based on the icu_tokenizer, which is quite good too.

Going further with Chinese?

There is no perfect one-size-fits-all solution for analyzing with Elasticsearch, regardless of the content you deal with, and that’s true for Chinese as well. You have to compose and build your own analyzer with the information you get. For example, I’m going with cjk and smartcn tokenization on my search fields, using multi-fields and the multi-match query.

To learn more about Chinese I recommand Chineasy which is a great way to get some basic reading skills! Learning such a rich language is not easy and you should also read this article before going for it, just so you know what’s you’re getting into! 快乐编码！

Frequently Asked Questions (FAQs) on Efficient Chinese Search with Elasticsearch

How does Elasticsearch handle Chinese language search?

Elasticsearch is a powerful search engine that can handle multiple languages, including Chinese. It uses a plugin called Smart Chinese Analysis, which is specifically designed to analyze Chinese text. This plugin uses a complex algorithm to break down Chinese sentences into individual words, which are then indexed and searchable. It also supports both Simplified and Traditional Chinese, making it versatile for different Chinese language users.

What is the role of the Smart Chinese Analysis plugin in Elasticsearch?

The Smart Chinese Analysis plugin is a crucial component in Elasticsearch when dealing with Chinese text. It uses a Hidden Markov Model to segment Chinese text into separate words, which are then indexed. This process is essential because, unlike English, Chinese text does not have spaces between words. The plugin also converts Chinese characters into Pinyin, making it easier for non-Chinese speakers to search for Chinese content.

How can I install the Smart Chinese Analysis plugin in Elasticsearch?

Installing the Smart Chinese Analysis plugin is straightforward. You can use the Elasticsearch plugin management utility by running the command bin/elasticsearch-plugin install analysis-smartcn in your Elasticsearch installation directory. After installation, you need to restart Elasticsearch for the changes to take effect.

How can I configure Elasticsearch to use the Smart Chinese Analysis plugin?

Once the Smart Chinese Analysis plugin is installed, you can configure Elasticsearch to use it by defining a custom analyzer in your index settings. This custom analyzer should use the smartcn_tokenizer and smartcn_stop filter. You can then use this custom analyzer when indexing your Chinese text.

Can Elasticsearch handle both Simplified and Traditional Chinese?

Yes, Elasticsearch can handle both Simplified and Traditional Chinese. The Smart Chinese Analysis plugin supports both forms of Chinese, making it versatile for different Chinese language users. However, it’s important to note that the plugin treats Simplified and Traditional Chinese as separate languages, so you need to configure your index settings accordingly.

How does Elasticsearch handle Pinyin in Chinese text?

The Smart Chinese Analysis plugin converts Chinese characters into Pinyin, which is a Romanized version of Chinese. This feature makes it easier for non-Chinese speakers to search for Chinese content. However, it’s important to note that the plugin does not support tone marks in Pinyin.

Can I use Elasticsearch for Chinese language search without the Smart Chinese Analysis plugin?

While it’s technically possible to use Elasticsearch for Chinese language search without the Smart Chinese Analysis plugin, it’s not recommended. The plugin provides essential functionality for analyzing Chinese text, such as word segmentation and Pinyin conversion, which are not available in the core Elasticsearch functionality.

How can I improve the accuracy of Chinese language search in Elasticsearch?

Improving the accuracy of Chinese language search in Elasticsearch involves fine-tuning your index settings and query parameters. For example, you can use the minimum_should_match parameter in your queries to control the number of words that must match in the search results. You can also use the boost parameter to give more weight to certain fields in your documents.

How does Elasticsearch handle Chinese stop words?

The Smart Chinese Analysis plugin includes a stop word filter, which removes common Chinese words that do not carry much meaning. This filter improves the efficiency of the search process by reducing the number of words that need to be indexed and searched. You can customize this stop word list to suit your specific needs.

Can I use Elasticsearch for other Asian languages?

Yes, Elasticsearch supports multiple languages, including other Asian languages like Japanese and Korean. However, each language requires a specific analysis plugin, similar to the Smart Chinese Analysis plugin for Chinese. These plugins provide the necessary functionality for analyzing and indexing text in these languages.