- When implementing Chinese search, many developers naturally use the ik tokenizer plugin from GitHub
- A common mapping configuration following official documentation examples looks like this:
{
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
}
}
}
- Let’s examine data in a test index containing two documents: “打火车” (beat train) and “火车” (train)
curl 127.0.0.1:9200/test/_search | jq
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "Video_1",
"_score": 1,
"_source": {
"id": 1,
"title": "打火车"
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "Video_2",
"_score": 1,
"_source": {
"id": 2,
"title": "火车"
}
}
]
}
}
- Searching for “打火车” yields unexpected ranking:
curl 127.0.0.1:9200/test/_search?q=打火车 | jq
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.21110919,
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "Video_2",
"_score": 0.21110919,
"_source": {
"id": 2,
"title": "火车"
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "Video_1",
"_score": 0.160443,
"_source": {
"id": 1,
"title": "打火车"
}
}
]
}
}
- Surprisingly, the document containing “火车” scores higher than “打火车”
curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq
{
"_index": "test",
"_type": "_doc",
"_id": "Video_1",
"_version": 1,
"found": true,
"took": 0,
"term_vectors": {
"title": {
"field_statistics": {
"sum_doc_freq": 3,
"doc_count": 2,
"sum_ttf": 3
},
"terms": {
"打火": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 2
}
]
},
"火车": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 1,
"end_offset": 3
}
]
}
}
}
}
}
- The term “打火车” is split into “打火” (flame) and “火车” (train), causing scoring anomalies
- Solution: Use consistent tokenizers for both indexing and searching
{
"properties": {
"title": {
"type": "text",
"analyzer": "ik_smart",
"search_analyzer": "ik_smart"
}
}
}
- Verified tokenization after modification:
curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq
{
"_index": "test",
"_type": "_doc",
"_id": "Video_1",
"_version": 1,
"found": true,
"took": 0,
"term_vectors": {
"title": {
"field_statistics": {
"sum_doc_freq": 3,
"doc_count": 2,
"sum_ttf": 3
},
"terms": {
"打": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 1
}
]
},
"火车": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 1,
"end_offset": 3
}
]
}
}
}
}
}
- Final search results show correct ranking:
curl 127.0.0.1:9200/test/_search?q=打火车 | jq
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.77041256,
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "Video_1",
"_score": 0.77041256,
"_source": {
"id": 1,
"title": "打火车"
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "Video_2",
"_score": 0.21110919,
"_source": {
"id": 2,
"title": "火车"
}
}
]
}
}