- Today, our product team reported an issue: why can’t we find relevant data when searching for “be” in our system?
- There are indeed videos containing phrases like “i want to be xxx” in our database
Stop Words
We use the IK Analyzer. After checking GitHub issues, we learned about stop words. Reference: Stop words explanation
Solution: Remove the specific stop word
- Navigate to Elasticsearch root directory (Docker container in my case)
cd /usr/share/elasticsearch
- Access IK Analyzer configuration (located in
config
directory for newer versions) cd config/analysis-ik
- Check the English stop words file:
# cat stopword.dic
a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or
such
that
the
their
then
there
these
they
this
to
was
will
with
- Remove the target stop word “be”
- Restart Elasticsearch
- Reindex the documents. Now searches for “be” will return relevant results
Custom Dictionary
- Analyze current tokenization:
POST /_analyze
{
"analyzer": "ik_max_word",
"text": "q宠大乱斗"
}
// Response:
{
"tokens": [
{"token": "q", "type": "ENGLISH", ...},
{"token": "宠大", "type": "CN_WORD", ...},
{"token": "大乱斗", "type": "CN_WORD", ...},
{"token": "大乱", "type": "CN_WORD", ...},
{"token": "斗", "type": "CN_CHAR", ...}
]
}
- Modify
IKAnalyzer.cfg.xml
:
<!-- Before -->
<entry key="ext_dict"></entry>
<!-- After -->
<entry key="ext_dict">custom_words.dic</entry>
- Create custom dictionary file:
# cat custom_words.dic
q宠
- Restart Elasticsearch
- Verify new tokenization:
{
"tokens": [
{"token": "q宠", "type": "CN_WORD", ...},
{"token": "q", "type": "ENGLISH", ...},
{"token": "宠大", "type": "CN_WORD", ...},
{"token": "大乱斗", "type": "CN_WORD", ...},
{"token": "大乱", "type": "CN_WORD", ...},
{"token": "斗", "type": "CN_CHAR", ...}
]
}
Now the custom term “q宠” is properly recognized in tokenization results.