Stop Words and Custom Dictionaries in Elasticsearch

Today, our product team reported an issue: why can’t we find relevant data when searching for “be” in our system?
There are indeed videos containing phrases like “i want to be xxx” in our database

Stop Words

We use the IK Analyzer. After checking GitHub issues, we learned about stop words. Reference: Stop words explanation

Solution: Remove the specific stop word

Navigate to Elasticsearch root directory (Docker container in my case)
cd /usr/share/elasticsearch
Access IK Analyzer configuration (located in config directory for newer versions)
cd config/analysis-ik
Check the English stop words file:

# cat stopword.dic 
a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or
such
that
the
their
then
there
these
they
this
to
was
will
with

Remove the target stop word “be”
Restart Elasticsearch
Reindex the documents. Now searches for “be” will return relevant results

Custom Dictionary

Analyze current tokenization:

POST /_analyze
{
     "analyzer": "ik_max_word",
     "text": "q宠大乱斗"
}

// Response:
{
	"tokens": [
		{"token": "q", "type": "ENGLISH", ...},
		{"token": "宠大", "type": "CN_WORD", ...},
		{"token": "大乱斗", "type": "CN_WORD", ...},
		{"token": "大乱", "type": "CN_WORD", ...},
		{"token": "斗", "type": "CN_CHAR", ...}
	]
}

Modify IKAnalyzer.cfg.xml:

<!-- Before -->
<entry key="ext_dict"></entry>

<!-- After -->
<entry key="ext_dict">custom_words.dic</entry>

Create custom dictionary file:

# cat custom_words.dic
q宠

Restart Elasticsearch
Verify new tokenization:

{
	"tokens": [
		{"token": "q宠", "type": "CN_WORD", ...},
		{"token": "q", "type": "ENGLISH", ...},
		{"token": "宠大", "type": "CN_WORD", ...},
		{"token": "大乱斗", "type": "CN_WORD", ...},
		{"token": "大乱", "type": "CN_WORD", ...},
		{"token": "斗", "type": "CN_CHAR", ...}
	]
}

Now the custom term “q宠” is properly recognized in tokenization results.