Featured image of post Stop Words and Custom Dictionaries in Elasticsearch

Stop Words and Custom Dictionaries in Elasticsearch

Customize Your Search Dictionary

  • Today, our product team reported an issue: why can’t we find relevant data when searching for “be” in our system?
  • There are indeed videos containing phrases like “i want to be xxx” in our database

Stop Words

We use the IK Analyzer. After checking GitHub issues, we learned about stop words. Reference: Stop words explanation

Solution: Remove the specific stop word

  1. Navigate to Elasticsearch root directory (Docker container in my case)
  2. cd /usr/share/elasticsearch
  3. Access IK Analyzer configuration (located in config directory for newer versions)
  4. cd config/analysis-ik
  5. Check the English stop words file:
# cat stopword.dic 
a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or
such
that
the
their
then
there
these
they
this
to
was
will
with
  1. Remove the target stop word “be”
  2. Restart Elasticsearch
  3. Reindex the documents. Now searches for “be” will return relevant results

Custom Dictionary

  1. Analyze current tokenization:
POST /_analyze
{
     "analyzer": "ik_max_word",
     "text": "q宠大乱斗"
}

// Response:
{
	"tokens": [
		{"token": "q", "type": "ENGLISH", ...},
		{"token": "宠大", "type": "CN_WORD", ...},
		{"token": "大乱斗", "type": "CN_WORD", ...},
		{"token": "大乱", "type": "CN_WORD", ...},
		{"token": "斗", "type": "CN_CHAR", ...}
	]
}
  1. Modify IKAnalyzer.cfg.xml:
<!-- Before -->
<entry key="ext_dict"></entry>

<!-- After -->
<entry key="ext_dict">custom_words.dic</entry>
  1. Create custom dictionary file:
# cat custom_words.dic
q宠
  1. Restart Elasticsearch
  2. Verify new tokenization:
{
	"tokens": [
		{"token": "q宠", "type": "CN_WORD", ...},
		{"token": "q", "type": "ENGLISH", ...},
		{"token": "宠大", "type": "CN_WORD", ...},
		{"token": "大乱斗", "type": "CN_WORD", ...},
		{"token": "大乱", "type": "CN_WORD", ...},
		{"token": "斗", "type": "CN_CHAR", ...}
	]
}

Now the custom term “q宠” is properly recognized in tokenization results.