ljzsdut
GitHubToggle Dark/Light/Auto modeToggle Dark/Light/Auto modeToggle Dark/Light/Auto modeBack to homepage

06 自定义分词器

分词

analysis(只是一个概念),文本分析是将全文本转换为一系列单词的过程,也叫分词。analysis是通 过analyzer(分词器)来实现的,可以使用Elasticsearch内置的分词器,也可以自己去定制一些分词 器。 除了在数据写入的时候进行分词处理,那么在查询的时候也可以使用分析器对查询语句进行分词。

analysis是由三部分组成,例如有 Hello a World, the world is beautiful

  1. Character Filter: 例如将文本中html标签剔除掉,例如将&转换为’and‘等等。
  2. Tokenizer: 按照规则进行分词,在英文中按照空格分词。
  3. Token Filter: 进行nomalnization(标准化)操作:去掉stop world(停用词,a, an, the, is, are等),转换小写,同义词转换,单复数转换

什么是标准化处理?

标准化处理是用于完善分词器结果的。

分词器处理的文本结果,通常会有一些不需要的、有异议的、包含时态转化等情况的数据。在如:I think dogs is human’s best friend.中的分词结果是:i、 think、 dogs、 human’s、 best、 friend。其中i是很少应用在搜索条件中的单词;dogs 是 dog 单词的复数形式,通常在搜索过程中使用dog 作为搜索条件更频繁一些;human’s 是特殊的标记方式,通常不会在搜索中作为条件出现。那么 ElasticSearch 维护这些单词是没有太大必要的。这个时候就需要标准化处理了。

如:china 搜索时,如果条件为 cn 是否可搜索到。如:dogs,搜索时,条件为 dog是否可搜索到数据。如果可以使用简写(cn)或者单复数(dog&dogs)搜索到想要的结果,那么称为搜索引擎人性化。

normalization 是为了提升召回率的(recall),就是提升搜索能力的。

normalization 是配合分词器(analyzer)完成其功能的。

1、内置的分词器

分词器名称处理过程
Standard Analyzer默认的分词器,按词切分,小写处理
Simple Analyzer按照非字母切分(符号被过滤),小写处理
Stop Analyzer小写处理,停用词过滤(the, a, this)
Whitespace Analyzer按照空格切分,不转小写
Keyword Analyzer不分词,直接将输入当做输出
Pattern Analyzer正则表达式,默认是\W+(非字符串分隔)

不同类型的field,可能有的是full text,有的可能是exact value。

比如date类型的filed,就是精确类型的(exact value)。

默认的分词器行为(Standard Analyzer)

standard tokenizer:以单词边界进行切分 standard token filter:什么都不做 lowercase token filter:将所有字母转换为小写 stop token filter(默认被禁用):移除停用词,比如a the it等等

内置分词器示例

Standard Analyzer

GET _analyze
{
  "analyzer": "standard",
  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}

#输出:
{
  "tokens" : [
    {
      "token" : "2",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "dog",
      "start_offset" : 43,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "in",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "the",
      "start_offset" : 50,
      "end_offset" : 53,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "summer",
      "start_offset" : 54,
      "end_offset" : 60,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "evening",
      "start_offset" : 61,
      "end_offset" : 68,
      "type" : "<ALPHANUM>",
      "position" : 12
    }
  ]
}

Simple Analyzer

GET _analyze
{
  "analyzer": "simple",
  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}

# 输出:
{
  "tokens" : [
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dog",
      "start_offset" : 43,
      "end_offset" : 46,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "the",
      "start_offset" : 50,
      "end_offset" : 53,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "summer",
      "start_offset" : 54,
      "end_offset" : 60,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening",
      "start_offset" : 61,
      "end_offset" : 68,
      "type" : "word",
      "position" : 11
    }
  ]
}

Stop Analyzer

GET _analyze
{
  "analyzer": "stop",
  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}

#输出:
{
  "tokens" : [
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dog",
      "start_offset" : 43,
      "end_offset" : 46,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "summer",
      "start_offset" : 54,
      "end_offset" : 60,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening",
      "start_offset" : 61,
      "end_offset" : 68,
      "type" : "word",
      "position" : 11
    }
  ]
}

Whitespace Analyzer

GET _analyze
{
  "analyzer": "whitespace",
  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}

# 输出:
{
  "tokens" : [
    {
      "token" : "2",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "brown-foxes",
      "start_offset" : 16,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dog",
      "start_offset" : 43,
      "end_offset" : 46,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "the",
      "start_offset" : 50,
      "end_offset" : 53,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "summer",
      "start_offset" : 54,
      "end_offset" : 60,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening",
      "start_offset" : 61,
      "end_offset" : 68,
      "type" : "word",
      "position" : 11
    }
  ]
}

Keyword Analyzer

GET _analyze
{
  "analyzer": "keyword",
  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}

# 输出:
{
  "tokens" : [
    {
      "token" : "2 Running quick brown-foxes leap over lazy dog in the summer evening",
      "start_offset" : 0,
      "end_offset" : 68,
      "type" : "word",
      "position" : 0
    }
  ]
}

Pattern Analyzer

GET _analyze
{
  "analyzer": "pattern",
  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}

# 输出:
{
  "tokens" : [
    {
      "token" : "2",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "dog",
      "start_offset" : 43,
      "end_offset" : 46,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "in",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "the",
      "start_offset" : 50,
      "end_offset" : 53,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "summer",
      "start_offset" : 54,
      "end_offset" : 60,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "evening",
      "start_offset" : 61,
      "end_offset" : 68,
      "type" : "word",
      "position" : 12
    }
  ]
}

2、修改分词器的设置

# 启用english停用词(token filter)
PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std": {  #这个相当于定义个一个自定义分词器,名称为"es_std"
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  }
}

GET /my_index/_analyze
{
  "analyzer": "standard", 
  "text": "a dog is in the house"
}

GET /my_index/_analyze
{
  "analyzer": "es_std",
  "text":"a dog is in the house"
}

3、定制化自己的分词器

# 定义索引中使用的分词
PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and": {
          "type": "mapping",
          "mappings": [
            "&=> and"
          ]
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": [
            "the",
            "a"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": [
            "html_strip",
            "&_to_and"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stopwords"
          ]
        }
      }
    }
  }
}

GET /my_index/_analyze
{
  "text": "tom&jerry are a friend in the house, <a>, HAHA!!",
  "analyzer": "my_analyzer"
}


PUT /my_index/_mapping
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
}