06 自定义分词器

分词

analysis(只是一个概念)，文本分析是将全文本转换为一系列单词的过程，也叫分词。analysis是通过analyzer(分词器)来实现的，可以使用Elasticsearch内置的分词器，也可以自己去定制一些分词器。除了在数据写入的时候进行分词处理，那么在查询的时候也可以使用分析器对查询语句进行分词。

analysis是由三部分组成，例如有 Hello a World, the world is beautiful

Character Filter: 例如将文本中html标签剔除掉，例如将&转换为’and‘等等。
Tokenizer: 按照规则进行分词，在英文中按照空格分词。
Token Filter: 进行nomalnization(标准化)操作：去掉stop world(停用词，a, an, the, is, are等)，转换小写，同义词转换，单复数转换

什么是标准化处理？
标准化处理是用于完善分词器结果的。
分词器处理的文本结果，通常会有一些不需要的、有异议的、包含时态转化等情况的数据。在如：I think dogs is human’s best friend.中的分词结果是：i、 think、 dogs、 human’s、 best、 friend。其中i是很少应用在搜索条件中的单词；dogs 是 dog 单词的复数形式，通常在搜索过程中使用dog 作为搜索条件更频繁一些；human’s 是特殊的标记方式，通常不会在搜索中作为条件出现。那么 ElasticSearch 维护这些单词是没有太大必要的。这个时候就需要标准化处理了。
如：china 搜索时，如果条件为 cn 是否可搜索到。如：dogs，搜索时，条件为 dog是否可搜索到数据。如果可以使用简写（cn）或者单复数（dog&dogs）搜索到想要的结果，那么称为搜索引擎人性化。
normalization 是为了提升召回率的（recall），就是提升搜索能力的。
normalization 是配合分词器(analyzer)完成其功能的。

1、内置的分词器

分词器名称	处理过程
Standard Analyzer	默认的分词器，按词切分，小写处理
Simple Analyzer	按照非字母切分(符号被过滤)，小写处理
Stop Analyzer	小写处理，停用词过滤(the, a, this)
Whitespace Analyzer	按照空格切分，不转小写
Keyword Analyzer	不分词，直接将输入当做输出
Pattern Analyzer	正则表达式，默认是\W+(非字符串分隔)

不同类型的field，可能有的是full text，有的可能是exact value。
比如date类型的filed，就是精确类型的（exact value）。

默认的分词器行为(Standard Analyzer)

standard tokenizer：以单词边界进行切分 standard token filter：什么都不做 lowercase token filter：将所有字母转换为小写 stop token filter（默认被禁用）：移除停用词，比如a the it等等

内置分词器示例

Standard Analyzer

GET _analyze
{
  "analyzer": "standard",
  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}

#输出：
{
  "tokens" : [
    {
      "token" : "2",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "dog",
      "start_offset" : 43,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "in",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "the",
      "start_offset" : 50,
      "end_offset" : 53,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "summer",
      "start_offset" : 54,
      "end_offset" : 60,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "evening",
      "start_offset" : 61,
      "end_offset" : 68,
      "type" : "<ALPHANUM>",
      "position" : 12
    }
  ]
}

Simple Analyzer

GET _analyze
{
  "analyzer": "simple",
  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}

# 输出：
{
  "tokens" : [
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dog",
      "start_offset" : 43,
      "end_offset" : 46,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "the",
      "start_offset" : 50,
      "end_offset" : 53,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "summer",
      "start_offset" : 54,
      "end_offset" : 60,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening",
      "start_offset" : 61,
      "end_offset" : 68,
      "type" : "word",
      "position" : 11
    }
  ]
}

Stop Analyzer

GET _analyze
{
  "analyzer": "stop",
  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}

#输出：
{
  "tokens" : [
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dog",
      "start_offset" : 43,
      "end_offset" : 46,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "summer",
      "start_offset" : 54,
      "end_offset" : 60,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening",
      "start_offset" : 61,
      "end_offset" : 68,
      "type" : "word",
      "position" : 11
    }
  ]
}

Whitespace Analyzer

GET _analyze
{
  "analyzer": "whitespace",
  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}

# 输出：
{
  "tokens" : [
    {
      "token" : "2",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "brown-foxes",
      "start_offset" : 16,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dog",
      "start_offset" : 43,
      "end_offset" : 46,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "the",
      "start_offset" : 50,
      "end_offset" : 53,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "summer",
      "start_offset" : 54,
      "end_offset" : 60,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening",
      "start_offset" : 61,
      "end_offset" : 68,
      "type" : "word",
      "position" : 11
    }
  ]
}

Keyword Analyzer

GET _analyze
{
  "analyzer": "keyword",
  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}

# 输出：
{
  "tokens" : [
    {
      "token" : "2 Running quick brown-foxes leap over lazy dog in the summer evening",
      "start_offset" : 0,
      "end_offset" : 68,
      "type" : "word",
      "position" : 0
    }
  ]
}

Pattern Analyzer

GET _analyze
{
  "analyzer": "pattern",
  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}

# 输出：
{
  "tokens" : [
    {
      "token" : "2",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "dog",
      "start_offset" : 43,
      "end_offset" : 46,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "in",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "the",
      "start_offset" : 50,
      "end_offset" : 53,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "summer",
      "start_offset" : 54,
      "end_offset" : 60,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "evening",
      "start_offset" : 61,
      "end_offset" : 68,
      "type" : "word",
      "position" : 12
    }
  ]
}

2、修改分词器的设置

# 启用english停用词（token filter）
PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std": {  #这个相当于定义个一个自定义分词器，名称为"es_std"
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  }
}

GET /my_index/_analyze
{
  "analyzer": "standard", 
  "text": "a dog is in the house"
}

GET /my_index/_analyze
{
  "analyzer": "es_std",
  "text":"a dog is in the house"
}

3、定制化自己的分词器

# 定义索引中使用的分词
PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and": {
          "type": "mapping",
          "mappings": [
            "&=> and"
          ]
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": [
            "the",
            "a"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": [
            "html_strip",
            "&_to_and"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stopwords"
          ]
        }
      }
    }
  }
}

GET /my_index/_analyze
{
  "text": "tom&jerry are a friend in the house, <a>, HAHA!!",
  "analyzer": "my_analyzer"
}


PUT /my_index/_mapping
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
}