06 自定义分词器
analysis(只是一个概念),文本分析是将全文本转换为一系列单词的过程,也叫分词。analysis是通 过analyzer(分词器)来实现的,可以使用Elasticsearch内置的分词器,也可以自己去定制一些分词 器。 除了在数据写入的时候进行分词处理,那么在查询的时候也可以使用分析器对查询语句进行分词。
analysis是由三部分组成,例如有 Hello a World, the world is beautiful
- Character Filter: 例如将文本中html标签剔除掉,例如将&转换为’and‘等等。
- Tokenizer: 按照规则进行分词,在英文中按照空格分词。
- Token Filter: 进行nomalnization(标准化)操作:去掉stop world(停用词,a, an, the, is, are等),转换小写,同义词转换,单复数转换
什么是标准化处理?
标准化处理是用于完善分词器结果的。
分词器处理的文本结果,通常会有一些不需要的、有异议的、包含时态转化等情况的数据。在如:I think dogs is human’s best friend.中的分词结果是:i、 think、 dogs、 human’s、 best、 friend。其中i是很少应用在搜索条件中的单词;dogs 是 dog 单词的复数形式,通常在搜索过程中使用dog 作为搜索条件更频繁一些;human’s 是特殊的标记方式,通常不会在搜索中作为条件出现。那么 ElasticSearch 维护这些单词是没有太大必要的。这个时候就需要标准化处理了。
如:china 搜索时,如果条件为 cn 是否可搜索到。如:dogs,搜索时,条件为 dog是否可搜索到数据。如果可以使用简写(cn)或者单复数(dog&dogs)搜索到想要的结果,那么称为搜索引擎人性化。
normalization 是为了提升召回率的(recall),就是提升搜索能力的。
normalization 是配合分词器(analyzer)完成其功能的。
| 分词器名称 | 处理过程 |
|---|---|
| Standard Analyzer | 默认的分词器,按词切分,小写处理 |
| Simple Analyzer | 按照非字母切分(符号被过滤),小写处理 |
| Stop Analyzer | 小写处理,停用词过滤(the, a, this) |
| Whitespace Analyzer | 按照空格切分,不转小写 |
| Keyword Analyzer | 不分词,直接将输入当做输出 |
| Pattern Analyzer | 正则表达式,默认是\W+(非字符串分隔) |
不同类型的field,可能有的是full text,有的可能是exact value。
比如date类型的filed,就是精确类型的(exact value)。
standard tokenizer:以单词边界进行切分 standard token filter:什么都不做 lowercase token filter:将所有字母转换为小写 stop token filter(默认被禁用):移除停用词,比如a the it等等
GET _analyze
{
"analyzer": "standard",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
#输出:
{
"tokens" : [
{
"token" : "2",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "brown",
"start_offset" : 16,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "foxes",
"start_offset" : 22,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "dog",
"start_offset" : 43,
"end_offset" : 46,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "in",
"start_offset" : 47,
"end_offset" : 49,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "the",
"start_offset" : 50,
"end_offset" : 53,
"type" : "<ALPHANUM>",
"position" : 10
},
{
"token" : "summer",
"start_offset" : 54,
"end_offset" : 60,
"type" : "<ALPHANUM>",
"position" : 11
},
{
"token" : "evening",
"start_offset" : 61,
"end_offset" : 68,
"type" : "<ALPHANUM>",
"position" : 12
}
]
}
GET _analyze
{
"analyzer": "simple",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
# 输出:
{
"tokens" : [
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 1
},
{
"token" : "brown",
"start_offset" : 16,
"end_offset" : 21,
"type" : "word",
"position" : 2
},
{
"token" : "foxes",
"start_offset" : 22,
"end_offset" : 27,
"type" : "word",
"position" : 3
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "word",
"position" : 4
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "word",
"position" : 5
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 6
},
{
"token" : "dog",
"start_offset" : 43,
"end_offset" : 46,
"type" : "word",
"position" : 7
},
{
"token" : "in",
"start_offset" : 47,
"end_offset" : 49,
"type" : "word",
"position" : 8
},
{
"token" : "the",
"start_offset" : 50,
"end_offset" : 53,
"type" : "word",
"position" : 9
},
{
"token" : "summer",
"start_offset" : 54,
"end_offset" : 60,
"type" : "word",
"position" : 10
},
{
"token" : "evening",
"start_offset" : 61,
"end_offset" : 68,
"type" : "word",
"position" : 11
}
]
}
GET _analyze
{
"analyzer": "stop",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
#输出:
{
"tokens" : [
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 1
},
{
"token" : "brown",
"start_offset" : 16,
"end_offset" : 21,
"type" : "word",
"position" : 2
},
{
"token" : "foxes",
"start_offset" : 22,
"end_offset" : 27,
"type" : "word",
"position" : 3
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "word",
"position" : 4
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "word",
"position" : 5
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 6
},
{
"token" : "dog",
"start_offset" : 43,
"end_offset" : 46,
"type" : "word",
"position" : 7
},
{
"token" : "summer",
"start_offset" : 54,
"end_offset" : 60,
"type" : "word",
"position" : 10
},
{
"token" : "evening",
"start_offset" : 61,
"end_offset" : 68,
"type" : "word",
"position" : 11
}
]
}
GET _analyze
{
"analyzer": "whitespace",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
# 输出:
{
"tokens" : [
{
"token" : "2",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "Running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 2
},
{
"token" : "brown-foxes",
"start_offset" : 16,
"end_offset" : 27,
"type" : "word",
"position" : 3
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "word",
"position" : 4
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "word",
"position" : 5
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 6
},
{
"token" : "dog",
"start_offset" : 43,
"end_offset" : 46,
"type" : "word",
"position" : 7
},
{
"token" : "in",
"start_offset" : 47,
"end_offset" : 49,
"type" : "word",
"position" : 8
},
{
"token" : "the",
"start_offset" : 50,
"end_offset" : 53,
"type" : "word",
"position" : 9
},
{
"token" : "summer",
"start_offset" : 54,
"end_offset" : 60,
"type" : "word",
"position" : 10
},
{
"token" : "evening",
"start_offset" : 61,
"end_offset" : 68,
"type" : "word",
"position" : 11
}
]
}
GET _analyze
{
"analyzer": "keyword",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
# 输出:
{
"tokens" : [
{
"token" : "2 Running quick brown-foxes leap over lazy dog in the summer evening",
"start_offset" : 0,
"end_offset" : 68,
"type" : "word",
"position" : 0
}
]
}
GET _analyze
{
"analyzer": "pattern",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
# 输出:
{
"tokens" : [
{
"token" : "2",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 2
},
{
"token" : "brown",
"start_offset" : 16,
"end_offset" : 21,
"type" : "word",
"position" : 3
},
{
"token" : "foxes",
"start_offset" : 22,
"end_offset" : 27,
"type" : "word",
"position" : 4
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "word",
"position" : 5
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "word",
"position" : 6
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 7
},
{
"token" : "dog",
"start_offset" : 43,
"end_offset" : 46,
"type" : "word",
"position" : 8
},
{
"token" : "in",
"start_offset" : 47,
"end_offset" : 49,
"type" : "word",
"position" : 9
},
{
"token" : "the",
"start_offset" : 50,
"end_offset" : 53,
"type" : "word",
"position" : 10
},
{
"token" : "summer",
"start_offset" : 54,
"end_offset" : 60,
"type" : "word",
"position" : 11
},
{
"token" : "evening",
"start_offset" : 61,
"end_offset" : 68,
"type" : "word",
"position" : 12
}
]
}
# 启用english停用词(token filter)
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"es_std": { #这个相当于定义个一个自定义分词器,名称为"es_std"
"type": "standard",
"stopwords": "_english_"
}
}
}
}
}
GET /my_index/_analyze
{
"analyzer": "standard",
"text": "a dog is in the house"
}
GET /my_index/_analyze
{
"analyzer": "es_std",
"text":"a dog is in the house"
}
# 定义索引中使用的分词
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": {
"&_to_and": {
"type": "mapping",
"mappings": [
"&=> and"
]
}
},
"filter": {
"my_stopwords": {
"type": "stop",
"stopwords": [
"the",
"a"
]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"html_strip",
"&_to_and"
],
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stopwords"
]
}
}
}
}
}
GET /my_index/_analyze
{
"text": "tom&jerry are a friend in the house, <a>, HAHA!!",
"analyzer": "my_analyzer"
}
PUT /my_index/_mapping
{
"properties": {
"content": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}