代码之家  ›  专栏  ›  技术社区  ›  user3668129

如何构建一个考虑词与词之间距离的Elasticsearch查询?

  •  0
  • user3668129  · 技术社区  · 2 年前

    我和你一起跑步 elasticsearch:7.6.2

    我有一个包含4个简单文档的索引:

        PUT demo_idx/_doc/1
        {
          "content": "Distributed nature, simple REST APIs, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, the end"
        }
    
        PUT demo_idx/_doc/2
        {
          "content": "Distributed tmp nature, simple REST APIs, speed, and scalability"
        }
    
        PUT demo_idx/_doc/3
        {
          "content": "Distributed nature, simple REST APIs, speed, and scalability"
        }
    
        PUT demo_idx/_doc/4
        {
          "content": "Distributed tmp tmp nature"
        }
    

    我想搜索文本: distributed nature 并按顺序获得以下结果:

    Doc id: 3 
    Doc id: 1
    Doc id: 2
    Doc id: 4
    

    i、 e精确匹配的文档(文档3和文档1)将在小斜率的文档(文档2)和大斜率匹配的文档最后显示(文档4)之前显示

    我读了这篇帖子: How to build an Elasticsearch query that will take into account the distance between words and the exactitude of the word 但这对我没有帮助

    我尝试了以下搜索查询:

    "query": {
                "bool": {
                    "must":
                        [{
                            "match_phrase": {
                                "content": {
                                    "query": query,
                                    "slop": 2
                                }
                            }
                        }]
                }
            }
    

    但它没有给我所需的结果。

    我得到了以下结果:

    Doc id: 3  ,Score: 0.22949813
    Doc id: 4  ,Score: 0.15556586
    Doc id: 1  ,Score: 0.15401536 
    Doc id: 2  ,Score: 0.14397088
    

    如何编写查询以获得想要的结果?

    0 回复  |  直到 2 年前
        1
  •  1
  •   Bhavya    2 年前

    通过使用bool-should子句,可以显示与“分布式性质”完全匹配的文档。第一条将提高那些完全符合“分布式性质”的文档的分数,而不会有任何污点。

    POST demo_idx/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "match_phrase": {
                "content": {
                  "query": "Distributed nature"
                }
              }
            },
            {
              "match_phrase": {
                "content": {
                  "query": "Distributed nature",
                  "slop": 2
                }
              }
            }
          ]
        }
      }
    }
    

    搜索响应将是:

    "hits" : [
          {
            "_index" : "demo_idx",
            "_type" : "_doc",
            "_id" : "3",
            "_score" : 0.45899627,
            "_source" : {
              "content" : "Distributed nature, simple REST APIs, speed, and scalability"
            }
          },
          {
            "_index" : "demo_idx",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 0.30803072,
            "_source" : {
              "content" : "Distributed nature, simple REST APIs, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, the end"
            }
          },
          {
            "_index" : "demo_idx",
            "_type" : "_doc",
            "_id" : "4",
            "_score" : 0.15556586,
            "_source" : {
              "content" : "Distributed tmp tmp nature"
            }
          },
          {
            "_index" : "demo_idx",
            "_type" : "_doc",
            "_id" : "2",
            "_score" : 0.14397088,
            "_source" : {
              "content" : "Distributed tmp nature, simple REST APIs, speed, and scalability"
            }
          }
        ]
    

    更新1:

    为了避免搜索查询评分中“字段长度”参数的影响,需要使用更新映射API禁用“内容”字段的“规范”参数

    PUT demo_idx/_mapping
    {
      "properties": {
        "content": {
          "type": "text",
          "norms": "false"
        }
      }
    }
    

    在此之后,再次为文档重新编制索引,以便 norms 不会立即移除

    现在点击搜索查询,搜索响应将按照您期望的顺序进行。