代码之家  ›  专栏  ›  技术社区  ›  Tuan Nguyen

不在映射中的字段包含在ElasticSearch返回的搜索结果中

  •  3
  • Tuan Nguyen  · 技术社区  · 12 年前

    我想使用Tiregem作为ElasticSearch的客户端来索引pdf附件。在映射中,我将附件字段从_source中排除,这样附件就不会存储在索引中,并且 未在搜索结果中返回 以下为:

    mapping :_source => { :excludes => ['attachment_original'] } do
      indexes :id, :type => 'integer'
      indexes :folder_id, :type => 'integer'
      indexes :attachment_file_name
      indexes :attachment_updated_at, :type => 'date'
      indexes :attachment_original, :type => 'attachment'
    end 
    

    当我运行以下curl命令时,我仍然可以看到搜索结果中包含的附件内容:

    curl -X POST "http://localhost:9200/user_files/user_file/_search?pretty=true" -d '{
      "query": {
        "query_string": {
          "query": "rspec"
        }
      }
    }'
    

    我已经在这里发布了我的问题 thread 以下为:

    但我刚刚注意到,搜索结果中不仅包括附件,还包括所有其他字段,包括未映射的字段,如您所见:

    {
      "took": 20,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 1,
        "max_score": 0.025427073,
        "hits": [
          {
            "_index": "user_files",
            "_type": "user_file",
            "_id": "5",
            "_score": 0.025427073,
            "_source": {
              "user_file": {
                "id": 5,
                "folder_id": 1,
                "updated_at": "2012-08-16T11:32:41Z",
                "attachment_file_size": 179895,
                "attachment_updated_at": "2012-08-16T11:32:41Z",
                "attachment_file_name": "hw4.pdf",
                "attachment_content_type": "application/pdf",
                "created_at": "2012-08-16T11:32:41Z",
                "attachment_original": "JVBERi0xLjQKJeLjz9MKNyA"
              }
            }
          }
        ]
      }
    }
    

    attachment_file_size attachment_content_type 未在映射中定义,但在搜索结果中返回:

    {
      "id": 5,
      "folder_id": 1,
      "updated_at": "2012-08-16T11:32:41Z",
      "attachment_file_size": 179895, <---------------------
      "attachment_updated_at": "2012-08-16T11:32:41Z",
      "attachment_file_name": "hw4.pdf", <------------------
      "attachment_content_type": "application/pdf",
      "created_at": "2012-08-16T11:32:41Z",
      "attachment_original": "JVBERi0xLjQKJeLjz9MKNyA"
    }
    

    以下是我的完整实现:

      include Tire::Model::Search
      include Tire::Model::Callbacks
    
      def self.search(folder, params)
        tire.search() do
          query { string params[:query], default_operator: "AND"} if params[:query].present?
          #filter :term, folder_id: folder.id
          #highlight :attachment_original, :options => {:tag => "<em>"}
          raise to_curl
        end
      end
    
      mapping :_source => { :excludes => ['attachment_original'] } do
        indexes :id, :type => 'integer'
        indexes :folder_id, :type => 'integer'
        indexes :attachment_file_name
        indexes :attachment_updated_at, :type => 'date'
        indexes :attachment_original, :type => 'attachment'
      end
    
      def to_indexed_json
         to_json(:methods => [:attachment_original])
       end
    
      def attachment_original
        if attachment_file_name.present?
          path_to_original = attachment.path
          Base64.encode64(open(path_to_original) { |f| f.read })
        end    
      end
    

    有人能帮我弄清楚为什么所有字段都包含在 _source ?

    编辑: 这是运行的输出 localhost:9200/user_files/_mapping

    {
      "user_files": {
        "user_file": {
          "_source": {
            "excludes": [
              "attachment_original"
            ]
          },
          "properties": {
            "attachment_content_type": {
              "type": "string"
            },
            "attachment_file_name": {
              "type": "string"
            },
            "attachment_file_size": {
              "type": "long"
            },
            "attachment_original": {
              "type": "attachment",
              "path": "full",
              "fields": {
                "attachment_original": {
                  "type": "string"
                },
                "author": {
                  "type": "string"
                },
                "title": {
                  "type": "string"
                },
                "name": {
                  "type": "string"
                },
                "date": {
                  "type": "date",
                  "format": "dateOptionalTime"
                },
                "keywords": {
                  "type": "string"
                },
                "content_type": {
                  "type": "string"
                }
              }
            },
            "attachment_updated_at": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "created_at": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "folder_id": {
              "type": "integer"
            },
            "id": {
              "type": "integer"
            },
            "updated_at": {
              "type": "date",
              "format": "dateOptionalTime"
            }
          }
        }
      }
    }
    

    正如您所看到的,由于某些原因,所有字段都包含在映射中!

    1 回复  |  直到 7 年前
        1
  •  1
  •   Community Dai    7 年前

    在您的 to_indexed_json ,包括 attachment_original 方法,因此将其发送到弹性搜索。这也是为什么所有其他属性都包含在映射中,从而包含在源中的原因。

    请参阅 ElasticSearch & Tire: Using Mapping and to_indexed_json 有关该主题的更多信息,请提问。

    看起来Tire确实向elasticsearch发送了正确的映射JSON——我的建议是使用 Tire.configure { logger STDERR, level: "debug" } 以检查正在发生的事情,并通过trz在原始级别上查明问题。