代码之家  ›  专栏  ›  技术社区  ›  Pavel

ElasticSearch按多个字段分组

  •  57
  • Pavel  · 技术社区  · 11 年前

    我发现的唯一接近的事情是: Multiple group-by in Elasticsearch

    基本上,我正在尝试获得与以下内容等效的ES MySql 查询:

    select gender, age_range, count(distinct profile_id) as count 
    FROM TABLE group by age_range, gender
    

    年龄和性别本身很容易得到:

    {
      "query": {
        "match_all": {}
      },
      "facets": {
        "ages": {
          "terms": {
            "field": "age_range",
            "size": 20
          }
        },
        "gender_by_age": {
          "terms": {
            "fields": [
              "age_range",
              "gender"
            ]
          }
        }
      },
      "size": 0
    }
    

    其给出:

    {
      "ages": {
        "_type": "terms",
        "missing": 0,
        "total": 193961,
        "other": 0,
        "terms": [
          {
            "term": 0,
            "count": 162643
          },
          {
            "term": 3,
            "count": 10683
          },
          {
            "term": 4,
            "count": 8931
          },
          {
            "term": 5,
            "count": 4690
          },
          {
            "term": 6,
            "count": 3647
          },
          {
            "term": 2,
            "count": 3247
          },
          {
            "term": 1,
            "count": 120
          }
        ]
      },
      "total_gender": {
        "_type": "terms",
        "missing": 0,
        "total": 193961,
        "other": 0,
        "terms": [
          {
            "term": 1,
            "count": 94799
          },
          {
            "term": 2,
            "count": 62645
          },
          {
            "term": 0,
            "count": 36517
          }
        ]
      }
    }
    

    但现在我需要这样的东西:

    [breakdown_gender] => Array
        (
            [1] => Array
                (
                    [0] => 264
                    [1] => 1
                    [2] => 6
                    [3] => 67
                    [4] => 72
                    [5] => 40
                    [6] => 23
                )
    
            [2] => Array
                (
                    [0] => 153
                    [2] => 2
                    [3] => 21
                    [4] => 35
                    [5] => 22
                    [6] => 11
                )
    
        )
    

    请注意 0,1,2,3,4,5,6 是年龄范围的“映射”,所以它们实际上意味着什么:)而不仅仅是数字。例如,性别[1](即“男性”)分为年龄范围[0](即“18岁以下”),总数为246。

    4 回复  |  直到 2 年前
        1
  •  107
  •   Joe    9 年前

    从的1.0版开始 ElasticSearch ,新的 aggregations API 允许按多个字段分组,使用 子集合 。假设您要按字段分组 field1 , field2 field3 :

    {
      "aggs": {
        "agg1": {
          "terms": {
            "field": "field1"
          },
          "aggs": {
            "agg2": {
              "terms": {
                "field": "field2"
              },
              "aggs": {
                "agg3": {
                  "terms": {
                    "field": "field3"
                  }
                }
              }          
            }
          }
        }
      }
    }
    

    当然,这种情况可以持续到你想要的任何领域。

    更新:
    为了完整起见,以下是上面查询的输出。下面还有python代码,用于生成聚合查询并将结果扁平化为字典列表。

    {
      "aggregations": {
        "agg1": {
          "buckets": [{
            "doc_count": <count>,
            "key": <value of field1>,
            "agg2": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field2>,
                "agg3": {
                  "buckets": [{
                    "doc_count": <count>,
                    "key": <value of field3>
                  },
                  {
                    "doc_count": <count>,
                    "key": <value of field3>
                  }, ...
                  ]
                },
                {
                "doc_count": <count>,
                "key": <value of field2>,
                "agg3": {
                  "buckets": [{
                    "doc_count": <count>,
                    "key": <value of field3>
                  },
                  {
                    "doc_count": <count>,
                    "key": <value of field3>
                  }, ...
                  ]
                }, ...
              ]
            },
            {
            "doc_count": <count>,
            "key": <value of field1>,
            "agg2": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field2>,
                "agg3": {
                  "buckets": [{
                    "doc_count": <count>,
                    "key": <value of field3>
                  },
                  {
                    "doc_count": <count>,
                    "key": <value of field3>
                  }, ...
                  ]
                },
                {
                "doc_count": <count>,
                "key": <value of field2>,
                "agg3": {
                  "buckets": [{
                    "doc_count": <count>,
                    "key": <value of field3>
                  },
                  {
                    "doc_count": <count>,
                    "key": <value of field3>
                  }, ...
                  ]
                }, ...
              ]
            }, ...
          ]
        }
      }
    }
    

    下面的python代码通过给定字段列表来执行组。我是你指定的 include_missing=True ,它还包括一些字段缺失的值的组合(如果您有Elasticsearch 2.0版本,则不需要它,这要归功于 this )

    def group_by(es, fields, include_missing):
        current_level_terms = {'terms': {'field': fields[0]}}
        agg_spec = {fields[0]: current_level_terms}
    
        if include_missing:
            current_level_missing = {'missing': {'field': fields[0]}}
            agg_spec[fields[0] + '_missing'] = current_level_missing
    
        for field in fields[1:]:
            next_level_terms = {'terms': {'field': field}}
            current_level_terms['aggs'] = {
                field: next_level_terms,
            }
    
            if include_missing:
                next_level_missing = {'missing': {'field': field}}
                current_level_terms['aggs'][field + '_missing'] = next_level_missing
                current_level_missing['aggs'] = {
                    field: next_level_terms,
                    field + '_missing': next_level_missing,
                }
                current_level_missing = next_level_missing
    
            current_level_terms = next_level_terms
    
        agg_result = es.search(body={'aggs': agg_spec})['aggregations']
        return get_docs_from_agg_result(agg_result, fields, include_missing)
    
    
    def get_docs_from_agg_result(agg_result, fields, include_missing):
        current_field = fields[0]
        buckets = agg_result[current_field]['buckets']
        if include_missing:
            buckets.append(agg_result[(current_field + '_missing')])
    
        if len(fields) == 1:
            return [
                {
                    current_field: bucket.get('key'),
                    'doc_count': bucket['doc_count'],
                }
                for bucket in buckets if bucket['doc_count'] > 0
            ]
    
        result = []
        for bucket in buckets:
            records = get_docs_from_agg_result(bucket, fields[1:], include_missing)
            value = bucket.get('key')
            for record in records:
                record[current_field] = value
            result.extend(records)
    
        return result
    
        2
  •  24
  •   Frederik Struck-Schøning    2 年前

    由于您只有2个字段,因此一种简单的方法是使用单个方面执行两个查询。对于男性:

    {
        "query" : {
          "term" : { "gender" : "Male" }
        },
        "facets" : {
            "age_range" : {
                "terms" : {
                    "field" : "age_range"
                }
            }
        }
    }
    

    对于女性:

    {
        "query" : {
          "term" : { "gender" : "Female" }
        },
        "facets" : {
            "age_range" : {
                "terms" : {
                    "field" : "age_range"
                }
            }
        }
    }
    

    或者,您可以使用facet过滤器在单个查询中执行此操作(请参阅 this link 获取更多信息)

    {
        "query" : {
           "match_all": {}
        },
        "facets" : {
            "age_range_male" : {
                "terms" : {
                    "field" : "age_range"
                },
                "facet_filter":{
                    "term": {
                        "gender": "Male"
                    }
                }
            },
            "age_range_female" : {
                "terms" : {
                    "field" : "age_range"
                },
                "facet_filter":{
                    "term": {
                        "gender": "Female"
                    }
                }
            }
        }
    }
    

    更新:

    因为刻面即将被移除。这是具有聚合的解决方案:

    {
      "query": {
        "match_all": {}
      },
      "aggs": {
        "male": {
          "filter": {
            "term": {
              "gender": "Male"
            }
          },
          "aggs": {
            "age_range": {
              "terms": {
                "field": "age_range"
              }
            }
          }
        },
        "female": {
          "filter": {
            "term": {
              "gender": "Female"
            }
          },
          "aggs": {
            "age_range": {
              "terms": {
                "field": "age_range"
              }
            }
          }
        }
      }
    }
    
        3
  •  0
  •   Frederik Struck-Schøning    2 年前

    我知道,它不能回答这个问题,但我在寻找一种进行多术语聚合的方法时找到了这个页面。最后,在文档中找到了有关此功能的信息。也许这会对某人有所帮助。。。 multi_terms aggregation :

          "aggs": {
            "lat_lng": {
              "multi_terms": {
                "terms": [{
                  "field": "lat"
                },{
                  "field": "lng"
                }]
              }
            }
          }
    

    结果将接近

        ...
            {
              "key" : [
                "43.00861889999999",
                "-78.8186202"
              ],
              "key_as_string" : "43.00861889999999|-78.8186202",
              "doc_count" : 6
            },
        ...
    
        4
  •  0
  •   Frederik Struck-Schøning    2 年前

    我尝试过对组织年收入的配置文件进行分组,然后使用以下查询在行业之间进一步分布

    例子:

    {
    "size": 0,
    "aggs": {
        "categories": {
            "filter": {
                "exists": {
                    "field": "organization_industries"
                }
            },
            "aggs": {
                "names": {
                    "terms": {
                        "field": "organization_revenue_in_thousands_int.keyword",
                        "size": 200,
                        "order": {
                            "_key": "desc"
                        }
                    },
                    "aggs": {
                        "industry_stats": {
                            "terms": {
                                "field": "organization_industries.keyword"
                            }
                        }
                    }
                }
            }
        }
    }
    

    }

    输出:

        "aggregations": {
        "categories": {
            "doc_count": 195161605,
            "names": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 19226983,
                "buckets": [
                    {
                        "key": "99900",
                        "doc_count": 1742,
                        "industry_stats": {
                            "doc_count_error_upper_bound": 0,
                            "sum_other_doc_count": 0,
                            "buckets": [
                                {
                                    "key": "internet",
                                    "doc_count": 1605
                                },
                                {
                                    "key": "investment management",
                                    "doc_count": 81
                                },
                                {
                                    "key": "biotechnology",
                                    "doc_count": 54
                                },
                                {
                                    "key": "computer & network security",
                                    "doc_count": 2
                                }
                            ]
                        }
                    },                
                    {
                        "key": "998000",
                        "doc_count": 71,
                        "industry_stats": {
                            "doc_count_error_upper_bound": 0,
                            "sum_other_doc_count": 0,
                            "buckets": [
                                {
                                    "key": "finance",
                                    "doc_count": 48
                                },
                                {
                                    "key": "information technology & services",
                                    "doc_count": 23
                                }
                            ]
                        }
                    }
                    
                    }
                ]
            }
        }