代码之家  ›  专栏  ›  技术社区  ›  clstaudt

高效查询pandas.DataFrame中的大量布尔列

  •  0
  • clstaudt  · 技术社区  · 5 年前

    在我的dataframe中,每行代表一组代码,一个热编码,因此dataframe有大量的布尔列。

    enter image description here

    我想选择包含代码子集的所有行,即给定列集的值为真的所有行。

    一组示例可能是:

    code_selection = {"H045027", "S100031", "G121001", "S456005", "M743110"} 
    

    我的第一次尝试依赖于 DataFrame.query

    def filter_codeset_1(codesets_onehot, code_selection):
        """Return only code sets that contain all of the codes in the code selection"""
        query_string = " & ".join(code_selection)
        return codesets_onehot.query(query_string)
    

    MemoryError                               Traceback (most recent call last)
    <ipython-input-86-8fb45d40b678> in <module>
    ----> 1 filtered = filter_codeset(codesets_onehot, code_selection)
    
    <ipython-input-71-ca3fccfa21ba> in filter_codeset(codesets_onehot, code_selection)
          2     """Return only code sets that contain all of the codes in the code selection"""
          3     query_string = " & ".join(code_selection)
    ----> 4     return codesets_onehot.query(query_string)
    
    ~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in query(self, expr, inplace, **kwargs)
       2845         kwargs['level'] = kwargs.pop('level', 0) + 1
       2846         kwargs['target'] = None
    -> 2847         res = self.eval(expr, **kwargs)
       2848 
       2849         try:
    
    ~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in eval(self, expr, inplace, **kwargs)
       2960             kwargs['target'] = self
       2961         kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
    -> 2962         return _eval(expr, inplace=inplace, **kwargs)
       2963 
       2964     def select_dtypes(self, include=None, exclude=None):
    
    ~/anaconda3/lib/python3.7/site-packages/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace)
        294         eng = _engines[engine]
        295         eng_inst = eng(parsed_expr)
    --> 296         ret = eng_inst.evaluate()
        297 
        298         if parsed_expr.assigner is None:
    
    ~/anaconda3/lib/python3.7/site-packages/pandas/core/computation/engines.py in evaluate(self)
         74 
         75         # make sure no names in resolvers and locals/globals clash
    ---> 76         res = self._evaluate()
         77         return _reconstruct_object(self.result_type, res, self.aligned_axes,
         78                                    self.expr.terms.return_type)
    
    ~/anaconda3/lib/python3.7/site-packages/pandas/core/computation/engines.py in _evaluate(self)
        121             truediv = scope['truediv']
        122             _check_ne_builtin_clash(self.expr)
    --> 123             return ne.evaluate(s, local_dict=scope, truediv=truediv)
        124         except KeyError as e:
        125             # python 3 compat kludge
    
    ~/anaconda3/lib/python3.7/site-packages/numexpr/necompiler.py in evaluate(ex, local_dict, global_dict, out, order, casting, **kwargs)
        814     expr_key = (ex, tuple(sorted(context.items())))
        815     if expr_key not in _names_cache:
    --> 816         _names_cache[expr_key] = getExprNames(ex, context)
        817     names, ex_uses_vml = _names_cache[expr_key]
        818     arguments = getArguments(names, local_dict, global_dict)
    
    ~/anaconda3/lib/python3.7/site-packages/numexpr/necompiler.py in getExprNames(text, context)
        705 
        706 def getExprNames(text, context):
    --> 707     ex = stringToExpression(text, {}, context)
        708     ast = expressionToAST(ex)
        709     input_order = getInputOrder(ast, None)
    
    ~/anaconda3/lib/python3.7/site-packages/numexpr/necompiler.py in stringToExpression(s, types, context)
        282         else:
        283             flags = 0
    --> 284         c = compile(s, '<expr>', 'eval', flags)
        285         # make VariableNode's for the names
        286         names = {}
    
    MemoryError: 
    

    对于更具可伸缩性的实现(在不超过几秒钟的时间内用数百个代码集查询数十万行),我的选项是什么?应该可以非常有效地做到这一点,因为基本上每行都需要选择一组固定的布尔值并与之连接 and

    以下是备选实施方案,包括答案中建议的实施方案:

    def filter_codeset_2(codesets_onehot, code_selection):
        column_mask = codesets_onehot.columns.isin(code_selection)
        return codesets_onehot[codesets_onehot.apply(lambda row: row[column_mask].all(), axis=1)]
    

    似乎有效,但需要更长的时间:墙壁时间:1分钟22秒

    def filter_codesets_3(codesets_onehot, code_selection):
        codesets_onehot = codesets_onehot.reset_index(drop=True)
        return codesets_onehot.loc[[set(codesets_onehot.columns[i]) == code_selection for i in codesets_onehot.values],:]
    

    给出空结果需要更长时间:墙壁时间:1min 5s

    def filter_codesets_4(codesets_onehot, code_selection):
        columns_of_interest = list(code_selection)
        len_coi = len(columns_of_interest)
        return codesets_onehot.loc[codesets_onehot[columns_of_interest].sum(axis=1) == len_coi]
    

    这可以工作,大约和第一个版本一样快:墙时间:28.7秒。它的优点是可以查询更大的集合,而不会出现内存错误。

    def filter_codesets_5(codesets_onehot, code_selection):
        return codesets_onehot[codesets_onehot[list(code_selection)].all(1)]
    

    有效,简单明了,耗时:30秒。我想只有熊猫很难跑完这段时间。

    3 回复  |  直到 5 年前
        1
  •  3
  •   cs95    5 年前

    再考虑一下,似乎只需选择感兴趣的列并调用 DataFrame.all .

    df_filtered = df[df[list(code_selection)].all(1)]
    

    我们可以打电话更快 np.ndarray.all 而不是 DataFrame.all .

    df_filtered = df[df[list(code_selection)].values.all(1)]
    

    numba :

    from numba import njit, prange
    
    @njit(parallel=True)
    def get_mask(v, pos):
        mask = [True] * v.shape[0]
        for i in prange(v.shape[0]):
            for j in pos:
                mask[i] &= v[i, j]
    
        return np.array(mask)
    

    演出

    np.random.seed(0)
    df = pd.DataFrame(np.random.choice(2, (100000, 1000), p=[0.1, 0.9]))
    code_selection = set(np.random.choice(df.columns, 20))
    

    %timeit df[df[list(code_selection)].all(1)]
    %timeit df[df[list(code_selection)].values.all(1)]
    %timeit df[get_mask(df.values, df.columns.get_indexer(code_selection))]
    
    61.2 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    52.6 ms ± 435 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    36.1 ms ± 460 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
        2
  •  2
  •   Shabbir Khan    5 年前

    我会这样做-

    data = [
        [True, False, True],
        [False, True, False],
        [True, True, True],
        [True, True, False],
        [False, True, True]
    ]
    
    df = pd.DataFrame(data, columns=['a', 'b', 'c'])
    
    columns_of_interest = ['b', 'c']
    len_coi = len(columns_of_interest)
    
    df.loc[df[columns_of_interest].sum(axis=1) == len_coi]
    

        3
  •  1
  •   yatu Sayali Sonawane    5 年前

    df.loc[[set(df.columns[i]) == code_selection for i in df.values],:]
    

    如果索引不起作用,请先尝试删除该索引:

    df = df.reset_index(drop=True)
    df.loc[[set(df.columns[i]) == code_selection for i in df.values],:]