代码之家  ›  专栏  ›  技术社区  ›  Dawny33

熊猫串匹配优化

  •  0
  • Dawny33  · 技术社区  · 6 年前

    目前,我有下面一行,我尝试在我的熊猫栏中进行字符串匹配:

    input_supplier = input_supplier[input_supplier['Category Level - 3'].str.contains(category, flags=re.IGNORECASE)]
    

    但是,这个操作需要很多时间。熊猫df的大小为: (8098977, 16) .

    有什么方法可以优化这个特定的操作吗?

    2 回复  |  直到 6 年前
        1
  •  1
  •   It_is_Chris    6 年前

    len(df3)
    
    9599904
    
    # Creating a column then filtering
    start_time = time.time()
    search = ['Emma','Ryan','Gerald','Billy','Helen']
    df3['search'] = df3['First'].str.contains('|'.join(search))
    new_df = df3[df3['search'] == True]
    end_time = time.time()
    print(f'Elapsed time was {(end_time - start_time)} seconds')
    
    Elapsed time was 6.525546073913574 seconds
    

    start_time = time.time()
    search = ['Emma','Ryan','Gerald','Billy','Helen']
    input_supplier = df3[df3['First'].str.contains('|'.join(search), flags=re.IGNORECASE)]
    end_time = time.time()
    print(f'Elapsed time was {(end_time - start_time)} seconds')
    
    Elapsed time was 11.464462518692017 seconds
    

        2
  •  0
  •   b2002    6 年前

    import numpy as np
    import pandas as pd
    import re
    
    names = np.array(['Walter', 'Emma', 'Gus', 'Ryan', 'Skylar', 'Gerald',
                      'Saul', 'Billy', 'Jesse', 'Helen'] * 1000000)
    input_supplier = pd.DataFrame(names, columns=['Category Level - 3'])
    
    len(input_supplier)
    10000000
    
    category = ['Emma', 'Ryan', 'Gerald', 'Billy', 'Helen']
    

    %%timeit
    input_supplier['search'] = \
        input_supplier['Category Level - 3'].str.contains('|'.join(category))
    df1 = input_supplier[input_supplier['search'] == True]
    
    4.42 s ± 37.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    %%timeit
    df2 = input_supplier[input_supplier['Category Level - 3'].str.contains(
        '|'.join(category), flags=re.IGNORECASE)]
    
    5.45 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    %%timeit
    lcase_vals = [x.lower() for x in input_supplier['Category Level - 3']]
    category_lcase = [x.lower() for x in category]
    df3 = input_supplier.iloc[np.where(np.isin(lcase_vals, category_lcase))[0]]
    
    2.02 s ± 31.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    %%timeit
    col_vals = input_supplier['Category Level - 3'].values
    df4 = input_supplier.iloc[np.where(np.isin(col_vals, category))[0]]
    
    623 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)