代码之家 › 专栏 › 技术社区 › Dawny33

熊猫串匹配优化

performance pandas python-3.x python

Dawny33 · 技术社区 · 6 年前

目前,我有下面一行,我尝试在我的熊猫栏中进行字符串匹配:

input_supplier = input_supplier[input_supplier['Category Level - 3'].str.contains(category, flags=re.IGNORECASE)]

但是,这个操作需要很多时间。熊猫df的大小为: (8098977, 16) .

有什么方法可以优化这个特定的操作吗?

2 回复 | 直到 6 年前

It_is_Chris 6 年前

len(df3)

9599904

# Creating a column then filtering
start_time = time.time()
search = ['Emma','Ryan','Gerald','Billy','Helen']
df3['search'] = df3['First'].str.contains('|'.join(search))
new_df = df3[df3['search'] == True]
end_time = time.time()
print(f'Elapsed time was {(end_time - start_time)} seconds')

Elapsed time was 6.525546073913574 seconds

start_time = time.time()
search = ['Emma','Ryan','Gerald','Billy','Helen']
input_supplier = df3[df3['First'].str.contains('|'.join(search), flags=re.IGNORECASE)]
end_time = time.time()
print(f'Elapsed time was {(end_time - start_time)} seconds')

Elapsed time was 11.464462518692017 seconds

b2002 6 年前

import numpy as np
import pandas as pd
import re

names = np.array(['Walter', 'Emma', 'Gus', 'Ryan', 'Skylar', 'Gerald',
                  'Saul', 'Billy', 'Jesse', 'Helen'] * 1000000)
input_supplier = pd.DataFrame(names, columns=['Category Level - 3'])

len(input_supplier)
10000000

category = ['Emma', 'Ryan', 'Gerald', 'Billy', 'Helen']

%%timeit
input_supplier['search'] = \
    input_supplier['Category Level - 3'].str.contains('|'.join(category))
df1 = input_supplier[input_supplier['search'] == True]

4.42 s Â± 37.4 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)

%%timeit
df2 = input_supplier[input_supplier['Category Level - 3'].str.contains(
    '|'.join(category), flags=re.IGNORECASE)]

5.45 s Â± 25.9 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)

%%timeit
lcase_vals = [x.lower() for x in input_supplier['Category Level - 3']]
category_lcase = [x.lower() for x in category]
df3 = input_supplier.iloc[np.where(np.isin(lcase_vals, category_lcase))[0]]

2.02 s Â± 31.3 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)

%%timeit
col_vals = input_supplier['Category Level - 3'].values
df4 = input_supplier.iloc[np.where(np.isin(col_vals, category))[0]]

623 ms Â± 1.12 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)

推荐文章

Aaron Green · 我的python程序无法识别数据库的存在,即使它在那里

1 年前

danial · 如何在多个字符串的每个位置找到最频繁的字符

2 年前

Henry · 使用Python将json重新格式化为键值对

2 年前

eymentakak · json字典类型错误:字符串索引必须是整数

2 年前

Qubix · 从熊猫数据帧创建相对熵矩阵

2 年前

FÄÅ ÛÅ · 字典、列表和字符串

2 年前

OrbitDuster · 如何使用gmail api在python中打印gmail正文?

2 年前

guiguilecodeur · 如何删除我的词汇表中的重复元素

2 年前

Susheel P M · 这是关于if-else语句[关闭]

2 年前

Slartibartfast · 关于Python版本安装

2 年前