代码之家 › 专栏 › 技术社区 › campo

熊猫.基于子串出现次数的计数表达方法

pandas-groupby pandas python-3.x python

campo · 技术社区 · 6 年前

假设我有一个如下所示的数据帧:

df=pd.DataFrame({'name': ['john','jack','jill','al','zoe','jenn','ringo','paul','george','lisa'], 'how do you feel?': ['excited', 'not excited', 'excited and nervous', 'worried', 'really worried', 'excited', 'not that worried', 'not that excited', 'nervous', 'nervous']})

      how do you feel?    name
0              excited    john
1          not excited    jack
2  excited and nervous    jill
3              worried      al
4       really worried     zoe
5              excited    jenn
6     not that worried   ringo
7     not that excited    paul
8              nervous  george
9              nervous    lisa

我对计数很感兴趣,但分为三类:“兴奋”、“担心”和“紧张”。

问题是“兴奋和紧张”应该与“兴奋”归为一类。实际上,包含“excited”的字符串应该包含在一个组中除了对于像“不那么兴奋”和“不兴奋”这样的弦。同样的逻辑也适用于“担心”和“紧张”。(请注意,“兴奋和紧张”实际上同时属于“兴奋”和“紧张”两类)

您可以看到,典型的GROMPUBY不能工作,字符串搜索必须是灵活的。

我有一个解决办法,但想知道你们是否都能找到一个更好的方法,成为蟒蛇,和/或使用更合适的方法,我可能不知道。

以下是我的解决方案:

定义一个函数,返回包含所需子字符串且不包含否定该情感的子字符串的行的计数。

def get_perc(df, column_label, str_include, str_exclude):

    data=df[col_lab][(~df[col_lab].str.contains(str_exclude, case=False)) & \
    (df[col_lab].str.contains(str_include,  case=False))]

    num=data.count()

    return num

然后,在循环内调用此函数,传入各种“str.contains”参数,并将结果收集到另一个数据帧中。

groups=['excited', 'worried', 'nervous']
column_label='How do you feel?'

data=pd.DataFrame([], columns=['group','num'])
for str_include in groups:
    num=get_perc(df, column_label, str_include, 'not|neither')
    tmp=pd.DataFrame([{'group': str_include,'num': num}])
    data=pd.concat([data, tmp])


data

      group    num
0   excited      3
1   worried      2
2   nervous      3

有没有一种更清洁的方法可以让你想到?我试过一个正则表达式 str.contains “尝试避免需要两个布尔级数和” & “。但是,如果没有捕捉组,我就做不到,这意味着我必须使用 str.extract “,这似乎不允许我以同样的方式选择数据。

任何帮助都非常感谢。

2 回复 | 直到 6 年前

Zero 6 年前

你可以:

方法1

忽略 not 排,然后
变得相关 groups 从指示器字符串。

In [140]: col = 'how do you feel?'

In [141]: groups = ['excited', 'worried', 'nervous']

In [142]: df.loc[~df[col].str.contains('not '), col].str.get_dummies(sep=' ')[groups].sum()
Out[142]:
excited    3
worried    2
nervous    3
dtype: int64

方法2

In [162]: dfs = df['how do you feel?'].str.get_dummies(sep=' ')

In [163]: dfs.loc[~dfs['not'].astype(bool), groups].sum()
Out[163]:
excited    3
worried    2
nervous    3
dtype: int64

harvpan 6 年前

您可以简单地提供映射,然后按映射产生的新序列分组。

map_dict = {'excited and nervous':'excited', 'not that excited':'not excited', 
            'really worried':'worried', 'not that worried':'not worried'}
df.groupby(df['how do you feel?'].replace(map_dict)).size()

输出:

how do you feel?
excited        3
nervous        2
not excited    2
not worried    1
worried        2
dtype: int64