代码之家  ›  专栏  ›  技术社区  ›  ccsv

Python熊猫计数字符串中的正则表达式与复合词的匹配

  •  2
  • ccsv  · 技术社区  · 9 年前

    我有一本正则表达式词典,我想统计词典中包含复合词的主题和正则表达式的匹配项。

    import pandas as pd
    
    
    terms = {'animals':"(fox|russian brown deer|bald eagle|arctic fox)",
    'people':'(John Adams|Rob|Steve|Superman|Super man)',
    'games':'(basketball|basket ball|bball)'
    }
    
    df=pd.DataFrame({
    'Score': [4,6,2,7,8],
    'Foo': ['Superman was looking for a russian brown deer.', 'John adams started to play basket ball with rob yesterday before steve called him','Basketball or bball is a sport played by Steve afterschool','The bald eagle flew pass the arctic fox three times','The fox was sptted playing basket ball?']
    })
    

    要计算匹配项,我可以使用与问题类似的代码: Python pandas count number of Regex matches in a string 但它用空格分隔字符串,然后计算不包括复合词的词。有什么替代方法可以做到这一点,以便包含由空格连接的复合项?

    df1 = df.Foo.str.split(expand=True).stack().reset_index(level=1, drop=True).reset_index(name='Foo')
    
    
    
    for k, v in terms.items():
        df1[k] = df1.Foo.str.contains('(?i)(^|\s)'+terms[k]+'($|\s|\.|,|\?)')
    
    
    df2= df1.groupby('index').sum().astype(int)
    
    
    df = pd.concat([df,df2], axis=1)
    print(df)
    

    最终结果应如下所示:

                                                     Foo  Score  animals  people  \
    0     Superman was looking for a russian brown deer.      4        1       1   
    1  John adams started to play basket ball with ro...      6        0       3   
    2  Basketball or bball is a sport played by Steve...      2        0       1   
    3  The bald eagle flew pass the artic fox three t...      7        3       0   
    4             The fox was sptted playing basket ball      8        1       0   
    
       games  
    0      0  
    1      1  
    2      2  
    3      0  
    4      1  
    

    请注意,对于3行,动物栏中的单词fox和单词arctic fox应分别计数一次(两次合计)。

    1 回复  |  直到 7 年前
        1
  •  0
  •   Sergey Bushmanov    9 年前

    请看看这是不是你想要的:

    import(re)
    for k in terms.keys():
        df[k] = 0
        for words in re.sub("[()]","",terms[k]).split('|'):
            mask = df.Foo.str.contains(words, case = False)
            df[k] += mask
    df
    
    
                                                  Foo   Score   people  animals games
    0   Superman was looking for a russian brown deer.      4        1        1     0
    1   John adams started to play basket ball with ro...   6        3        0     1
    2   Basketball or bball is a sport played by Steve...   2        1        0     2
    3   The bald eagle flew pass the arctic fox three ...   7        0        3     0
    4   The fox was sptted playing basket ball?             8        0        1     1