代码之家  ›  专栏  ›  技术社区  ›  PineNuts0

Python NLTK:使用BigramConsolutionFinder从数据帧的文本字段显示常用短语(ngrams)的频率

  •  0
  • PineNuts0  · 技术社区  · 5 年前

    我有以下标记化数据帧示例:

    No  category    problem_definition_stopwords
    175 2521       ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
    211 1438       ['galley', 'work', 'table', 'stuck']
    912 2698       ['cloth', 'stuck']
    572 2521       ['stuck', 'coffee']
    

    我成功地运行了下面的代码以获取ngram短语。

    finder = BigramCollocationFinder.from_documents(df['problem_definition_stopwords'])
    
    # only bigrams that appear 1+ times
    finder.apply_freq_filter(1) 
    
    # return the 10 n-grams with the highest PMI
    finder.nbest(bigram_measures.pmi, 10) 
    

    结果如下所示,pmi排名前10位:

    [('brewing', 'properly'), ('galley', 'work'), ('maker', 'brewing'), ('properly', '2'), ('work', 'table'), ('coffee', 'maker'), ('2', '420'), ('cloth', 'stuck'), ('table', 'stuck'), ('420', '420')]
    

    我希望上面的结果出现在一个包含频率计数的数据框中,显示这些二元图发生的频率。

    所需输出示例:

    ngram                    frequency
    'brewing', 'properly'    1
    'galley', 'work'         1
    'maker', 'brewing'       1
    'properly', '2'          1
    ...                      ...
    

    如何在Python中执行上述操作?

    1 回复  |  直到 5 年前
        1
  •  2
  •   blacksite    5 年前

    这应该够了。。。

    首先,设置数据集(或类似的数据集):

    import pandas as pd
    from nltk.collocations import *
    import nltk.collocations
    from nltk import ngrams
    from collections import Counter
    
    s = pd.Series(
        [
            ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420'],
            ['galley', 'work', 'table', 'stuck'],
            ['cloth', 'stuck'],
            ['stuck', 'coffee']
        ]
    )
    
    finder = BigramCollocationFinder.from_documents(s.values)
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    
    # only bigrams that appear 1+ times
    finder.apply_freq_filter(1) 
    
    # return the 10 n-grams with the highest PMI
    result = finder.nbest(bigram_measures.pmi, 10)
    

    使用 nltk.ngrams 要重新创建ngrams列表:

    ngram_list = [pair for row in s for pair in ngrams(row, 2)]
    

    使用 collections.Counter 要计算每个ngram在整个语料库中出现的次数:

    counts = Counter(ngram_list).most_common()
    

    构建一个看起来像你想要的数据框架:

    pd.DataFrame.from_records(counts, columns=['gram', 'count'])
                       gram  count
    0            (420, 420)      2
    1       (coffee, maker)      1
    2      (maker, brewing)      1
    3   (brewing, properly)      1
    4         (properly, 2)      1
    5              (2, 420)      1
    6        (galley, work)      1
    7         (work, table)      1
    8        (table, stuck)      1
    9        (cloth, stuck)      1
    10      (stuck, coffee)      1
    

    然后,您可以进行过滤,只查看由您的 finder.nbest 电话:

    df = pd.DataFrame.from_records(counts, columns=['gram', 'count'])
    df[df['gram'].isin(result)]