代码之家  ›  专栏  ›  技术社区  ›  PineNuts0

NLP生成按列中的值分组的并置三角图数据帧

  •  1
  • PineNuts0  · 技术社区  · 5 年前

    我有下面的示例数据框架。让我们假设每个字母实际上是一个词。例如, a = 'ant' b = 'boy' .

    id  words
    1   [a, b, c, d, e, f, g]
    1   [h, I, o]
    1   
    1   [a, b, c]
    2   [e, f, g, m, n, q, r, s]
    2   [w, j, f]
    3   [l, t, m, n, q, s, a]
    3   [c, d, e, f, g]
    4   
    4   [f, g, z]
    

    创建上述示例数据帧的代码:

    import pandas as pd 
    
    d = {'id': [1, 1, 1, 1, 2, 2, 3, 3, 4, 4], 'words': [['a', 'b', 'c', 'd', 'e', 'f', 'g'], ['h', 'I', 'o'], '', ['a', 'b', 'c'], ['e', 'f', 'g', 'm', 'n', 'q', 'r', 's'], ['w', 'j', 'f'], ['l', 't', 'm', 'n', 'q', 's', 'a'], ['c', 'd', 'e', 'f', 'g'], '',  ['f', 'g', 'z']]}
    
    df = pd.DataFrame(data=d)
    

    我在上面运行了以下NLP代码来执行以下操作:给我一个从“words”字段中并置在一起的各种3字组合的计数。

    from nltk.collocations import *
    from nltk import ngrams
    from collections import Counter
    
    
    trigram_measures = nltk.collocations.BigramAssocMeasures()
    
    finder = BigramCollocationFinder.from_documents(df['words'])
    
    finder.nbest(trigram_measures.pmi, 100) 
    
    s = pd.Series(df['words'])
    
    ngram_list = [pair for row in s for pair in ngrams(row, 3)]
    
    counts = Counter(ngram_list).most_common()
    
    df = pd.DataFrame.from_records(counts, columns=['gram', 'count'])
    

    假设输出的样本结果如下(数据值是假的):

    gram                          count 
    a, b, c                       13
    c, d, e                       9
    g, h, i                       6
    q, r, s                       1
    

    问题是,我希望结果输出被“id”字段分割。 我想要的样本输出如下(数据是假的和随机的) :

    id   gram                          count 
    1    a, b, c                       13
    1    c, d, e                       9
    1    g, h, i                       6
    1    q, r, s                       1
    2    a, b, c                       6
    2    w, j, f                       3
    3    l, t, m                       4
    3    e, f, g                       2
    4    f, g, z                       1
    

    我如何做到这一点?…按“ID”字段获取结果?

    1 回复  |  直到 5 年前
        1
  •  0
  •   Dani Mesejo    5 年前

    id

    import nltk
    import pandas as pd
    
    from nltk.collocations import *
    from nltk import ngrams
    from collections import Counter
    
    d = {'id': [1, 1, 1, 1, 2, 2, 3, 3, 4, 4],
         'words': [['a', 'b', 'c', 'd', 'e', 'f', 'g'], ['h', 'I', 'o'], '', ['a', 'b', 'c'],
                   ['e', 'f', 'g', 'm', 'n', 'q', 'r', 's'], ['w', 'j', 'f'], ['l', 't', 'm', 'n', 'q', 's', 'a'],
                   ['c', 'd', 'e', 'f', 'g'], '', ['f', 'g', 'z']]}
    
    df = pd.DataFrame(data=d)
    
    
    def counts(x):
        trigram_measures = nltk.collocations.BigramAssocMeasures()
        finder = BigramCollocationFinder.from_documents(x)
        finder.nbest(trigram_measures.pmi, 100)
    
        s = pd.Series(x)
    
        ngram_list = [pair for row in s for pair in ngrams(row, 3)]
    
        c = Counter(ngram_list).most_common()
    
        return pd.DataFrame([(x.name, ) + element for element in c], columns=['id', 'gram', 'count'])
    
    
    output = df.groupby('id', as_index=False).words.apply(counts).reset_index(drop=True)
    print(output)
    

        id       gram  count
    0    1  (a, b, c)      2
    1    1  (h, I, o)      1
    2    1  (b, c, d)      1
    3    1  (d, e, f)      1
    4    1  (c, d, e)      1
    5    1  (e, f, g)      1
    6    2  (g, m, n)      1
    7    2  (q, r, s)      1
    8    2  (m, n, q)      1
    9    2  (n, q, r)      1
    10   2  (f, g, m)      1
    11   2  (w, j, f)      1
    12   2  (e, f, g)      1
    13   3  (t, m, n)      1
    14   3  (q, s, a)      1
    15   3  (e, f, g)      1
    16   3  (d, e, f)      1
    17   3  (m, n, q)      1
    18   3  (c, d, e)      1
    19   3  (n, q, s)      1
    20   3  (l, t, m)      1
    21   4  (f, g, z)      1