代码之家  ›  专栏  ›  技术社区  ›  Kristada673

如何构建熊猫数据帧中项目的频率计数表?

  •  1
  • Kristada673  · 技术社区  · 6 年前

    假设我在csv文件中有以下数据, example.csv :

    Word    Score
    Dog     1
    Bird    2
    Cat     3
    Dog     2
    Dog     3
    Dog     1
    Bird    3
    Cat     1
    Bird    1
    Cat     3
    

    我想计算每个单词的频率。预期输出如下:

            1   2   3
    Dog     2   1   1
    Bird    0   1   1
    Cat     1   0   2
    

    我的代码如下:

    将熊猫作为PD导入

    x1 = pd.read_csv(r'path\to\example.csv')
    
    def getUniqueWords(allWords) :
        uniqueWords = [] 
        for i in allWords:
            if not i in uniqueWords:
                uniqueWords.append(i)
        return uniqueWords
    
    unique_words = getUniqueWords(x1['Word'])
    unique_scores = getUniqueWords(x1['Score'])
    
    scores_matrix = [[0 for x in range(len(unique_words))] for x in range(len(unique_scores)+1)]   
    # The '+1' is because Python indexing starts from 0; so if a score of 0 is present in the data, the 0 index will be used for that. 
    
    for i in range(len(unique_words)):
        temp = x1[x1['Word']==unique_words[i]]
        for j, word in temp.iterrows():
            scores_matrix[i][j] += 1  # Supposed to store the count for word i with score j
    

    但这会产生以下错误:

    IndexError                                Traceback (most recent call last)
    <ipython-input-123-141ab9cd7847> in <module>()
         19     temp = x1[x1['Word']==unique_words[i]]
         20     for j, word in temp.iterrows():
    ---> 21         scores_matrix[i][j] += 1
    
    IndexError: list index out of range
    

    而且,即使我可以修复这个错误, scores_matrix 不显示标题( Dog , Bird , Cat 作为行索引,以及 1 , 2 , 3 作为列索引)。我希望能够访问每个单词的计数和每个分数-达到这种效果:

    scores_matrix['Dog'][1]
    >>> 2
    
    scores_matrix['Cat'][2]
    >>> 0
    

    那么,我如何解决/解决这两个问题呢?

    1 回复  |  直到 6 年前
        1
  •  3
  •   jezrael    6 年前

    使用 groupby 排序为假且 value_counts size 具有 unstack :

    df1 = df.groupby('Word', sort=False)['Score'].value_counts().unstack(fill_value=0)
    

    df1 = df.groupby(['Word','Score'], sort=False).size().unstack(fill_value=0)
    
    print (df1)
    Score  1  2  3
    Word          
    Dog    2  1  1
    Bird   1  1  1
    Cat    1  0  2
    

    如果订单不重要,使用 crosstab :

    df1 = pd.crosstab(df['Word'], df['Score'])
    print (df1)
    Score  1  2  3
    Word          
    Bird   1  1  1
    Cat    1  0  2
    Dog    2  1  1
    

    最后一次按标签选择 DataFrame.loc :

    print (df.loc['Cat', 2])
    0