代码之家  ›  专栏  ›  技术社区  ›  horcle_buzz

使用Pandas设置SurveyMonkey数据格式

  •  3
  • horcle_buzz  · 技术社区  · 6 年前

    我有一份由SurveyMonkey上的参与者完成的调查要分析。不幸的是,数据的组织方式并不理想,因为每个问题的每个分类回答都有自己的列。

    例如,这里是数据帧中一个响应的前几行:

         How long have you been participating in the Garden Awards Program?  \
    0                                           One year                   
    1                                                NaN                   
    2                                                NaN                   
    3                                                NaN                   
    4                                                NaN                   
    
      Unnamed: 10 Unnamed: 11      Unnamed: 12  \
    0   2-3 years   4-5 years  5 or more years   
    1         NaN         NaN              NaN   
    2         NaN   4-5 years              NaN   
    3   2-3 years         NaN              NaN   
    4         NaN         NaN  5 or more years   
    
      How did you initially learn of the Garden Awards Program?  \
    0              I nominated my garden to be evaluated          
    1                                                NaN          
    2              I nominated my garden to be evaluated          
    3                                                NaN          
    4                                                NaN          
    
                                             Unnamed: 14  etc...
    0  A friend or family member nominated my garden ...  
    1  A friend or family member nominated my garden ...  
    2                                                NaN  
    3                                                NaN  
    4                                                NaN  
    

    这个问题, How long have you been participating in the Garden Awards Program? ,具有有效响应: one year ,则, 2-3 years ,并都在第一行中作为键找到,该键指向哪个列保存哪个值。这是第一个问题。(类似于 How did you initially learn of the Garden Awards Program? ,其中有效响应为: I nominated my garden to be evaluated ,则, A friend or family member nominated my garden 等)。

    第二个问题是,每个分类响应的附加列都是 Unnamed: N ,其中N是与所有问题关联的类别的列数。

    在我开始重新映射并将每个问题的列展平/折叠为一个列之前,我想知道是否有其他方法可以使用熊猫来处理这样呈现的调查数据。我所有的搜索都指向SurveyMonkey API,但我看不出这有什么用处。

    我猜我需要把柱子弄平,因此,如果有人能提出一种方法,那就太好了。我认为有一种方法可以通过抓取相邻列直到 Unnamed 不再是在列名中,但我不知道如何做到这一点。

    1 回复  |  直到 6 年前
        1
  •  4
  •   Gustavo Bezerra    6 年前

    我将使用以下 DataFrame (可从以下网址下载CSV here ):

         Q1 Unnamed: 2 Unnamed: 3    Q2 Unnamed: 5 Unnamed: 6    Q3 Unnamed: 7 Unnamed: 8
    0  A1-A       A1-B       A1-C  A2-A       A2-B       A2-C  A3-A       A4-B       A3-C
    1  A1-A        NaN        NaN   NaN       A2-B        NaN   NaN        NaN       A3-C
    2   NaN       A1-B        NaN  A2-A        NaN        NaN   NaN       A4-B        NaN
    3   NaN        NaN       A1-C   NaN       A2-B        NaN  A3-A        NaN        NaN
    4   NaN       A1-B        NaN   NaN        NaN       A2-C   NaN        NaN       A3-C
    5  A1-A        NaN        NaN   NaN       A2-B        NaN  A3-A        NaN        NaN
    

    关键假设:

    1. 名称不以开头的每列 Unnamed 实际上是一个问题的标题
    2. 问题标题之间的列表示列间隔左端问题的选项

    解决方案概述:

    1. 查找每个问题开始和结束的索引
    2. 将每个问题展平到一列( pd.Series )
    3. 将问题列重新合并在一起

    实施(第1部分):

    indices = [i for i, c in enumerate(df.columns) if not c.startswith('Unnamed')]
    questions = [c for c in df.columns if not c.startswith('Unnamed')]
    slices = [slice(i, j) for i, j in zip(indices, indices[1:] + [None])]
    

    您可以看到,对下面这样的切片进行迭代,可以得到一个 数据帧 每个问题对应:

    for q in slices:
        print(df.iloc[:, q])  # Use `display` if using Jupyter
    

    实施(第2-3部分):

    def parse_response(s):
        try:
            return s[~s.isnull()][0]
        except IndexError:
            return np.nan
    
    data = [df.iloc[:, q].apply(parse_response, axis=1)[1:] for q in slices]
    df = pd.concat(data, axis=1)
    df.columns = questions
    

    输出:

         Q1    Q2    Q3
    1  A1-A  A2-B  A3-C
    2  A1-B  A2-A  A4-B
    3  A1-C  A2-B  A3-A
    4  A1-B  A2-C  A3-C
    5  A1-A  A2-B  A3-A