代码之家  ›  专栏  ›  技术社区  ›  Datacrawler

将列标题转换为新列并保留每列的值

  •  1
  • Datacrawler  · 技术社区  · 6 年前

    升级版 :已将数据帧粘贴为页面底部的示例。

    我原来的xls文件如下所示:

    enter image description here

    我需要两个动作来让它看起来像下面这样:

    enter image description here

    首先,我需要用上面单元格中显示的值填充空行值。这是通过以下功能实现的:

    def get_csv():
        #Read csv file
        df = pd.read_excel('test.xls')
        df = df.fillna(method='ffill')
        return df
    

    其次,我用了 stack 具有 set_index :

    df = (df.set_index(['Country', 'Gender', 'Arr-Dep'])
             .stack()
             .reset_index(name='Value')
             .rename(columns={'level_3':'Year'}))
    

    我想知道是否有更简单的方法。是否有将数据框架、excel等转换为所需格式的库?

    Country  Gender                          Direction   1974   1975   1976  
    0   Austria    Male  IN  13728   8754   9695   
    1       NaN     NaN  OUT  17977  12271   9899   
    2       NaN  Female  IN   8541   6465   6447   
    3       NaN     NaN  OUT   8450   7190   6288   
    4       NaN   Total  IN  22269  15219  16142   
    5       NaN     NaN  OUT  26427  19461  16187   
    6   Belgium    Male  IN   2412   2245   2296   
    7       NaN     NaN  OUT   2800   2490   2413   
    8       NaN  Female  IN   2105   2022   2057   
    9       NaN     NaN  OUT   2100   2113   2004   
    10      NaN   Total  IN   4517   4267   4353   
    11      NaN     NaN  OUT   4900   4603   4417
    
    2 回复  |  直到 6 年前
        1
  •  2
  •   jezrael    6 年前

    使用替代解决方案 melt ,但如果需要相同的列顺序,如 stacked DataFrame 是必要的补充 sort_values

    df1 = (df.ffill()
             .melt(id_vars=['Country','Gender','Direction'],var_name="Date",value_name='Value')
    )
    

    print (df1)
        Country  Gender Direction  Date  Value
    0   Austria    Male        IN  1974  13728
    1   Austria    Male       OUT  1974  17977
    2   Austria  Female        IN  1974   8541
    3   Austria  Female       OUT  1974   8450
    4   Austria   Total        IN  1974  22269
    5   Austria   Total       OUT  1974  26427
    6   Belgium    Male        IN  1974   2412
    7   Belgium    Male       OUT  1974   2800
    8   Belgium  Female        IN  1974   2105
    9   Belgium  Female       OUT  1974   2100
    10  Belgium   Total        IN  1974   4517
    11  Belgium   Total       OUT  1974   4900
    12  Austria    Male        IN  1975   8754
    13  Austria    Male       OUT  1975  12271
    14  Austria  Female        IN  1975   6465
    15  Austria  Female       OUT  1975   7190
    16  Austria   Total        IN  1975  15219
    17  Austria   Total       OUT  1975  19461
    18  Belgium    Male        IN  1975   2245
    19  Belgium    Male       OUT  1975   2490
    20  Belgium  Female        IN  1975   2022
    21  Belgium  Female       OUT  1975   2113
    22  Belgium   Total        IN  1975   4267
    23  Belgium   Total       OUT  1975   4603
    24  Austria    Male        IN  1976   9695
    25  Austria    Male       OUT  1976   9899
    26  Austria  Female        IN  1976   6447
    27  Austria  Female       OUT  1976   6288
    28  Austria   Total        IN  1976  16142
    29  Austria   Total       OUT  1976  16187
    30  Belgium    Male        IN  1976   2296
    ...
    ...
    

    df1 = (df.ffill()
             .melt(id_vars=['Country','Gender','Direction'],var_name="Date", value_name='Value')
             .sort_values(['Country', 'Gender','Direction'])
             .reset_index(drop=True))
    

    print (df1)
        Country  Gender Direction  Date  Value
    0   Austria  Female        IN  1974   8541
    1   Austria  Female        IN  1975   6465
    2   Austria  Female        IN  1976   6447
    3   Austria  Female       OUT  1974   8450
    4   Austria  Female       OUT  1975   7190
    5   Austria  Female       OUT  1976   6288
    6   Austria    Male        IN  1974  13728
    7   Austria    Male        IN  1975   8754
    8   Austria    Male        IN  1976   9695
    9   Austria    Male       OUT  1974  17977
    10  Austria    Male       OUT  1975  12271
    11  Austria    Male       OUT  1976   9899
    12  Austria   Total        IN  1974  22269
    13  Austria   Total        IN  1975  15219
    14  Austria   Total        IN  1976  16142
    15  Austria   Total       OUT  1974  26427
    16  Austria   Total       OUT  1975  19461
    17  Austria   Total       OUT  1976  16187
    18  Belgium  Female        IN  1974   2105
    19  Belgium  Female        IN  1975   2022
    20  Belgium  Female        IN  1976   2057
    21  Belgium  Female       OUT  1974   2100
    22  Belgium  Female       OUT  1975   2113
    23  Belgium  Female       OUT  1976   2004
    24  Belgium    Male        IN  1974   2412
    25  Belgium    Male        IN  1975   2245
    26  Belgium    Male        IN  1976   2296
    27  Belgium    Male       OUT  1974   2800
    28  Belgium    Male       OUT  1975   2490
    29  Belgium    Male       OUT  1976   2413
    30  Belgium   Total        IN  1974   4517
    ...
    ...
    
        2
  •  2
  •   piRSquared    6 年前

    stack

    我喜欢你的方法。我会用几种方法来改变它。

    1. 使用特定方法向上填充 ffill
    2. 在堆叠之前重命名列轴,以避免以后重命名列(个人首选项)

    df.ffill().set_index(
        ['Country', 'Gender', 'Direction']
    ).rename_axis('Year', 1).stack().reset_index(name='Value')
    
        Country  Gender Direction  Year  Value
    0   Austria    Male        IN  1974  13728
    1   Austria    Male        IN  1975   8754
    2   Austria    Male        IN  1976   9695
    3   Austria    Male       OUT  1974  17977
    4   Austria    Male       OUT  1975  12271
    5   Austria    Male       OUT  1976   9899
    ...
    

    纽比

    我想建立一个自定义的方法。这应该很快。

    def cstm_ffill(s):
      i = np.flatnonzero(s.notna())
      i = np.concatenate([[0], i, [len(s)]])
      d = np.diff(i)
      a = s.values[i[:-1].repeat(d)]
      return a
    
    def cstm_melt(df):
      c = cstm_ffill(df.Country)
      g = cstm_ffill(df.Gender)
      d = cstm_ffill(df.Direction)
      y = df.columns[3:].values
    
      k = len(y)
    
      i = np.column_stack([c, g, d])
      v = np.column_stack([*map(df.get, y)]).ravel()
    
      df_ = pd.DataFrame(
          np.column_stack([i.repeat(k, axis=0), v, np.tile(y, len(i))]),
          columns=['Country', 'Gender', 'Direction', 'Year', 'Value']
      )
      return df_
    
    cstm_melt(df)
    
        Country  Gender Direction   Year Value
    0   Austria    Male        IN  13728  1974
    1   Austria    Male        IN   8754  1975
    2   Austria    Male        IN   9695  1976
    3   Austria    Male       OUT  17977  1974
    4   Austria    Male       OUT  12271  1975
    5   Austria    Male       OUT   9899  1976
    ...