代码之家  ›  专栏  ›  技术社区  ›  Paweł Szczur MKer

熊猫数据帧合并意外值

  •  1
  • Paweł Szczur MKer  · 技术社区  · 6 年前

    我有两个简单的数据帧:

    a = homes_in.copy()
    b = homes.copy()
    
    a['have'] = [True,]*a.shape[0]
    b['have'] = [True,]*b.shape[0]
    
    a = a['have'].to_frame()
    b = b['have'].to_frame()
    
    print(a.shape)
    print(b.shape)
    
    a.reset_index(inplace=True)
    b.reset_index(inplace=True)
    idx_cols = ['State', 'RegionName']
    
    c = pd.merge(a, b, how='outer', left_on=idx_cols, right_on=idx_cols, suffixes=['_a', '_b'])
    print(c.shape)
    print(sum(c['have_a']))
    print(sum(c['have_b']))
    

    输出

    (10730, 1)
    (10592, 1)
    (10730, 4)
    10730
    10730
    

    在哪里? a.head() 是:

                        have
    State RegionName        
    NY    New York      True
    CA    Los Angeles   True
    IL    Chicago       True
    PA    Philadelphia  True
    AZ    Phoenix       True
    

    问题是:列中的所有值 have_a have_b True 价值。

    我尝试用伪造的数据复制行为,但失败了:

    col = ['first', 'second', 'third']
    a = pd.DataFrame.from_records([('a','b',1), ('a','c',1), ('a','d', 1)], columns=col)
    b = pd.DataFrame.from_records([('a','b',2), ('a','c',2)], columns=col)
    pd.merge(a,b,how='outer',left_on=['first','second'],right_on=['first', 'second'])
    
    1 回复  |  直到 6 年前
        1
  •  1
  •   jezrael    6 年前

    我认为有重复的:

    col = ['first', 'second', 'third']
    a = pd.DataFrame.from_records([('a','b',True), ('a','c',True), ('a','c', True)], columns=col)
    b = pd.DataFrame.from_records([('a','b',True), ('a','c',True)], columns=col)
    c = pd.merge(a,b,how='outer',left_on=['first','second'],right_on=['first', 'second'])
    print (a)
      first second  third
    0     a      b   True
    1     a      c   True <-duplicates a,c
    2     a      c   True <-duplicates a,c
    
    print (b)
      first second  third
    0     a      b   True
    1     a      c   True
    
    print (c)
      first second  third_x  third_y
    0     a      b     True     True
    1     a      c     True     True
    2     a      c     True     True
    

    您可以找到重复项:

    print (a[a.duplicated(['first','second'], keep=False)])
      first second  third
    1     a      c   True
    2     a      c   True
    
    print (b[b.duplicated(['first','second'], keep=False)])
    Empty DataFrame
    Columns: [first, second, third]
    Index: []
    

    解决方案是删除重复项 drop_duplicates 以下内容:

    a = a.drop_duplicates(['first','second'])
    b = b.drop_duplicates(['first','second'])
    
    c = pd.merge(a,b,how='outer',left_on=['first','second'],right_on=['first', 'second'])
    print (a)
      first second  third
    0     a      b   True
    1     a      c   True
    
    print (b)
      first second  third
    0     a      b   True
    1     a      c   True
    
    print (c)
      first second  third_x  third_y
    0     a      b     True     True
    1     a      c     True     True