代码之家  ›  专栏  ›  技术社区  ›  gogasca

熊猫多次合并会生成一个x和y列

  •  0
  • gogasca  · 技术社区  · 6 年前

    我有两个数据帧,我在其中找到基于列(TLD)的公共匹配,如果找到匹配(在源和目标中的列之间),我将列(UUID)的值从源复制到目标数据帧。

    现在我还需要检查不同的列是否匹配。(公司名称)

    数据帧1:源

       uuid           website company_name           tld
    0     11  www.facebook.com     facebook  facebook.com
    1     22     www.yahoo.com    yahoo inc     yahoo.com
    2     33    www.google.com       Google    google.com
    3     44     www.cisco.com        Cisco     cisco.com
    

    数据帧2:目的地

      id  website           company_name           tld  match uuid
    0  a  www.facebook.com      facebook  facebook.com  False  NaN
    1  b         www.y.com     Yahoo Inc         y.com  False  NaN
    2  c         www.g.com        Google         g.com  False  NaN
    3  d         www.g.com    Google Inc         g.com  False  NaN
    4  e  www.facebook.com  Facebook Inc  facebook.com  False  NaN
    

    查找 matches 以下内容:

    destination.loc[destination.tld.isin(source.tld),'match'] = True
    destination = destination.merge(source[['tld', 'uuid']], on='tld', how='left')
    

    上面将uuid列从源复制到uuid列中 在目标数据帧中。

     id           website company_name           tld  match  uuid
    0  a  www.facebook.com     facebook  facebook.com   True  11
    1  b         www.y.com     YahooInc         y.com  False  NaN
    2  c         www.g.com       Google         g.com  False  NaN
    3  d         www.g.com    GoogleInc         g.com  False  NaN
    4  e  www.facebook.com  FacebookInc  facebook.com   True  11
    

    现在,我需要检查公司名称是否匹配,以及是否具有以下内容:

     id           website company_name           tld  match  uuid
    0  a  www.facebook.com     facebook  facebook.com   True  11
    1  b         www.y.com     YahooInc         y.com  False  NaN
    2  c         www.g.com       Google         g.com  True   33
    3  d         www.g.com    GoogleInc         g.com  False  NaN
    4  e  www.facebook.com  FacebookInc  facebook.com   True  11
    

    当我试图添加:

    destination.loc[destination.company_name.isin(source.company_name), 'match'] = True
    destination = destination.merge(source[['company_name', 'uuid']], on='company_name', how='left')
    

    我得到一个重复的uuid列:uuid和uuid

    id           website  company_name           tld  match uuid_x uuid_y
    0  a  www.facebook.com      facebook  facebook.com   True     11     11
    1  b         www.y.com     Yahoo Inc         y.com  False    NaN    NaN
    2  c         www.g.com        Google         g.com   True    NaN     33
    3  d         www.g.com    Google Inc         g.com  False    NaN    NaN
    4  e  www.facebook.com  Facebook Inc  facebook.com   True     11    NaN
    

    最终代码

    destination.loc[destination.tld.isin(source.tld),'match'] = True
    destination = destination.merge(source[['tld', 'uuid']], on='tld', how='left')
    destination.loc[destination.company_name.isin(source.company_name), 'match'] = True
    destination = destination.merge(source[['company_name', 'uuid']], on='company_name', how='left')
    
    1 回复  |  直到 6 年前
        1
  •  1
  •   jezrael    6 年前

    match m1 m2 combine_first

    m1 = destination.tld.isin(source.tld)
    m2 = destination.company_name.isin(source.company_name)
    destination['match'] = m1 | m2
    destination1 = destination.merge(source[['tld', 'uuid']], on='tld', how='left')
    destination = destination.merge(source[['company_name','uuid']],on='company_name',how='left')
    
    destination['uuid'] = destination['uuid'].combine_first(destination1['uuid'])
    print (destination)
      id           website  company_name           tld  match  uuid
    0  a  www.facebook.com      facebook  facebook.com   True  11.0
    1  b         www.y.com     Yahoo Inc         y.com  False   NaN
    2  c         www.g.com        Google         g.com   True  33.0
    3  d         www.g.com    Google Inc         g.com  False   NaN
    4  e  www.facebook.com  Facebook Inc  facebook.com   True  11.0