代码之家 › 专栏 › 技术社区 › PaweÅ Szczur MKer

熊猫数据帧合并意外值

numpy pandas python

PaweÅ Szczur MKer · 技术社区 · 7 年前

我有两个简单的数据帧:

a = homes_in.copy()
b = homes.copy()

a['have'] = [True,]*a.shape[0]
b['have'] = [True,]*b.shape[0]

a = a['have'].to_frame()
b = b['have'].to_frame()

print(a.shape)
print(b.shape)

a.reset_index(inplace=True)
b.reset_index(inplace=True)
idx_cols = ['State', 'RegionName']

c = pd.merge(a, b, how='outer', left_on=idx_cols, right_on=idx_cols, suffixes=['_a', '_b'])
print(c.shape)
print(sum(c['have_a']))
print(sum(c['have_b']))

输出

(10730, 1)
(10592, 1)
(10730, 4)
10730
10730

在哪里? a.head() 是:

                    have
State RegionName        
NY    New York      True
CA    Los Angeles   True
IL    Chicago       True
PA    Philadelphia  True
AZ    Phoenix       True

问题是:列中的所有值 have_a 和 have_b 有 True 价值。

我尝试用伪造的数据复制行为,但失败了:

col = ['first', 'second', 'third']
a = pd.DataFrame.from_records([('a','b',1), ('a','c',1), ('a','d', 1)], columns=col)
b = pd.DataFrame.from_records([('a','b',2), ('a','c',2)], columns=col)
pd.merge(a,b,how='outer',left_on=['first','second'],right_on=['first', 'second'])

1 回复 | 直到 7 年前

jezrael 7 年前

我认为有重复的:

col = ['first', 'second', 'third']
a = pd.DataFrame.from_records([('a','b',True), ('a','c',True), ('a','c', True)], columns=col)
b = pd.DataFrame.from_records([('a','b',True), ('a','c',True)], columns=col)
c = pd.merge(a,b,how='outer',left_on=['first','second'],right_on=['first', 'second'])
print (a)
  first second  third
0     a      b   True
1     a      c   True <-duplicates a,c
2     a      c   True <-duplicates a,c

print (b)
  first second  third
0     a      b   True
1     a      c   True

print (c)
  first second  third_x  third_y
0     a      b     True     True
1     a      c     True     True
2     a      c     True     True

您可以找到重复项:

print (a[a.duplicated(['first','second'], keep=False)])
  first second  third
1     a      c   True
2     a      c   True

print (b[b.duplicated(['first','second'], keep=False)])
Empty DataFrame
Columns: [first, second, third]
Index: []

解决方案是删除重复项 drop_duplicates 以下内容:

a = a.drop_duplicates(['first','second'])
b = b.drop_duplicates(['first','second'])

c = pd.merge(a,b,how='outer',left_on=['first','second'],right_on=['first', 'second'])
print (a)
  first second  third
0     a      b   True
1     a      c   True

print (b)
  first second  third
0     a      b   True
1     a      c   True

print (c)
  first second  third_x  third_y
0     a      b     True     True
1     a      c     True     True

推荐文章

Google User · Django管理员在`list_display中未显示`creation_date`字段`

4 月前

user29747013 · 如何创建一个新的数据框架,其中包含原始数据框架中列的聚合列?

4 月前

ÎÎÎ½Î· ÎÎ®Î¹Î½Î¿Ï · Python lxml.html语法错误:使用lxml find时XPATH的谓词无效

4 月前

user29715306 · from_users=和chats=电视节目中的差异

4 月前

Redshoe · 当执行numpy.genfromtxt()时,python是否会读取文件的所有行?

4 月前

RASEL MAHMUD · 为什么以及如何在is_even()函数内的IF条件中递归X变量在满足0后递增?[副本]

4 月前

prayner · 更新嵌套字典包含列表中的项

5 月前

Bringo Jr · 我可以在O(n)中解决这个问题吗?

5 月前

Dave · 如何在for循环中修改列表值

5 月前

Shukurullox Komiljonov · 从记录中获得相互和解。使用SQL

5 月前