您可以使用
merge()
使用
left_on
和
right_on
参数指定键字段,然后检查NaN值以查找哪些行不在表中:
import pandas as pd
# Create dataframes to test with
table_a = pd.DataFrame({
"value": [1, 2, 3, 4, 5],
"employee_id": [100, 200, 300, 400, 500]
})
print "Table A:\n", table_a
table_b = pd.DataFrame({
"value": [1, 2, 3, 4, 5],
"user_id": [100, 200, 300, 1000, 2000],
"age": [40, 50, 60, 70, 80]
})
print "\nTable B:\n", table_b
# Merge table A (left) on employee_id, and table B (right) on user_id
merged = table_a.merge(table_b, left_on="employee_id", right_on="user_id", how="outer", suffixes=("_tableA", "_tableB"))
print "\nMerged:\n", merged
# Table A-columns with NaNs are not present in table B
only_in_table_a = merged.loc[merged.value_tableB.isnull()]
print "\nOnly in table A:\n", only_in_table_a
# Table B-columns with NaNs are not present in table A
only_in_table_b = merged.loc[merged.value_tableA.isnull()]
print "\nOnly in table B:\n", only_in_table_b
# Rows with no NaNs are in both tables
in_both = merged.dropna(subset=["employee_id", "user_id"])
print "\nIn both:\n", in_both
这将产生:
Table A:
employee_id value
0 100 1
1 200 2
2 300 3
3 400 4
4 500 5
Table B:
age user_id value
0 40 100 1
1 50 200 2
2 60 300 3
3 70 1000 4
4 80 2000 5
Merged:
employee_id value_tableA age user_id value_tableB
0 100.0 1.0 40.0 100.0 1.0
1 200.0 2.0 50.0 200.0 2.0
2 300.0 3.0 60.0 300.0 3.0
3 400.0 4.0 NaN NaN NaN
4 500.0 5.0 NaN NaN NaN
5 NaN NaN 70.0 1000.0 4.0
6 NaN NaN 80.0 2000.0 5.0
Only in table A:
employee_id value_tableA age user_id value_tableB
3 400.0 4.0 NaN NaN NaN
4 500.0 5.0 NaN NaN NaN
Only in table B:
employee_id value_tableA age user_id value_tableB
5 NaN NaN 70.0 1000.0 4.0
6 NaN NaN 80.0 2000.0 5.0
In both:
employee_id value_tableA age user_id value_tableB
0 100.0 1.0 40.0 100.0 1.0
1 200.0 2.0 50.0 200.0 2.0
2 300.0 3.0 60.0 300.0 3.0