我有一个大熊猫数据框“dfTagTuple”,大约有5600000行,比如:
index Source Target Weight
0 a b 2.0
1 a d 1.2
2 a b 2.0
3 a d 1.2
4 a b 2.0
5 a d 1.2
6 a b 2.0
7 a d 1.2
8 b d 0.3
9 b d 0.3
10 b d 0.3
11 b d 0.3
12 b d 0.3
13 b d 0.3
14 c l 0.8
以及源/目标的唯一值列表(~91.000)。
对于那个唯一列表中的每个值,我需要column.Source==value-like所在的行
df = dfTagTuple.loc[dfTagTuple["Source"] == "a"]
然后我需要将行的数量限制在X的顶部(比率,这里是0.2=20%),所以权重最大的节点,将它们添加到list/dataframe中,然后从最终结果构建一个dataframe
=对于每个节点,保持连接的顶部X%。
最终结果应为:
index Source Target Weight
0 a b 2.0
8 b d 1.3
10 b f 0.5
16 c l 0.8
如果有人知道在SQL中如何工作,我还可以将数据帧推送到SQLite中,并推送到“gettopxpersourcevalue”中?
目前的代码:
keepRows = []
ratio = 0.2
dfTagTupleNodes = dfTagTuple["Source"].to_frame()
dfTagTupleNodes.drop_duplicates(inplace=True)
for row in dfTagTupleNodes.itertuples():
df = dfTagTuple.loc[dfTagTuple["Source"] == row.Source]
df.sort_values(by=['Weight'], ascending=False, inplace=True)
keepRowAmount = int((len(df.index) * ratio))
if keepRowAmount == 0:
keepRowAmount = 1
dfKeep = df[:keepRowAmount]
for edge in dfKeep.itertuples():
keepRows.append([edge.Source, edge.Target, edge.Weight])
dfTagTupleTopX = pd.DataFrame(keepRows, columns=["Source", "Target", "Weight"])