代码之家 › 专栏 › 技术社区 › Vega

获取大型边缘数据帧中每个节点的前X%边缘?

edge limit append dataframe performance

Vega · 技术社区 · 5 年前

我有一个大熊猫数据框“dfTagTuple”,大约有5600000行,比如:

index Source Target Weight
0     a      b      2.0
1     a      d      1.2
2     a      b      2.0
3     a      d      1.2
4     a      b      2.0
5     a      d      1.2
6     a      b      2.0
7     a      d      1.2
8     b      d      0.3
9     b      d      0.3
10    b      d      0.3
11    b      d      0.3
12    b      d      0.3
13    b      d      0.3
14    c      l      0.8

以及源/目标的唯一值列表(~91.000)。

对于那个唯一列表中的每个值,我需要column.Source==value-like所在的行

df = dfTagTuple.loc[dfTagTuple["Source"] == "a"]

然后我需要将行的数量限制在X的顶部(比率,这里是0.2=20%),所以权重最大的节点,将它们添加到list/dataframe中,然后从最终结果构建一个dataframe

=对于每个节点,保持连接的顶部X%。

最终结果应为:

index Source Target Weight
0     a      b      2.0 # keep, Source.a=7x, top20=7*0,2=1, highest Weight
8     b      d      1.3 # keep, Source.b=8x, top20=8*0,2=2, highest Weight
10    b      f      0.5 # keep, Source.b=8x, top20=8*0,2=2, highest Weight
16    c      l      0.8 # keep, Source.c=1x, top20=1*0,2=0=1, highest Weight

如果有人知道在SQL中如何工作,我还可以将数据帧推送到SQLite中,并推送到“gettopxpersourcevalue”中?

目前的代码:

keepRows = []
ratio = 0.2

dfTagTupleNodes = dfTagTuple["Source"].to_frame()
dfTagTupleNodes.drop_duplicates(inplace=True)

for row in dfTagTupleNodes.itertuples():
    df = dfTagTuple.loc[dfTagTuple["Source"] == row.Source]
    df.sort_values(by=['Weight'], ascending=False, inplace=True)
    keepRowAmount = int((len(df.index) * ratio))
    if keepRowAmount == 0:
        keepRowAmount = 1
    dfKeep = df[:keepRowAmount]

    for edge in dfKeep.itertuples():
       keepRows.append([edge.Source, edge.Target, edge.Weight])

dfTagTupleTopX = pd.DataFrame(keepRows, columns=["Source", "Target", "Weight"])

0 回复 | 直到 5 年前

推荐文章

user1245262 · 筛选Pandas数据帧时出现问题

1 年前

Foroand · 熊猫数据帧中的词频计数耗时过长

1 年前

user14696236 · 如何为每个对应的列创建一行[重复]

2 年前

Shawn Hemelstrand · 为什么我的自定义errorbar函数不能在R中工作?

2 年前

Karim Abou El Naga · 将带字符串的DataFrame绘制到堆叠条形图中

2 年前

The Great · 拆分并存储数据帧,但名称基于特定列中的唯一值

2 年前

nickolakis · 基于R中的列名复制列

2 年前

opposity · 形成一个数据帧,该数据帧包含R中包含类别和子类别的列

2 年前

A. Handler · 有没有办法将数据帧的列与完整列名向量相匹配?

2 年前

JasonX · 运行减法计算

2 年前