代码之家  ›  专栏  ›  技术社区  ›  Vega

获取大型边缘数据帧中每个节点的前X%边缘?

  •  0
  • Vega  · 技术社区  · 5 年前

    我有一个大熊猫数据框“dfTagTuple”,大约有5600000行,比如:

    index Source Target Weight
    0     a      b      2.0
    1     a      d      1.2
    2     a      b      2.0
    3     a      d      1.2
    4     a      b      2.0
    5     a      d      1.2
    6     a      b      2.0
    7     a      d      1.2
    8     b      d      0.3
    9     b      d      0.3
    10    b      d      0.3
    11    b      d      0.3
    12    b      d      0.3
    13    b      d      0.3
    14    c      l      0.8
    

    以及源/目标的唯一值列表(~91.000)。

    对于那个唯一列表中的每个值,我需要column.Source==value-like所在的行

    df = dfTagTuple.loc[dfTagTuple["Source"] == "a"]
    

    然后我需要将行的数量限制在X的顶部(比率,这里是0.2=20%),所以权重最大的节点,将它们添加到list/dataframe中,然后从最终结果构建一个dataframe

    =对于每个节点,保持连接的顶部X%。

    最终结果应为:

    index Source Target Weight
    0     a      b      2.0 # keep, Source.a=7x, top20=7*0,2=1, highest Weight
    8     b      d      1.3 # keep, Source.b=8x, top20=8*0,2=2, highest Weight
    10    b      f      0.5 # keep, Source.b=8x, top20=8*0,2=2, highest Weight
    16    c      l      0.8 # keep, Source.c=1x, top20=1*0,2=0=1, highest Weight
    

    如果有人知道在SQL中如何工作,我还可以将数据帧推送到SQLite中,并推送到“gettopxpersourcevalue”中?

    目前的代码:

    keepRows = []
    ratio = 0.2
    
    dfTagTupleNodes = dfTagTuple["Source"].to_frame()
    dfTagTupleNodes.drop_duplicates(inplace=True)
    
    for row in dfTagTupleNodes.itertuples():
        df = dfTagTuple.loc[dfTagTuple["Source"] == row.Source]
        df.sort_values(by=['Weight'], ascending=False, inplace=True)
        keepRowAmount = int((len(df.index) * ratio))
        if keepRowAmount == 0:
            keepRowAmount = 1
        dfKeep = df[:keepRowAmount]
    
        for edge in dfKeep.itertuples():
           keepRows.append([edge.Source, edge.Target, edge.Weight])
    
    dfTagTupleTopX = pd.DataFrame(keepRows, columns=["Source", "Target", "Weight"])
    
    0 回复  |  直到 5 年前