代码之家  ›  专栏  ›  技术社区  ›  lpt

连接多行pyspark

  •  0
  • lpt  · 技术社区  · 6 年前

    我需要将以下数据合并到一行中:

    vector_no_stopw_df.select("filtered").show(3, truncate=False)
    
    
    
      +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |filtered                                                                                                                                                                                                                          |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[, problem, population]                                                                                                                                                                                                           |
    |[tyler, notes, global, population, increase, sharply, next, century, , almost, growth, occurring, relatively, underdeveloped, africa, south, asia, , contrast, , population, actually, decline, countries]                        |
    |[many, economists, uncomfortable, population, issues, , perhaps, arent, covered, depth, standard, graduate, curriculum, , touch, topics, may, culturally, controversial, even, politically, incorrect, thats, unfortunate, future]|
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    

    所以看起来像

    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |filtered                                                                                                                                                                                                                          |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[, problem, population,tyler, notes, global, population, increase, sharply, next, century, , almost, growth, occurring, relatively, underdeveloped, africa, south, asia, , contrast, , population, actually, decline, countries,many, economists, uncomfortable, population, issues, , perhaps, arent, covered, depth, standard, graduate, curriculum, , touch, topics, may, culturally, controversial, even, politically, incorrect, thats, unfortunate, future]|
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    

    我知道这是微不足道的。但找不到解决办法。我试过了 concat_ws 它也没起作用。

    这个 混凝土 ,我执行生成( vector_no_stopw_df.select(concat_ws(',', vector_no_stopw_df.filtered)).collect() )如下:

    [Row(concat_ws(,, filtered)='one,big,advantages,economist,long,time,council,economic,advisers,,years,ago,ive,gotten,know,follow,lot,people,thinking,,started,cea,august,,finished,july,,,first,academic,year,,fellow,senior,economists,paul,krugman,,lawrence,summers'),
     Row(concat_ws(,, filtered)='isnt,going,happen,anytime,soon,meantime,,tax,system,puts,place,much,higher,marginal,rates,people,acknowledge,people,keep,focusing,federal,income,taxes,alone,,marginal,rates,top,around,,percent,leaves,state'),
     Row(concat_ws(,, filtered)=',,
    

    这是解决方案,以防别人需要它。 :

    我用了巨蟒的 itertools 图书馆。

    vector_no_stopw_df_count=vector_no_stopw_df.select("filtered").collect()
    vector_no_stopw_df_count[0].filtered
    vector_no_stopw_list=[i.filtered for i in vector_no_stopw_df_count]
    

    扁平化列表

    from itertools import chain
    flattenlist= list(chain.from_iterable(vector_no_stopw_list))
    flattenlist[:20]
    

    结果 :

    ['',
     'problem',
     'population',
     'tyler',
     'notes',
     'global',
     'population',
     'increase',
     'sharply',
     'next',
     'century',
     '',
     'almost',
     'growth',
     'occurring',
     'relatively',
     'underdeveloped',
     'africa',
     'south',
     'asia']
    
    1 回复  |  直到 6 年前
        1
  •  0
  •   karthikr    6 年前

    在某种程度上,你在寻找 explode .

    你可以用 collect_list 为此:

    from pyspark.sql.functions as F
    df.groupBy(<somecol>).agg(F.collect_list('filtered').alias('aggregated_filters'))