代码之家  ›  专栏  ›  技术社区  ›  syb0rg

基于行簇的聚合文本

  •  0
  • syb0rg  · 技术社区  · 6 年前

    我有一个熊猫数据框架,如:

                                                    text  is_from_me
    0                              Happy birthday bud!!!           1
    1                                        Thanks man!           0
    2  Definitely would've come back had I thought ab...           1
    3                                         Your good            0
    4                                          Okay haha           1
    5                                    Have a good one           1
    6                   Yea you too. What are you up to?           0
    7                      No hw like I'm doing all day            1
    8                                        Just got up           1
    9     Same here. I went to the football game last...           0
    10                  I think I saw that in your story           1
    11                                               Win?          1
    12                               Lost in last second           0
    13                                     Aw, that sucks          1
    14                      Means it was a good game tho?          1
    15  Really good game. They were on the 1/2 yard li...          0
    16                                               Dang          1
    

    我正在尝试制作以下内容:

                                                   input    output
    0                              Happy birthday bud!!!    Thanks man!  
    2                                        Thanks man!    Definitely would've come back had I thought ab...
    3  Definitely would've come back had I thought ab...    Your good
    4                                          Your good    Okay haha\nHave a good one
    6                         Okay haha\nHave a good one    Yea you too. What are you up to?
    7                   Yea you too. What are you up to?    No hw like I'm doing all day\nJust got up
    9          No hw like I'm doing all day\nJust got up    Same here. I went to the football game last...
    10    Same here. I went to the football game last...    I think I saw that in your story\nWin?
    12            I think I saw that in your story\nWin?    Lost in last second
    13                               Lost in last second    Aw, that sucks\nMeans it was a good game tho?
    15     Aw, that sucks\nMeans it was a good game tho?    Really good game. They were on the 1/2 yard li...
    16 Really good game. They were on the 1/2 yard li...    Dang
    

    我可以用这段代码完成一些工作:

    pd.concat([df['text'].reset_index(drop=True), df['text'].shift(-1).reset_index(drop=True)], axis=1)
    

    但是,这并不是基于 is_from_me 其中,组的文本与分隔原始字符串的换行符组合在一起。这是一个简单的例子,可能会有多个超过2行被分组到一行中。

    我尝试过用一种简单的方法来定义这个分组,但是我所能做的只是一个复杂的for循环,Sorta用一种简单的方法来完成这个工作。我可以写一个聚合函数来完成这个任务吗?

    2 回复  |  直到 6 年前
        1
  •  1
  •   Vivek Kalyanarangan    6 年前

    使用-

    input_ = df.groupby((df.is_from_me != df.is_from_me.shift()).cumsum())['text'].apply(lambda x: '\n'.join(x))
    output = input_.shift(-1)
    pd.concat([input_, output], axis=1)
    

    产量

        text    text
    is_from_me      
    1   Happy birthday bud!!!   Thanks man!
    2   Thanks man! Definitely would've come back had I thought ab...
    3   Definitely would've come back had I thought ab...   Your good
    4   Your good   Okay haha\nHave a good one
    5   Okay haha\nHave a good one  Yea you too. What are you up to?
    6   Yea you too. What are you up to?    No hw like I'm doing all day\nJust got up
    7   No hw like I'm doing all day\nJust got up   Same here. I went to the football game last...
    8   Same here. I went to the football game last...  I think I saw that in your story\nWin?
    9   I think I saw that in your story\nWin?  Lost in last second
    10  Lost in last second Aw. that sucks\nMeans it was a good game tho?
    11  Aw. that sucks\nMeans it was a good game tho?   Really good game. They were on the 1/2 yard li...
    12  Really good game. They were on the 1/2 yard li...   Dang
    13  Dang    NaN
    
        2
  •  1
  •   Yuca    6 年前

    你可以使用 pd.groupby . 输出看起来很难看,但它应该是您需要的

    a = df.groupby([df.is_from_me.diff().ne(0).cumsum()]).agg(lambda x: tuple(x))
    a['output'] = a['text']
    a['input'] = a.shift()['text']
    

    输出

                 input  \
    is_from_me                                                      
    1                                                         NaN   
    2                                    (Happy birthday bud!!!,)   
    3                                              (Thanks man!,)   
    4           (Definitely would've come back had I thought a...   
    5                                                (Your good,)   
    6                                (Okay haha, Have a good one)   
    7                         (Yea you too. What are you up to?,)   
    8                 (No hw like I'm doing all day, Just got up)   
    9           (Same here. I went to the football game last...,)   
    10                   (I think I saw that in your story, Win?)   
    11                                     (Lost in last second,)   
    12            (Aw, that sucks, Means it was a good game tho?)   
    13          (Really good game. They were on the 1/2 yard l...   
    
                                                           output  
    is_from_me                                                     
    1                                    (Happy birthday bud!!!,)  
    2                                              (Thanks man!,)  
    3           (Definitely would've come back had I thought a...  
    4                                                (Your good,)  
    5                                (Okay haha, Have a good one)  
    6                         (Yea you too. What are you up to?,)  
    7                 (No hw like I'm doing all day, Just got up)  
    8           (Same here. I went to the football game last...,)  
    9                    (I think I saw that in your story, Win?)  
    10                                     (Lost in last second,)  
    11            (Aw, that sucks, Means it was a good game tho?)  
    12          (Really good game. They were on the 1/2 yard l...  
    13                                                    (Dang,)