代码之家  ›  专栏  ›  技术社区  ›  Ju Ko

具有正lookhead的Regex仍然使用strsplit()在错误的位置拆分字符串

  •  0
  • Ju Ko  · 技术社区  · 6 年前

    我正在尝试拆分包含消息的字符向量,对吗 在前面 日期时间指示器的。

    我在考虑使用 strsplit() 使用正则表达式和 perl = TRUE

    以下是一些示例数据:

    TEST <- c("05.10.17, 09:26 - Person One: How about we chill on sunday\n05.10.17, 09:27 - Person One: I could bring some beer\n05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n05.10.17, 09:27 - Person Two: ???\n05.10.17, 09:28 - Person Two: You guys have history?\n05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n")
    

    这就是我迄今为止所尝试的:

    Cut <- unlist(strsplit(TEST,"(?=[0-3][0-9][.][0-9]{2}[.][0-9]{2}[,][ ][0-9]{2}:[0-9]{2})", perl = TRUE))
    Cut
    

    根据 this website ,正则表达式应该在日期时间指示器的正前方剪切字符串。然而,我得到的结果是这样的,第一个角色被切掉了:

     [1] "0"                                                                                   
     [2] "5.10.17, 09:26 - Person One: How about we chill on sunday\n"                         
     [3] "0"                                                                                   
     [4] "5.10.17, 09:27 - Person One: I could bring some beer\n"                              
     [5] "0"                                                                                   
     [6] "5.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"  
     [7] "0"                                                                                   
     [8] "5.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"                              
     [9] "0"                                                                                   
    [10] "5.10.17, 09:27 - Person Two: ???"                                                                   
    [11] "0"                                                                                   
    [12] "5.10.17, 09:28 - Person Two: You guys have history?\n"                               
    [13] "0"                                                                                   
    [14] "5.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
    

    这就是结果 应该 看起来像:

     [1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"                                                                                   
     [2] "05.10.17, 09:27 - Person One: I could bring some beer\n"                         
     [3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"                                                                                   
     [4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"                              
     [5] "05.10.17, 09:27 - Person Two: ???\n"                                                                                   
     [6] "05.10.17, 09:28 - Person Two: You guys have history?\n"  
     [7] 05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n" 
    

    注意:我不能在换行符处拆分数据,因为某些消息包含消息中间的一个或多个数据。

    3 回复  |  直到 6 年前
        1
  •  2
  •   Onyambu    6 年前

    您只需在以下情况下创建拆分模式 \n 后跟日期。

     strsplit(gsub("(.*?\\n)(\\d+[.]\\d+[.]\\d+)","\\1SPLITHERE\\2",TEST),"SPLITHERE")
    [[1]]
    [1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"                         
    [2] "05.10.17, 09:27 - Person One: I could bring some beer\n"                              
    [3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"  
    [4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"                              
    [5] "05.10.17, 09:27 - Person Two: ???\n"                                                  
    [6] "05.10.17, 09:28 - Person Two: You guys have history?\n"                               
    [7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
    

    您还可以使用 rematches 从底部r

     regmatches(TEST,gregexpr(".*?\\n",TEST))
    [[1]]
    [1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"                         
    [2] "05.10.17, 09:27 - Person One: I could bring some beer\n"                              
    [3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"  
    [4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"                              
    [5] "05.10.17, 09:27 - Person Two: ???\n"                                                  
    [6] "05.10.17, 09:28 - Person Two: You guys have history?\n"                               
    [7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
    
        2
  •  1
  •   Gilles San Martin    6 年前

    可以添加白色字符类 \\s 在你积极展望之前。

    我稍微更改了您的示例,使其更精确地匹配您的问题(即在标题中添加)

    > TEST <- c("05.10.17, 09:26 - Person One: How about\n we chill on sunday\n05.10.17, 09:27 - Person One: I could bring some beer\n05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n05.10.17, 09:27 - Person Two: ???\n05.10.17, 09:28 - Person Two: You guys have history?\n05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n")
    > unlist(strsplit(TEST,"\\s(?=[0-3][0-9][.][0-9]{2}[.][0-9]{2}[,][ ][0-9]{2}:[0-9]{2})", perl = TRUE))
    
    ## [1] "05.10.17, 09:26 - Person One: How about\n we chill on sunday"                         
    ## [2] "05.10.17, 09:27 - Person One: I could bring some beer"                                
    ## [3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards"    
    ## [4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-"                                
    ## [5] "05.10.17, 09:27 - Person Two: ???"                                                    
    ## [6] "05.10.17, 09:28 - Person Two: You guys have history?"                                 
    ## [7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
    
        3
  •  1
  •   Shenglin Chen    6 年前
    strsplit(TEST, '(?<=\\\n|^)(0)',perl=T)[[1]][2:7]