代码之家  ›  专栏  ›  技术社区  ›  abaporu

在r[关闭]中提取文本字符串

  •  -1
  • abaporu  · 技术社区  · 6 年前

    我有一个这样的专栏:

    > PREFI.(S): NETWORK SA|ADV.(A/S):JOHN SMITH SANT'ANNA (30652/RS) AND OTHER(A/S)|RECDO.(A/S): CLAUDIA TRROMMER|ADV.(A/S): LOUISE (52417/RS)
    
    > PREFI.(S): RUTH SEIXAS|ADV.(A/S): LOPES SOUTO (47706/RS)|RECDO.(A/S): MARTINS (64285/RS)
    

    我想: 1)拆分值| 2)仅获取介于”)“或”:“和非字母字符/行尾之间的文本

    结果将是:

     NETWORK SA 
     JOHN SMITH
     AND OTHER
     CLAUDIA TRROMMER
     LOUISE RUTH
    

    我想我第一部分做得很成功

    docs <- str_split(processos$partes,"\\|")
    

    但是我不知道最后一部分-即使在用regex lookback/ahead进行了一些尝试之后

    1 回复  |  直到 6 年前
        1
  •  1
  •   Retired Data Munger    6 年前

    解决方案使用 泰迪弗斯 以及 桁条 功能:

    > library(tidyverse)
    
    > x <- "
    + > PREFI.(S): NETWORK SA|ADV.(A/S):JOHN SMITH SANT'ANNA (30652/RS) AND OTHER(A/S)|RECDO.(A/S): CLAUDIA TRROMMER|ADV.(A/S): LOUISE (52417/RS) ..." ... [TRUNCATED] 
    
    > # split on "|"
    > xs <- str_split(x, "\\|")[[1]]
    
    > # extract the data
    > str_extract_all(xs, "\\):[ a-zA-Z]*") %>%
    +   unlist() %>%
    +   sub("^..", "", .)  # get rid of "):"
    [1] " NETWORK SA"       "JOHN SMITH SANT"   " CLAUDIA TRROMMER"
    [4] " LOUISE "          " RUTH SEIXAS"      " LOPES SOUTO "    
    [7] " MARTINS "