代码之家  ›  专栏  ›  技术社区  ›  Cina

正则表达式:提取URL的一部分并在R中创建新列

  •  1
  • Cina  · 技术社区  · 6 年前

    我有URL的记录,我想提取其中的一部分并创建新的列。在我的示例中,我想考虑后面的数字 "groups" 作为 group_id 以及 dicussion_topics 作为 discussion_id df类:

     user  url
        1      "https://test.com/groups/3276/discussion_topics/3939"
        2      "https://test.com/groups/34/discussion_topics/11"
        3      "https://test.com/groups/3276"
        4      "https://test.com/groups/other"
    

    我想要像这样的结果

    user  group_id  dicussion_id  
    1      3276       3939
    2      34         11
    3      3276       NA
    4      NA         NA
    

    如何使用r中的正则表达式? 谢谢

    3 回复  |  直到 6 年前
        1
  •  3
  •   Onyambu    6 年前
    dat$group_id=as.numeric(sub(".*/groups/(\\d+).*|.*","\\1",dat$url))
    dat$discussion=as.numeric(sub(".*/discussion_topics/(\\d+).*|.*","\\1",dat$url))
    dat
      user                                                 url group_id discussion
    1    1 https://test.com/groups/3276/discussion_topics/3939     3276       3939
    2    2     https://test.com/groups/34/discussion_topics/11       34         11
    3    3                        https://test.com/groups/3276     3276         NA
    4    4                       https://test.com/groups/other       NA         NA
    
        2
  •  1
  •   AndS.    6 年前

    df$group_id <- as.numeric(regmatches(df$url, gregexpr(".*groups/*\\K.\\d+", df$url, perl=TRUE)))
    df$discussion <- as.numeric(regmatches(df$url, gregexpr(".*topics/*\\K.\\d+", df$url, perl=TRUE)))
    
        3
  •  1
  •   Manuel Bickel    6 年前

    stringi

    library(stringi)
    extract_info = function(x) {
      x$group = stri_extract_all_regex(x$url, "(?<=groups/)\\d+")
      x$topic = stri_extract_all_regex(x$url, "(?<=discussion_topics/)\\d+")
      x
    }
    extract_info(dat)
    #    user                                                 url group topic
    # 1    1 https://test.com/groups/3276/discussion_topics/3939  3276  3939
    # 2    2     https://test.com/groups/34/discussion_topics/11    34    11
    # 3    3                        https://test.com/groups/3276  3276    NA
    # 4    4                       https://test.com/groups/other    NA    NA
    
    extract_info2 = function(dat) {
    dat$group_id=as.numeric(sub(".*/groups/(\\d+).*|.*","\\1",dat$url))
    dat$discussion=as.numeric(sub(".*/discussion_topics/(\\d+).*|.*","\\1",dat$url))
    dat
    }
    
    extract_info3 = function(data) {
      df$group_id <- as.numeric(regmatches(df$url, gregexpr(".*groups/*\\K.\\d+", df$url, perl=TRUE)))
      df$discussion <- as.numeric(regmatches(df$url, gregexpr(".*topics/*\\K.\\d+", df$url, perl=TRUE)))
      df
    }
    
    microbenchmark::microbenchmark(
      extract_info(dat)
      ,extract_info2(dat)
      ,extract_info3(dat)
    )
    # Unit: microseconds
    #            expr     min      lq     mean   median       uq      max neval
    # extract_info(dat)  152.769 160.269 172.1629 170.5325 176.0590  300.011   100
    # extract_info2(dat)  99.872 106.386 120.9876 117.2415 125.7285  226.981   100
    # extract_info3(dat) 285.799 301.984 378.7235 308.8925 323.3000 6684.297   100