代码之家  ›  专栏  ›  技术社区  ›  Marco C

从R中的字符串向量提取城市

  •  3
  • Marco C  · 技术社区  · 6 年前

    我的数据集db中有一列,比如db$affiliation,看起来像:

    **db$affiliation**
    [1] "[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA"                               
    [2] "[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS."                                                
    [3] "[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND."   
    [4] ...
    

    我想在同一个数据集中创建一列,只包含db$affiliation中的城市名称,例如

     **db$cities**
     [1] LOS ANGELES
     [2] TWENTE
     [3] BANGKOK
     [4] ...
    

    如果有多个城市名称可用,我希望命令只返回最后一个城市名称,如果没有城市名称可用,我希望使用NA。我该怎么做?

    我想我可以 world.cities$name 在里面 data(world.cities) maps 但我想不出是怎么回事。

    我甚至尝试拆分db$附属列,例如:

    db$affiliation <- gsub("\\[[^\\]]*\\]", "", db$affiliation, perl=TRUE) # remove content within brackets 
    db$affiliation[2] # check the separator
    db <- cSplit(db, 'affiliation', sep=c(", "), type.convert=FALSE) # split after comma 
    

    其结果(我在affiliation\u 3之后将其截断)为:

        affiliation_1            affiliation_2                  affiliation_3 
    [1] UNIV CALIF LOS ANGELES   DEPT GEOG                      LOS ANGELES  
    [2] UNIV TWENTE              DEPT WATER ENGN & MANAGEMENT   DRIENERLOLAAN            
    [3] CHULALONGKORN UNIV       FAC ARCHITECTURE               BANGKOK 
    

    然后通过:

    db$cities <- lapply(db$affiliation_1, function(x)x[which(x %in% world.cities$name)])
    

    但我得到一个空列。

    谢谢你的帮助!

    3 回复  |  直到 6 年前
        1
  •  2
  •   Prem    6 年前

    示例字符串中有许多城市,如果在中发现多个城市,您可能需要重新考虑是否仍要获取“最后一个城市” affiliation

    library(maps)
    data(world.cities)
    
    #sample data
    df <- data.frame(affiliation = c("[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA",
                                     "[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.",
                                     "[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND.",
                                     "Prem"), stringsAsFactors = F)
    
    #fetch city and it's respective country from 'affiliation' column
    cities_country <- lapply(gsub("\\[|\\]|[,;]|\\.","",df$affiliation), function(x) 
      paste(as.character(world.cities$name[sapply(world.cities$name, grepl, x, ignore.case=T)]),
            as.character(world.cities$country.etc[sapply(world.cities$name, grepl, x, ignore.case=T)]),
            sep="_"))
    df$cities_country <- lapply(cities_country, function(x) if(identical(x, character(0))) NA_character_ else x)
    df
    

    输出为:

    affiliation
    1                                                                 [SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA
    2 [VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.
    3                                                               [ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND.
    4                                                                                                                                           Prem
                                                                                                                                                                                                                                                                                                cities_country
    1                                                                      Al_Norway, Alle_Switzerland, Allen_Philippines, Allen_USA, Angeles_Costa Rica, Angeles_Philippines, Cali_Colombia, Cot_Costa Rica, Li_Norway, Los Angeles_Chile, Los Angeles_USA, Os_Kyrgyzstan, Os_Norway, U_Micronesia, Usa_Japan
    2 Ae_Marshall Islands, Ede_Netherlands, Ede_Nigeria, Enschede_Netherlands, Hede_China, Ine_Marshall Islands, Laa_Austria, Lola_Guinea, Man_Ivory Coast, Mana_French Guiana, Manage_Belgium, Nagem_Luxembourg, Ob_Russia, Ola_Panama, Po_Burkina Faso, U_Micronesia, Van_Turkey, Wa_Ghana, We_New Caledonia
    3                                                                                                                                     Aila_Estonia, Al_Norway, Anan_Japan, Ba_Fiji, Bangkok_Thailand, Hit_Iraq, Ila_Nigeria, Ilan_Taiwan, Long_Thailand, Nan_Thailand, Tsu_Japan, U_Micronesia, Ula_Turkey
    4                                                                                                                                                                                                                                                                                                       NA
    

    ( 笔记 在上述输出中,我保留了所有出现的城市,并且为了方便起见,还将其加上了各自国家的后缀)

        2
  •  1
  •   RolandASc    6 年前

    从您显示的几行中可以看出,您可能可以执行以下操作(请注意,您没有对齐套管):

    tmpVec <- sapply(strsplit(db$affiliation, split = ","), function(x) {
      cleanVec <- toupper(trimws(x))
      cleanVec[max(which(cleanVec %in% toupper(maps::world.cities$name)))]
    })
    

    或者在函数中添加更多代码,以避免出现难看的警告。

        3
  •  1
  •   jazzurro    6 年前

    让我留下解决方案的一部分。根据我自己的研究,方括号中的字母似乎表示个人姓名。例如,我发现 Sutee Anantsuksomsri 是实际名称。这一观察结果表明,我们可能希望删除括号中的文本。

    删除方括号中的文本后,我使用 unnest_tokens() 在tidytext包中。请注意,该函数将所有字母转换为小写字母。如果不喜欢,可以通过指定 to_lower = FALSE . 首先,我把每个城市的名字分成几个单词。我还为每个城市分配了一个ID号。其次,我清理了你的数据。如前所述,我使用 gsub() . 然后,我申请了 unnest\u令牌() 到数据。我使用来自 cities 在里面 filter() . 到目前为止,我们得到的结果如下。显然,你还有更多的工作要做。我留下了采样数据, mydf 在下面我希望你能离开这里。

    data(world.cities)
    
    cities <- world.cities %>%
              mutate(id = 1:n()) %>%
              unnest_tokens(input = name, output = word, token = "words")
    
    temp <- mydf %>%
            mutate(affiliation = gsub(x = affiliation, pattern = "\\[.*\\]", replacement = "")) %>%          
            unnest_tokens(input = affiliation, output = word, token = "words") %>%
            filter(word %in% cities$word)
    
    
       id     word
    1   1      los
    2   1  angeles
    3   1      los
    4   1  angeles
    5   1       ca
    6   1      usa
    7   2    water
    8   2       ae
    9   2 enschede
    10  3  bangkok
    

    数据

    mydf <- structure(list(id = 1:3, affiliation = c("[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA", 
    "[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.", 
    "[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND."
    )), .Names = c("id", "affiliation"), row.names = c(NA, -3L), class = "data.frame")