代码之家 › 专栏 › 技术社区 › Saewon Park

查看列表中的哪个向量包含在另一个列表中的向量中(查找人名匹配项)

sapply match list r

Saewon Park · 技术社区 · 6 年前

我有一个人名矢量列表,每个矢量只有名字和姓氏,还有另一个矢量列表,每个矢量有名字、中间名和姓氏。我需要匹配这两个列表以查找包含在这两个列表中的人。因为名称不按顺序排列(有些向量的名字是第一个值,而另一些向量的姓氏是第一个值),所以我想通过查找第二个列表(全名)中哪个向量包含第一个列表中某个向量的所有值来匹配这两个向量(仅限名字和姓氏)。

到目前为止我所做的:

#reproducible example
first_last_names_list <- list(c("boy", "boy"),
                       c("bob", "orengo"),
                       c("kalonzo", "musyoka"),
                       c("anami", "lisamula"))

full_names_list <- list(c("boy", "juma", "boy"), 
                        c("stephen", "kalonzo", "musyoka"),
                        c("james", "bob", "orengo"),
                        c("lisamula", "silverse", "anami"))

首先,我尝试创建一个函数来检查一个向量是否包含在另一个向量中(很大程度上基于 here )

my_contain <- function(values,x){
    tx <- table(x)
    tv <- table(values)
    z <- tv[names(tx)] - tx
    if(all(z >= 0 & !is.na(z))){
       paste(x, collapse = " ")
       }
    }

#value would be the longer vector (from full_name_list) 
#and x would be the shorter vector(from first_last_name_list)

然后,我尝试将这个函数放在sapply()中,这样我就可以处理列表,这就是我遇到的问题。我可以让它看看一个向量是否包含在一个向量列表中,但是我不知道如何检查一个列表中的所有向量,看看它是否包含在第二个列表中的任何向量中。

#testing with the first vector from first_last_names_list. 
#Need to make it run through all the vectors from first_last_names_list.

sapply(1:length(full_names_list),
   function(i) any(my_contain(full_names_list[[i]], 
                              first_last_names_list[[1]]) == 
                              paste(first_last_names_list[[1]], collapse = " ")))

#[1]  TRUE FALSE FALSE FALSE

最后——尽管在一个问题上问得太多了——如果有人能给我一些关于如何将agrep()用于模糊匹配以解释名称中的拼写错误的建议,那就太好了!如果不是,那也没关系,因为我至少要先把匹配的部分找对。

4 回复 | 直到 6 年前

Onyambu 6 年前

既然你在处理 lists 为了便于处理正则表达式,最好将它们折叠成向量。但你只是按照升序排列它们。在这种情况下,您可以很容易地匹配它们:

lst=sapply(first_last_names_list,function(x)paste0(sort(x),collapse=" "))
 lst1=gsub("\\s|$",".*",lst)
 lst2=sapply(full_names_list,function(x)paste(sort(x),collapse=" "))
 (lst3 = Vectorize(grep)(lst1,list(lst2),value=T,ignore.case=T))
               boy.*boy.*             bob.*orengo.*        kalonzo.*musyoka.*         anami.*lisamula.* 
           "boy boy juma"        "bob james orengo" "kalonzo musyoka stephen" "anami lisamula silverse"

现在如果你想链接 first_name_last_name_list 和 full_name_list 然后:

setNames(full_names_list[ match(lst3,lst2)],sapply(first_last_names_list[grep(paste0(names(lst3),collapse = "|"),lst1)],paste,collapse=" "))
$`boy boy`
[1] "boy"  "juma" "boy" 

$`bob orengo`
[1] "james"  "bob"    "orengo"

$`kalonzo musyoka`
[1] "stephen" "kalonzo" "musyoka"

$`anami lisamula`
[1] "lisamula" "silverse" "anami"

其中名称来自第一个\最后一个\列表,元素是完整的\名称\列表。对于您来说,处理字符向量而不是列表是很好的:

Mike S 6 年前

编辑我已经修改了解决方案,以满足重复名称(如“john john”)不应与“john smith”匹配的约束。

apply(sapply(first_last_names_list, unlist), 2, function(x){
        any(sapply(full_names_list, function(y) sum(unlist(y) %in% x) >= length(x)))
    })

此解决方案仍使用 %in% 和apply函数,但现在它对 first_last 它看到的名字 多少单词 在每个名称中 full_names 列表匹配。如果这个号码是 大于或等于 中的字数 first_list 命名正在考虑的项(在您的示例中总是两个单词,但代码适用于任何数字),它返回true。然后将此逻辑数组与 ANY 返回单个矢量,显示每个首字母末字母是否与任何全名匹配。

例如,“约翰” 不会与“约翰·史密斯随机”匹配,因为“约翰·史密斯随机”中3个单词中只有1个匹配。然而,它将是与“约翰·亚当·约翰”匹配,因为“约翰·亚当·约翰”中的3个单词中有2个匹配,2等于“约翰·约翰”的长度。它还将与“John John John John John”匹配,因为5个单词中的5个匹配,大于2个。

tiQu 6 年前

而不是我的\包含,尝试

x %in% values

可能还取消列表并使用数据帧?不确定你是否考虑过——可能会让事情变得更容易:

# unlist to vectors
fl <- unlist(first_last_names_list)
fn <- unlist(full_names_list)

# grab individual names and convert to dfs; 
# assumptions: first_last_names_list only contains 2-element vectors
#              full_names_list only contains 3-element vectors
first_last_df <- data.frame(first_fl=fl[c(T, F)],last_fl=fl[c(F, T)])
full_name_df <- data.frame(first_fn=fn[c(T,F,F)],mid_fn=fn[c(F,T,F)],last_fn=fn[c(F,F,T)])

tiQu 6 年前

或者你可以这样做:

first_last_names_list <- list(c("boy", "boy"),
                          c("bob", "orengo"),
                          c("kalonzo", "musyoka"),
                          c("anami", "lisamula")) 

full_names_list <- list(c("boy", "juma", "boy"), 
                    c("stephen", "kalonzo", "musyoka"),
                    c("james", "bob", "orengo"),
                    c("lisamula", "silverse", "anami"),
                    c("musyoka", "jeremy", "kalonzo")) # added just to test

# create copies of full_names_list without middle name; 
# one list with matching name order, one with inverted order
full_names_short <- lapply(full_names_list,function(x){x[c(1,3)]})
full_names_inv <- lapply(full_names_list,function(x){x[c(3,1)]})

# check if names in full_names_list match either
full_names_list[full_names_short %in% first_last_names_list | full_names_inv %in% first_last_names_list]

在这种情况下 %in% 执行您希望它执行的操作,它检查完整的名称向量是否匹配。