总结
library(purrr)
library(stringr)
female$Count <- map_int(female$Var2,
function(x){sum(str_detect(female_df$MH2, pattern = x))})
介绍
有多种方法可以计算每个单词或短语包含多少行。但根据目前为止这条线索中的答案和讨论,实现这一目标的总体策略是。
-
lapply
和
sapply
map
来自的函数
purrr
包裹
-
使用函数来计数或检测特定模式(单词或短语)是否在字符串中。这些函数类似于
grep
,
grepl
从R基开始,或
str_detect
str_which
从
stringr
由于OP有大量数据需要处理,我进行了分析,比较了R基函数的哪些组合,
和
斯特林格
我总共调查了八种组合。在使用
map_int
,
格雷普
或
str_哪个
和
格雷普
或
str_检测
.
数据准备
在这里,我创建了两个数据帧,
female
female_df
,基于OP的示例。注意,我设置了
stringsAsFactors
确保每一列都是字符格式。
# Create the example data frame: female
female <- data.frame(Var2 = c("myocardial infarction drug therapy",
"imipramine poisoning",
"oximetry",
"thrombosis drug therapy",
"angioedema chemically induced"),
stringsAsFactors = FALSE)
# Create the example data frame: female_df
female_df <- data.frame(MH2 = c("oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects",
"angioedema chemically induced, angioedema chemically induced, oximetry",
"abo blood group system, imipramine poisoning, adverse effects",
"isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy",
"thrombosis drug therapy"),
stringsAsFactors = FALSE)
microbenchmark
是用于评估代码性能的包。
# Load packages
library(purrr)
library(stringr)
library(microbenchmark)
功能组合
组合1
这是卢斯·泰尔斯的回答。它使用
赛普利
和
.
sapply(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})
myocardial infarction drug therapy imipramine poisoning
1 1
oximetry thrombosis drug therapy
2 2
angioedema chemically induced
1
组合2
这是Dave2e的回答。它使用
和
格雷普
sapply(female$Var2, function(x){length(grep(x, female_df$MH2))})
myocardial infarction drug therapy imipramine poisoning
1 1
oximetry thrombosis drug therapy
2 2
angioedema chemically induced
1
组合3
这使用
和
.
map_int(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})
[1] 1 1 2 2 1
组合4
这使用
map_int
.
map_int(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})
[1] 1 1 2 2 1
这使用
和
.
map_int(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})
[1] 1 1 2 2 1
map_int
和
格雷普
map_int(female$Var2, function(x){length(grep(x, female_df$MH2))})
[1] 1 1 2 2 1
这使用
和
str_检测
sapply(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})
myocardial infarction drug therapy imipramine poisoning
1 1
oximetry thrombosis drug therapy
2 2
angioedema chemically induced
1
组合8
和
.
sapply(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})
myocardial infarction drug therapy imipramine poisoning
1 1
oximetry thrombosis drug therapy
2 2
angioedema chemically induced
1
所有这些组合都是有效答案。例如,我们可以
female$Count <
微基准
在这里,我用30000次抽样对这八种组合进行了基准测试。
m <- microbenchmark(
C1 = {sapply(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})},
C2 = {sapply(female$Var2, function(x){length(grep(x, female_df$MH2))})},
C3 = {map_int(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})},
C4 = {map_int(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})},
C5 = {map_int(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})},
C6 = {map_int(female$Var2, function(x){length(grep(x, female_df$MH2))})},
C7 = {sapply(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})},
C8 = {sapply(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})},
times = 30000L
)
print(m)
Unit: microseconds
expr min lq mean median uq max neval
C1 166.144 200.784 1503.780 2192.261 2401.063 184228.81 30000
C2 163.578 198.860 1420.937 1460.653 2280.465 144553.22 30000
C3 189.238 231.575 1502.319 790.305 2386.309 146455.85 30000
C4 200.784 246.329 1461.714 1224.909 2306.125 184189.04 30000
C5 150.107 185.388 1452.586 1970.630 2376.687 32124.08 30000
C6 148.824 184.105 1398.312 1921.556 2259.937 145843.88 30000
C7 205.916 251.461 1516.979 851.246 2408.119 146305.10 30000
C8 215.538 264.932 1481.538 1508.764 2324.727 229709.16 30000
map_int
和
str_检测