代码之家 › 专栏 › 技术社区 › Selva

Regex替换R中的wiki引文

stringr r regex

-1

Selva · 技术社区 · 8 年前

维基百科文章中引用的替代词是什么?

输入示例:

 text <- "[76][note 7] just like traditional Hinduism regards the Vedas "

预期产出:

"just like traditional Hinduism regards the Vedas"

我尝试过:

> text <- "[76][note 7] just like traditional Hinduism regards the Vedas "
> library(stringr)
> str_replace_all(text, "\\[ \\d+ \\]", "")
[1] "[76][note 7] just like traditional Hinduism regards the Vedas "

4 回复 | 直到 8 年前

MFR 8 年前

试试这个:

text <- "[76][note 7] just like traditional Hinduism regards the Vedas "
 library(stringr)
 str_replace_all(text, "\\[[^\\]]*\\]\\s*", "")

输出:

 "just like traditional Hinduism regards the Vedas "

Hector Buelta 8 年前

此正则表达式是一个选项:

(?!.*\]).*

lookabout(括号内的块)将贪婪地将指针设置在最后一个“]”之后。表达式“.*”的其余部分将匹配您想要的内容(包括前导空格//但在您选择的语言中这将是一个简单的)直到新行

R. Schifini 8 年前

这应该可以做到:

trimws(sub("\\[.*\\]", "",text))

结果:

[1] "just like traditional Hinduism regards the Vedas"

此图案查找左括号( \\[ ),一个右括号( \\] )以及介于两者之间的一切( .* ).

默认情况下 .* 贪婪,也就是说,它会尽量匹配,即使有右括号和左括号,直到找到最后一个右括号。此匹配项被一个空字符串替换。

最后 trimws 函数将删除结果的星号和结尾处的空格。

编辑:删除整个句子中的引文

如果句子中有多处引用,则模式和功能会改变为:

trimws(gsub(" ?\\[.*?\\] ", "",text))

例如,如果句子是:

text1 <- "[76][note 7] just like traditional Hinduism [34] regards the Vedas "
text2 <- "[76][note 7] just like traditional Hinduism[34] regards the Vedas "

各自的结果将是:

[1] "just like traditional Hinduism regards the Vedas"
[1] "just like traditional Hinduism regards the Vedas"

图案更改:

.*? 将regexp从贪婪更改为懒惰。也就是说,它将尝试匹配最短的模式,直到找到第一个右括号。

开始 ? (空格+问号)这将尝试匹配左括号前的可选空格。

Wiktor StribiÅ¼ew 8 年前

你的 \\[ \\d+ \\] 不起作用,因为模式中有空格。此外,如果删除空格,表达式将只匹配 [ + digits + ] 和不匹配 [note 7] -类似子字符串。

下面是一个基本R解决方案,使用 gsub 带有TRE regex(否 perl=TRUE 是必需的):

text <- "[76][note 7] just like traditional Hinduism regards the Vedas "
trimws(gsub("\\[[^]]+]", "", text))
## Or to remove only those [] that contain digits/word + space + digits
trimws(gsub("\\[(?:[[:alnum:]]+[[:blank:]]*)?[0-9]+]", "", text))

请参阅 R demo

图案说明 :

\\[ -文字 [ (必须在char类之外转义)
(?:[[:alnum:]]+[[:blank:]]*)? -(由于以下原因,为可选顺序 ? 末尾的限定符)1个或多个字母数字,后跟0+空格或制表符
[0-9]+ -1+位数
] -文字 ] (无需在字符类之外转义)

这个 trimws 删除前导/尾随空白。

regex demo (请注意,选择PCRE选项是因为它支持POSIX字符类,请勿使用此网站测试TRE正则表达式模式!)。