代码之家 › 专栏 › 技术社区 › âÊÊá¸á¸½á¸

相同的正则表达式,但不同的结果在大熊猫与R

stringr pandas regex r python

âÊÊá¸á¸½á¸ · 技术社区 · 5 年前

考虑一下这个简单的正则表达式,它旨在提取标题

(\w[\w-]+){2,}

在Python中运行它( Pandas stringr )结果完全不同!

弦 提取工作正常:查看 'this-is-a-very-nice-test' 正确分析

library(stringr)
> str_extract_all('stackoverflow.stack.com/read/this-is-a-very-nice-test', 
+                 regex('(\\w[-\\w]+){2,}'))
[[1]]
[1] "stackoverflow"            "stack"                    "read"                     "this-is-a-very-nice-test"

myseries = pd.Series({'text' : 'stackoverflow.stack.com/read/this-is-a-very-nice-test'})

myseries.str.extractall(r'(\w[-\w]+){2,}')
Out[51]: 
             0
     match    
text 0      ow
     1      ck
     2      ad
     3      st

这里怎么了?

0 回复 | 直到 5 年前

Wiktor StribiÅ¼ew 5 年前

这个 (\w[-\w]+){2,} 正则表达式表示 repeated capturing group

重复捕获组将只捕获最后一次迭代

看到了吗 regex demo .extractall 因为这个方法需要一个“ 捕获组的正则表达式模式 “并返回” 一 DataFrame 每个匹配一行,每组一列

与熊猫相反 extractall ,R stringr::str_extract_all 在其结果中省略所有捕获的子字符串,并且仅“

Mahmoud Elshahat 5 年前

将此部分“{2,}”更改为“{1,}”后,这是预期的工作

import re
s = 'stackoverflow.stack.com/read/this-is-a-very-nice-test'
out = re.findall(r'(\w[-\w]+){1,}', s)
print(out)

输出:

['stackoverflow', 'stack', 'com', 'read', 'this-is-a-very-nice-test']

编辑: python prespective的解释:

在前面的例子{2,}中,将m=2和n设为无穷大,这意味着一个模式应该至少重复2次,

推荐文章

lonix · 使用sed从JSON中提取非贪婪正则表达式

1 年前

me-me · regex检查电子邮件字符串是否有@后跟一个点以及点符号后至少2个字符[重复]

2 年前

Dave Guerrero · 是否有一个正则表达式模式来捕获字符串中直到第一个字母字符的数字?

2 年前

Dima Malko · 如何在指定符号前添加符号?

2 年前

shekharsabale · 从列表元素捕获子字符串

2 年前

maycca · 正则表达式:过滤年份数值大于某个值的文件?字符串中编码的年份

2 年前

Katia · 根据特定规则进行多行匹配

2 年前

Andrei Cleland · 在长正则表达式中包含unicode字符

2 年前

MHA · Pandas str.extract()以字母结尾的数字

2 年前

Slava Vir · 如何查找后面“/”之间的最后一组

2 年前