代码之家 › 专栏 › 技术社区 › DeepSpace

轻微的正则表达式混淆-$使用多行标志时的行为

regex python

DeepSpace · 技术社区 · 6 年前

!
interesting1 a
not interesting b
interesting2 c
!
interesting1 a
not interesting b
interesting2 c
!
interesting1 a
not interesting b
interesting2 c
not interesting arbitrary text d
!

你可能已经猜到了,我想提取 a 和 c interesting2 c 行是可选的,但我只需要 一 如果还有 (按节)。

!\n(interesting1 (?P<a>.*?)$.*?(?:interesting2 (?P<c>.*?))?$\n(?=!)) 我得到:

一 和 c 从前两部分开始,但是(可以理解) 一 和 c\nnot interesting arbitrary text d 从最后一节开始。见 regex101 .

对于这种情况,我怀疑这是最有效的正则表达式,因为这个小文本需要438个步骤,所以我愿意使用任何其他更有效的解决方案来获得正确的结果。

如果我把正则表达式改成 !\n(interesting1 (?P<a>.*?)$.*?(?:interesting2 (?P<c>\w+))?$\n(?=!)) ( \w+ .*? 在捕获组中 c (自 \w 不包括 \n ).

我不懂的是如何使用 $ 为了在
利息2 c 最后呢 !

使用可选的非捕获组和 $

2 回复 | 直到 6 年前

Aran-Fey Kevin 6 年前

$ 为了在 interesting2 c 最后呢 !

那是因为 $ $ 只是一个锚定,它在字符串的末尾(如果正则表达式处于多行模式,则在换行符之前)断言一个位置。匹配一行文本完全不需要它。

正则表达式不起作用的原因很简单:它缺少一些与可选行匹配的内容。就像我之前说过的, $ 只是一个锚-它没有消费任何文本。所以为了成功匹配你的 (?=!) 展望未来,团队 c 必须将所有的文本与性格。为了防止这种情况发生,您必须添加与最后一行匹配的内容,例如 .*? [^\n]* .

不过,在这种特殊情况下,并不像添加 .*? 在向前看。为什么?因为组是可选的,添加 .*? 最终会阻止 c 来自匹配的组:

!\n(interesting1 (?P<a>.*?)$.*?(?:interesting2 (?P<c>\w+))?$\n.*?(?=!))
                            ^  ^                              ^
                            |  |                              this .*? would grow
                            |  |                              and consume the
                            |  |                              "interesting2 c"
                            |  this group is optional, so it would be skipped
                            this .*? would match the empty string

所以最好从头重写正则表达式。

!\ninteresting1 (?P<a>.*)(?:\n[^!].*)*\ninteresting2 (?P<c>.*)

逻辑非常简单:

!\ninteresting1 (?P<a>.*) a
(?:\n[^!].*)* 跳过任何不以字母开头的行
\ninteresting2 (?P<c>.*) 匹配和捕获

这与正则表达式稍有不同,因为它只会在两者都匹配的情况下产生匹配和存在于一个节中。另请参见 online demo .

yoonghm 6 年前

我用这个

import re

text=\
"""
!
interesting1 a
not interesting b
interesting2 c
!
interesting1 a
not interesting b
interesting2 c
!
interesting1 a
not interesting b
interesting2 c
not interesting d
!
"""

pa = re.compile(r'^interesting[12] ([a-zA-Z]){1}', re.MULTILINE)
m = pa.findall(text)
print(m)