代码之家 › 专栏 › 技术社区 › singmotor

正则表达式匹配到以开头的最后一行`-[`

substring python-3.x string regex python

singmotor · 技术社区 · 6 年前

我有一个指定的正文块,它包含一个github标记列表,格式如下:

**HEADERONE**
- [x] Logged In
- [ ] Logged Out
- [x] Spun Around
- [x] Did the hokey pokey

但这个列表被其他类似的垃圾所包围:

A body paragraph about other things. Lorem ipsom and all that

**HEADERONE**
- [x] Logged In
- [ ] Logged Out
- [x] Spun Around
- [x] Did the hokey pokey

Maybe a link here www.go_ogle.com 

Another list that isn't important
- [ ] Thing one
- [ ] Thing two
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo

我可以在抓取后以编程方式切掉字符串,但我很好奇是否有一种非常干净的方法来抓取我的列表?头总是一样的,所以从 **HEADERONE** 直到双新生产线的第一次运行正常。从 **领班** 直到最后一行结束 - [ 不过会很神奇的。

我在用

\*\*HEADERONE\*\*[^*]*?(?=\n{2})

但在Regex101中有效的时候, re.search("\*\*HEADERONE\*\*[^*]*?(?=\n{2})",body) 出于某种原因不返回任何值。所以我换成了

\*\*HEADERONE\*\*[\S\s]*?(?=\n{2})

但这太吸引人了,包括第二个名单。有什么想法吗?

3 回复 | 直到 6 年前

Wiktor StribiÅ¼ew 6 年前

尽管替换 (?=\n{2}) 具有 (?=(?:\r\n){2}) 会解决这个问题,因为输入中有crlf结尾,我建议使用更精确的模式:

m = re.search(r'^\*\*HEADERONE\*\*(?:\r?\n-\s*\[[^][]*].*)*', s, re.M)
if m:
    print(m.group())

见 regex demo 以及 Python demo 是的。

解释

^ -线的起点( re.M 重新定义 ^ 锚固性能)
\*\*HEADERONE\*\* -一个 **HEADERONE** 一串
(?:\r?\n-\s*\[[^][]*].*)* -连续重复零次或多次
- \r?\n -CRLF或LF结束
- - -连字符
- \s* -0+空格
- \[ -一个 [ 烧焦
- [^][]* -0+字符 ] 和 [
- ] -一个 ]
- .* -剩下的部分。

此外,还有一种方法可以使用非regex方法获取文件中的所有匹配项:

res = []
tmp = []
inblock = False
for line in f:  # f is a handle to the open file, or use s.splitlines() to split the string s into lines
    if line == '**HEADERONE**':
        tmp.append(line.rstrip())
        inblock = not inblock
    elif inblock and line.startswith("- ["):
        tmp.append(line.rstrip())
    else:
        if len(tmp) > 0:
            res.append("\n".join(tmp))
            tmp = []
            inblock = not inblock

见 Python demo online 是的。基本上,一旦 **领班** 找到,所有后续行以 - [ 附加到 tmp ,然后加入到 res 列表。

dawg 6 年前

你可以在 \*\*HEADERONE\*\* 第一行空白如下:

^(\*\*HEADERONE\*\*[\s\S]*?)^\s*$

Demo

这个 [\s\S]*? 匹配所有字符,包括换行符,直到第一个空行。如果可能没有空行或字符串结尾,则可以将该测试轻松添加到表单中:

^(\*\*HEADERONE\*\*[\s\S]*?)(?:^\s*$|\Z)

Demo

如果希望使用python非regex方法获取该块,并且块由两行或多行新行分隔,则可以执行以下操作:

print('\n'.join(block for block in s.replace('\r\n', '\n').split('\n\n') if block.lstrip().startswith('**HEADERONE**')))

Try it online

或者,如果你有一个文件:

print('\n'.join(block for block in fo.read() if block.lstrip().startswith('**HEADERONE**')))

在哪里? fo 文件是用 'U' 在文件模式下。

DYZ 6 年前

regex = r'\*\*HEADERONE\*\*(?:\n.+)+'
#^^^ HEADER followed by ONE newline and some other stuff
results = re.findall(regex, text)
print(results[0])
#**HEADERONE**
#- [x] Logged In
#- [ ] Logged Out
#- [x] Spun Around
#- [x] Did the hokey pokey