代码之家  ›  专栏  ›  技术社区  ›  ohho

python中的多行模式匹配

  •  0
  • ohho  · 技术社区  · 15 年前

    计算机定期生成的消息(简化版):

    Hello user123,
    
    - (604)7080900
    - 152
    - minutes
    
    Regards
    

    使用python,我如何提取“(604)7080900”,“152”,“minutes”(即前导后面的任何文本 "- " 模式)在两个空行之间(空行是 \n\n 在“hello user123”和 \n 在“问候”之前)。如果结果字符串列表存储在数组中就更好了。谢谢!

    编辑:两个空行之间的行数不是固定的。

    第二编辑:

    例如

    hello
    
    - x1
    - x2
    - x3
    
    - x4
    
    - x6
    morning
    - x7
    
    world
    

    x1 x2 x3是好的,因为所有的线都被两条空线包围,x4也因为同样的原因是好的。X6不好,因为后面没有空行,X7不好,因为前面没有空行。x2是好的(不像x6,x7),因为前面的线是好的,后面的线也是好的。

    当我发布以下问题时,可能不清楚这些条件:

    a continuous of good lines between 2 empty lines
    
    good line must have leading "- "
    good line must follow an empty line or follow another good line
    good line must be followed by an empty line or followed by another good line
    

    谢谢

    4 回复  |  直到 13 年前
        1
  •  3
  •   Thomas Wouters    15 年前

    最简单的方法是遍历这些行(假设您有一个行列表或一个文件,或者将字符串拆分成一个行列表),直到看到一行 '\n' ,然后检查每行以 '- ' (使用 startswith 字符串方法)并将其切片,存储结果,直到找到另一个空行。例如:

    # if you have a single string, split it into lines.
    L = s.splitlines()
    # if you (now) have a list of lines, grab an iterator so we can continue
    # iteration where it left off.
    it = iter(L)
    # Alternatively, if you have a file, just use that directly.
    it = open(....)
    
    # Find the first empty line:
    for line in it:
        # Treat lines of just whitespace as empty lines too. If you don't want
        # that, do 'if line == ""'.
        if not line.strip():
            break
    # Now starts data.
    for line in it:
        if not line.rstrip():
            # End of data.
            break
        if line.startswith('- '):
            data.append(line[:2].rstrip())
        else:
            # misformed data?
            raise ValueError, "misformed line %r" % (line,)
    

    编辑:既然你详细说明了你想做什么,这里有一个循环的更新版本。它不再循环两次,而是在遇到“坏”行之前收集数据,并在遇到块分隔符时保存或丢弃收集的行。它不需要显式迭代器,因为它不重新启动迭代,所以您只需向它传递一个行列表(或任何iterable):

    def getblocks(L):
        # The list of good blocks (as lists of lines.) You can also make this
        # a flat list if you prefer.
        data = []
        # The list of good lines encountered in the current block
        # (but the block may still become bad.)
        block = []
        # Whether the current block is bad.
        bad = 1
        for line in L:
            # Not in a 'good' block, and encountering the block separator.
            if bad and not line.rstrip():
                bad = 0
                block = []
                continue
            # In a 'good' block and encountering the block separator.
            if not bad and not line.rstrip():
                # Save 'good' data. Or, if you want a flat list of lines,
                # use 'extend' instead of 'append' (also below.)
                data.append(block)
                block = []
                continue
            if not bad and line.startswith('- '):
                # A good line in a 'good' (not 'bad' yet) block; save the line,
                # minus
                # '- ' prefix and trailing whitespace.
                block.append(line[2:].rstrip())
                continue
            else:
                # A 'bad' line, invalidating the current block.
                bad = 1
        # Don't forget to handle the last block, if it's good
        # (and if you want to handle the last block.)
        if not bad and block:
            data.append(block)
        return data
    

    在这里,它正在发挥作用:

    >>> L = """hello
    ...
    ... - x1
    ... - x2
    ... - x3
    ...
    ... - x4
    ...
    ... - x6
    ... morning
    ... - x7
    ...
    ... world""".splitlines()
    >>> print getblocks(L)
    [['x1', 'x2', 'x3'], ['x4']]
    
        2
  •  4
  •   YOU    15 年前
    >>> import re
    >>>
    >>> x="""Hello user123,
    ...
    ... - (604)7080900
    ... - 152
    ... - minutes
    ...
    ... Regards
    ... """
    >>>
    >>> re.findall("\n+\n-\s*(.*)\n-\s*(.*)\n-\s*(minutes)\s*\n\n+",x)
    [('(604)7080900', '152', 'minutes')]
    >>>
    
        3
  •  1
  •   SilentGhost    15 年前
    >>> s = """Hello user123,
    
    - (604)7080900
    - 152
    - minutes
    
    Regards
    """
    >>> import re
    >>> re.findall(r'^- (.*)', s, re.M)
    ['(604)7080900', '152', 'minutes']
    
        4
  •  1
  •   remosu    15 年前
    l = """Hello user123,
    
    - (604)7080900
    - 152
    - minutes
    
    Regards  
    
    Hello user124,
    
    - (604)8576576
    - 345
    - minutes
    - seconds
    - bla
    
    Regards"""
    

    这样做:

    result = []
    for data in s.split('Regards'): 
        result.append([v.strip() for v in data.split('-')[1:]])
    del result[-1] # remove empty list at end
    

    还有这个:

    >>> result
    [['(604)7080900', '152', 'minutes'],
    ['(604)8576576', '345', 'minutes', 'seconds', 'bla']]