代码之家  ›  专栏  ›  技术社区  ›  Manoj Govindan

有发电机版本的吗`字符串。拆分()`在Python中?

  •  100
  • Manoj Govindan  · 技术社区  · 14 年前

    string.split() 返回一个 generator 相反呢?有什么理由反对使用生成器版本吗?

    14 回复  |  直到 14 年前
        1
  •  87
  •   ninjagecko    4 年前

    re.finditer 使用相当小的内存开销。

    def split_iter(string):
        return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))
    

    演示:

    >>> list( split_iter("A programmer's RegEx test.") )
    ['A', "programmer's", 'RegEx', 'test']
    

    编辑: 我刚刚确认了在python3.2.1中这需要恒定的内存,假设我的测试方法是正确的。我创建了一个非常大的字符串(1GB左右),然后用 for

    更通用的版本:

    str.split “,下面是一个更通用的版本:

    def splitStr(string, sep="\s+"):
        # warning: does not yet work if sep is a lookahead like `(?=b)`
        if sep=='':
            return (c for c in string)
        else:
            return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))
    
        # alternatively, more verbosely:
        regex = f'(?:^|{sep})((?:(?!{sep}).)*)'
        for match in re.finditer(regex, string):
            fragment = match.group(1)
            yield fragment
    
    

    我们的想法是 ((?!pat).)* 通过确保组贪婪地匹配直到模式开始匹配(lookaheads不使用regex有限状态机中的字符串),来“否定”组。在伪代码中:重复使用( begin-of-string 异或 {sep} ) + as much as possible until we would be able to begin again (or hit end of string)

    演示:

    >>> splitStr('.......A...b...c....', sep='...')
    <generator object splitStr.<locals>.<genexpr> at 0x7fe8530fb5e8>
    
    >>> list(splitStr('A,b,c.', sep=','))
    ['A', 'b', 'c.']
    
    >>> list(splitStr(',,A,b,c.,', sep=','))
    ['', '', 'A', 'b', 'c.', '']
    
    >>> list(splitStr('.......A...b...c....', '\.\.\.'))
    ['', '', '.A', 'b', 'c', '.']
    
    >>> list(splitStr('   A  b  c. '))
    ['', 'A', 'b', 'c.', '']
    
    

    (应当指出 str.split 有一个丑陋的行为:它有特殊情况 sep=None 作为第一件事 str.strip "\s+"

    r'(.*?)($|,)' ',,,a,,b,c' 退货 ['', '', '', 'a', '', 'b', 'c', '']

    (如果您想自己实现它以获得更高的性能(尽管它们很重,regex最重要的是在C中运行),您应该编写一些代码(使用ctypes?不知道如何让发电机与它一起工作?),使用以下用于固定长度分隔符的伪代码:哈希长度为L的分隔符。在使用运行哈希算法扫描字符串时,保留长度为L的运行哈希,O(1)更新时间。每当散列可能等于您的分隔符时,手动检查过去的几个字符是否是分隔符;如果是,则从上次yield开始生成子字符串。字符串开头和结尾的特殊情况。这将是教科书算法的一个生成器版本,用于做O(N)文本搜索。多处理版本也是可能的。他们可能看起来杀伤力过大,但这个问题意味着一个人正在处理非常巨大的字符串。。。在这一点上,您可能会考虑一些疯狂的事情,比如缓存字节偏移量(如果它们很少的话),或者在磁盘上使用一些由testring view object支持的磁盘,购买更多的RAM等等。)

        2
  •  18
  •   Eli Collins    8 年前

    我能想到的最有效的方法就是用 offset str.find() 方法。这避免了大量的内存使用,并在不需要时依赖regexp的开销。

    [编辑2016-8-2:更新此选项以可选地支持regex分隔符]

    def isplit(source, sep=None, regex=False):
        """
        generator version of str.split()
    
        :param source:
            source string (unicode or bytes)
    
        :param sep:
            separator to split on.
    
        :param regex:
            if True, will treat sep as regular expression.
    
        :returns:
            generator yielding elements of string.
        """
        if sep is None:
            # mimic default python behavior
            source = source.strip()
            sep = "\\s+"
            if isinstance(source, bytes):
                sep = sep.encode("ascii")
            regex = True
        if regex:
            # version using re.finditer()
            if not hasattr(sep, "finditer"):
                sep = re.compile(sep)
            start = 0
            for m in sep.finditer(source):
                idx = m.start()
                assert idx >= start
                yield source[start:idx]
                start = m.end()
            yield source[start:]
        else:
            # version using str.find(), less overhead than re.finditer()
            sepsize = len(sep)
            start = 0
            while True:
                idx = source.find(sep, start)
                if idx == -1:
                    yield source[start:]
                    return
                yield source[start:idx]
                start = idx + sepsize
    

    你想怎么用就怎么用。。。

    >>> print list(isplit("abcb","b"))
    ['a','c','']
    

    虽然每次执行find()或切片时字符串中都会有一点开销,但这应该是最小的,因为字符串在内存中表示为连续数组。

        3
  •  10
  •   Bernd Petersohn    14 年前

    这是的生成器版本 split() 通过实施 re.search() 这不存在分配太多子字符串的问题。

    import re
    
    def itersplit(s, sep=None):
        exp = re.compile(r'\s+' if sep is None else re.escape(sep))
        pos = 0
        while True:
            m = exp.search(s, pos)
            if not m:
                if pos < len(s) or sep is not None:
                    yield s[pos:]
                break
            if pos < m.start() or sep is not None:
                yield s[pos:m.start()]
            pos = m.end()
    
    
    sample1 = "Good evening, world!"
    sample2 = " Good evening, world! "
    sample3 = "brackets][all][][over][here"
    sample4 = "][brackets][all][][over][here]["
    
    assert list(itersplit(sample1)) == sample1.split()
    assert list(itersplit(sample2)) == sample2.split()
    assert list(itersplit(sample3, '][')) == sample3.split('][')
    assert list(itersplit(sample4, '][')) == sample4.split('][')
    

    编辑:

        4
  •  10
  •   c z    7 年前

    • str.split (默认值=0.3461570239996945)
    • 手动搜索(按字符)(Dave Webb的答案之一)=0.8260340550004912
    • re.finditer
    • str.find
    • itertools.takewhile (伊格纳西奥·巴斯克斯·艾布拉姆斯的回答)=2.023023967998597
    • str.split(..., maxsplit=1)

    递归回答( string.split 具有 maxsplit = 1 )未能在合理的时间内完成 字符串。拆分

    测试使用 timeit 日期:

    the_text = "100 " * 9999 + "100"
    
    def test_function( method ):
        def fn( ):
            total = 0
    
            for x in method( the_text ):
                total += int( x )
    
            return total
    
        return fn
    

    这就提出了另一个问题,为什么 尽管它的内存使用率很高,但它的速度要快得多。

        5
  •  6
  •   Oleh Prypin    12 年前

    我只复制主文件的docstring str_split 功能:


    str_split(s, *delims, empty=None)
    

    分开绳子 s 其他的论点,可能省略了 空零件( empty 这是一个生成器函数。

    当只提供一个分隔符时,字符串将被它简单地拆分。 那么 True 默认情况下。

    str_split('[]aaa[][]bb[c', '[]')
        -> '', 'aaa', '', 'bb[c'
    str_split('[]aaa[][]bb[c', '[]', empty=False)
        -> 'aaa', 'bb[c'
    

    如果提供了多个分隔符,则字符串将按最长值拆分 设置为 是的 在这种情况下,分隔符只能是单个字符。

    str_split('aaa, bb : c;', ' ', ',', ':', ';')
        -> 'aaa', 'bb', 'c'
    str_split('aaa, bb : c;', *' ,:;', empty=True)
        -> 'aaa', '', 'bb', '', '', 'c', ''
    

    如果没有提供分隔符, string.whitespace 与相同 str.split()

    str_split('aaa\\t  bb c \\n')
        -> 'aaa', 'bb', 'c'
    

    import string
    
    def _str_split_chars(s, delims):
        "Split the string `s` by characters contained in `delims`, including the \
        empty parts between two consecutive delimiters"
        start = 0
        for i, c in enumerate(s):
            if c in delims:
                yield s[start:i]
                start = i+1
        yield s[start:]
    
    def _str_split_chars_ne(s, delims):
        "Split the string `s` by longest possible sequences of characters \
        contained in `delims`"
        start = 0
        in_s = False
        for i, c in enumerate(s):
            if c in delims:
                if in_s:
                    yield s[start:i]
                    in_s = False
            else:
                if not in_s:
                    in_s = True
                    start = i
        if in_s:
            yield s[start:]
    
    
    def _str_split_word(s, delim):
        "Split the string `s` by the string `delim`"
        dlen = len(delim)
        start = 0
        try:
            while True:
                i = s.index(delim, start)
                yield s[start:i]
                start = i+dlen
        except ValueError:
            pass
        yield s[start:]
    
    def _str_split_word_ne(s, delim):
        "Split the string `s` by the string `delim`, not including empty parts \
        between two consecutive delimiters"
        dlen = len(delim)
        start = 0
        try:
            while True:
                i = s.index(delim, start)
                if start!=i:
                    yield s[start:i]
                start = i+dlen
        except ValueError:
            pass
        if start<len(s):
            yield s[start:]
    
    
    def str_split(s, *delims, empty=None):
        """\
    Split the string `s` by the rest of the arguments, possibly omitting
    empty parts (`empty` keyword argument is responsible for that).
    This is a generator function.
    
    When only one delimiter is supplied, the string is simply split by it.
    `empty` is then `True` by default.
        str_split('[]aaa[][]bb[c', '[]')
            -> '', 'aaa', '', 'bb[c'
        str_split('[]aaa[][]bb[c', '[]', empty=False)
            -> 'aaa', 'bb[c'
    
    When multiple delimiters are supplied, the string is split by longest
    possible sequences of those delimiters by default, or, if `empty` is set to
    `True`, empty strings between the delimiters are also included. Note that
    the delimiters in this case may only be single characters.
        str_split('aaa, bb : c;', ' ', ',', ':', ';')
            -> 'aaa', 'bb', 'c'
        str_split('aaa, bb : c;', *' ,:;', empty=True)
            -> 'aaa', '', 'bb', '', '', 'c', ''
    
    When no delimiters are supplied, `string.whitespace` is used, so the effect
    is the same as `str.split()`, except this function is a generator.
        str_split('aaa\\t  bb c \\n')
            -> 'aaa', 'bb', 'c'
    """
        if len(delims)==1:
            f = _str_split_word if empty is None or empty else _str_split_word_ne
            return f(s, delims[0])
        if len(delims)==0:
            delims = string.whitespace
        delims = set(delims) if len(delims)>=4 else ''.join(delims)
        if any(len(d)>1 for d in delims):
            raise ValueError("Only 1-character multiple delimiters are supported")
        f = _str_split_chars if empty else _str_split_chars_ne
        return f(s, delims)
    

    这个函数在python3中工作,并且可以应用一个简单但相当难看的修复程序使它在2和3版本中都工作。函数的第一行应更改为:

    def str_split(s, *delims, **kwargs):
        """...docstring..."""
        empty = kwargs.get('empty')
    
        6
  •  3
  •   Ignacio Vazquez-Abrams    14 年前

    itertools.takewhile()

    编辑:

    非常简单的、半途而废的实现:

    import itertools
    import string
    
    def isplitwords(s):
      i = iter(s)
      while True:
        r = []
        for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
          r.append(c)
        else:
          if r:
            yield ''.join(r)
            continue
          else:
            raise StopIteration()
    
        7
  •  3
  •   David Webb    14 年前

    split()

    如果你想写一个,那就相当容易了:

    import string
    
    def gsplit(s,sep=string.whitespace):
        word = []
    
        for c in s:
            if c in sep:
                if word:
                    yield "".join(word)
                    word = []
            else:
                word.append(c)
    
        if word:
            yield "".join(word)
    
        8
  •  3
  •   dshepherd    9 年前

    def isplit(string, delimiter = None):
        """Like string.split but returns an iterator (lazy)
    
        Multiple character delimters are not handled.
        """
    
        if delimiter is None:
            # Whitespace delimited by default
            delim = r"\s"
    
        elif len(delimiter) != 1:
            raise ValueError("Can only handle single character delimiters",
                            delimiter)
    
        else:
            # Escape, incase it's "\", "*" etc.
            delim = re.escape(delimiter)
    
        return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))
    

    下面是我使用的测试(在Python3和Python2中):

    # Wrapper to make it a list
    def helper(*args,  **kwargs):
        return list(isplit(*args, **kwargs))
    
    # Normal delimiters
    assert helper("1,2,3", ",") == ["1", "2", "3"]
    assert helper("1;2;3,", ";") == ["1", "2", "3,"]
    assert helper("1;2 ;3,  ", ";") == ["1", "2 ", "3,  "]
    
    # Whitespace
    assert helper("1 2 3") == ["1", "2", "3"]
    assert helper("1\t2\t3") == ["1", "2", "3"]
    assert helper("1\t2 \t3") == ["1", "2", "3"]
    assert helper("1\n2\n3") == ["1", "2", "3"]
    
    # Surrounding whitespace dropped
    assert helper(" 1 2  3  ") == ["1", "2", "3"]
    
    # Regex special characters
    assert helper(r"1\2\3", "\\") == ["1", "2", "3"]
    assert helper(r"1*2*3", "*") == ["1", "2", "3"]
    
    # No multi-char delimiters allowed
    try:
        helper(r"1,.2,.3", ",.")
        assert False
    except ValueError:
        pass
    

    does "the right thing" 对于unicode空白,但我还没有实际测试过。

    也可作为 gist

        9
  •  3
  •   reubano    9 年前

    如果你也想 阅读 返回

    import itertools as it
    
    def iter_split(string, sep=None):
        sep = sep or ' '
        groups = it.groupby(string, lambda s: s != sep)
        return (''.join(g) for k, g in groups if k)
    

    用法

    >>> list(iter_split(iter("Good evening, world!")))
    ['Good', 'evening,', 'world!']
    
        10
  •  3
  •   blacksite    5 年前

    more_itertools.split_at 提供模拟到 str.split

    >>> import more_itertools as mit
    
    
    >>> list(mit.split_at("abcdcba", lambda x: x == "b"))
    [['a'], ['c', 'd', 'c'], ['a']]
    
    >>> "abcdcba".split("b")
    ['a', 'cdc', 'a']
    

    more_itertools 是第三方软件包。

        11
  •  2
  •   Veltzer Doron    6 年前

    我想展示如何使用find iter解决方案为给定的分隔符返回一个生成器,然后使用itertools中的成对配方来构建上一个下一个迭代,该迭代将获得与原始split方法相同的实际单词。


    from more_itertools import pairwise
    import re
    
    string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
    delimiter = " "
    # split according to the given delimiter including segments beginning at the beginning and ending at the end
    for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)):
        print(string[prev.end(): curr.start()])
    

    1. 这是相当有效的
        12
  •  1
  •   Tavy    4 年前

    最愚蠢的方法,没有regex/itertools:

    def isplit(text, split='\n'):
        while text != '':
            end = text.find(split)
    
            if end == -1:
                yield text
                text = ''
            else:
                yield text[:end]
                text = text[end + 1:]
    
        13
  •  1
  •   David Rissato Cruz    3 年前

    def str_split(text: str, separator: str) -> Iterable[str]:
        i = 0
        n = len(text)
        while i <= n:
            j = text.find(separator, i)
            if j == -1:
                j = n
            yield text[i:j]
            i = j + 1
    
        14
  •  0
  •   travelingbones    11 年前
    def split_generator(f,s):
        """
        f is a string, s is the substring we split on.
        This produces a generator rather than a possibly
        memory intensive list. 
        """
        i=0
        j=0
        while j<len(f):
            if i>=len(f):
                yield f[j:]
                j=i
            elif f[i] != s:
                i=i+1
            else:
                yield [f[j:i]]
                j=i+1
                i=i+1
    
        15
  •  0
  •   Narcisse Doudieu Siewe    5 年前

    下面是一个简单的回答

    def gen_str(some_string, sep):
        j=0
        guard = len(some_string)-1
        for i,s in enumerate(some_string):
            if s == sep:
               yield some_string[j:i]
               j=i+1
            elif i!=guard:
               continue
            else:
               yield some_string[j:]
    
        16
  •  0
  •   Apalala    3 年前
    def isplit(text, sep=None, maxsplit=-1):
        if not isinstance(text, (str, bytes)):
            raise TypeError(f"requires 'str' or 'bytes' but received a '{type(text).__name__}'")
        if sep in ('', b''):
            raise ValueError('empty separator')
    
        if maxsplit == 0 or not text:
            yield text
            return
    
        regex = (
            re.escape(sep) if sep is not None
            else [br'\s+', r'\s+'][isinstance(text, str)]
        )
        yield from re.split(regex, text, maxsplit=max(0, maxsplit))