代码之家  ›  专栏  ›  技术社区  ›  artem

如何将字符串拆分为具有给定长度但不打断句子的子字符串?

  •  1
  • artem  · 技术社区  · 6 年前

    我有一个带有大文本的字符串,需要将它拆分为多个子字符串,长度为<=N个字符(尽可能接近N;N总是大于最大的句子),但我也不需要打断句子。

    例如,如果N=80且给定文本:

    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam. Nam sit amet iaculis lacus, non sagittis nulla. Nam blandit quam eget velit maximus, eu consectetur sapien sodales. Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel.
    

    "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam."
    "Nam sit amet iaculis lacus, non sagittis nulla."
    "Nam blandit quam eget velit maximus, eu consectetur sapien sodales."
    "Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel."
    

    我也希望这能和英语和俄语一起使用。

    如何做到这一点?

    2 回复  |  直到 6 年前
        1
  •  1
  •   Joe Iddon    6 年前

    我将采取的步骤:

    • 启动一个列表来存储行和当前 line 变量来存储当前行的字符串。
    • 把段落分成几句话-这要求你 .split '.' ,删除后面的空句子( "" ),去掉前导和尾随空格( .strip
    • 把这些句子循环一遍,然后:
      • 如果句子可以加到当前行上,就把它加上
      • 否则,将当前工作行字符串添加到行列表中,并将当前行字符串设置为当前语句

    para = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam. Nam sit amet iaculis lacus, non sagittis nulla. Nam blandit quam eget velit maximus, eu consectetur sapien sodales. Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel."
    lines = []
    line = ''
    for sentence in (s.strip()+'.' for s in para.split('.')[:-1]):
        if len(line) + len(sentence) + 1 >= 80: #can't fit on that line => start new one
            lines.append(line)
            line = sentence
        else:                                   #can fit on => add a space then this sentence
            line += ' ' + sentence                
    

    lines 作为:

    [
     "Lorem ipsum dolor sit amet, consectetur adipiscing elit.Integer in tellus quam.",
     "Nam sit amet iaculis lacus, non sagittis nulla.",
     "Nam blandit quam eget velit maximus, eu consectetur sapien sodales."
    ]
    
        2
  •  1
  •   wizzwizz4    6 年前

    把句子移到哪里,而不是移到前面。长度包括空格,因为我是用navely而不是正则表达式或其他东西来拆分。

    def get_sentences(text, min_length):
        sentences = (sentence + ". "
                     for sentence in text.split(". "))
        current_line = ""
        for sentence in sentences:
            if len(current_line >= min_length):
                yield current_line
                current_line = sentence
            else:
                current_line += sentence
        yield current_line