代码之家 › 专栏 › 技术社区 › JimminyCricket

从文本中提取单词两边的25个单词

regex-lookarounds python-3.x regex python

JimminyCricket · 技术社区 · 5 年前

我有下面的文本,我正在尝试使用这个模式来提取25个单词到每边的匹配。挑战在于匹配是重叠的,因此python regex引擎只需要一个匹配。如果有人能帮忙修理这个我会很感激的

文本

2015年展望公司目前提供以下2015年展望,而不是正式的财务指导。该展望不包括任何未来收购和交易相关成本的影响。收入-根据2014年第四季度的收入、在我们的一些设施中增加新项目以及先前对重要位置的收购,公司预计当前100个项目的利用率将保持在一定的平均水平。

我试过以下模式

pattern = r'(?<=outlook\s)((\w+.*?){25})'

这将创建一个匹配,而我需要两个匹配,不管其中一个是否重叠。

我基本上需要两场比赛

2 回复 | 直到 5 年前

Patrick Artner 5 年前

我根本不会使用regex-python module re 不处理重叠范围…

text = """2015 Outlook The Company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. This outlook does not include the impact of any future acquisitions and transaction-related costs. Revenues - Based on the revenues from the fourth quarter of 2014, the addition of new items at our some facility and the previously opened acquisition of Important Place, the Company expects utilization of the current 100 items to remain in some average"""

lookfor = "outlook"

# split text at spaces
splitted = text.lower().split()

# get the position in splitted where the words match (remove .,-?! for comparison) 
positions = [i for i,w in enumerate(splitted) if lookfor == w.strip(".,-?!")]


# printing here, you can put those slices in a list for later usage
for p in positions:    # positions is: [1, 8, 21]
    print( ' '.join(splitted[max(0,p-26):p+26]) )
    print()

输出:

2015 outlook the company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. this outlook does not include the impact

2015 outlook the company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. this outlook does not include the impact of any future acquisitions and transaction-related costs.

2015 outlook the company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. this outlook does not include the impact of any future acquisitions and transaction-related costs. revenues - based on the revenues from the fourth quarter of 2014, the

通过迭代被拆分的单词,您可以得到位置并对被拆分的列表进行切片。确保从开始 0 即使是在 p-26 那么低 零 ,否则您将无法获得任何输出。(从-4开始表示从字符串结束)

A l w a y s S u n n y 5 年前

一 非正则表达式 方式:

string = "2015 Outlook The Company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. This outlook does not include the impact of any future acquisitions and transaction-related costs. Revenues - Based on the revenues from the fourth quarter of 2014, the addition of new items at our some facility and the previously opened acquisition of Important Place, the Company expects utilization of the current 100 items to remain in some average"
words = string.split()
starting25 = " ".join(words[:25])
ending25 = " ".join(words[-25:])
print(starting25)
print("\n")
print(ending25)