我必须匹配文档中出现的多个令牌,并获取匹配令牌的值和位置。
对于非Unicode文本,我使用此regex
r"\b(?=\w)" + re.escape(word) + r"\b(?!\w)"
具有
finditer
它起作用了。
对于Unicode文本,我必须使用类似于单词边界的解决方案,例如
u"(\s|^)%s(\s|$)" % word
. 这在大多数情况下都会起作用,但当我有两个连续的单词,如“224 224 224; 224 224; \ 224; 135; \ \ \ \224 \ \224 \\224 \\\\\\\\\\\”。
这是复制此问题的代码。
import re
import json
document="These are oranges and apples and and pears, but not pinapples\nThese are oranges and apples and pears, but not pinapples"
document="तà¥à¤® मà¥à¤à¥ दà¥à¤¸à¥à¤¤ à¤à¤¹à¤¤à¥ à¤à¤¹à¤¤à¥ हà¥"
sentences=[]
seen = {}
lines=document.splitlines()
for index,line in enumerate(lines):
print("Line:%d %s" % (index,line))
rgx = re.compile("([\w][\w']*\w)")
tokens=rgx.findall(line)
tokens=["तà¥à¤®","मà¥à¤à¥","दà¥à¤¸à¥à¤¤","à¤à¤¹à¤¤à¥","à¤à¤¹à¤¤à¥","हà¥"]
print("Tokens:",tokens)
sentence={}
items=[]
for index_word,word in enumerate(tokens):
my_regex = u"(\s|^)%s(\s|$)" % word
r = re.compile(my_regex, flags=re.I | re.X | re.UNICODE)
item = {}
for m in r.finditer(document):
token=m.group()
characterOffsetBegin=m.start()
characterOffsetEnd=characterOffsetBegin+len(m.group()) - 1
print ("word:%s characterOffsetBegin:%d characterOffsetEnd:%d" % (token, characterOffsetBegin, characterOffsetEnd) )
found=-1
if word in seen:
found=seen[word]
if characterOffsetBegin > found:
seen[word] = characterOffsetBegin
item['index']=index_word+1
item['word']=token
item['characterOffsetBegin'] = characterOffsetBegin;
item['characterOffsetEnd'] = characterOffsetEnd;
items.append(item)
break
sentence['text']=line
sentence['tokens']=items
sentences.append(sentence)
print(json.dumps(sentences, indent=4, sort_keys=True))
print("------ testing ------")
text=''
for sentence in sentences:
for token in sentence['tokens']:
text = text + document[token['characterOffsetBegin']:token['characterOffsetEnd']+1] + " "
text = text + '\n'
print(text)
专门为代币
à¤à¤¹à¤¤à¥
我将得到相同的匹配,而不是下一个令牌。
word: à¤à¤¹à¤¤à¥ characterOffsetBegin:20 characterOffsetEnd:25
word: à¤à¤¹à¤¤à¥ characterOffsetBegin:20 characterOffsetEnd:25