代码之家  ›  专栏  ›  技术社区  ›  loretoparisi

python regex-unicode文本匹配的位置和值

  •  1
  • loretoparisi  · 技术社区  · 6 年前

    我必须匹配文档中出现的多个令牌,并获取匹配令牌的值和位置。

    对于非Unicode文本,我使用此regex r"\b(?=\w)" + re.escape(word) + r"\b(?!\w)" 具有 finditer 它起作用了。

    对于Unicode文本,我必须使用类似于单词边界的解决方案,例如 u"(\s|^)%s(\s|$)" % word . 这在大多数情况下都会起作用,但当我有两个连续的单词,如“224 224 224; 224 224; \ 224; 135; \ \ \ \224 \ \224 \\224 \\\\\\\\\\\”。

    这是复制此问题的代码。

    import re
    import json
    
    # a input document of sentences
    document="These are oranges and apples and and pears, but not pinapples\nThese are oranges and apples and pears, but not pinapples"
    
    
    # uncomment to test UNICODE
    document="तुम मुझे दोस्त कहते कहते हो"
    
    sentences=[] # sentences
    seen = {} # map if a token has been see already!
    
    # split into sentences
    lines=document.splitlines()
    
    for index,line in enumerate(lines):
    
      print("Line:%d %s" % (index,line))
    
      # split token that are words
      # LP: (for Simon ;P we do not care of punct at all!
      rgx = re.compile("([\w][\w']*\w)")
      tokens=rgx.findall(line)
    
      # uncomment to test UNICODE
      tokens=["तुम","मुझे","दोस्त","कहते","कहते","हो"]
    
      print("Tokens:",tokens)
    
      sentence={} # a sentence
      items=[] # word tokens
    
      # for each token word
      for index_word,word in enumerate(tokens):
    
        # uncomment to test UNICODE
        my_regex = u"(\s|^)%s(\s|$)"  % word
        #my_regex = r"\b(?=\w)" + re.escape(word) + r"\b(?!\w)"
        r = re.compile(my_regex, flags=re.I | re.X | re.UNICODE)
    
        item = {}
        # for each matched token in sentence
        for m in r.finditer(document):
    
          token=m.group()
          characterOffsetBegin=m.start()
          characterOffsetEnd=characterOffsetBegin+len(m.group()) - 1 # LP: star from 0
    
          print ("word:%s characterOffsetBegin:%d characterOffsetEnd:%d" % (token, characterOffsetBegin, characterOffsetEnd) )
    
          found=-1
          if word in seen:
            found=seen[word]
    
          if characterOffsetBegin > found:
            # store last word has been seen
            seen[word] = characterOffsetBegin
            item['index']=index_word+1 #// word index starts from 1
            item['word']=token
            item['characterOffsetBegin'] = characterOffsetBegin;
            item['characterOffsetEnd'] = characterOffsetEnd;
            items.append(item)
            break
    
      sentence['text']=line
      sentence['tokens']=items
      sentences.append(sentence)
    
    print(json.dumps(sentences, indent=4, sort_keys=True))
    
    print("------ testing ------")
    text=''
    for sentence in sentences:
      for token in sentence['tokens']:
        # LP: we get the token from a slice in original text
        text = text + document[token['characterOffsetBegin']:token['characterOffsetEnd']+1] + " "
      text = text + '\n'
    print(text)
    

    专门为代币 कहते 我将得到相同的匹配,而不是下一个令牌。

    word: कहते  characterOffsetBegin:20 characterOffsetEnd:25
    word: कहते  characterOffsetBegin:20 characterOffsetEnd:25
    
    1 回复  |  直到 6 年前
        1
  •  1
  •   Wiktor Stribiżew    6 年前

    对于非Unicode文本,可以使用更好的regex-like

    my_regex = r"(?<!\w){}(?!\w)".format(re.escape(word))
    

    如果 word 以非单词字符开头。这个 (?<!\w) 如果当前位置和 (?!\w) 如果当前位置右侧有单词char,则否定lookahead将失败匹配。

    unicode文本regex的第二个问题是,第二个组使用单词后的空白,因此无法进行后续匹配。在这里使用了望台很方便:

    my_regex = r"(?<!\S){}(?!\S)".format(re.escape(word))
    

    看到这个 Python demo online .

    这个 (?<!\S) 如果当前位置和 (?!\S) 如果当前位置右侧有一个非空白字符,则否定的lookahead将失败匹配。