代码之家  ›  专栏  ›  技术社区  ›  Dave

Spacy不标记周期

  •  0
  • Dave  · 技术社区  · 4 年前

    如果最后一个“单词”是一个包含句号的非单词,我该如何修正/调整spacy不分隔句末句号的事实?

    >>> nlp = spacy.spacy.load('en_core_web_md')
    >>> doc = nlp("The Eiffel Tower is located at 48.86N 2.29E.")
    >>> print(doc[-1])
    2.29E.
    >>> print(nlp("The Eiffel Tower is very beautiful.")[-1])
    .
       
    

    我试图提取(命名实体识别)文档中的lat/lon引用,但无法找到一种方法使提取的实体与文本相对应 "48.86N 2.29E" 没有最后一段时间。

    我想保持所有其他常用的(英语)标记规则不变。

    0 回复  |  直到 4 年前
        1
  •  0
  •   Raqib    4 年前

    您需要在标记器中注册一个自定义后缀。这可以通过以下方式完成:

    import re
    import spacy
    from spacy.tokenizer import Tokenizer
    
    suffix_re = re.compile(r'''\.$''')
    
    def custom_tokenizer(nlp):
        return Tokenizer(nlp.vocab, suffix_search=suffix_re.search)
    
    nlp = spacy.load("en_core_web_sm")
    nlp.tokenizer = custom_tokenizer(nlp)
    
    doc = nlp("The Eiffel Tower is very beautiful.")
    print([t.text for t in doc])
    
    doc2 = nlp("The Eiffel Tower is located at 48.86N 2.29E.")
    print([t.text for t in doc2])
    
    doc3 = nlp("The Eiffel Tower, Norte Dame and Champs Elysee are located at 48.86N 2.29E.")
    print([t.text for t in doc3])
    

    输出

    ['The', 'Eiffel', 'Tower', 'is', 'very', 'beautiful', '.']
    ['The', 'Eiffel', 'Tower', 'is', 'located', 'at', '48.86N', '2.29E', '.']
    ['The', 'Eiffel', 'Tower,', 'Norte', 'Dame', 'and', 'Champs', 'Elysee', 'are', 'located', 'at', '48.86N', '2.29E', '.']