代码之家  ›  专栏  ›  技术社区  ›  horcle_buzz

在spaCy短语匹配器上使用正则表达式[重复]

  •  0
  • horcle_buzz  · 技术社区  · 6 年前

    我正在创建一个spaCy正则表达式来匹配数字,并将其提取到数据帧中。

    问题:Panda从数字中提取,但覆盖值而不是附加值。如何解决?

    (原代码信用证:亚龙贡)

    from __future__ import unicode_literals
    import spacy
    import re
    import pandas as pd
    from datetime import date
    nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger', 'ner'])
    doc = nlp("This is a sample number: 11. This is second sample number: 1145.")
    NUM_PATTERN = re.compile(r"\d+")
    for match in re.finditer(NUM_PATTERN, doc.text):
        start, end = match.span()
        Number = doc.char_span(start, end)
        print Number
    pandas_attributes = [Number,]
    df = pd.DataFrame(pandas_attributes,
                      columns=['Number'])
    print df
    

    输出:

    11
    1145
      Number
    0   1145
    
    Expected output:
          Number
    o      11 
    1      1145
    

    我正在尝试对单个文本进行多模式匹配。

    from __future__ import unicode_literals
    import spacy
    import re
    import pandas as pd
    from datetime import date
    nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger', 'ner'])
    doc = nlp("This is a sample-number: 11. This is second sample number: 1145.")
    NUM_PATTERN = re.compile(r"\d+")
    HYPH_PATTERN = re.compile('\w+(?:-)\w+')
    
    for match in re.finditer(NUM_PATTERN, doc.text):
        start, end = match.span()
        Number = doc.char_span(start, end)
        print Number
    
    for match in re.finditer(HYPH_PATTERN, doc.text):
        start, end = match.span()
        Hyph_word = doc.char_span(start, end)
        print Hyph_word
    
    pandas_attributes = [Number,Hyph_word]
    df = pd.DataFrame(pandas_attributes,
                      columns=['Number','Hyphenword'])
    print df
    

    电流输出。

    Output:
    11
    1145
    sample-number
    
    AssertionError: 2 columns passed, passed data had 3 columns
    
    Expected output:
    Number  Hyphen_word
    11      sample-number
    1145  
    

    编辑2:输出

                    Number Hyphenword
    0                 (11)     (1145)
    1  (sample, -, number)       Non
    
    Expected output:
    
        Number   Hyphenword
    0        11   sample-word
    1      1145   Non
    
    0 回复  |  直到 7 年前
        1
  •  2
  •   jezrael    7 年前

    您需要将值附加到循环中的列表:

    L = []
    for match in re.finditer(NUM_PATTERN, doc.text):
        start, end = match.span()
        L.append(doc.char_span(start, end))
    

    然后使用 DataFrame 建造商:

    df = pd.DataFrame(L,columns=['Number'])
    

    还可以附加具有多个值的元组:

    L = []
    for x in range(3):
        Number = x + 1
        Val = x + 4
        L.append((Number, Val))
    
    print (L)
    [(1, 4), (2, 5), (3, 6)]
    
    df = pd.DataFrame(L,columns=['Number', 'Val'])
    print (df)
       Number  Val
    0       1    4
    1       2    5
    2       3    6
    

    append :

    PATTERNS = [NUM_PATTERN, HYPH_PATTERN]
    
    pandas_attributes = []
    for pat in PATTERNS:
        L = []
        for match in re.finditer(pat, doc.text):
            start, end = match.span()
            L.append(doc.char_span(start, end))
        pandas_attributes.append(L) 
    
    df = pd.DataFrame(pandas_attributes,
                      index=['Number','Hyphenword']).T