代码之家  ›  专栏  ›  技术社区  ›  Lerner Zhang

有什么好的、更好的或直接的方法可以从nltk树中得到分块结果吗?

  •  0
  • Lerner Zhang  · 技术社区  · 6 年前

    我想把这根线切块,使这些组有一定的高度。原序应保留,并应完整地包含所有原词。

    import nltk 
    height = 2
    sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
    
    pattern = """NP: {<DT>?<JJ>*<NN>}
    VBD: {<VBD>}
    IN: {<IN>}"""
    NPChunker = nltk.RegexpParser(pattern) 
    result = NPChunker.parse(sentence)
    
    In [29]: Tree.fromstring(str(result)).pretty_print()
                                 S                                      
                _________________|_____________________________          
               NP                        VBD       IN          NP       
       ________|_________________         |        |      _____|____     
    the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT     cat/NN
    

    我的方法有点像下面的暴力:

    In [30]: [list(map(lambda x: x[0], _tree.leaves())) for _tree in result.subtrees(lambda x: x.height()==height)]
    Out[30]: [['the', 'little', 'yellow', 'dog'], ['barked'], ['at'], ['the', 'cat']]
    

    1 回复  |  直到 6 年前
        1
  •  1
  •   alvas    6 年前

    不,NLTK中没有任何内置函数来返回特定深度的树。

    How to Traverse an NLTK Tree object?

    为了提高效率,您可以先迭代深度,并且仅当深度小于必要值时才重复,例如。

    import nltk 
    sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
    
    pattern = """NP: {<DT>?<JJ>*<NN>}
    VBD: {<VBD>}
    IN: {<IN>}"""
    NPChunker = nltk.RegexpParser(pattern) 
    result = NPChunker.parse(sentence)
    
    def traverse_tree(tree, depth=float('inf')):
        """ 
        Traversing the Tree depth-first,
        yield leaves up to `depth` level.
        """
        for subtree in tree:
            if type(subtree) == nltk.tree.Tree:
                if subtree.height() <= depth:
                    yield subtree.leaves()
                    traverse_tree(subtree)
    
    
    list(traverse_tree(result, 2))
    

    [[('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN')],
     [('barked', 'VBD')],
     [('at', 'IN')],
     [('the', 'DT'), ('cat', 'NN')]]
    

    另一个例子:

    x = """(S
      (NP the/DT 
          (AP little/JJ yellow/JJ)
           dog/NN)
      (VBD barked/VBD)
      (IN at/IN)
      (NP the/DT cat/NN))"""
    
    list(traverse_tree(Tree.fromstring(x), 2))
    

    [输出]:

    [['barked/VBD'], ['at/IN'], ['the/DT', 'cat/NN']]