代码之家 › 专栏 › 技术社区 › Lerner Zhang

有什么好的、更好的或直接的方法可以从nltk树中得到分块结果吗?

chunking depth-first-search nltk nlp python

Lerner Zhang · 技术社区 · 6 年前

我想把这根线切块,使这些组有一定的高度。原序应保留,并应完整地包含所有原词。

import nltk 
height = 2
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]

pattern = """NP: {<DT>?<JJ>*<NN>}
VBD: {<VBD>}
IN: {<IN>}"""
NPChunker = nltk.RegexpParser(pattern) 
result = NPChunker.parse(sentence)

In [29]: Tree.fromstring(str(result)).pretty_print()
                             S                                      
            _________________|_____________________________          
           NP                        VBD       IN          NP       
   ________|_________________         |        |      _____|____     
the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT     cat/NN

我的方法有点像下面的暴力:

In [30]: [list(map(lambda x: x[0], _tree.leaves())) for _tree in result.subtrees(lambda x: x.height()==height)]
Out[30]: [['the', 'little', 'yellow', 'dog'], ['barked'], ['at'], ['the', 'cat']]

1 回复 | 直到 6 年前

alvas 6 年前

不,NLTK中没有任何内置函数来返回特定深度的树。

How to Traverse an NLTK Tree object?

为了提高效率,您可以先迭代深度,并且仅当深度小于必要值时才重复,例如。

import nltk 
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]

pattern = """NP: {<DT>?<JJ>*<NN>}
VBD: {<VBD>}
IN: {<IN>}"""
NPChunker = nltk.RegexpParser(pattern) 
result = NPChunker.parse(sentence)

def traverse_tree(tree, depth=float('inf')):
    """ 
    Traversing the Tree depth-first,
    yield leaves up to `depth` level.
    """
    for subtree in tree:
        if type(subtree) == nltk.tree.Tree:
            if subtree.height() <= depth:
                yield subtree.leaves()
                traverse_tree(subtree)


list(traverse_tree(result, 2))

[[('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN')],
 [('barked', 'VBD')],
 [('at', 'IN')],
 [('the', 'DT'), ('cat', 'NN')]]

另一个例子:

x = """(S
  (NP the/DT 
      (AP little/JJ yellow/JJ)
       dog/NN)
  (VBD barked/VBD)
  (IN at/IN)
  (NP the/DT cat/NN))"""

list(traverse_tree(Tree.fromstring(x), 2))

[输出]:

[['barked/VBD'], ['at/IN'], ['the/DT', 'cat/NN']]

推荐文章

user4660280 · 使用我自己的标记语料库进行NLTK词性标记?

6 年前

Swamy · 如何建立深度学习模型,从几个不同的袋子中挑选单词,形成一个有意义的句子[结束]

6 年前

user9092346 · NLTK-标记后连接专有名词

6 年前

Nice · 如何解决nltk中的NotImplementedError。分类I?

6 年前

ArchivistG · 尝试使用re将3个结果打印到表中

6 年前

AKKA · nltk中Jaccard距离度量的实现。指标。距离与数学定义不一致?

6 年前

Ovaflow · 计算句子中的特定单词

6 年前

Sandy · 使用pandas从字符串生成N-gram

6 年前

Freakant · NLTK。检测句子是否是疑问句?

6 年前

Adeeb Abdul Salam · 如何查找NLTK缺少的资源?[副本]

6 年前