代码之家  ›  专栏  ›  技术社区  ›  Lerner Zhang

如何漂亮地打印nltk树对象?

  •  2
  • Lerner Zhang  · 技术社区  · 6 年前

    import nltk 
    sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
    
    pattern = """NP: {<DT>?<JJ>*<NN>}
    VBD: {<VBD>}
    IN: {<IN>}"""
    NPChunker = nltk.RegexpParser(pattern) 
    result = NPChunker.parse(sentence)
    

    资料来源: https://stackoverflow.com/a/31937278/3552975

    我不知道为什么我不能打印 result .

    result.pretty_print()
    

    错误显示 TypeError: not all arguments converted during string formatting . 我用Python3.5,nltk3.3。

    1 回复  |  直到 6 年前
        1
  •  9
  •   alvas    6 年前

    如果您正在寻找一个带括号的解析输出,那么可以使用 Tree.pprint()

    >>> import nltk 
    >>> sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
    >>> 
    >>> pattern = """NP: {<DT>?<JJ>*<NN>}
    ... VBD: {<VBD>}
    ... IN: {<IN>}"""
    >>> NPChunker = nltk.RegexpParser(pattern) 
    >>> result = NPChunker.parse(sentence)
    >>> result.pprint()
    (S
      (NP the/DT little/JJ yellow/JJ dog/NN)
      (VBD barked/VBD)
      (IN at/IN)
      (NP the/DT cat/NN))
    

    但很可能你在找

                                 S                                      
                _________________|_____________________________          
               NP                        VBD       IN          NP       
       ________|_________________         |        |      _____|____     
    the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT     cat/NN
    

    Tree.pretty_print() https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L692 :

    def pretty_print(self, sentence=None, highlight=(), stream=None, **kwargs):
        """
        Pretty-print this tree as ASCII or Unicode art.
        For explanation of the arguments, see the documentation for
        `nltk.treeprettyprinter.TreePrettyPrinter`.
        """
        from nltk.treeprettyprinter import TreePrettyPrinter
        print(TreePrettyPrinter(self, sentence, highlight).text(**kwargs),
              file=stream)
    

    它创造了一个 TreePrettyPrinter 对象, https://github.com/nltk/nltk/blob/develop/nltk/treeprettyprinter.py#L50

    class TreePrettyPrinter(object):
        def __init__(self, tree, sentence=None, highlight=()):
            if sentence is None:
                leaves = tree.leaves()
                if (leaves and not any(len(a) == 0 for a in tree.subtrees())
                        and all(isinstance(a, int) for a in leaves)):
                    sentence = [str(a) for a in leaves]
                else:
                    # this deals with empty nodes (frontier non-terminals)
                    # and multiple/mixed terminals under non-terminals.
                    tree = tree.copy(True)
                    sentence = []
                    for a in tree.subtrees():
                        if len(a) == 0:
                            a.append(len(sentence))
                            sentence.append(None)
                        elif any(not isinstance(b, Tree) for b in a):
                            for n, b in enumerate(a):
                                if not isinstance(b, Tree):
                                    a[n] = len(sentence)
                                    sentence.append('%s' % b)
            self.nodes, self.coords, self.edges, self.highlight = self.nodecoords(
                    tree, sentence, highlight)
    

    sentence.append('%s' % b) https://github.com/nltk/nltk/blob/develop/nltk/treeprettyprinter.py#L97

    问题是 为什么会引起打字错误

    TypeError: not all arguments converted during string formatting
    

    如果我们仔细看,它看起来让我们可以使用 print('%s' % b)

    # String
    >>> x = 'abc'
    >>> type(x)
    <class 'str'>
    >>> print('%s' % x)
    abc
    
    # Integer
    >>> x = 123
    >>> type(x)
    <class 'int'>
    >>> print('%s' % x)
    123
    
    # Float 
    >>> x = 1.23
    >>> type(x)
    <class 'float'>
    >>> print('%s' % x)
    1.23
    
    # Boolean
    >>> x = True
    >>> type(x)
    <class 'bool'>
    >>> print('%s' % x)
    True
    

    令人惊讶的是,它甚至可以在列表中使用!

    >>> x = ['abc', 'def']
    >>> type(x)
    <class 'list'>
    >>> print('%s' % x)
    ['abc', 'def']
    

    但它被 tuple !!

    >>> x = ('DT', 123)
    >>> x = ('abc', 'def')
    >>> type(x)
    <class 'tuple'>
    >>> print('%s' % x)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: not all arguments converted during string formatting
    

    所以如果我们回到 https://github.com/nltk/nltk/blob/develop/nltk/treeprettyprinter.py#L95

    if not isinstance(b, Tree):
        a[n] = len(sentence)
        sentence.append('%s' % b)
    

    既然我们知道 句子.append(“%s”%b) 无法处理 ,添加对元组类型的检查并以某种方式连接元组中的项并转换为 str 将产生美好的 pretty_print :

    if not isinstance(b, Tree):
        a[n] = len(sentence)
        if type(b) == tuple:
            b = '/'.join(b)
        sentence.append('%s' % b)
    

    S码
    _________________|_____________________________
    NP中的NP-VBD
    ________|_________________         |        |      _____|____
    

    而不改变 nltk

    让我们看看 result i、 东阿 Tree

    Tree('S', [Tree('NP', [('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN')]), Tree('VBD', [('barked', 'VBD')]), Tree('IN', [('at', 'IN')]), Tree('NP', [('the', 'DT'), ('cat', 'NN')])])
    

    看起来叶子是作为字符串的元组列表保存的。 [('the', 'DT'), ('cat', 'NN')] ,所以我们可以做一些黑客,使它成为字符串列表,例如。 [('the/DT'), ('cat/NN')] Tree.pretty\u打印() 会玩得很好。

    既然我们知道 树.pprint()

    (S
      (NP the/DT little/JJ yellow/JJ dog/NN)
      (VBD barked/VBD)
      (IN at/IN)
      (NP the/DT cat/NN))
    

    我们可以简单地输出一个带括号的解析字符串,然后重新读取解析 Tree.fromstring() :

    from nltk import Tree
    Tree.fromstring(str(result)).pretty_print()
    

    最终付款:

    import nltk 
    sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
    
    pattern = """NP: {<DT>?<JJ>*<NN>}
    VBD: {<VBD>}
    IN: {<IN>}"""
    NPChunker = nltk.RegexpParser(pattern) 
    result = NPChunker.parse(sentence)
    
    Tree.fromstring(str(result)).pretty_print()
    

    [输出]:

    S码
    NP中的NP-VBD
    ________|_________________         |        |      _____|____