代码之家  ›  专栏  ›  技术社区  ›  Mat

NLTK使用实例[已关闭]

  •  72
  • Mat  · 技术社区  · 16 年前

    我在玩 Natural Language Toolkit (NLTK)。

    其文件( Book HOWTO )是相当庞大的和例子有时稍微先进。

    NTLK articles 博客。

    3 回复  |  直到 9 年前
        1
  •  28
  •   Mat    16 年前

    这是我自己的一个实际例子,可以让其他人看看这个问题(请原谅样本文本,这是我在上面发现的第一件事) Wikipedia

    import nltk
    import pprint
    
    tokenizer = None
    tagger = None
    
    def init_nltk():
        global tokenizer
        global tagger
        tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|[^\w\s]+')
        tagger = nltk.UnigramTagger(nltk.corpus.brown.tagged_sents())
    
    def tag(text):
        global tokenizer
        global tagger
        if not tokenizer:
            init_nltk()
        tokenized = tokenizer.tokenize(text)
        tagged = tagger.tag(tokenized)
        tagged.sort(lambda x,y:cmp(x[1],y[1]))
        return tagged
    
    def main():
        text = """Mr Blobby is a fictional character who featured on Noel
        Edmonds' Saturday night entertainment show Noel's House Party,
        which was often a ratings winner in the 1990s. Mr Blobby also
        appeared on the Jamie Rose show of 1997. He was designed as an
        outrageously over the top parody of a one-dimensional, mute novelty
        character, which ironically made him distinctive, absurd and popular.
        He was a large pink humanoid, covered with yellow spots, sporting a
        permanent toothy grin and jiggling eyes. He communicated by saying
        the word "blobby" in an electronically-altered voice, expressing
        his moods through tone of voice and repetition.
    
        There was a Mrs. Blobby, seen briefly in the video, and sold as a
        doll.
    
        However Mr Blobby actually started out as part of the 'Gotcha'
        feature during the show's second series (originally called 'Gotcha
        Oscars' until the threat of legal action from the Academy of Motion
        Picture Arts and Sciences[citation needed]), in which celebrities
        were caught out in a Candid Camera style prank. Celebrities such as
        dancer Wayne Sleep and rugby union player Will Carling would be
        enticed to take part in a fictitious children's programme based around
        their profession. Mr Blobby would clumsily take part in the activity,
        knocking over the set, causing mayhem and saying "blobby blobby
        blobby", until finally when the prank was revealed, the Blobby
        costume would be opened - revealing Noel inside. This was all the more
        surprising for the "victim" as during rehearsals Blobby would be
        played by an actor wearing only the arms and legs of the costume and
        speaking in a normal manner.[citation needed]"""
        tagged = tag(text)    
        l = list(set(tagged))
        l.sort(lambda x,y:cmp(x[1],y[1]))
        pprint.pprint(l)
    
    if __name__ == '__main__':
        main()
    

    输出:

    [('rugby', None),
     ('Oscars', None),
     ('1990s', None),
     ('",', None),
     ('Candid', None),
     ('"', None),
     ('blobby', None),
     ('Edmonds', None),
     ('Mr', None),
     ('outrageously', None),
     ('.[', None),
     ('toothy', None),
     ('Celebrities', None),
     ('Gotcha', None),
     (']),', None),
     ('Jamie', None),
     ('humanoid', None),
     ('Blobby', None),
     ('Carling', None),
     ('enticed', None),
     ('programme', None),
     ('1997', None),
     ('s', None),
     ("'", "'"),
     ('[', '('),
     ('(', '('),
     (']', ')'),
     (',', ','),
     ('.', '.'),
     ('all', 'ABN'),
     ('the', 'AT'),
     ('an', 'AT'),
     ('a', 'AT'),
     ('be', 'BE'),
     ('were', 'BED'),
     ('was', 'BEDZ'),
     ('is', 'BEZ'),
     ('and', 'CC'),
     ('one', 'CD'),
     ('until', 'CS'),
     ('as', 'CS'),
     ('This', 'DT'),
     ('There', 'EX'),
     ('of', 'IN'),
     ('inside', 'IN'),
     ('from', 'IN'),
     ('around', 'IN'),
     ('with', 'IN'),
     ('through', 'IN'),
     ('-', 'IN'),
     ('on', 'IN'),
     ('in', 'IN'),
     ('by', 'IN'),
     ('during', 'IN'),
     ('over', 'IN'),
     ('for', 'IN'),
     ('distinctive', 'JJ'),
     ('permanent', 'JJ'),
     ('mute', 'JJ'),
     ('popular', 'JJ'),
     ('such', 'JJ'),
     ('fictional', 'JJ'),
     ('yellow', 'JJ'),
     ('pink', 'JJ'),
     ('fictitious', 'JJ'),
     ('normal', 'JJ'),
     ('dimensional', 'JJ'),
     ('legal', 'JJ'),
     ('large', 'JJ'),
     ('surprising', 'JJ'),
     ('absurd', 'JJ'),
     ('Will', 'MD'),
     ('would', 'MD'),
     ('style', 'NN'),
     ('threat', 'NN'),
     ('novelty', 'NN'),
     ('union', 'NN'),
     ('prank', 'NN'),
     ('winner', 'NN'),
     ('parody', 'NN'),
     ('player', 'NN'),
     ('actor', 'NN'),
     ('character', 'NN'),
     ('victim', 'NN'),
     ('costume', 'NN'),
     ('action', 'NN'),
     ('activity', 'NN'),
     ('dancer', 'NN'),
     ('grin', 'NN'),
     ('doll', 'NN'),
     ('top', 'NN'),
     ('mayhem', 'NN'),
     ('citation', 'NN'),
     ('part', 'NN'),
     ('repetition', 'NN'),
     ('manner', 'NN'),
     ('tone', 'NN'),
     ('Picture', 'NN'),
     ('entertainment', 'NN'),
     ('night', 'NN'),
     ('series', 'NN'),
     ('voice', 'NN'),
     ('Mrs', 'NN'),
     ('video', 'NN'),
     ('Motion', 'NN'),
     ('profession', 'NN'),
     ('feature', 'NN'),
     ('word', 'NN'),
     ('Academy', 'NN-TL'),
     ('Camera', 'NN-TL'),
     ('Party', 'NN-TL'),
     ('House', 'NN-TL'),
     ('eyes', 'NNS'),
     ('spots', 'NNS'),
     ('rehearsals', 'NNS'),
     ('ratings', 'NNS'),
     ('arms', 'NNS'),
     ('celebrities', 'NNS'),
     ('children', 'NNS'),
     ('moods', 'NNS'),
     ('legs', 'NNS'),
     ('Sciences', 'NNS-TL'),
     ('Arts', 'NNS-TL'),
     ('Wayne', 'NP'),
     ('Rose', 'NP'),
     ('Noel', 'NP'),
     ('Saturday', 'NR'),
     ('second', 'OD'),
     ('his', 'PP$'),
     ('their', 'PP$'),
     ('him', 'PPO'),
     ('He', 'PPS'),
     ('more', 'QL'),
     ('However', 'RB'),
     ('actually', 'RB'),
     ('also', 'RB'),
     ('clumsily', 'RB'),
     ('originally', 'RB'),
     ('only', 'RB'),
     ('often', 'RB'),
     ('ironically', 'RB'),
     ('briefly', 'RB'),
     ('finally', 'RB'),
     ('electronically', 'RB-HL'),
     ('out', 'RP'),
     ('to', 'TO'),
     ('show', 'VB'),
     ('Sleep', 'VB'),
     ('take', 'VB'),
     ('opened', 'VBD'),
     ('played', 'VBD'),
     ('caught', 'VBD'),
     ('appeared', 'VBD'),
     ('revealed', 'VBD'),
     ('started', 'VBD'),
     ('saying', 'VBG'),
     ('causing', 'VBG'),
     ('expressing', 'VBG'),
     ('knocking', 'VBG'),
     ('wearing', 'VBG'),
     ('speaking', 'VBG'),
     ('sporting', 'VBG'),
     ('revealing', 'VBG'),
     ('jiggling', 'VBG'),
     ('sold', 'VBN'),
     ('called', 'VBN'),
     ('made', 'VBN'),
     ('altered', 'VBN'),
     ('based', 'VBN'),
     ('designed', 'VBN'),
     ('covered', 'VBN'),
     ('communicated', 'VBN'),
     ('needed', 'VBN'),
     ('seen', 'VBN'),
     ('set', 'VBN'),
     ('featured', 'VBN'),
     ('which', 'WDT'),
     ('who', 'WPS'),
     ('when', 'WRB')]
    
        2
  •  18
  •   Pete Mancini    13 年前

    一般来说,NLP是非常有用的,所以您可能需要将搜索范围扩大到文本分析的一般应用程序。我使用NLTK通过提取概念图生成文件分类来帮助moss2010。效果很好。文件很快就会以有用的方式聚集起来。

        3
  •  14
  •   Jacob    15 年前

    streamhacker.com (谢谢你的提醒,我从这个问题中得到了相当多的点击量)。你到底想做什么?NLTK有很多用于执行各种任务的工具,但是对于如何使用这些工具以及如何最好地使用它们,有些缺乏明确的信息。它也面向学术问题,所以翻译 pedagogical