代码之家  ›  专栏  ›  技术社区  ›  Mark McDonald

自然语言处理的词频算法

  •  32
  • Mark McDonald  · 技术社区  · 16 年前

    在没有获得信息检索学位的情况下,我想知道是否存在计算给定文本体中单词出现频率的算法。我们的目标是对人们在一组文本评论中所说的话有一种“一般的感觉”。沿着 Wordle .

    我想:

    • 忽略冠词、代词等(‘a’、‘an’、‘the’、‘him’、‘them’等)
    • 保留专有名词
    • 忽略连字符,软类型除外

    伸手去够星星,这些将是桃色的:

    • 处理词干和复数(例如,like、likes、liked、likeing匹配相同的结果)
    • 形容词(副词等)与其主语的组合(“伟大的服务”,而不是“伟大的”,“服务”)。

    我试过用WordNet做一些基本的工作,但我只是盲目地调整,希望它能适用于我的特定数据。一些更通用的东西会很好。

    8 回复  |  直到 16 年前
        2
  •  16
  •   underspecified    16 年前

    part-of-speech taggers

    $ echo "Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text." | tree-tagger-english 
    # Word  POS     surface form
    Without IN  without
    getting VVG get
    a   DT  a
    degree  NN  degree
    in  IN  in
    information NN  information
    retrieval   NN  retrieval
    ,   ,   ,
    I   PP  I
    'd  MD  will
    like    VV  like
    to  TO  to
    know    VV  know
    if  IN  if
    there   EX  there
    exists  VVZ exist
    any DT  any
    algorithms  NNS algorithm
    for IN  for
    counting    VVG count
    the DT  the
    frequency   NN  frequency
    that    IN/that that
    words   NNS word
    occur   VVP occur
    in  IN  in
    a   DT  a
    given   VVN give
    body    NN  body
    of  IN  of
    text    NN  text
    .   SENT    .
    

    TreeTagger
    GENIA Tagger
    Stanford POS Tagger

    n-grams UNIX for Poets Foundations of Statistical Natural Language Processing

        3
  •  4
  •   unmounted    16 年前

    >>> import urllib2, string
    >>> devilsdict = urllib2.urlopen('http://www.gutenberg.org/files/972/972.txt').read()
    >>> workinglist = devilsdict.split()
    >>> cleanlist = [item.strip(string.punctuation) for item in workinglist]
    >>> results = {}
    >>> skip = {'a':'', 'the':'', 'an':''}
    >>> for item in cleanlist:
          if item not in skip:
            try:
              results[item] += 1
            except KeyError:
              results[item] = 1
    
    >>> results
    {'': 17, 'writings': 3, 'foul': 1, 'Sugar': 1, 'four': 8, 'Does': 1, "friend's": 1, 'hanging': 4, 'Until': 1, 'marching': 2 ...
    

        5
  •  2
  •   Justin Bozonier    16 年前

        6
  •  1
  •   graffic    16 年前
        7
  •  0
  •   tafseer    15 年前