代码之家 › 专栏 › 技术社区 › Mark McDonald

自然语言处理的词频算法

word-frequency nlp algorithm

Mark McDonald · 技术社区 · 16 年前

在没有获得信息检索学位的情况下,我想知道是否存在计算给定文本体中单词出现频率的算法。我们的目标是对人们在一组文本评论中所说的话有一种“一般的感觉”。沿着 Wordle .

我想:

忽略冠词、代词等(‘a’、‘an’、‘the’、‘him’、‘them’等)
保留专有名词
忽略连字符,软类型除外

伸手去够星星,这些将是桃色的:

处理词干和复数(例如,like、likes、liked、likeing匹配相同的结果)
形容词(副词等)与其主语的组合(“伟大的服务”,而不是“伟大的”,“服务”)。

我试过用WordNet做一些基本的工作,但我只是盲目地调整,希望它能适用于我的特定数据。一些更通用的东西会很好。

8 回复 | 直到 16 年前

Aleksandar Dimitrov 16 年前

Lucene NLTK

tf-idf

underspecified 16 年前

part-of-speech taggers

$ echo "Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text." | tree-tagger-english 
# Word  POS     surface form
Without IN  without
getting VVG get
a   DT  a
degree  NN  degree
in  IN  in
information NN  information
retrieval   NN  retrieval
,   ,   ,
I   PP  I
'd  MD  will
like    VV  like
to  TO  to
know    VV  know
if  IN  if
there   EX  there
exists  VVZ exist
any DT  any
algorithms  NNS algorithm
for IN  for
counting    VVG count
the DT  the
frequency   NN  frequency
that    IN/that that
words   NNS word
occur   VVP occur
in  IN  in
a   DT  a
given   VVN give
body    NN  body
of  IN  of
text    NN  text
.   SENT    .

TreeTagger
GENIA Tagger
Stanford POS Tagger

n-grams UNIX for Poets Foundations of Statistical Natural Language Processing

unmounted 16 年前

>>> import urllib2, string
>>> devilsdict = urllib2.urlopen('http://www.gutenberg.org/files/972/972.txt').read()
>>> workinglist = devilsdict.split()
>>> cleanlist = [item.strip(string.punctuation) for item in workinglist]
>>> results = {}
>>> skip = {'a':'', 'the':'', 'an':''}
>>> for item in cleanlist:
      if item not in skip:
        try:
          results[item] += 1
        except KeyError:
          results[item] = 1

>>> results
{'': 17, 'writings': 3, 'foul': 1, 'Sugar': 1, 'four': 8, 'Does': 1, "friend's": 1, 'hanging': 4, 'Until': 1, 'marching': 2 ...

naspinski 16 年前

ttp://naspinski.net/post/Findingcounting-Keywords-out-of-a-Text-Document.aspx

Justin Bozonier 16 年前

graffic 16 年前

Programming Collective Intelligence

tafseer 15 年前

Dim 6 年前

spacy

navigating the parse tree

prodigy