代码之家 › 专栏 › 技术社区 › gd13

用nltk过滤三角形标记

collocation nltk nlp python

gd13 · 技术社区 · 6 年前

我想找出一个语料库的三元结构,但有一个限制,即至少有两个三元结构的单词不是专有名词。这是我目前的代码。

def collocation_finder(text,window_size):

      ign = stopwords.words('english')


      #Clean the text
      finder = TrigramCollocationFinder.from_words(text, window_size) 
      finder.apply_freq_filter(2) 
      finder.apply_word_filter(lambda w: len(w) < 2 or w.lower() in ign)
      finder.apply_word_filter(lambda w: next(iter(w)) in propernouns)



      trig_mes = TrigramAssocMeasures()
      #Get trigrams based on raw frequency
      collocs = finder.nbest(trig_mes.raw_freq,10) 
      scores = finder.score_ngrams( trig_mes.raw_freq)

      return(collocs)

其中propernouns是语料库中所有专有名词的列表。

问题是,我的最后一个字过滤的一个,应该是确保我不超过我的限制。有什么想法吗?

1 回复 | 直到 6 年前

Hassan Voyeau 6 年前

这应该是你想要的

finder.apply_ngram_filter(lambda w1, w2, w3: sum([w1 n propernouns, w2 in propernouns, w3 in propernouns]) >= 2)

推荐文章

thenightmarechild92 · 使用正则表达式拆分具有唯一标题的子节

10 月前

lucasa.lisboa · 无法从“huggingface_hub”导入名称“split_torch_state_dict_into_shards”

10 月前

Zoltan Hernyak · C#中的英文文本标记化不是python是可能的吗?

1 年前

Toothpick Anemone · 字母“i”本身的正则表达式是什么?

1 年前

me0076 · 使用LLM提取多个实体

1 年前

Franck Dernoncourt · 当测试字符串100%包含查询字符串时,为什么fuzzywuzzy的process.extractBests不能给出100%的分数?

1 年前

jstark523 · 试图为我的应用程序找出最佳LLM选项

1 年前

lyanna · 检测同词句子的语义相异性

1 年前

Yash Babel · Microsoft Copilot-与Excel集成

1 年前

chetan sharma · 从每一行和a)、b)、c)、d)中删除,列类型为pandas.core.series。系列

1 年前