代码之家  ›  专栏  ›  技术社区  ›  Nicolò Gasparini

gensim doc2vec修剪和删除词汇

  •  0
  • Nicolò Gasparini  · 技术社区  · 6 年前

    我尝试创建一个简单的doc2vec模型:

     sentences = []
     sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'rosse', u'con', u'tacco'], tags=[1]))
     sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'blu'], tags=[2]))
     sentences.append(doc2vec.TaggedDocument(words=[u'scarponcini', u'Emporio', u'Armani'], tags=[3]))
     sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'marca', u'italiana'], tags=[4]))
     sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'bianche', u'senza', u'tacco'], tags=[5]))
    
     model = Doc2Vec(alpha=0.025, min_alpha=0.025)  # use fixed learning rate
     model.build_vocab(sentences)  
    

    但最终我的词汇量是空的。通过一些调试,我看到在build_vocab()函数中,字典实际上是由worsary.scan_vocab()函数创建的,但是它被下面的词汇删除了。准备_vocab()函数。更深入地说,这是导致问题的功能:

    def keep_vocab_item(word, count, min_count, trim_rule=None):
        """Check that should we keep `word` in vocab or remove.
    
        Parameters
        ----------
        word : str
            Input word.
        count : int
            Number of times that word contains in corpus.
        min_count : int
            Frequency threshold for `word`.
        trim_rule : function, optional
            Function for trimming entities from vocab, default behaviour is `vocab[w] <= min_reduce`.
    
        Returns
        -------
        bool
            True if `word` should stay, False otherwise.
    
        """
        default_res = count >= min_count
    
        if trim_rule is None:
            return default_res # <-- ALWAYS RETURNS FALSE
        else:
            rule_res = trim_rule(word, count, min_count)
            if rule_res == RULE_KEEP:
                return True
            elif rule_res == RULE_DISCARD:
                return False
            else:
                return default_res  
    

    有人理解这个问题吗?

    1 回复  |  直到 6 年前
        1
  •  2
  •   Nicolò Gasparini    6 年前

    我自己找到了答案,Min_Count的默认值是5,我没有带5个或更多计数器的单词。 我只需要更改这行代码:

    model = Doc2Vec(min_count=0, alpha=0.025, min_alpha=0.025)