代码之家 › 专栏 › 技术社区 › Nicolò Gasparini

gensim doc2vec修剪和删除词汇

vocabulary doc2vec gensim python

Nicolò Gasparini · 技术社区 · 6 年前

我尝试创建一个简单的doc2vec模型:

 sentences = []
 sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'rosse', u'con', u'tacco'], tags=[1]))
 sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'blu'], tags=[2]))
 sentences.append(doc2vec.TaggedDocument(words=[u'scarponcini', u'Emporio', u'Armani'], tags=[3]))
 sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'marca', u'italiana'], tags=[4]))
 sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'bianche', u'senza', u'tacco'], tags=[5]))

 model = Doc2Vec(alpha=0.025, min_alpha=0.025)  # use fixed learning rate
 model.build_vocab(sentences)

但最终我的词汇量是空的。通过一些调试,我看到在build_vocab()函数中,字典实际上是由worsary.scan_vocab()函数创建的,但是它被下面的词汇删除了。准备_vocab()函数。更深入地说,这是导致问题的功能:

def keep_vocab_item(word, count, min_count, trim_rule=None):
    """Check that should we keep `word` in vocab or remove.

    Parameters
    ----------
    word : str
        Input word.
    count : int
        Number of times that word contains in corpus.
    min_count : int
        Frequency threshold for `word`.
    trim_rule : function, optional
        Function for trimming entities from vocab, default behaviour is `vocab[w] <= min_reduce`.

    Returns
    -------
    bool
        True if `word` should stay, False otherwise.

    """
    default_res = count >= min_count

    if trim_rule is None:
        return default_res # <-- ALWAYS RETURNS FALSE
    else:
        rule_res = trim_rule(word, count, min_count)
        if rule_res == RULE_KEEP:
            return True
        elif rule_res == RULE_DISCARD:
            return False
        else:
            return default_res

有人理解这个问题吗?

1 回复 | 直到 6 年前

Nicolò Gasparini 6 年前

我自己找到了答案,Min_Count的默认值是5,我没有带5个或更多计数器的单词。我只需要更改这行代码:

model = Doc2Vec(min_count=0, alpha=0.025, min_alpha=0.025)

推荐文章

Sarah Elnaggar · 我使用Gensim Doc2vec进行图形嵌入,然后在keras中使用两层深度神经网络进行二元分类

3 年前

Simon Hessner · gensim-Doc2Vec:iter与时代的差异

6 年前

Christopher · Doc2Vec的管道和网格搜索

6 年前

user2578525 · 生产环境中的文档相似性

6 年前

V. Déhaye · 尝试更新gensim的LdaModel时的索引器错误

6 年前

abdalmohaymen aliesmaeel · gensim模型返回ID与输入doc2vec不相关

7 年前

surya · UnpicklingError:加载键“3”无效

7 年前

MMM · 使用大型语料库python gensim的极慢LDA训练模型

7 年前

j-s · 如何在python中使用gensim和word2vec查找语义相似度

7 年前

OverflowingTheGlass · Gensim Doc2Vec访问向量(按文档作者)

7 年前