代码之家 › 专栏 › 技术社区 › Marsellus Wallace

如何从预先训练的单词嵌入数据集创建Keras嵌入层?

word-embedding word2vec keras tensorflow python

Marsellus Wallace · 技术社区 · 7 年前

如何将预先训练好的单词嵌入到Keras中 Embedding 层

我下载了 glove.6B.50d.txt (glove.6B.zip文件来自 https://nlp.stanford.edu/projects/glove/ )我不知道如何将其添加到Keras嵌入层。请参见: https://keras.io/layers/embeddings/

4 回复 | 直到 7 年前

Marsellus Wallace 7 年前

您需要将嵌入矩阵传递给 Embedding 图层如下:

Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)

vocabLen :词汇表中的标记数
embDim :嵌入向量维度(示例中为50)
embeddingMatrix :嵌入手套构建的矩阵。6B。50天。txt文件
isTrainable :是否希望嵌入件可培训或冻结层

这个 glove.6B.50d.txt 是以空格分隔的值的列表:word标记+(50)嵌入值。例如 the 0.418 0.24968 -0.41242 ...

创建 pretrainedEmbeddingLayer 从手套文件:

# Prepare Glove File
def readGloveFile(gloveFile):
    with open(gloveFile, 'r') as f:
        wordToGlove = {}  # map from a token (word) to a Glove embedding vector
        wordToIndex = {}  # map from a token to an index
        indexToWord = {}  # map from an index to a token 

        for line in f:
            record = line.strip().split()
            token = record[0] # take the token (word) from the text line
            wordToGlove[token] = np.array(record[1:], dtype=np.float64) # associate the Glove embedding vector to a that token (word)

        tokens = sorted(wordToGlove.keys())
        for idx, tok in enumerate(tokens):
            kerasIdx = idx + 1  # 0 is reserved for masking in Keras (see above)
            wordToIndex[tok] = kerasIdx # associate an index to a token (word)
            indexToWord[kerasIdx] = tok # associate a word to a token (word). Note: inverse of dictionary above

    return wordToIndex, indexToWord, wordToGlove

# Create Pretrained Keras Embedding Layer
def createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, isTrainable):
    vocabLen = len(wordToIndex) + 1  # adding 1 to account for masking
    embDim = next(iter(wordToGlove.values())).shape[0]  # works with any glove dimensions (e.g. 50)

    embeddingMatrix = np.zeros((vocabLen, embDim))  # initialize with zeros
    for word, index in wordToIndex.items():
        embeddingMatrix[index, :] = wordToGlove[word] # create embedding: word index to Glove word embedding

    embeddingLayer = Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)
    return embeddingLayer

# usage
wordToIndex, indexToWord, wordToGlove = readGloveFile("/path/to/glove.6B.50d.txt")
pretrainedEmbeddingLayer = createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, False)
model = Sequential()
model.add(pretrainedEmbeddingLayer)
...

Bhushan Pant 7 年前

有一篇很棒的博客文章描述了如何使用预先训练好的词向量嵌入创建嵌入层:

https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

上述文章的代码可在此处找到:

https://github.com/keras-team/keras/blob/master/examples/pretrained_word_embeddings.py

另一个同样目的的好博客: https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

janluke 4 年前

几年前,我写了一个名为 embfile 用于“嵌入文件”(但我只是在2020年才发布)。我想介绍的用例是创建一个经过预训练的嵌入矩阵来初始化 Embedding 层我想通过尽可能快地加载所需的单词向量来实现这一点。

它支持多种格式:

.txt(带或不带“标题行”)
.bin,谷歌Word2Vec格式
.vvm,我使用的一种自定义格式(它只是一个TAR文件,在单独的文件中包含词汇、向量和元数据,以便词汇可以在几秒钟内完全读取,向量可以随机访问)。

包裹是 extensively documented 并进行了测试。还有 examples that show how to use it with Keras .

import embfile

with embfile.open(EMBEDDING_FILE_PATH) as f:

    emb_matrix, word2index, missing_words = embfile.build_matrix(
        f, 
        words=vocab,     # this could also be a word2index dictionary as well
        start_index=1,   # leave the first row to zeros 
    )

此函数还处理文件词汇表之外的单词的初始化。默认情况下,它在找到的向量上拟合正态分布,并使用它生成新的随机向量(这就是AllenNLP所做的)。我不确定这个功能是否仍然有用:现在,您可以使用FastText或其他工具为未知单词生成嵌入。

请记住,txt和bin文件基本上是顺序文件,需要进行完全扫描(除非在结尾之前找到所有要查找的单词)。这就是为什么我使用vvm文件,它为向量提供随机访问。仅仅通过索引顺序文件就可以解决这个问题,但embfile没有这个功能。尽管如此,您可以将顺序文件转换为vvm(这类似于创建索引并将所有内容打包到单个文件中)。

MuzaffarShaikh 3 年前

我在寻找类似的东西。我找到了这篇回答这个问题的博文。它正确地解释了创建 embedding_matrix 并将其传递给 Embedding() 层

GloVe Embeddings for deep learning in Keras.