代码之家 › 专栏 › 技术社区 › dokondr

Tensorflow标记器:要保留的最大单词数

tokenize tensorflow2.0 tensorflow

0

dokondr · 技术社区 · 4 年前

试图通过应用Tensorflow标记器对IMDB电影评论进行标记。我想拥有最多10000个单词的词汇量。对于看不见的单词,我使用默认标记。

type(X), X.shape, X[:3]

(pandas.core.series.Series,(25000,),
 0    first think another disney movie might good it...
 1    put aside dr house repeat missed desperate hou...
 2    big fan stephen king s work film made even gre...
 Name: SentimentText, dtype: object)

from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer=Tokenizer(num_words=10000,oov_token='xxxxxxx')
# fit on the input data 
tokenizer.fit_on_texts(X)

当我检查tokenizer字典中的单词数量时,我得到:

X_dict=tokenizer.word_index

list(enumerate(X_dict.items()))[:10]
[(0, ('xxxxxxx', 1)),
 (1, ('s', 2)),
 (2, ('movie', 3)),
 (3, ('film', 4)),
 (4, ('not', 5)),
 (5, ('it', 6)),
 (6, ('one', 7)),
 (7, ('like', 8)),
 (8, ('i', 9)),
 (9, ('good', 10))]

print(len(X_dict))

Out: 74120

为什么我得到74120个单词而不是10000个单词?

0 回复 | 直到 4 年前

1

MichaelJanz 4 年前

因为词典总是保存着的。当你查看来源时 code 你在函数中看到了 fit_on_texts() 参数 num_words 被忽略。但是,当您使用以下命令将文本转换为序列时 texts_to_sequences() 你可以看到呼叫 texts_to_sequences_generator() 然后它具有以下代码:

for w in seq:
    i = self.word_index.get(w)
    if i is not None:
         if num_words and i >= num_words:
              if oov_token_index is not None:
                  vect.append(oov_token_index)
              else:
                  vect.append(i)
         elif self.oov_token is not None:
            vect.append(oov_token_index)
    yield vect

您可以看到,num_words被注意到并用于进一步生成序列。这很有用,因为你可以很容易地改变单词的数量,而无需再次适应整个文本,所以实验一下它是否很适合你的需求,或者你是否需要更多的单词来成功完成你的任务,正如nicolewhite在她的github中所说的那样 answer .

所以基本上,当你跑步时,你观察到的与预期一致 np.unique() 在所有序列上,值不会超过10000个。