试图通过应用Tensorflow标记器对IMDB电影评论进行标记。我想拥有最多10000个单词的词汇量。对于看不见的单词,我使用默认标记。
type(X), X.shape, X[:3]
(pandas.core.series.Series,(25000,),
0 first think another disney movie might good it...
1 put aside dr house repeat missed desperate hou...
2 big fan stephen king s work film made even gre...
Name: SentimentText, dtype: object)
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer=Tokenizer(num_words=10000,oov_token='xxxxxxx')
# fit on the input data
tokenizer.fit_on_texts(X)
当我检查tokenizer字典中的单词数量时,我得到:
X_dict=tokenizer.word_index
list(enumerate(X_dict.items()))[:10]
[(0, ('xxxxxxx', 1)),
(1, ('s', 2)),
(2, ('movie', 3)),
(3, ('film', 4)),
(4, ('not', 5)),
(5, ('it', 6)),
(6, ('one', 7)),
(7, ('like', 8)),
(8, ('i', 9)),
(9, ('good', 10))]
print(len(X_dict))
Out: 74120
为什么我得到74120个单词而不是10000个单词?