代码之家 › 专栏 › 技术社区 › rmahesh

尝试在keras中标记文本时出现空白错误

keras numpy python

rmahesh · 技术社区 · 6 年前

我有一个有两列的数据框架。第一列(内容已清除)包含包含句子的行。第二列(有意义)包含关联的二进制标签。

我遇到的问题是,当我试图标记内容清理列中的文本时,空格会出现问题。以下是迄今为止我的代码:

df = pd.read_csv(pathname, encoding = "ISO-8859-1")
df = df[['content_cleaned', 'meaningful']]
df = df.sample(frac=1)

#Transposed columns into numpy arrays 
X = np.asarray(df[['content_cleaned']])
y = np.asarray(df[['meaningful']])

#Split into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21) 

# Create tokenizer
tokenizer = Tokenizer(num_words=100) #No row has more than 100 words.

#Tokenize the predictors (text)
X_train = tokenizer.sequences_to_matrix(X_train.astype(np.int32), mode="binary")
X_test = tokenizer.sequences_to_matrix(X_test.astype(np.int32), mode="binary")

#Convert the labels to the binary
encoder = LabelBinarizer()
encoder.fit(y_train) 
y_train = encoder.transform(y_train)
y_test = encoder.transform(y_test)

错误突出显示的代码行是:

X_train = tokenizer.sequences_to_matrix(X_train.astype(np.int32), mode="binary")

错误消息为:

invalid literal for int() with base 10: "STX's better than reported quarter is likely to bode well for WDC results."

“base 10:”后面的句子是包含文本的列中某一行的示例。这将是一个例子,我正试图标记化的句子。

我相信这是numpy的一个问题,但我也相信这可能是我标记此文本数组的方法中的一个错误。

任何帮助都会很好!

1 回复 | 直到 6 年前

mcemilg 6 年前

你不是在标记文字, sequences_to_matrix 方法不标记文本,但将序列列表转换为 matrices . 有很多方法可以标记文本数据,因此如果您想使用Keras标记器,可以使用以下方法:

from keras.preprocessing.text import Tokenizer

# Tip for you: the num_words param is not the max length of given sentences
# It is the maximum number of words to keep in dictionary
tokenizer = Tokenizer(num_words=100)

# Creates a word index dictionary in itself
# Do not fit on your test data it will mislead on your score
tokenizer.fit_on_texts(X_train)

# Now you can convert the texts to sequences
X_train_encoded = tokenizer.texts_to_sequences(X_train)
X_test_encoded = tokenizer.texts_to_sequences(X_test)

# You need to add pads to sentences to fix them to same size
from keras.preprocessing.sequence import pad_sequences
max_len = 100
X_train = pad_sequences(X_train_encoded, maxlen=max_len)
X_test = pad_sequences(X_test_encoded, maxlen=max_len)

希望对你有帮助,退房 here 有一个关于用keras预处理文本的很好的教程。

推荐文章

Google User · Django管理员在`list_display中未显示`creation_date`字段`

4 月前

user29747013 · 如何创建一个新的数据框架,其中包含原始数据框架中列的聚合列?

4 月前

ÎÎÎ½Î· ÎÎ®Î¹Î½Î¿Ï · Python lxml.html语法错误:使用lxml find时XPATH的谓词无效

4 月前

user29715306 · from_users=和chats=电视节目中的差异

4 月前

Redshoe · 当执行numpy.genfromtxt()时,python是否会读取文件的所有行?

4 月前

RASEL MAHMUD · 为什么以及如何在is_even()函数内的IF条件中递归X变量在满足0后递增?[副本]

4 月前

prayner · 更新嵌套字典包含列表中的项

4 月前

Bringo Jr · 我可以在O(n)中解决这个问题吗?

4 月前

Dave · 如何在for循环中修改列表值

4 月前

Shukurullox Komiljonov · 从记录中获得相互和解。使用SQL

4 月前