正如您所说的,您永远不会存储回数据,让我们创建一个函数来完成所有的工作,然后使用
map
. 将数据存储到每个循环框中比将数据存储到B中更有效)。
def cleaner(tweet):
tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) #Remove http links
tweet = " ".join(tweet.split())
tweet = ''.join(c for c in tweet if c not in emoji.UNICODE_EMOJI) #Remove Emojis
tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
tweet = " ".join(w for w in nltk.wordpunct_tokenize(tweet) \
if w.lower() in words or not w.isalpha())
return tweet
trump_df['tweet'] = trump_df['tweet'].map(lambda x: cleaner(x))
trump_df.to_csv('') #specify location
这将覆盖
tweet
列及其修改。
如前所述,我认为这样做效率会更高一些,但这就像在
for
循环,用每个干净的tweet填充它。
clean_tweets = []
for tweet in trump_df['tweet']:
tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
##Here's where all the cleaning takes place
clean_tweets.append(tweet)
trump_df['tweet'] = clean_tweets
trump_df.to_csv('') #Specify location