代码之家 › 专栏 › 技术社区 › Abhishek Rai

清理Twitter数据pandas python

pandas python-3.x python

Abhishek Rai · 技术社区 · 4 年前

尝试将twitter数据清理为panda数据框。我好像少了一步。在我处理完所有的tweet之后,我想我错过了覆盖旧tweet的新tweet?当我保存文件时,我在tweets中没有看到任何更改。我错过了什么?

import pandas as pd
import re
import emoji
import nltk
nltk.download('words')
words = set(nltk.corpus.words.words())

trump_df = pd.read_csv('new_Trump.csv')
for tweet in trump_df['tweet']:
    tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
    tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) #Remove http links
    tweet = " ".join(tweet.split())
    tweet = ''.join(c for c in tweet if c not in emoji.UNICODE_EMOJI) #Remove Emojis
    tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
    tweet = " ".join(w for w in nltk.wordpunct_tokenize(tweet) \
         if w.lower() in words or not w.isalpha()) #Remove non-english tweets (not 100% success)
    print(tweet)
trump_df.to_csv('new_Trump.csv')

1 回复 | 直到 4 年前

Celius Stingher 4 年前

正如您所说的,您永远不会存储回数据,让我们创建一个函数来完成所有的工作,然后使用 map . 将数据存储到每个循环框中比将数据存储到B中更有效)。

def cleaner(tweet):
    tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
    tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) #Remove http links
    tweet = " ".join(tweet.split())
    tweet = ''.join(c for c in tweet if c not in emoji.UNICODE_EMOJI) #Remove Emojis
    tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
    tweet = " ".join(w for w in nltk.wordpunct_tokenize(tweet) \
         if w.lower() in words or not w.isalpha())
    return tweet
trump_df['tweet'] = trump_df['tweet'].map(lambda x: cleaner(x))
trump_df.to_csv('') #specify location

这将覆盖 tweet 列及其修改。

如前所述,我认为这样做效率会更高一些,但这就像在 for 循环,用每个干净的tweet填充它。

clean_tweets = []
for tweet in trump_df['tweet']:
    tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
    ##Here's where all the cleaning takes place
    clean_tweets.append(tweet)
trump_df['tweet'] = clean_tweets
trump_df.to_csv('') #Specify location

推荐文章

Aaron Green · 我的python程序无法识别数据库的存在,即使它在那里

1 年前

danial · 如何在多个字符串的每个位置找到最频繁的字符

2 年前

Henry · 使用Python将json重新格式化为键值对

2 年前

eymentakak · json字典类型错误:字符串索引必须是整数

2 年前

Qubix · 从熊猫数据帧创建相对熵矩阵

2 年前

FÄÅ ÛÅ · 字典、列表和字符串

2 年前

OrbitDuster · 如何使用gmail api在python中打印gmail正文?

2 年前

guiguilecodeur · 如何删除我的词汇表中的重复元素

2 年前

Susheel P M · 这是关于if-else语句[关闭]

2 年前

Slartibartfast · 关于Python版本安装

2 年前