代码之家  ›  专栏  ›  技术社区  ›  Abhishek Rai

清理Twitter数据pandas python

  •  1
  • Abhishek Rai  · 技术社区  · 4 年前

    尝试将twitter数据清理为panda数据框。我好像少了一步。在我处理完所有的tweet之后,我想我错过了覆盖旧tweet的新tweet?当我保存文件时,我在tweets中没有看到任何更改。我错过了什么?

    import pandas as pd
    import re
    import emoji
    import nltk
    nltk.download('words')
    words = set(nltk.corpus.words.words())
    
    trump_df = pd.read_csv('new_Trump.csv')
    for tweet in trump_df['tweet']:
        tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
        tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) #Remove http links
        tweet = " ".join(tweet.split())
        tweet = ''.join(c for c in tweet if c not in emoji.UNICODE_EMOJI) #Remove Emojis
        tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
        tweet = " ".join(w for w in nltk.wordpunct_tokenize(tweet) \
             if w.lower() in words or not w.isalpha()) #Remove non-english tweets (not 100% success)
        print(tweet)
    trump_df.to_csv('new_Trump.csv')
    
    1 回复  |  直到 4 年前
        1
  •  1
  •   Celius Stingher    4 年前

    正如您所说的,您永远不会存储回数据,让我们创建一个函数来完成所有的工作,然后使用 map . 将数据存储到每个循环框中比将数据存储到B中更有效)。

    def cleaner(tweet):
        tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
        tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) #Remove http links
        tweet = " ".join(tweet.split())
        tweet = ''.join(c for c in tweet if c not in emoji.UNICODE_EMOJI) #Remove Emojis
        tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
        tweet = " ".join(w for w in nltk.wordpunct_tokenize(tweet) \
             if w.lower() in words or not w.isalpha())
        return tweet
    trump_df['tweet'] = trump_df['tweet'].map(lambda x: cleaner(x))
    trump_df.to_csv('') #specify location
    

    这将覆盖 tweet 列及其修改。

    如前所述,我认为这样做效率会更高一些,但这就像在 for 循环,用每个干净的tweet填充它。

    clean_tweets = []
    for tweet in trump_df['tweet']:
        tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
        ##Here's where all the cleaning takes place
        clean_tweets.append(tweet)
    trump_df['tweet'] = clean_tweets
    trump_df.to_csv('') #Specify location