代码之家  ›  专栏  ›  技术社区  ›  Camilla8

带标签的CSV文件

  •  1
  • Camilla8  · 技术社区  · 6 年前

    如此处所示 Python Tf idf algorithm 我使用此代码获取一组文档上的单词频率。

    import pandas as pd
    import csv
    import os
    from sklearn.feature_extraction.text import TfidfVectorizer
    from nltk import word_tokenize
    from nltk.stem.porter import PorterStemmer
    import codecs
    
    def tokenize(text):
        tokens = word_tokenize(text)
        stems = []
        for item in tokens: stems.append(PorterStemmer().stem(item))
        return stems
    
    with codecs.open("book1.txt",'r','utf-8') as i1,\
            codecs.open("book2.txt",'r','utf-8') as i2,\
            codecs.open("book3.txt",'r','utf-8') as i3:
        # your corpus
        t1=i1.read().replace('\n',' ')
        t2=i2.read().replace('\n',' ')
        t3=i3.read().replace('\n',' ')
    
        text = [t1,t2,t3]
        # word tokenize and stem
        text = [" ".join(tokenize(txt.lower())) for txt in text]
        vectorizer = TfidfVectorizer()
        matrix = vectorizer.fit_transform(text).todense()
        # transform the matrix to a pandas df
        matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
        # sum over each document (axis=0)
        top_words = matrix.sum(axis=0).sort_values(ascending=False)
    
        top_words.to_csv('dict.csv', index=True, float_format="%f",encoding="utf-8")
    

    最后一行,我创建了一个csv文件,其中列出了所有单词及其频率。有没有办法给他们贴上标签,看看一个单词是属于第三个文档,还是属于所有文档? 我的目标是从csv文件中删除仅出现在第三个文档中的所有单词( book3 )

    1 回复  |  直到 6 年前
        1
  •  1
  •   Gabriel    6 年前

    您可以使用 isin() 属性筛选出 top_words 在第三本书中 top_ words 在整个语料库中。

    (在下面的例子中,我从 http://www.gutenberg.org/ )

    import codecs
    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    # import nltk
    # nltk.download('punkt')
    from nltk import word_tokenize
    from nltk.stem.porter import PorterStemmer
    
    def tokenize(text):
        tokens = word_tokenize(text)
        stems = []
        for item in tokens: stems.append(PorterStemmer().stem(item))
        return stems
    
    with codecs.open("56732-0.txt",'r','utf-8') as i1,\
            codecs.open("56734-0.txt",'r','utf-8') as i2,\
            codecs.open("56736-0.txt",'r','utf-8') as i3:
        # your corpus
        t1=i1.read().replace('\n',' ')
        t2=i2.read().replace('\n',' ')
        t3=i3.read().replace('\n',' ')
    
    text = [t1,t2,t3]
    # word tokenize and stem
    text = [" ".join(tokenize(txt.lower())) for txt in text]
    vectorizer = TfidfVectorizer()
    matrix = vectorizer.fit_transform(text).todense()
    # transform the matrix to a pandas df
    matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
    # sum over each document (axis=0)
    top_words = matrix.sum(axis=0).sort_values(ascending=False)
    
    # top_words for the 3rd book alone
    text = [" ".join(tokenize(t3.lower()))]
    matrix = vectorizer.fit_transform(text).todense()
    matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
    top_words3 = matrix.sum(axis=0).sort_values(ascending=False)
    
    # Mask out words in t3
    mask = ~top_words.index.isin(top_words3.index)
    # Filter those words from top_words
    top_words = top_words[mask]
    
    top_words.to_csv('dict.csv', index=True, float_format="%f",encoding="utf-8")