代码之家 › 专栏 › 技术社区 › Camilla8

带标签的CSV文件

sklearn-pandas tf-idf export-to-csv csv python

Camilla8 · 技术社区 · 7 年前

如此处所示 Python Tf idf algorithm 我使用此代码获取一组文档上的单词频率。

import pandas as pd
import csv
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
import codecs

def tokenize(text):
    tokens = word_tokenize(text)
    stems = []
    for item in tokens: stems.append(PorterStemmer().stem(item))
    return stems

with codecs.open("book1.txt",'r','utf-8') as i1,\
        codecs.open("book2.txt",'r','utf-8') as i2,\
        codecs.open("book3.txt",'r','utf-8') as i3:
    # your corpus
    t1=i1.read().replace('\n',' ')
    t2=i2.read().replace('\n',' ')
    t3=i3.read().replace('\n',' ')

    text = [t1,t2,t3]
    # word tokenize and stem
    text = [" ".join(tokenize(txt.lower())) for txt in text]
    vectorizer = TfidfVectorizer()
    matrix = vectorizer.fit_transform(text).todense()
    # transform the matrix to a pandas df
    matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
    # sum over each document (axis=0)
    top_words = matrix.sum(axis=0).sort_values(ascending=False)

    top_words.to_csv('dict.csv', index=True, float_format="%f",encoding="utf-8")

最后一行,我创建了一个csv文件,其中列出了所有单词及其频率。有没有办法给他们贴上标签,看看一个单词是属于第三个文档,还是属于所有文档? 我的目标是从csv文件中删除仅出现在第三个文档中的所有单词( book3 )

1 回复 | 直到 7 年前

Gabriel 7 年前

您可以使用 isin() 属性筛选出 top_words 在第三本书中 top_ words 在整个语料库中。

(在下面的例子中,我从 http://www.gutenberg.org/ )

import codecs
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# import nltk
# nltk.download('punkt')
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer

def tokenize(text):
    tokens = word_tokenize(text)
    stems = []
    for item in tokens: stems.append(PorterStemmer().stem(item))
    return stems

with codecs.open("56732-0.txt",'r','utf-8') as i1,\
        codecs.open("56734-0.txt",'r','utf-8') as i2,\
        codecs.open("56736-0.txt",'r','utf-8') as i3:
    # your corpus
    t1=i1.read().replace('\n',' ')
    t2=i2.read().replace('\n',' ')
    t3=i3.read().replace('\n',' ')

text = [t1,t2,t3]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)

# top_words for the 3rd book alone
text = [" ".join(tokenize(t3.lower()))]
matrix = vectorizer.fit_transform(text).todense()
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
top_words3 = matrix.sum(axis=0).sort_values(ascending=False)

# Mask out words in t3
mask = ~top_words.index.isin(top_words3.index)
# Filter those words from top_words
top_words = top_words[mask]

top_words.to_csv('dict.csv', index=True, float_format="%f",encoding="utf-8")