代码之家  ›  专栏  ›  技术社区  ›  Somdip Dey

基于TF-IDF评分的KNN文本分类方法

  •  0
  • Somdip Dey  · 技术社区  · 5 年前

    我有一个CSV文件(语料库.csv)在语料库中使用以下格式的分级摘要(文本):

    Institute,    Score,    Abstract
    
    
    ----------------------------------------------------------------------
    
    
    UoM,    3.0,    Hello, this is abstract one
    
    UoM,    3.2,    Hello, this is abstract two and yet counting.
    
    UoE,    3.1,    Hello, yet another abstract but this is a unique one.
    
    UoE,    2.2,    Hello, please no more abstract.
    

    我正在尝试用python创建一个KNN分类程序,它能够获得一个用户输入摘要,例如,“这是一个新的唯一摘要”,然后将这个用户输入摘要分类到最接近语料库(CSV)的地方,并返回预测摘要的分数/等级。我怎么才能做到呢?

    我有以下代码:

    from sklearn.feature_extraction.text import TfidfVectorizer
    from nltk.corpus import stopwords
    import numpy as np
    import pandas as pd
    from csv import reader,writer
    import operator as op
    import string
    
    #Read data from corpus
    r = reader(open('corpus.csv','r'))
    abstract_list = []
    score_list = []
    institute_list = []
    row_count = 0
    for row in list(r)[1:]:
        institute,score,abstract = row
        if len(abstract.split()) > 0:
          institute_list.append(institute)
          score = float(score)
          score_list.append(score)
          abstract = abstract.translate(string.punctuation).lower()
          abstract_list.append(abstract)
          row_count = row_count + 1
    
    print("Total processed data: ", row_count)
    
    #Vectorize (TF-IDF, ngrams 1-4, no stop words) using sklearn -->
    vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,4),
                         min_df = 0, stop_words = 'english', sublinear_tf=True)
    response = vectorizer.fit_transform(abstract_list)
    feature_names = vectorizer.get_feature_names()
    

    在前面提到的代码中,如何使用TF-IDF计算中的特性进行KNN分类?(可能使用sklearn.NeighborsClassifier框架)

    P、 这个应用案例的等级是摘要的相应分数/等级。

    0 回复  |  直到 5 年前
        1
  •  1
  •   Roee Anuar    5 年前

    KNN是一种分类算法-这意味着您必须有一个class属性。KNN可以使用TFIDF的输出作为输入矩阵-TrainX,但是对于数据中的每一行,仍然需要trany-the类。但是,你可以使用KNN回归。 用你的分数作为类变量:

    from sklearn.feature_extraction.text import TfidfVectorizer
    from nltk.corpus import stopwords
    import numpy as np
    import pandas as pd
    from csv import reader,writer
    import operator as op
    import string
    from sklearn import neighbors
    
    #Read data from corpus
    r = reader(open('corpus.csv','r'))
    abstract_list = []
    score_list = []
    institute_list = []
    row_count = 0
    for row in list(r)[1:]:
        institute,score,abstract = row[0], row[1], row[2]
        if len(abstract.split()) > 0:
          institute_list.append(institute)
          score = float(score)
          score_list.append(score)
          abstract = abstract.translate(string.punctuation).lower()
          abstract_list.append(abstract)
          row_count = row_count + 1
    
    print("Total processed data: ", row_count)
    
    #Vectorize (TF-IDF, ngrams 1-4, no stop words) using sklearn -->
    vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,4),
                         min_df = 0, stop_words = 'english', sublinear_tf=True)
    response = vectorizer.fit_transform(abstract_list)
    classes = score_list
    feature_names = vectorizer.get_feature_names()
    
    clf = neighbors.KNeighborsRegressor(n_neighbors=1)
    clf.fit(response, classes)
    clf.predict(response)