代码之家  ›  专栏  ›  技术社区  ›  Jane Sully

从文档项矩阵计算前n个词对共现

  •  1
  • Jane Sully  · 技术社区  · 6 年前

    我用gensim创建了一个单词包模型。尽管在现实中要长得多,但以下是使用gensim在标记文本上创建一包单词文档术语矩阵时输出的格式:

    id2word = corpora.Dictionary(texts)
    corpus = [id2word.doc2bow(text) for text in texts]
    
    [[(0, 2),
      (1, 1),
      (2, 1),
      (3, 1),
      (4, 11),
      (385, 1),
      (386, 2),
      (387, 3),
      (388, 1),
      (389, 1),
      (390, 1)],
     [(4, 31),
      (8, 2),
      (13, 2),
      (16, 2),
      (17, 2),
      (26, 1),
      (28, 4),
      (29, 1),
      (30, 1)]]
    

    这是一个稀疏矩阵表示,据我所知,其他库也以类似的方式表示文档项矩阵。如果文档术语矩阵是非稀疏的(也就是说零个条目也在那里),我知道我只需要(a.t*a),因为a是维度的(文档数乘以术语数),所以将两者相乘将得到术语共现。最后,我想得到前n个共现项(所以得到在同一文本中出现在一起的前n个项对)。我怎样才能做到这一点?我不喜欢Gensim创建弓模型。如果像sklearn这样的另一个图书馆能更容易地做到这一点,我是非常开放的。我希望您能给我任何关于这个问题的建议/帮助/代码——谢谢!

    1 回复  |  直到 6 年前
        1
  •  2
  •   KRKirov    6 年前

    编辑:下面是如何实现矩阵乘法的问题。免责声明:对于一个非常大的语料库来说,这可能是不可行的。

    sklearn公司:

    from sklearn.feature_extraction.text import CountVectorizer
    
    Doc1 = 'Wimbledon is one of the four Grand Slam tennis tournaments, the others being the Australian Open, the French Open and the US Open.'
    Doc2 = 'Since the Australian Open shifted to hardcourt in 1988, Wimbledon is the only major still played on grass'
    docs = [Doc1, Doc2]
    
    # Instantiate CountVectorizer and apply it to docs
    cv = CountVectorizer()
    doc_cv = cv.fit_transform(docs)
    
    # Display tokens
    cv.get_feature_names()
    
    # Display tokens (dict keys) and their numerical encoding (dict values)
    cv.vocabulary_
    
    # Matrix multiplication of the term matrix
    token_mat = doc_cv.toarray().T @ doc_cv.toarray()
    

    根西姆:

    import gensim as gs
    import numpy as np
    
    cp = [[(0, 2),
      (1, 1),
      (2, 1),
      (3, 1),
      (4, 11),
      (7, 1),
      (11, 2),
      (13, 3),
      (22, 1),
      (26, 1),
      (30, 1)],
     [(4, 31),
      (8, 2),
      (13, 2),
      (16, 2),
      (17, 2),
      (26, 1),
      (28, 4),
      (29, 1),
      (30, 1)]]
    
    # Convert to a dense matrix and perform the matrix multiplication
    mat_1 = gs.matutils.sparse2full(cp[0], max(cp[0])[0]+1).reshape(1, -1)
    mat_2 = gs.matutils.sparse2full(cp[1], max(cp[0])[0]+1).reshape(1, -1)
    mat = np.append(mat_1, mat_2, axis=0)
    mat_product = mat.T @ mat
    

    对于连续出现的单词,可以为一组文档准备一个bigram列表,然后使用python的计数器计算bigram的出现次数。下面是一个使用nltk的示例。

    import nltk
    from nltk.util import ngrams
    from nltk.stem import WordNetLemmatizer
    from nltk.corpus import stopwords
    from collections import Counter
    
    stop_words = set(stopwords.words('english'))
    
    # Get the tokens from the built-in collection of presidential inaugural speeches
    tokens = nltk.corpus.inaugural.words()
    
    # Futher text preprocessing
    tokens = [t.lower() for t in tokens if t not in stop_words]
    word_l = WordNetLemmatizer()
    tokens = [word_l.lemmatize(t) for t in tokens if t.isalpha()]
    
    # Create bigram list and count bigrams
    bi_grams = list(ngrams(tokens, 2)) 
    counter = Counter(bi_grams)
    
    # Show the most common bigrams
    counter.most_common(5)
    Out[36]: 
    [(('united', 'state'), 153),
     (('fellow', 'citizen'), 116),
     (('let', 'u'), 99),
     (('i', 'shall'), 96),
     (('american', 'people'), 40)]
    
    # Query the occurrence of a specific bigram
    counter[('great', 'people')]
    Out[37]: 7