代码之家  ›  专栏  ›  技术社区  ›  Jacek Kotowski

使用哈希字典的柠檬化函数不适用于R中的tm包

  •  2
  • Jacek Kotowski  · 技术社区  · 7 年前

    我想使用一个大的外部字典(格式如下面的txt变量)对波兰语文本进行线性化。我不幸运,有一个选择波兰与流行的文本挖掘软件包。答案 https://stackoverflow.com/a/45790325/3480717

    不幸的是,它不适用于tm生成的语料库格式。让我粘贴Dmitry的代码:

    library(hashmap)
    library(data.table)
    txt = 
      "Abadan  Abadanem
      Abadan  Abadanie
      Abadan  Abadanowi
      Abadan  Abadanu
      abadańczyk  abadańczycy
      abadańczyk  abadańczykach
      abadańczyk  abadańczykami
      "
    dt = fread(txt, header = F, col.names = c("lemma", "word"))
    lemma_hm = hashmap(dt$word, dt$lemma)
    
    lemma_hm[["Abadanu"]]
    #"Abadan"
    
    
    lemma_tokenizer = function(x, lemma_hashmap, 
                               tokenizer = text2vec::word_tokenizer) {
      tokens_list = tokenizer(x)
      for(i in seq_along(tokens_list)) {
        tokens = tokens_list[[i]]
        replacements = lemma_hashmap[[tokens]]
        ind = !is.na(replacements)
        tokens_list[[i]][ind] = replacements[ind]
      }
      tokens_list
    }
    texts = c("Abadanowi abadańczykach OutOfVocabulary", 
              "abadańczyk Abadan OutOfVocabulary")
    lemma_tokenizer(texts, lemma_hm)
    
    #[[1]]
    #[1] "Abadan"          "abadańczyk"      "OutOfVocabulary"
    #[[2]]
    #[1] "abadańczyk"      "Abadan"          "OutOfVocabulary"
    

    现在我想将其应用于tm语料库“docs”

    docs <- tm_map(docs, function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm"))
    

    LemmaTokenizer <- function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm")
    
    docsTDM <-
      DocumentTermMatrix(docs, control = list(wordLengths = c(4, 25), tokenize=LemmaTokenizer))
    

    它向我抛出了一个错误:

     Error in lemma_hashmap[[tokens]] : 
      attempt to select more than one element in vectorIndex 
    

    4 回复  |  直到 7 年前
        1
  •  2
  •   Damiano Fantini    7 年前

    我看到两个问题。1) 自定义函数返回一个列表,而它应该返回字符串向量;2)传递了一个错误的lemma\u hashmap参数。

    解决第一个问题的快速解决方法是在返回函数结果之前使用paste()和sapply()。

    lemma_tokenizer = function(x, lemma_hashmap, 
                               tokenizer = text2vec::word_tokenizer) {
      tokens_list = tokenizer(x)
      for(i in seq_along(tokens_list)) {
        tokens = tokens_list[[i]]
        replacements = lemma_hashmap[[tokens]]
        ind = !is.na(replacements)
        tokens_list[[i]][ind] = replacements[ind]
      }
    
      # paste together, return a vector
      sapply(tokens_list, (function(i){paste(i, collapse = " ")}))
    }
    

    我们可以运行与您帖子相同的示例。

    texts = c("Abadanowi abadańczykach OutOfVocabulary", 
              "abadańczyk Abadan OutOfVocabulary")
    lemma_tokenizer(texts, lemma_hm)
    [1] "Abadan abadańczyk OutOfVocabulary" "abadańczyk Abadan OutOfVocabulary"
    

    docs <- SimpleCorpus(VectorSource(texts))
    out <- tm_map(docs, (function(x) {lemma_tokenizer(x, lemma_hashmap=lemma_hm)}))
    out[[1]]$content
    [1] "Abadan abadańczyk OutOfVocabulary"
    
        2
  •  2
  •   Marcin    7 年前
        3
  •  2
  •   Ken Benoit    7 年前

    尝试使用 dictionary() paste() 活动

    txt <-  
        "Abadan  Abadanem
    Abadan  Abadanie
    Abadan  Abadanowi
    Abadan  Abadanu
    abadańczyk  abadańczycy
    abadańczyk  abadańczykach
    abadańczyk  abadańczykami"
    
    list_temp <- strsplit(readLines(textConnection(txt)), "\\s+")
    list_temp2 <- lapply(list_temp, "[", 2)
    names(list_temp2) <- sapply(list_temp, "[", 1)
    
    library("quanteda")
    polish_lemma_dict <- dictionary(list_temp2)
    # Dictionary object with 7 key entries.
    # - Abadan:
    #   - abadanem
    # - Abadan:
    #   - abadanie
    # - Abadan:
    #   - abadanowi
    # - Abadan:
    #   - abadanu
    # - abadańczyk: 
    #   - abadańczycy
    # - abadańczyk:
    #   - abadańczykach
    # - abadańczyk:
    #   - abadańczykami
    
    texts <- c("Abadanowi abadańczykach OutOfVocabulary", 
               "abadańczyk Abadan OutOfVocabulary")
    

    这个 texts 量子田 tokens_lookup()

    require(magrittr)
    texts %>%
        tokens() %>%
        tokens_lookup(dictionary = polish_lemma_dict, exclusive = FALSE, capkeys = FALSE) %>%
        as.character() %>%
        paste(collapse = " ")
    # [1] "Abadan abadańczyk OutOfVocabulary abadańczyk Abadan OutOfVocabulary"
    
        4
  •  1
  •   Jacek Kotowski    7 年前

    这是我在答案中使用的完全不完美的代码。感谢许多人,我在底部描述了所有来源。我意识到这很粗糙,但它为我抓住了mise,即我可以使用txt-lemmes字典和我的stopwords对波兰语文本进行分类。感谢达米亚诺·范蒂尼、德米特里·塞利万诺夫和许多其他人。

    #----1. Set up. ----
    setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
    
    
    library(readtext)
    library(tm)
    library(proxy)
    library(stringi)
    library(stringr)
    library(hashmap)
    library(data.table)
    library(text2vec)
    
    # For reading n-grams
    library(RWeka) #(*)
    BigramTokenizer <- 
               function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3)) #(*)
    
    
    #----2. Read data. ----
    stopwordsPL <- as.vector(str_split(readLines("polish.stopwords.text",encoding = "UTF-8"), pattern = " ",simplify = T))
    
    
    docs <- VCorpus(DirSource(pattern="txt"))
    titles <- rownames(summary(docs))
    
    docs <- tm_map(docs, removeWords, words=stopwordsPL)
    docs <- tm_map(docs, tolower)
    docs <- tm_map(docs, function(x) stri_trans_general(x, "Latin-ASCII"))
    docs <- tm_map(docs, removeNumbers)
    docs <- tm_map(docs, removePunctuation)
    docs <- tm_map(docs, stripWhitespace)
    
    # for English texts it would be simpler
    # docs <- tm_map(docs, removeWords, stopwords("english")) #can add other words to remove
    # docs <- tm_map(docs, stemDocument, "english")
    
    #====3. Lemmatize ====
    # # Dictionary from http://www.lexiconista.com/datasets/lemmatization/
    # lemmadict_file = "lemmatization-pl.text"
    # dt = fread(file= lemmadict_file, header = F, col.names = c("lemma", "word"), data.table=T, encoding="UTF-8")
    # # I threw away Polish letters, maybe changing locales may help.
    # dt$lemma <- stri_trans_general(dt$lemma, "Latin-ASCII;lower")
    # dt$word <- stri_trans_general(dt$word, "Latin-ASCII;lower")
    # dt <- unique(dt)
    # 
    # # Creating hash dictionary
    # lemma_hm = hashmap(dt$word, dt$lemma)
    # 
    # # Test if it works
    # lemma_hm[["mnozyl"]]
    # # [1] "mnozyc"
    # 
    # save_hashmap(lemma_hm, file="lemma_hm", overwrite = TRUE, compress = TRUE)
    
    lemma_hm <- load_hashmap(file="lemma_hm")
    
    lemma_tokenizer = function(x, lemma_hashmap, 
                               tokenizer = text2vec::word_tokenizer) {
      tokens_list = tokenizer(x)
      for(i in seq_along(tokens_list)) {
        tokens = tokens_list[[i]]
        replacements = lemma_hashmap[[tokens]]
        ind = !is.na(replacements)
        tokens_list[[i]][ind] = replacements[ind]
      }
      # paste together, return a vector
      sapply(tokens_list, (function(i){paste(i, collapse = " ")}))
    }
    
    docs <- tm_map(docs, (function(x) {lemma_tokenizer(x, lemma_hashmap=lemma_hm)}))
    docs <- tm_map(docs, PlainTextDocument)
    
    #====4. Create document term matrix====
    
    docsTDM <-
      DocumentTermMatrix(docs, control = list(wordLengths = c(5, 25),tokenize = BigramTokenizer))  #  tokenize=LemmaTokenizer, tokenize = BigramTokenizer (*)
    
    
    docsTDM$dimnames
    
    #====5. Remove sparse and common words====
    
    docsTDM <- removeSparseTerms(docsTDM, .90)
    
    # https://stackoverflow.com/questions/25905144/removing-overly-common-words-occur-in-more-than-80-of-the-documents-in-r
    
    removeCommonTerms <- function (x, pct) 
    {
      stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), 
                is.numeric(pct), pct > 0, pct < 1)
      m <- if (inherits(x, "DocumentTermMatrix")) 
        t(x)
      else x
      t <- table(m$i) < m$ncol * (pct)
      termIndex <- as.numeric(names(t[t]))
      if (inherits(x, "DocumentTermMatrix")) 
        x[, termIndex]
      else x[termIndex, ]
    }
    
    
    docsTDM <-
      removeCommonTerms(docsTDM, .8) #remove terms that are in >=80% of the documents
    docsTDM$dimnames
    
    
    #====6. Cluster data (hclust). ====
    
    
    docsdissim <- dist(as.matrix(docsTDM), method = "cosine")
    
    docsdissim2 <- as.matrix(docsdissim)
    dim(docsdissim2)
    
    rownames(docsdissim2) <- titles
    colnames(docsdissim2) <- titles
    
    h <- hclust(docsdissim, method = "ward.D2")
    
    plot(h, labels = titles, sub = "")
    
    # Library hclust with p-values (pvclust)
    
    library(pvclust)
    
    h_pv <- pvclust(docsdissim2, method.hclust = "ward.D2", method.dist ="correlation")
    
    plot(h_pv)
    
    data.frame(cutree(tree = h_pv$hclust, k = 4))
    
    
    # pvclust provides two types of p-values: AU (Approximately Unbiased) p-value and BP (Bootstrap Probability) value. 
    # AU p-value, which is computed by multiscale bootstrap resampling, is a better approximation to unbiased p-value 
    # than BP value computed by normal bootstrap resampling.
    # AU p-value > 0.95 we can assume the clusters exist and may stably be 
    # observed if we increase the number of observations. 
    # (http://stat.sys.i.kyoto-u.ac.jp/prog/pvclust/)
    
    #==== Literature:====
    # Original article:
    # http://www.rexamine.com/2014/06/text-mining-in-r-automatic-categorization-of-wikipedia-articles/
    
    # Updates to make it work after some functions became obsolete:
    # https://stackoverflow.com/questions/34423823/r-automatic-categorization-of-wikipedia-articles
    # https://stackoverflow.com/questions/34372166/error-using-termdocumentmatrix-and-dist-functions-in-r
    #
    # Based on that:
    # http://brazenly.blogspot.co.uk/2015/02/r-categorization-clustering-of.html
    #
    # Sparse terms:
    # https://stackoverflow.com/questions/28763389/how-does-the-removesparseterms-in-r-work
    
    # Lemmatizing function:
    # https://stackoverflow.com/questions/46122591/a-lemmatizing-function-using-a-hash-dictionary-does-not-work-with-tm-package-in
    # https://stackoverflow.com/questions/45762559/lemmatization-using-txt-file-with-lemmes-in-r/45790325#45790325