代码之家  ›  专栏  ›  技术社区  ›  Banjo

基于词典对文本中的单词进行计数

r
  •  0
  • Banjo  · 技术社区  · 6 年前

    我有这样的数据:

    library(dplyr)
    
    glimpse(samp)
    Observations: 10
    Variables: 2
    $ text <chr> "@VirginAmerica What @dhepburn said.", "@VirginAmerica plus you've ...
    $ airline_sentiment <chr> "neutral", "positive", "neutral", "negative", "negative", "negative...
    

    我想将文本变量中的单词与词典中的单词进行比较,也就是说,我想计算某个单词在基于词典的文本中出现的频率。

    词典看起来像这样

    library(lexicon)
    hash_sentiment_sentiword[1:5]
                 x     y
    1:    365 days -0.50
    2:    366 days  0.25
    3:         3tc -0.25
    4:  a fortiori  0.25
    5: a good deal  0.25
    

    我知道有这样的功能 str_detect . 但是,通过这个,我只得到真/假值。

    结果应该如下(伪代码):

       text     x        y      n    
    1. word 1   word 1   score  2
    2. word 2   word 2   score  1
    3. word 3   word 3   score  10
    4. word 4   word 4   score  0
    5. word 5   word 5   score  0
    ...
    

    文本:文本列的一个字,来自 samp ; x和y:x和y列 hash_sentiment_sentiword ; n:文本中x字出现的频率,例如,单词“awesome”在x中,在文本中出现一次。所以“太棒了”,n应该是1。“国家”不在x中,而是在文本中。所以n是0。

    这是一个小的 dput() :

    dput(samp)
    
    structure(list(text = c("@VirginAmerica Thanks!", "@VirginAmerica SFO-PDX schedule is still MIA.", 
    "@VirginAmerica So excited for my first cross country flight LAX to MCO I've heard nothing but great things about Virgin America. #29DaysToGo", 
    "@VirginAmerica  I flew from NYC to SFO last week and couldn't fully sit in my seat due to two large gentleman on either side of me. HELP!", 
    "I <U+2764><U+FE0F> flying @VirginAmerica. <U+263A><U+FE0F><U+0001F44D>", 
    "@VirginAmerica you know what would be amazingly awesome? BOS-FLL PLEASE!!!!!!! I want to fly with only you."
    ), airline_sentiment = c("positive", "negative", "positive", 
    "negative", "positive", "positive")), row.names = 15:20, class = "data.frame")
    
    2 回复  |  直到 6 年前
        1
  •  1
  •   phiver    6 年前

    这样做的一种方法是使用tidyText,它的数量和文本挖掘包的数量一样多。我之所以选择TidyText,是因为你使用的是dplyr,这很好地解决了这个问题。我正在使用一个内部连接将词典与您的数据连接起来。如果要保留词典中不匹配的单词,请将其更改为左连接。

    library(tidytext)
    library(dplyr)
    samp %>% 
      unnest_tokens(text, output = "words", token = "tweets") %>% 
      inner_join(lexicon::hash_sentiment_sentiword, by = c("words" = "x")) %>% 
      group_by(words, y) %>% 
      summarise(n = n()) 
    
    # A tibble: 20 x 3
    # Groups:   words [?]
       words          y     n
       <chr>      <dbl> <int>
     1 about      0.25      1
     2 amazingly  0.125     1
     3 cross     -0.75      1
     4 due        0.25      1
     5 excited    0         1
     6 first      0.375     1
     7 fly       -0.5       1
     8 fully      0.375     1
     9 help       0.208     1
    10 know       0.188     1
    11 large     -0.25      1
    12 last      -0.208     1
    13 lax       -0.375     1
    14 on         0.125     1
    15 please     0.125     1
    16 side      -0.125     1
    17 still     -0.107     1
    18 thanks     0         1
    19 virgin     0.25      1
    20 want       0.125     1
    

    TidyText的额外信息: tidy text mining with R

    cran task view Natural Language Programming

    其他套餐:Quanteda、Qdap、Momentr、Udpipe

        2
  •  1
  •   niko    6 年前

    这是一个基地 R 解决方案

    # create an array of all the words in samp$text
    # optional: use regex to remove punctuation symbols (this can be refined)
    textWords <- unlist(strsplit(gsub('[[:punct:]]','',samp$text,perl=TRUE), ' '))
    # count occurences of each word and store it as data frame
    occurences <- unique(data.frame(text = textWords, 
                                    n = as.integer(ave(textWords, textWords, FUN = length)), 
                                    stringsAsFactors = FALSE))
    
    # get words of x with scores y
    xWordsList <- setNames(strsplit(lexicon::hash_sentiment_sentiword$x, ' '), 
                           lexicon::hash_sentiment_sentiword$y)
    
    # create the result data frame
    res <- data.frame(x = unlist(xWordsList), y = rep(names(xWordsList), lengths(xWordsList)))
    rm(xWordsList) # removing as object is rather large and not needed anymore
    
    # subset to keep only x elements which are in text
    res <- res[res$x %in% textWords,]
    # match occurences
    res$n <- vapply(1:nrow(res), 
                    function (k) occurences$n[occurences$text == res$x[k]], 
                    integer(1))
    rownames(res) <- 1:nrow(res)
    
    # a glimpse at the result
    head(res)
    #       x      y n
    # 1 great 0.3125 1
    # 2    in -0.125 1
    # 3 about   0.25 1
    # 4    of  0.125 1
    # 5    of -0.125 1
    # 6    to  0.125 4
    

    这可以增强(例如通过 .subset2 或精炼 regex )到处都是。另外,请注意,我省略了列 text 在里面 res 因为根据定义,该列与该列相同 x .