代码之家  ›  专栏  ›  技术社区  ›  John

使用R中的单词嵌入从文本变量预测数字变量

  •  1
  • John  · 技术社区  · 2 年前

    我有一个包含电影评论的文本变量和另一个包含收视率的变量——我想尝试使用文本评论来预测收视率。

    以下是一些示例数据:

    movie_reviews <- c("I really loved the movie plot", "This movie really sucked", "I really found this movie thought provoking", "ahh what a boring movie", "A wonderful movie, with a wonderful end", "Great action movie: Very thrilling", "Worst movie ever, it never stopped being cheesy", "Enjoying, feelgood movie for the entire family", "I will definitely watch this movie again")
    
    movie_ratings <- c(8, 2, 6, 3, 9, 8.5, 3.5, 9.5, 7.5)  
      
    movie_df <- tibble(movie_reviews, movie_ratings) 
    
    

    非常感谢。

    1 回复  |  直到 2 年前
        1
  •  1
  •   Oscar Kjell    2 年前

    为此,您可以使用 text -包裹

    # Create word embedding representations of your text
    help(textEmbed)
    reviews_embeddings <- textEmbed(movie_df, 
                                    model = "bert-base-uncased", # Select model you want from huggingface
                                    layers = 11:12) # Select which layers you want to use
    
    # Train the word embeddings to the numeric variable using ridge regression 
    reviews_rating_model <- textTrain(reviews_embeddings$movie_reviews, 
                                      movie_df$movie_ratings) 
    # See the results
    reviews_rating_model
    

    后果

    $results
    
        Pearson's product-moment correlation
    
    data:  predy_y$predictions and predy_y$y
    t = 5.621, df = 7, p-value = 0.0003991
    alternative hypothesis: true correlation is greater than 0
    95 percent confidence interval:
     0.6785761 1.0000000
    sample estimates:
          cor 
    0.9047823