代码之家  ›  专栏  ›  技术社区  ›  Beep

带有TD和Quanteda西班牙语字符的R西班牙语术语频率矩阵

  •  3
  • Beep  · 技术社区  · 6 年前

    我正在尝试学习如何使用推特数据进行文本分析。我在创建术语频率矩阵时遇到了一个问题。 我用西班牙语文本(带有特殊字符)创建语料库,没有任何问题。

    但是,当我创建术语频率矩阵(使用quanteda或tm库)时,西班牙语字符不会按预期显示(我看到的不是cancin,而是cancin)。

    关于如何获得术语频率矩阵来存储具有正确字符的文本,有什么建议吗?

    谢谢你的帮助。

    请注意:我更喜欢使用quanteda库,因为最终我将创建一个wordcloud,我想我更了解这个库的方法。我也在使用Windows计算机。

    我尝试过编码(tw2)<-“UTF-8”没有运气。

    library(dplyr)
    library(tm)
    library(quanteda)
    
    #' Creating a character with special Spanish characters:
    tw2 <- "RT @None: Enmascarados, si masduro chingán a tarek. Si quieres ahora, la aguantas canción  . https://t."
    
    
    #Cleaning the tweet, removing special punctuation, numbers http links, 
    extra spaces:
    clean_tw2 <- tolower(tw2)
    clean_tw2 = gsub("&amp", "", clean_tw2)
    clean_tw2 = gsub("(rt|via)((?:\\b\\W*@\\w+)+)", "", clean_tw2)
    clean_tw2 = gsub("@\\w+", "", clean_tw2)
    clean_tw2 = gsub("[[:punct:]]", "", clean_tw2)
    clean_tw2 = gsub("http\\w+", "", clean_tw2)
    clean_tw2 = gsub("[ \t]{2,}", "", clean_tw2)
    clean_tw2 = gsub("^\\s+|\\s+$", "", clean_tw2) 
    
    # creates a vector with common stopwords, and other words which I want removed.
    myStopwords <- c(stopwords("spanish"),"tarek","vez","ser","ahora")
    clean_tw2 <- (removeWords(clean_tw2,myStopwords))
    
    # If we print clean_tw2 we see that all the characters are displayed as expected.
    clean_tw2
    
    #'Create Corpus Using quanteda library
    corp_quan<-corpus(clean_tw2)
    # The corpus created via quanteda, displays the characters as expected.
    corp_quan$documents$texts
    
    #'Create Corpus Using TD library
    corp_td<-Corpus(VectorSource(clean_tw2))
    #' Remove common words from spanish from the Corpus.
    #' If we inspect the corp_td, we see that the characters and words are displayed correctly
    inspect(corp_td)
    
    # Create the DFM with quanteda library.
    tdm_quan<-dfm(corp_quan)
    # Here we see that the spanish characters are displayed incorrectly for Example: canción = canción
    tdm_quan
    
    # Create the TDM with TD library
    tdm_td<-TermDocumentMatrix(corp_td)
    
    # Here we see that the Spanish characters are displayed incorrectly (e.g. canción = canciÃ), and "si" is missing.
    tdm_td$dimnames$Terms
    
    2 回复  |  直到 6 年前
        1
  •  1
  •   phiver    6 年前

    在windows平台上创建DFM时,quanteda(和tm)似乎正在丢失编码。在里面 this tidytext 问题同样的问题也会发生在未测试的令牌上。现在也很好用 quanteda tokens 工作正常。 如果我强制执行 UTF-8 latin1 在上编码 @Dimnames$features dfm 您得到了正确的结果。

    ....
    previous code
    .....
    tdm_quan<-dfm(corp_quan)
    # Here we see that the spanish characters are displayed incorrectly for Example: canción = canción
    tdm_quan
    Document-feature matrix of: 1 document, 8 features (0% sparse).
    1 x 8 sparse Matrix of class "dfm"
           features
    docs    enmascarados si masduro chingán quieres aguantas canción t
      text1            1  2       1        1       1        1        1 1
    

    如果您执行以下操作:

    Encoding(tdm_quan@Dimnames$features) <- "UTF-8"
    tdm_quan
    Document-feature matrix of: 1 document, 8 features (0% sparse).
    1 x 8 sparse Matrix of class "dfm"
           features
    docs    enmascarados si masduro chingán quieres aguantas canción t
      text1            1  2       1       1       1        1       1 1
    
        2
  •  1
  •   Ken Benoit    6 年前

    让我猜猜。。。您正在使用Windows吗?在macOS上工作正常:

    clean_tw2
    ## [1] "enmascarados si masduro chingán   si quieres   aguantas canción"
    Encoding(clean_tw2)
    ## [1] "UTF-8"
    dfm(clean_tw2)
    ## Document-feature matrix of: 1 document, 7 features (0% sparse).
    ## 1 x 7 sparse Matrix of class "dfm"
    ##        features
    ## docs    enmascarados si masduro chingán quieres aguantas canción
    ##   text1            1  2       1       1       1        1       1
    

    我的系统信息:

    sessionInfo()
    # R version 3.4.4 (2018-03-15)
    # Platform: x86_64-apple-darwin15.6.0 (64-bit)
    # Running under: macOS High Sierra 10.13.4
    # 
    # Matrix products: default
    # BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
    # LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
    # 
    # locale:
    # [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
    # 
    # attached base packages:
    # [1] stats     graphics  grDevices utils     datasets  methods   base     
    # 
    # other attached packages:
    # [1] tm_0.7-3       NLP_0.1-11     dplyr_0.7.4    quanteda_1.1.6