代码之家  ›  专栏  ›  技术社区  ›  Doug Fir

为什么featnames(myDFM)包含不止一个或两个令牌的特性?

  •  0
  • Doug Fir  · 技术社区  · 7 年前

    我正在使用一个大型1M文档语料库,并在从中创建文档频率矩阵时应用了几个转换:

    library(quanteda)
    corpus_dfm <- dfm(tokens(corpus1M), # where corpus1M is already a corpus via quanteda::corpus()
                      remove = stopwords("english"),
                      #what = "word", #experimented if adding this made a difference
                      remove_punct = T,
                      remove_numbers = T,
                      remove_symbols = T,
                      ngrams = 1:2,
                      dictionary = lut_dict,
                      stem = TRUE)
    

    然后查看结果特征:

    dimnames(corpus_dfm)$features
    [1] "abandon"                                      
    [2] "abandoned auto"                               
    [3] "abandoned vehicl"
    ...
    [8] "accident hit and run"
    ...
    [60] "assault no weapon aggravated injuri" 
    

    为什么这些特性的长度超过1:2个二进制字符?词干提取似乎很成功,但标记似乎是句子而不是单词。

    dfm(tokens(corpus1M, what = "word") 但没有变化。

    library(tidyverse) # just for the pipe here
    example_text <- c("the quick brown fox",
                      "I like carrots",
                      "the there that etc cats dogs") %>% corpus
    

    然后,如果我应用与上述相同的dfm:

    > dimnames(corpus_dfm)$features
    [1] "etc."
    

    这令人惊讶,因为几乎所有的单词都被删除了?甚至不像以前,所以我更困惑!

    我如何在quanteda中创建一个dfm,其中只有1:2个单词标记,并且停止字被删除?

    1 回复  |  直到 7 年前
        1
  •  1
  •   Ken Benoit    7 年前

    第一个问题: 为什么dfm中的功能(名称)这么长?

    答:因为字典在 dfm()

    lut_dict[70:72]
    # Dictionary object with 3 key entries.
    # - assault felony:
    #     - asf
    # - assault misdemeanor:
    #     - asm
    # - assault no weapon aggravated injury:
    #     - anai
    

    第二个问题 :在可复制的示例中,为什么几乎所有的单词都不见了?

    答:因为字典值与dfm中的特征唯一匹配的是“etc”类别。

    corpus_dfm2 <- dfm(tokens(example_text), # where corpus1M is already a corpus via quanteda::corpus()
                      remove = stopwords("english"),
                      remove_punct = TRUE,
                      remove_numbers = TRUE,
                      remove_symbols = TRUE,
                      dictionary = lut_dict,
                      ngrams = 1:2,
                      stem = TRUE, verbose = TRUE)
    corpus_dfm2
    # Document-feature matrix of: 3 documents, 1 feature (66.7% sparse).
    # 3 x 1 sparse Matrix of class "dfmSparse"
    #        features
    # docs    etc.
    #   text1    0
    #   text2    0
    #   text3    1
    
    lut_dict["etc."]
    # Dictionary object with 1 key entry.
    # - etc.:
    #     - etc
    

    dfm(tokens(example_text),   # the "tokens" is not necessary here
        remove = stopwords("english"),
        remove_punct = TRUE,
        remove_numbers = TRUE,
        remove_symbols = TRUE,
        ngrams = 1:2,
        stem = TRUE)
    # Document-feature matrix of: 3 documents, 18 features (66.7% sparse).
    # 3 x 18 sparse Matrix of class "dfmSparse"
    #        features
    # docs    quick brown fox the_quick quick_brown brown_fox like carrot i_like
    #   text1     1     1   1         1           1         1    0      0      0
    #   text2     0     0   0         0           0         0    1      1      1
    #   text3     0     0   0         0           0         0    0      0      0
    #        features
    # docs    like_carrot etc cat dog the_there there_that that_etc etc_cat cat_dog
    #   text1           0   0   0   0         0          0        0       0       0
    #   text2           1   0   0   0         0          0        0       0       0
    #   text3           0   1   1   1         1          1        1       1       1
    

    如果要保持功能不匹配,请更换 dictionary thesaurus

    dfm(tokens(example_text), 
        remove = stopwords("english"),
        remove_punct = TRUE,
        remove_numbers = TRUE,
        remove_symbols = TRUE,
        thesaurus = lut_dict,
        ngrams = 1:2,
        stem = TRUE)
    Document-feature matrix of: 3 documents, 18 features (66.7% sparse).
    3 x 18 sparse Matrix of class "dfmSparse"
           features
    docs    quick brown fox the_quick quick_brown brown_fox like carrot i_like
      text1     1     1   1         1           1         1    0      0      0
      text2     0     0   0         0           0         0    1      1      1
      text3     0     0   0         0           0         0    0      0      0
           features
    docs    like_carrot cat dog the_there there_that that_etc etc_cat cat_dog ETC.
      text1           0   0   0         0          0        0       0       0    0
      text2           1   0   0         0          0        0       0       0    0
      text3           0   1   1         1          1        1       1       1    1