代码之家  ›  专栏  ›  技术社区  ›  spindoctor

在quanteda中替换多个ngrams

  •  1
  • spindoctor  · 技术社区  · 6 年前

    在我的新闻文章文本中,我想把几个不同的国家地理信息系统转换成一个首字母缩略词,它们指的是同一个政党。我之所以这样做,是因为我想避免任何情绪词典将政党名称(自由党)中的词与不同语境中的同一个词混淆(自由党帮助)。

    我可以在下面用 str_replace_all 我也知道 token_compound() 在quanteda中的函数,但它似乎不能完全满足我的需要。

    library(stringr)
    text<-c('a text about some political parties called the new democratic party the new democrats and the liberal party and the liberals')
    text1<-str_replace_all(text, '(liberal party)|liberals', 'olp')
    text2<-str_replace_all(text1, '(new democrats)|new democratic party', 'ndp')
    

    quanteda .

    下面是一些扩展的示例代码,可以更好地说明问题:

    `text<-c('a text about some political parties called the new democratic party 
    the new democrats and the liberal party and the liberals. I would like the 
    word democratic to be counted in the dfm but not the words new democratic. 
    The same goes for liberal helpings but not liberal party')
    partydict <- dictionary(list(
    olp = c("liberal party", "liberals"),
    ndp = c("new democrats", "new democratic party"),
    sentiment=c('liberal', 'democratic')
    ))
    
    dfm(text, dictionary=partydict)`
    

    这个例子很重要 democratic new democratic 民主的 有道理,但我会把它们分开计算。

    1 回复  |  直到 6 年前
        1
  •  0
  •   Ken Benoit    6 年前

    你想要这个功能 tokens_lookup() ,定义了一个字典,该字典将规范的参与方标签定义为键,并将参与方名称的所有ngram变体列为值。通过设置 exclusive = FALSE 它将保留不匹配的标记,实际上是作为所有变体与规范方名称的替代。

    在下面的示例中,我对您的输入文本进行了一些修改,以说明如何将党的名称组合成不同于使用“liberal”而不是“liberal party”的短语。

    library("quanteda")
    
    text<-c('a text about some political parties called the new democratic party 
             which is conservative the new democrats and the liberal party and the 
             liberals which are liberal helping poor people')
    toks <- tokens(text)
    
    partydict <- dictionary(list(
        olp = c("liberal party", "the liberals"),
        ndp = c("new democrats", "new democratic party")
    ))
    
    (toks2 <- tokens_lookup(toks, partydict, exclusive = FALSE))
    ## tokens from 1 document.
    ## text1 :
    ##  [1] "a"            "text"         "about"        "some"         "political"    "parties"     
    ##  [7] "called"       "the"          "NDP"          "which"        "is"           "conservative"
    ## [13] "the"          "NDP"          "and"          "the"          "OLP"          "and"         
    ## [19] "OLP"          "which"        "are"          "liberal"      "helping"      "poor"        
    ## [25] "people"   
    

    因此,它用参与方密钥替换了参与方名称的差异。

    sentdict <- dictionary(list(
        left = c("liberal", "left"),
        right = c("conservative", "")
    ))
    
    dfm(toks2) %>%
        dfm_lookup(dictionary = sentdict, exclusive = FALSE)
    ## Document-feature matrix of: 1 document, 19 features (0% sparse).
    ## 1 x 19 sparse Matrix of class "dfm"
    ##        features
    ## docs    olp ndp a text about some political parties called the which is RIGHT and LEFT are helping
    ##  text1   2   2 1    1     1    1         1       1      1   3     2  1     1   2    1   1       1
    ##        features
    ## docs    poor people
    ##  text1    1      1
    

    另外两个注意事项:

    1. 如果不希望替换标记中的键大写,请设置 capkeys = FALSE .

    2. valuetype 参数,包括 valuetype = regex . (请注意,示例中的正则表达式的格式可能不正确,因为 | ndp例子中的操作符将得到“新民主党”或“新的”,然后是“民主党”。但与 令牌查找()