代码之家  ›  专栏  ›  技术社区  ›  Dyllan

用R中的NAs按列计算两个子组之间的差值

  •  2
  • Dyllan  · 技术社区  · 7 年前

    我正在试图计算两个亚组在R中的NAs列中的绝对差异。或者更具体地说,我正在进行一个项目,我试图计算R中立法唱名投票的党派分歧程度。具体来说,我试图计算共和党和民主党在唱名投票上的不同投票。我试图用我的数据计算的具体方程式如下:

    Roll Call Partisanship=|Democratic Aye % - GOP Aye %|
    

    我的数据结构如下:

    Legislator   Party   Vote1   Vote2   Vote3  Vote4  Vote5   Vote6   Vote7
    Allen           R     yes     no      NA     no     yes     yes     no
    Barber          D     NA      no      no     yes    no      yes     no
    Cale            D     no      NA      yes    yes    yes     no      yes
    Devin           R     no      no      no     yes    yes     yes     yes
    Egan            R     yes     yes     yes    NA     no      no      no
    Floyd           R     yes     no      yes    no     yes     no      yes
    

    以下是创建此表的R代码:

    Legislator=c("Allen", "Barber", "Cale", "Devin", "Egan", "Floyd")
    Party=c("R", "D", "D", "R", "R", "R")
    vote1=c("yes", "NA", "no", "no", "yes", "yes")
    vote2=c("no", "no", "NA", "no", "yes", "no")
    vote3=c("NA", "no", "yes", "no", "yes", "yes")
    vote4=c("no", "yes", "yes", "yes", "NA", "no")
    vote5=c("yes", "no", "yes", "yes", "no", "yes")
    vote6=c("yes", "yes", "no", "yes", "no", "no")
    vote7=c("no", "no", "yes", "yes", "no", "yes")
    
    rollcall=cbind(Legislator, Party, vote1, vote2, vote3, vote4, vote5, vote6, vote7)
    

    使用上述等式,我想创建一个如下所示的矩阵:

    RollCall  Partisanship
    Vote1     0.75
    Vote2     0.25
    Vote3     0.17
    Vote4     0.70
    Vote5     0.25
    Vote6     0.00
    Vote7     0.00
    

    有人对我如何计算R中的这些分数有什么建议吗?特别是,我在NAs方面遇到了问题。我希望没有对唱名表决的立法者不被包括在特定的计算中。但是,如果使用na。省略,那么在所有点名计算中,这就完全排除了立法者。有人有什么建议吗?

    2 回复  |  直到 7 年前
        1
  •  1
  •   mtoto    7 年前

    这里有一个 data.table 解决方案:

    library(data.table)
    # convert your matrix to a data.table
    dt <- data.table(rollcall)
    # replace "NA"'s by actual NA's
    dt[dt == "NA"] <- NA
    
    # get your data in long format and calculate summary statistics
    dt_long <- melt(dt, id.vars = "Party", measure = patterns("^vote"))
    dt_long <- dt_long[!is.na(value),.(votes = sum(value=="yes") / .N), .(Party,variable)]
    
    # spread the result to arrive at expected format
    dcast(dt_long, variable ~ Party, value.var = "votes")[,.(Partisanship = abs(D - R)), "variable"]
    #  variable Partisanship
    #1:    vote1    0.7500000
    #2:    vote2    0.2500000
    #3:    vote3    0.1666667
    #4:    vote4    0.6666667
    #5:    vote5    0.2500000
    #6:    vote6    0.0000000
    #7:    vote7    0.0000000
    
        2
  •  0
  •   Indrajeet Patil    7 年前

    以下是一个解决方案 dplyr (比已经发布的解决方案更难看,但花了很多时间才发布):

    # setting up the data
    # **note that I've changed "NA" entries to NA **
    
    Legislator <- c("Allen", "Barber", "Cale", "Devin", "Egan", "Floyd")
    Party <- c("R", "D", "D", "R", "R", "R")
    vote1 <- c("yes", NA, "no", "no", "yes", "yes")
    vote2 <- c("no", "no", NA, "no", "yes", "no")
    vote3 <- c(NA, "no", "yes", "no", "yes", "yes")
    vote4 <- c("no", "yes", "yes", "yes", NA, "no")
    vote5 <- c("yes", "no", "yes", "yes", "no", "yes")
    vote6 <- c("yes", "yes", "no", "yes", "no", "no")
    vote7 <- c("no", "no", "yes", "yes", "no", "yes")
    
    rollcall <- as.data.frame(base::cbind(Legislator, Party, vote1, vote2, vote3, vote4, vote5, vote6, vote7))
    
    # converting to long format
    library(tidyr)
    #> Warning: package 'tidyr' was built under R version 3.4.2
    rollcall_long <- tidyr::gather(rollcall, vote, response, vote1:vote7, factor_key = TRUE)
    
    # compute frenquency table
    library(dplyr)
    
    vote_frequency <- rollcall_long %>% 
      dplyr::filter(!is.na(response)) %>% # remove NAs
      dplyr::group_by(Party, vote, response) %>% # compute frequency by these grouping variables
      dplyr::summarize(counts = n()) %>% # get the count of each response
      dplyr::mutate(perc = counts / sum(counts)) %>% # compute its percentage
      dplyr::arrange(vote, response, Party) %>% # arrange it properly
      dplyr::filter(response == "yes") %>% # select only yes responses ("Ayes")
    dplyr::select(-counts, -response)  # remove counts and response variables
    
    # compute Partisanship score
    Partisanship_df <- tidyr::spread(vote_frequency, Party, perc)
    Partisanship_df[is.na(Partisanship_df)] <- 0 # replacing NA with 0 because NA here represents that not a single "yes" was found
    Partisanship_df$Partisanship <- abs(Partisanship_df$D - Partisanship_df$R)
    
    # removing unnecessary columns
    Partisanship_df %>% dplyr::select(-c(R, D))
    #> # A tibble: 7 x 2
    #> # Groups: vote [7]
    #>   vote  Partisanship
    #> * <fct>        <dbl>
    #> 1 vote1        0.750
    #> 2 vote2        0.250
    #> 3 vote3        0.167
    #> 4 vote4        0.667
    #> 5 vote5        0.250
    #> 6 vote6        0    
    #> 7 vote7        0
    

    于2018年1月20日由 reprex package (v0.1.1.9000)。