代码之家  ›  专栏  ›  技术社区  ›  Werner RDyego

使用na.rm=true汇总数据

  •  4
  • Werner RDyego  · 技术社区  · 6 年前

    考虑下面的示例,该示例使用 dplyr summarise 管道以识别 min IMUM DATE 与一些 CHAR :

    library('tidyverse')
    library('lubridate')
    
    temp <- data.frame(
      CHAR = c(
        'A',
        'B',
        'C'
      ),
      DATE = c(
        '20090101',
        '20100101',
        NA
      ) %>% ymd(), # Turn character strings to dates
      stringsAsFactors = FALSE
    ) %>% group_by(
      CHAR
    ) %>% summarise(
      DATE = min(DATE, na.rm = TRUE) # Extract minimum date
    ) %>% ungroup()
    

    确定是否 IMUM是 NA 是否使用 is.na :

    temp %>% mutate(
      DATE_lgl = DATE %>% is.na() # Identify dates that are missing/NA
    )
    

    输出

    # A tibble: 3 x 3
      CHAR  DATE       DATE_lgl
      <chr> <date>     <lgl>   
    1 A     2009-01-01 FALSE   
    2 B     2010-01-01 FALSE   
    3 C     NA         FALSE   
    

    错误地 DATE_lgl 显示 FALSE 哪里 日期 . 为什么会这样?

    去除 na.rm = TRUE 修复了该问题,但不适用于以下配置 n.rm=真 需要删除缺少的条目:

    temp <- data.frame(
      CHAR = c(
        'A',
        'B',
        'C',
        'C'
      ),
      DATE = c(
        '20090101',
        '20100101',
        NA,
        '20110101'
      ) %>% ymd(), # Turn character strings to dates
      stringsAsFactors = FALSE
    ) %>% group_by(
      CHAR
    ) %>% summarise(
      DATE = min(DATE, na.rm = TRUE) # Extract minimum date
    ) %>% ungroup()
    

    > sessionInfo()
    R version 3.5.0 (2018-04-23)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows 7 x64 (build 7601) Service Pack 1
    
    Matrix products: default
    
    locale:
    [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252    LC_MONETARY=English_Canada.1252
    [4] LC_NUMERIC=C                    LC_TIME=English_Canada.1252    
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
     [1] bindrcpp_0.2.2  lubridate_1.7.4 forcats_0.3.0   stringr_1.3.1   dplyr_0.7.5     purrr_0.2.5    
     [7] readr_1.1.1     tidyr_0.8.1     tibble_1.4.2    ggplot2_2.2.1   tidyverse_1.2.1
    
    loaded via a namespace (and not attached):
     [1] Rcpp_0.12.17     cellranger_1.1.0 pillar_1.2.3     compiler_3.5.0   plyr_1.8.4       bindr_0.1.1     
     [7] tools_3.5.0      jsonlite_1.5     nlme_3.1-137     gtable_0.2.0     lattice_0.20-35  pkgconfig_2.0.1 
    [13] rlang_0.2.1      psych_1.8.4      cli_1.0.0        rstudioapi_0.7   yaml_2.1.19      parallel_3.5.0  
    [19] haven_1.1.1      xml2_1.2.0       httr_1.3.1       hms_0.4.2        grid_3.5.0       tidyselect_0.2.4
    [25] glue_1.2.0       R6_2.2.2         readxl_1.1.0     foreign_0.8-70   modelr_0.1.2     reshape2_1.4.3  
    [31] magrittr_1.5     scales_0.5.0     rvest_0.3.2      assertthat_0.2.0 mnormt_1.5-5     colorspace_1.3-2
    [37] utf8_1.1.4       stringi_1.1.7    lazyeval_0.2.1   munsell_0.4.3    broom_0.4.4      crayon_1.3.4    
    
    2 回复  |  直到 6 年前
        1
  •  4
  •   CPak    6 年前

    问题是你在评估

    min(NA, na.rm=TRUE)
    # Inf
    

    对于第3排,这导致

    dput(temp$DATE[3])
    # structure(Inf, class = "Date")
    

    添加 is.finite 对你 mutate

    temp %>% 
       mutate(DATE_lgl = is.finite(DATE) | is.na(DATE)  # Identify dates that are missing/NA)
    
     # A tibble: 3 x 3
     #   CHAR  DATE       DATE_lgl
     #  <chr> <date>     <lgl>   
     # 1 A     2009-01-01 TRUE    
     # 2 B     2010-01-01 TRUE    
     # 3 C     NA         FALSE
    

    印刷 NA 可能是日期类的打印限制

    as.Date(Inf, origin="1970-01-01")
    # NA
    dput(as.Date(Inf, origin="1970-01-01"))
    # structure(Inf, class = "Date")
    
        2
  •  2
  •   www    6 年前

    解决方法是转换 Date 列到字符,然后计算是否 NA .

    temp %>% mutate(
      DATE_lgl = is.na(as.character(DATE))
    )
    
    # # A tibble: 3 x 3
    #   CHAR  DATE       DATE_lgl
    #   <chr> <date>     <lgl>   
    # 1 A     2009-01-01 FALSE   
    # 2 B     2010-01-01 FALSE   
    # 3 C     NA         TRUE