代码之家  ›  专栏  ›  技术社区  ›  Mus mzuba

如何自动填充每小时值的数据帧,并用零填充空白值?

  •  0
  • Mus mzuba  · 技术社区  · 6 年前

    我有一些数据是从Elasticsearch系统中提取的,该系统显示员工在一个日期范围内的可用性,分为小时时段。

    员工每天24小时不可用,但我想显示24小时时段内的数据 0 填充没有数据的单元格。

    0个 s、 但我真的很想知道是否有更好的方法。

    注意,初始数据帧的大小并不总是相同的,因为不同的日期返回不同的小时值(7小时长的时隙、3小时长的时隙、12小时长的时隙等等)。

    还要注意,如果查询结果中没有任何点击/结果,则中间的小时时段不会显示为没有关联的数据(请参见 18:00 21:00

    目前,整个数据帧如下:

       hour                associate_count minutes_covered
       <dttm>                        <int>           <dbl>
    1  2018-08-06 10:00:00               2              37
    2  2018-08-06 11:00:00               2              60
    3  2018-08-06 12:00:00               2              42
    4  2018-08-06 13:00:00               1              56
    5  2018-08-06 14:00:00               2              60
    6  2018-08-06 15:00:00               2              60
    7  2018-08-06 16:00:00               2              60
    8  2018-08-06 17:00:00               1              52
    9  2018-08-06 18:00:00               1               0 # NOTE THAT THERE IS A 3-HOUR GAP HERE UNTIL THE NEXT HIT
    10 2018-08-06 21:00:00               1              10
    

    数据帧后面的数据:

    df <- structure(list(hour = structure(c(1533546000, 1533549600, 1533553200, 
    1533556800, 1533560400, 1533564000, 1533567600, 1533571200, 1533574800
    ), class = c("POSIXct", "POSIXt"), tzone = "Europe/London"), 
        associate_count = c(2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L), 
        minutes_covered = c(37, 60, 42, 56, 60, 60, 60, 52, 0)), row.names = c(NA, 
    -9L), class = c("tbl_df", "tbl", "data.frame"))
    

    我怎样才能得到这样的数据?

                   hour associate_count minutes_covered
    1  2018-08-06 00:00               0               0
    2  2018-08-06 01:00               0               0
    3  2018-08-06 02:00               0               0
    4  2018-08-06 03:00               0               0
    5  2018-08-06 04:00               0               0
    6  2018-08-06 05:00               0               0
    7  2018-08-06 06:00               0               0
    8  2018-08-06 07:00               0               0
    9  2018-08-06 08:00               0               0
    10 2018-08-06 09:00               0               0
    11 2018-08-06 10:00               2              37
    12 2018-08-06 11:00               2              60
    13 2018-08-06 12:00               2              42
    14 2018-08-06 13:00               1              56
    15 2018-08-06 14:00               2              60
    16 2018-08-06 15:00               2              60
    17 2018-08-06 16:00               2              60
    18 2018-08-06 17:00               1              52
    19 2018-08-06 18:00               1               0
    20 2018-08-06 19:00               0               0
    21 2018-08-06 20:00               0               0
    22 2018-08-06 21:00               1              10
    23 2018-08-06 22:00               0               0
    24 2018-08-06 23:00               0               0
    
    3 回复  |  直到 6 年前
        1
  •  2
  •   Calum You    6 年前

    你可以用 tidyr::complete

    library(tidyverse)
    library(lubridate)
    df <- structure(list(hour = structure(c(1533546000, 1533549600, 1533553200, 1533556800, 1533560400, 1533564000, 1533567600, 1533571200, 1533574800), class = c("POSIXct", "POSIXt"), tzone = "Europe/London"), associate_count = c(2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L), minutes_covered = c(37, 60, 42, 56, 60, 60, 60, 52, 0)), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"))
    
    my_complete <- function(df, start_date, end_date){
      start_hour <- str_c(start_date, " 00:00:00") %>% ymd_hms
      end_hour <- str_c(end_date, " 00:00:00") %>% ymd_hms
      df %>%
        complete(
          hour = seq(from = start_hour, to = end_hour, by = "hour"),
          fill = list(associate_count = 0L, minutes_covered = 0)
        )
    }
    my_complete(df, "2018-08-06", "2018-08-07")
    #> # A tibble: 25 x 3
    #>    hour                associate_count minutes_covered
    #>    <dttm>                        <int>           <dbl>
    #>  1 2018-08-06 00:00:00               0               0
    #>  2 2018-08-06 01:00:00               0               0
    #>  3 2018-08-06 02:00:00               0               0
    #>  4 2018-08-06 03:00:00               0               0
    #>  5 2018-08-06 04:00:00               0               0
    #>  6 2018-08-06 05:00:00               0               0
    #>  7 2018-08-06 06:00:00               0               0
    #>  8 2018-08-06 07:00:00               0               0
    #>  9 2018-08-06 08:00:00               0               0
    #> 10 2018-08-06 09:00:00               2              37
    #> # ... with 15 more rows
    

    于2018-08-13由 reprex package (第0.2.0版)。

        2
  •  0
  •   DanY    6 年前
    # create a sequence of hours for your day
    allhours <- data.frame(hour=seq(from= as.POSIXct("2018-06-08 00:00"), 
                                    to  = as.POSIXct("2018-06-08 23:00"), 
                                    by  = "hours"))
    
    # merge that sequence with your data (all=TRUE is important here)
    res <- merge(df, allhours, by="hour", all=TRUE)
    
    # convert NAs to Zeros
    res[is.na(res$associate_count), "associate_count"] <- 0
    res[is.na(res$minutes_covered), "minutes_covered"] <- 0
    
        3
  •  0
  •   Matias Andina    6 年前

    new_df <- data.frame(hour=seq(ymd_hms('2018-08-06 00:00:00'),
                         ymd_hms('2018-08-06 23:00:00'), by = '1 hour'))
    

    现在我们可以加入旧的数据框架

    new_df %>% left_join(df)
    Joining, by = "hour"
                      hour associate_count minutes_covered
    1  2018-08-06 00:00:00              NA              NA
    2  2018-08-06 01:00:00              NA              NA
    3  2018-08-06 02:00:00              NA              NA
    4  2018-08-06 03:00:00              NA              NA
    5  2018-08-06 04:00:00              NA              NA
    6  2018-08-06 05:00:00              NA              NA
    7  2018-08-06 06:00:00              NA              NA
    8  2018-08-06 07:00:00              NA              NA
    9  2018-08-06 08:00:00              NA              NA
    10 2018-08-06 09:00:00               2              37
    11 2018-08-06 10:00:00               2              60
    12 2018-08-06 11:00:00               2              42
    13 2018-08-06 12:00:00               1              56
    14 2018-08-06 13:00:00               2              60
    15 2018-08-06 14:00:00               2              60
    16 2018-08-06 15:00:00               2              60
    17 2018-08-06 16:00:00               1              52
    18 2018-08-06 17:00:00               1               0
    19 2018-08-06 18:00:00              NA              NA
    20 2018-08-06 19:00:00              NA              NA
    21 2018-08-06 20:00:00              NA              NA
    22 2018-08-06 21:00:00              NA              NA
    23 2018-08-06 22:00:00              NA              NA
    24 2018-08-06 23:00:00              NA              NA
    

    如果一定要摆脱 NAs %>% mutate_at(c(2:3), funs(replace(., is.na(.), 0)))