代码之家  ›  专栏  ›  技术社区  ›  R overflow

R:将两个数据帧与计算范围结合起来

r
  •  0
  • R overflow  · 技术社区  · 6 年前

    我对建立一个逻辑有问题,要让它起作用。在堆栈/网络上找不到有关此特定问题的任何内容。

    我有两个数据帧:

    数据帧1:

    ID  Date         Time 
    1   2017-11-13   06:34:50
    2   2017-11-13   06:40:10
    3   2017-11-14   23:58:10 
    

    数据帧2:

    Number_Visitors   hit_time 
     20               2017-11-13 06:34:50 
     18               2017-11-13 06:34:50
     15               2017-11-15 00:06:10
     25               2018-12-14 20:58:10
    

    我想要什么?

    我想知道二号桌的访客人数,和一号桌的日期和时间相匹配。但最困难的是:所有访客的日期/时间(从表一开始)+10分钟范围(所有访客的开始时间+10分钟)。

    ID  Date         Time        End_Time #I don't have this column yet.. 
    1   2017-11-13   06:34:50    06:44:50
    2   2017-11-13   06:40:10    06:50:10   
    3   2017-11-14   23:58:10    00:08:10 #Attention: it is one day later here.
    

    结果:

    ID  Date         Time        End_Time  Number_of_Visitors_in_range
    1   2017-11-13   06:34:50    06:44:50      28
    2   2017-11-13   06:40:10    06:50:10      0
    3   2017-11-14   23:58:10    00:08:10      15
    
    1 回复  |  直到 6 年前
        1
  •  2
  •   phiver    6 年前

    可能有多种答案。非等距连接/模糊连接是搜索项。

    根据您的示例(而不是dput),您可以使用如下内容。代码中的解释。

    DPLYR/模糊连接:

    library(dplyr)
    library(lubridate)
    library(fuzzyjoin)
    
    # set hit_time as posixct
    df2$hit_time <- ymd_hms(df2$hit_time)
    
    # first create 2 new columns so start and end match hit_time in other data.frame
    df1 <- df1 %>% mutate(Start_Time = ymd_hms(paste0(Date, Time)),
                   End_Time = Start_Time + minutes(10)) 
    
    # use fuzzy join and join everything together and roll up.
    fuzzy_left_join(df1, df2, c(Start_Time = "hit_time", End_Time = "hit_time"),
                 list(`<=`,`>=`)) %>% 
      group_by(ID, Start_Time, End_Time) %>% 
      summarise(No_Visitors_in_range = sum(Number_Visitors))
    # A tibble: 3 x 4
    # Groups:   ID, Start_Time [?]
         ID Start_Time          End_Time            No_Visitors_in_range
      <int> <dttm>              <dttm>                             <int>
    1     1 2017-11-13 06:34:50 2017-11-13 06:44:50                   38
    2     2 2017-11-13 06:40:10 2017-11-13 06:50:10                   NA
    3     3 2017-11-14 23:58:10 2017-11-15 00:08:10                   15
    

    数据表:

    library(data.table)
    library(lubridate)
    
    # set hit_time as posixct
    df2$hit_time <- ymd_hms(df2$hit_time)
    
    df1 <- as.data.table(df1)
    df2 <- as.data.table(df2)
    
    # first create 2 new columns so start and end match hit_time in other data.frame
    df1[, Start_Time := ymd_hms(paste0(Date, Time))][, End_Time := Start_Time + minutes(10)]
    
    # add sum of bbb to table 1 from table 2
    df1[, No_Visitors_in_range := df2[df1, on=.(hit_time >= Start_Time, hit_time <= End_Time), sum(Number_Visitors), by=.EACHI]$V1]
    
    df1
       ID       Date     Time          Start_Time            End_Time No_Visitors_in_range
    1:  1 2017-11-13 06:34:50 2017-11-13 06:34:50 2017-11-13 06:44:50                   38
    2:  2 2017-11-13 06:40:10 2017-11-13 06:40:10 2017-11-13 06:50:10                   NA
    3:  3 2017-11-14 23:58:10 2017-11-14 23:58:10 2017-11-15 00:08:10                   15
    

    数据:

    df1 <- structure(list(ID = 1:3, Date = c("2017-11-13", "2017-11-13", 
    "2017-11-14"), Time = c("06:34:50", "06:40:10", "23:58:10")), class = "data.frame", row.names = c(NA, 
    -3L))
    
    df2 <- structure(list(Number_Visitors = c(20L, 18L, 15L, 25L), hit_time = c("2017-11-13 06:34:50", "2017-11-13 06:34:50", "2017-11-15 00:06:10", "2018-12-14 20:58:10"
    )), class = "data.frame", row.names = c(NA, -4L))
    

    编辑: 基于重叠的时间框架,最好提前开始时间。

    df1[, End_Time := shift(Start_Time, type = "lead", fill = last(Start_Time))]
    
    # add sum of bbb to table 1 from table 2
    df1[, No_Visitors_in_range := df2[df1, on=.(hit_time_gmt >= Start_Time, hit_time_gmt < End_Time), sum(visitor_id), by=.EACHI]$V1]
    

    我在这里得到一个警告,也许你也会的,这没什么好担心的,我已经解释过了 here 是的。