代码之家  ›  专栏  ›  技术社区  ›  B. Davis

从分组数据中选择两个随机且连续的行

  •  2
  • B. Davis  · 技术社区  · 6 年前

    在以下数据中(包括 dput ),我对三个个体(IndIDII)进行了重复观察(lat和long)。请注意,每个人有不同数量的位置,它们是按 IndYear .

      IndIDII      IndYear  WintLat  WintLong
    1 BHS_265 BHS_265-2015 47.61025 -112.7210
    2 BHS_265 BHS_265-2016 47.59884 -112.7089
    3 BHS_770 BHS_770-2016 42.97379 -109.0400
    4 BHS_770 BHS_770-2017 42.97129 -109.0367
    5 BHS_770 BHS_770-2018 42.97244 -109.0509
    6 BHS_377 BHS_377-2015 43.34744 -109.4821
    7 BHS_377 BHS_377-2016 43.35559 -109.4445
    8 BHS_377 BHS_377-2017 43.35195 -109.4566
    9 BHS_377 BHS_377-2018 43.34765 -109.4892
    

    我想 filter 做一个新的 df 每行有两个连续的行 IndIDII . 在我的大数据集中,所有个体至少有2个观察值(即行),每个个体有2到4个观察值。显然,对于只有2行的个人,代码将返回仅有的2行。有了更多的数据,第1行和第2行, 2和3, 随机选择3和4。行的顺序并不重要,只要它们是连续的(即可以返回3和4) 4和3)。

    一如既往,非常感谢!

    Dat <- structure(list(IndIDII = c("BHS_265", "BHS_265", "BHS_770", "BHS_770", 
    "BHS_770", "BHS_377", "BHS_377", "BHS_377", "BHS_377"), IndYear = c("BHS_265-2015", 
    "BHS_265-2016", "BHS_770-2016", "BHS_770-2017", "BHS_770-2018", 
    "BHS_377-2015", "BHS_377-2016", "BHS_377-2017", "BHS_377-2018"
    ), WintLat = c(47.6102519805014, 47.5988417247191, 42.9737859090909, 
    42.9712914772727, 42.9724390816327, 43.3474354347826, 43.3555934579439, 
    43.3519543396226, 43.3476466990291), WintLong = c(-112.720994832869, 
    -112.708887595506, -109.039964727273, -109.036693522727, -109.050923061224, 
    -109.482114456522, -109.444522149533, -109.45659254717, -109.489241553398
    )), class = "data.frame", row.names = c(NA, -9L))
    
    3 回复  |  直到 6 年前
        1
  •  2
  •   Henrik plannapus    6 年前

    ave . 在每个组中,创建一个行索引( i <- seq_along(x) ). 要获取要保留的行的第一个索引,请从除最后一行索引外的所有行中抽取一行( sample(head(i, -1), 1) + 0:1 ). 检查哪些行索引位于采样行中( i %in% ...

    Dat[as.logical(ave(Dat$IndIDII, Dat$IndIDII, FUN = function(x){
      i <- seq_along(x)
      i %in% (sample(head(i, -1), 1) + 0:1)
    })), ]
    
    #   IndIDII      IndYear  WintLat  WintLong
    # 1 BHS_265 BHS_265-2015 47.61025 -112.7210
    # 2 BHS_265 BHS_265-2016 47.59884 -112.7089
    # 4 BHS_770 BHS_770-2017 42.97129 -109.0367
    # 5 BHS_770 BHS_770-2018 42.97244 -109.0509
    # 7 BHS_377 BHS_377-2016 43.35559 -109.4445
    # 8 BHS_377 BHS_377-2017 43.35195 -109.4566
    

    与此类似,但更简洁,有 data.table .I )以及每组的行数( .N )

    library(data.table)
    setDT(Dat)
    Dat[Dat[ , (sample(.I[-.N], 1)) + 0:1, by = IndIDII]$V1]
    
        2
  •  2
  •   Jilber Urbina    6 年前

    这里有一个使用R基函数的解决方案

    > set.seed(505) # you can set whatever seed you want, I set 505 for reproducibility
    > lapply(split(Dat, Dat$IndIDII), function(x) {
      ind <- sample(nrow(x))
      cons <- if(ind[1] < max(ind)){
        c(ind[1], ind[1]+1)
      } else {
        c(ind[1], ind[1]-1)
        }
      x[cons, ]
    })
    
    $`BHS_265`
      IndIDII      IndYear  WintLat  WintLong
    1 BHS_265 BHS_265-2015 47.61025 -112.7210
    2 BHS_265 BHS_265-2016 47.59884 -112.7089
    
    $BHS_377
      IndIDII      IndYear  WintLat  WintLong
    6 BHS_377 BHS_377-2015 43.34744 -109.4821
    7 BHS_377 BHS_377-2016 43.35559 -109.4445
    
    $BHS_770
      IndIDII      IndYear  WintLat  WintLong
    3 BHS_770 BHS_770-2016 42.97379 -109.0400
    4 BHS_770 BHS_770-2017 42.97129 -109.0367
    
        3
  •  1
  •   Calum You    6 年前

    select() 在函数的末尾。

    Dat <- structure(list(IndIDII = c("BHS_265", "BHS_265", "BHS_770", "BHS_770", "BHS_770", "BHS_377", "BHS_377", "BHS_377", "BHS_377"), IndYear = c("BHS_265-2015", "BHS_265-2016", "BHS_770-2016", "BHS_770-2017", "BHS_770-2018", "BHS_377-2015", "BHS_377-2016", "BHS_377-2017", "BHS_377-2018"), WintLat = c(47.6102519805014, 47.5988417247191, 42.9737859090909, 42.9712914772727, 42.9724390816327, 43.3474354347826, 43.3555934579439, 43.3519543396226, 43.3476466990291), WintLong = c(-112.720994832869, -112.708887595506, -109.039964727273, -109.036693522727, -109.050923061224, -109.482114456522, -109.444522149533, -109.45659254717, -109.489241553398)), class = "data.frame", row.names = c(NA, -9L))
    
    library(tidyverse)
    set.seed(123)
    sample_2_consecutive <- function(tbl, group_col){
      group_col <- enquo(group_col)
      with_rownums <- tbl %>%
        group_by(!!group_col) %>%
        mutate(row = row_number())
      rows_to_keep <- with_rownums %>%
        filter(row != max(row)) %>%
        sample_n(1) %>%
        mutate(row2 = row + 1) %>%
        gather(key, row, row, row2)
      with_rownums %>%
        semi_join(rows_to_keep, by = c(quo_name(quo(!!group_col)), "row")) %>%
        arrange(!!group_col, row) %>%
        ungroup() # %>%
      # select(-row)
    }
    sample_2_consecutive(Dat, IndIDII)
    #> # A tibble: 6 x 5
    #>   IndIDII IndYear      WintLat WintLong   row
    #>   <chr>   <chr>          <dbl>    <dbl> <int>
    #> 1 BHS_265 BHS_265-2015    47.6    -113.     1
    #> 2 BHS_265 BHS_265-2016    47.6    -113.     2
    #> 3 BHS_377 BHS_377-2017    43.4    -109.     3
    #> 4 BHS_377 BHS_377-2018    43.3    -109.     4
    #> 5 BHS_770 BHS_770-2016    43.0    -109.     1
    #> 6 BHS_770 BHS_770-2017    43.0    -109.     2
    

    reprex package (第0.2.0版)。