代码之家 › 专栏 › 技术社区 › B. Davis

从分组数据中选择两个随机且连续的行

dplyr r

B. Davis · 技术社区 · 6 年前

在以下数据中(包括 dput ),我对三个个体(IndIDII)进行了重复观察(lat和long)。请注意,每个人有不同数量的位置,它们是按 IndYear .

  IndIDII      IndYear  WintLat  WintLong
1 BHS_265 BHS_265-2015 47.61025 -112.7210
2 BHS_265 BHS_265-2016 47.59884 -112.7089
3 BHS_770 BHS_770-2016 42.97379 -109.0400
4 BHS_770 BHS_770-2017 42.97129 -109.0367
5 BHS_770 BHS_770-2018 42.97244 -109.0509
6 BHS_377 BHS_377-2015 43.34744 -109.4821
7 BHS_377 BHS_377-2016 43.35559 -109.4445
8 BHS_377 BHS_377-2017 43.35195 -109.4566
9 BHS_377 BHS_377-2018 43.34765 -109.4892

我想 filter 做一个新的 df 每行有两个连续的行 IndIDII . 在我的大数据集中,所有个体至少有2个观察值(即行),每个个体有2到4个观察值。显然,对于只有2行的个人,代码将返回仅有的2行。有了更多的数据,第1行和第2行, 2和3, 随机选择3和4。行的顺序并不重要,只要它们是连续的(即可以返回3和4) 或 4和3)。

一如既往,非常感谢!

Dat <- structure(list(IndIDII = c("BHS_265", "BHS_265", "BHS_770", "BHS_770", 
"BHS_770", "BHS_377", "BHS_377", "BHS_377", "BHS_377"), IndYear = c("BHS_265-2015", 
"BHS_265-2016", "BHS_770-2016", "BHS_770-2017", "BHS_770-2018", 
"BHS_377-2015", "BHS_377-2016", "BHS_377-2017", "BHS_377-2018"
), WintLat = c(47.6102519805014, 47.5988417247191, 42.9737859090909, 
42.9712914772727, 42.9724390816327, 43.3474354347826, 43.3555934579439, 
43.3519543396226, 43.3476466990291), WintLong = c(-112.720994832869, 
-112.708887595506, -109.039964727273, -109.036693522727, -109.050923061224, 
-109.482114456522, -109.444522149533, -109.45659254717, -109.489241553398
)), class = "data.frame", row.names = c(NA, -9L))

3 回复 | 直到 6 年前

Henrik plannapus 6 年前

ave . 在每个组中,创建一个行索引( i <- seq_along(x) ). 要获取要保留的行的第一个索引,请从除最后一行索引外的所有行中抽取一行( sample(head(i, -1), 1) + 0:1 ). 检查哪些行索引位于采样行中( i %in% ...

Dat[as.logical(ave(Dat$IndIDII, Dat$IndIDII, FUN = function(x){
  i <- seq_along(x)
  i %in% (sample(head(i, -1), 1) + 0:1)
})), ]

#   IndIDII      IndYear  WintLat  WintLong
# 1 BHS_265 BHS_265-2015 47.61025 -112.7210
# 2 BHS_265 BHS_265-2016 47.59884 -112.7089
# 4 BHS_770 BHS_770-2017 42.97129 -109.0367
# 5 BHS_770 BHS_770-2018 42.97244 -109.0509
# 7 BHS_377 BHS_377-2016 43.35559 -109.4445
# 8 BHS_377 BHS_377-2017 43.35195 -109.4566

与此类似,但更简洁,有 data.table .I )以及每组的行数( .N )

library(data.table)
setDT(Dat)
Dat[Dat[ , (sample(.I[-.N], 1)) + 0:1, by = IndIDII]$V1]

Jilber Urbina 6 年前

这里有一个使用R基函数的解决方案

> set.seed(505) # you can set whatever seed you want, I set 505 for reproducibility
> lapply(split(Dat, Dat$IndIDII), function(x) {
  ind <- sample(nrow(x))
  cons <- if(ind[1] < max(ind)){
    c(ind[1], ind[1]+1)
  } else {
    c(ind[1], ind[1]-1)
    }
  x[cons, ]
})

$`BHS_265`
  IndIDII      IndYear  WintLat  WintLong
1 BHS_265 BHS_265-2015 47.61025 -112.7210
2 BHS_265 BHS_265-2016 47.59884 -112.7089

$BHS_377
  IndIDII      IndYear  WintLat  WintLong
6 BHS_377 BHS_377-2015 43.34744 -109.4821
7 BHS_377 BHS_377-2016 43.35559 -109.4445

$BHS_770
  IndIDII      IndYear  WintLat  WintLong
3 BHS_770 BHS_770-2016 42.97379 -109.0400
4 BHS_770 BHS_770-2017 42.97129 -109.0367

Calum You 6 年前

select() 在函数的末尾。

Dat <- structure(list(IndIDII = c("BHS_265", "BHS_265", "BHS_770", "BHS_770", "BHS_770", "BHS_377", "BHS_377", "BHS_377", "BHS_377"), IndYear = c("BHS_265-2015", "BHS_265-2016", "BHS_770-2016", "BHS_770-2017", "BHS_770-2018", "BHS_377-2015", "BHS_377-2016", "BHS_377-2017", "BHS_377-2018"), WintLat = c(47.6102519805014, 47.5988417247191, 42.9737859090909, 42.9712914772727, 42.9724390816327, 43.3474354347826, 43.3555934579439, 43.3519543396226, 43.3476466990291), WintLong = c(-112.720994832869, -112.708887595506, -109.039964727273, -109.036693522727, -109.050923061224, -109.482114456522, -109.444522149533, -109.45659254717, -109.489241553398)), class = "data.frame", row.names = c(NA, -9L))

library(tidyverse)
set.seed(123)
sample_2_consecutive <- function(tbl, group_col){
  group_col <- enquo(group_col)
  with_rownums <- tbl %>%
    group_by(!!group_col) %>%
    mutate(row = row_number())
  rows_to_keep <- with_rownums %>%
    filter(row != max(row)) %>%
    sample_n(1) %>%
    mutate(row2 = row + 1) %>%
    gather(key, row, row, row2)
  with_rownums %>%
    semi_join(rows_to_keep, by = c(quo_name(quo(!!group_col)), "row")) %>%
    arrange(!!group_col, row) %>%
    ungroup() # %>%
  # select(-row)
}
sample_2_consecutive(Dat, IndIDII)
#> # A tibble: 6 x 5
#>   IndIDII IndYear      WintLat WintLong   row
#>   <chr>   <chr>          <dbl>    <dbl> <int>
#> 1 BHS_265 BHS_265-2015    47.6    -113.     1
#> 2 BHS_265 BHS_265-2016    47.6    -113.     2
#> 3 BHS_377 BHS_377-2017    43.4    -109.     3
#> 4 BHS_377 BHS_377-2018    43.3    -109.     4
#> 5 BHS_770 BHS_770-2016    43.0    -109.     1
#> 6 BHS_770 BHS_770-2017    43.0    -109.     2

reprex package (第0.2.0版)。