代码之家  ›  专栏  ›  技术社区  ›  socialscientist

R: 带赋值的分组估计加权分位数

  •  0
  • socialscientist  · 技术社区  · 7 年前

    当我有采样权重并将每个观察值分配给一个新变量中的相应分位数时,我试图计算每组中连续变量(我们称之为“值”)的每个观察值的分位数(0到100)。

    换句话说,每一行都是一个观察值,每个观察值都属于一个组。所有组都有2次以上的观察结果。在每个组中,我需要使用数据中的采样权重估计值的分布,确定观察值属于其组分布的百分比,然后将该百分比作为列添加到数据框中。

    据我所知 survey svyby() svyquantile() 但后者返回指定分位数的值,而不是给定观察值的分位数。

    # Load survey package
    library(survey)
    
    # Set seed for replication
    set.seed(123)
    
    # Create data with value, group, weight
    dat <- data.frame(value = 1:6, 
                      group = rep(1:3,2), 
                      weight = abs(rnorm(6))
    # Declare survey design 
    d <- survey::svydesign(id =~1, data = dat, weights = weight) 
    
    # Do something to calculate the quantile and add it to the data
    ????
    

    这类似于这个问题,但不是由子组完成的: Compute quantiles incorporating Sample Design (Survey package)

    1 回复  |  直到 7 年前
        1
  •  0
  •   socialscientist    7 年前

    我提出了一个解决方案。下面的语句序列 mutate() dplyr 由于功率 dplyr::bind_rows()

    # Set seed for replication
    set.seed(123)
    
    # Create data with value, group, weight
    dat <- data.frame(value = 1:6, 
                      group = rep(1:3,2), 
                      weight = abs(rnorm(6))
    
    # Initialize list for storing group results
    # Setting the length of the list is quicker than
    # creating an empty list and growing it
    quantile_list <- vector("list", length(unique(dat$group)))
    
    # Initialize variable to indicate initial iteration
    iteration <- 0
    
    # estimate the decile of each respondent
    # in a large for-loop
    
    for(group in unique(dat$group)) {
    
    # Keep only observations for a given group
      temp <- dat %>% dplyr::filter(group == group)
    
      # Create subset with missing values
      temp_missing <- temp %>% dplyr::filter(is.na(value))
    
      # Create subset without missing values
      temp_nonmissing <- temp %>% dplyr::filter(!is.na(value))
    
      # Sort observations with value on value, calculate cumulative
      # sum of sampling weights, create variable indicating the decile
      # of responses. 1 = lowest, 10 = highest
      temp_nonmissing <- temp_nonmissing %>% 
                                dplyr::arrange(value) %>%
                                dplyr::mutate(cumulative_weight = cumsum(weight),
                                              cumulative_weight_prop = cumulative_weight / sum(weight),
                                              decile = dplyr::case_when(cumulative_weight_prop < 0.10 ~ 1,
                                              cumulative_weight_prop >= 0.10 & cumulative_weight_prop < 0.20 ~ 2,
                                              cumulative_weight_prop >= 0.20 & cumulative_weight_prop < 0.30 ~ 3,
                                              cumulative_weight_prop >= 0.30 & cumulative_weight_prop < 0.40 ~ 4,
                                              cumulative_weight_prop >= 0.40 & cumulative_weight_prop < 0.50 ~ 5,
                                              cumulative_weight_prop >= 0.50 & cumulative_weight_prop < 0.60 ~ 6,
                                              cumulative_weight_prop >= 0.60 & cumulative_weight_prop < 0.70 ~ 7,
                                              cumulative_weight_prop >= 0.70 & cumulative_weight_prop < 0.80 ~ 8,
                                              cumulative_weight_prop >= 0.80 & cumulative_weight_prop < 0.90 ~ 9 ,
                                              cumulative_weight_prop >= 0.90 ~ 10))
    
      # Increment the iteration of the for loop
      iteration <- iteration + 1
    
      # Join the data with missing values and the data without
      # missing values on the value variable into
      # a single data frame
      quantile_list[[iteration]] <- dplyr::bind_rows(temp_nonmissing, temp_missing)
      }
    
    # Convert the list of data frames into a single dataframe
    out <- dplyr::bind_rows(quantile_list)
    
    # Show outcome
    head(out)