代码之家  ›  专栏  ›  技术社区  ›  C.Robin

正在从mutate(across())中减去{.col}

  •  1
  • C.Robin  · 技术社区  · 10 月前

    假设我有以下数据:

    df <- structure(list(treat = structure(1:4, levels = c("Control", "Treatment 1", 
    "Treatment 2", "Treatment 3"), class = "factor"), 
        female_n = c(314709L, 10456L, 10481L, 10455L), female_mean = c(0.506, 
        0.506, 0.504, 0.5), female_sd = c(0.5, 0.5, 0.5, 0.5), birth_year_n = c(314709L, 
        10456L, 10481L, 10455L), birth_year_mean = c(1973.74, 1973.654, 
        1973.486, 1973.766), birth_year_sd = c(16.867, 16.997, 16.869, 
        16.89), provided_phone_no_n = c(314709L, 10456L, 10481L, 
        10455L), provided_phone_no_mean = c(0.656, 0.666, 0.663, 
        0.647), provided_phone_no_sd = c(0.475, 0.472, 0.473, 0.478
        ), dem_n = c(314709L, 10456L, 10481L, 10455L), dem_mean = c(0.48, 
        0.474, 0.482, 0.478), dem_sd = c(0.5, 0.499, 0.5, 0.5), rep_n = c(314709L, 
        10456L, 10481L, 10455L), rep_mean = c(0.136, 0.141, 0.142, 
        0.138), rep_sd = c(0.343, 0.348, 0.349, 0.345), uaf_n = c(314709L, 
        10456L, 10481L, 10455L), uaf_mean = c(0.363, 0.365, 0.357, 
        0.363), uaf_sd = c(0.481, 0.481, 0.479, 0.481)), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -4L))
    

    我想添加一个新的 *_se 列,该列将现有的 *_n *_sd 我的数据中每个变量组的列。即每个一个 female_* , birth_year_* , provided_phone_no_* , dem_* , rep_* uaf_* .

    我想,试图做到这一点 mutate(across()) 可能是正确的helper函数,但我在substring中遇到了一些问题 {.col} 并让R将其识别为列名。这是我迄今为止的尝试:

    df %>%
        mutate(
        across(ends_with("_sd"),
               list(
                se = ~.x / sqrt(!!ensym("{str_replace(.col, '_sd', '_n')}"))
               )
        )
    

    以上返回错误:

    Error in `ensym()`:
    ! `arg` must be a symbol
    Backtrace:
      1. ... %>% ...
     10. rlang::abort(message = message)
    

    有人能看到我哪里错了吗?

    1 回复  |  直到 10 月前
        1
  •  4
  •   LMc    10 月前

    这确实是将数据转换为长格式,然后再转换回宽格式的理想情况:

    library(tidyr)
    library(dplyr)
    
    df |>
      pivot_longer(cols = -treat, 
                   names_pattern = "(.*)_(.*)", 
                   names_to = c("grp", ".value")) |>
      mutate(se = sd / sqrt(n)) |>
      pivot_wider(names_from = grp, 
                  values_from = n:se, 
                  names_glue = "{grp}_{.value}", 
                  names_vary = "slowest")
    #> # A tibble: 4 × 25
    #>   treat    female_n female_mean female_sd female_se birth_year_n birth_year_mean
    #>   <fct>       <int>       <dbl>     <dbl>     <dbl>        <int>           <dbl>
    #> 1 Control    314709       0.506       0.5  0.000891       314709           1974.
    #> 2 Alexand…    10456       0.506       0.5  0.00489         10456           1974.
    #> 3 Politic…    10481       0.504       0.5  0.00488         10481           1973.
    #> 4 Mark yo…    10455       0.5         0.5  0.00489         10455           1974.
    #> # ℹ 18 more variables: birth_year_sd <dbl>, birth_year_se <dbl>,
    #> #   provided_phone_no_n <int>, provided_phone_no_mean <dbl>,
    #> #   provided_phone_no_sd <dbl>, provided_phone_no_se <dbl>, dem_n <int>,
    #> #   dem_mean <dbl>, dem_sd <dbl>, dem_se <dbl>, rep_n <int>, rep_mean <dbl>,
    #> #   rep_sd <dbl>, rep_se <dbl>, uaf_n <int>, uaf_mean <dbl>, uaf_sd <dbl>,
    #> #   uaf_se <dbl>
    

    如果你真的想以宽格式进行,这里有一个可能的解决方案:

    library(dplyr)
    library(stringr)
    
    df |> 
      mutate(across(ends_with("_sd"), \(x) {
        x / sqrt(pick(all_of(str_replace(cur_column(), "_sd", "_n")))[[1]])
      }, 
      .names = "{str_remove(.col, '_sd')}_se"
      ))