代码之家  ›  专栏  ›  技术社区  ›  stats_noob

R: 计数箱中的观测值

r
  •  0
  • stats_noob  · 技术社区  · 1 年前

    我在R中有以下数据集:

    library(dplyr)
    
    set.seed(123)
    n <- 100
    country <- sample(c("USA", "Canada", "UK"), n, replace = TRUE)
    gender <- sample(c("M", "F"), n, replace = TRUE)
    age <- sample(18:100, n, replace = TRUE)
    height <- runif(n, min = 150, max = 180)
    owns_bicycle <- sample(c("Yes", "No"), n, replace = TRUE)
    
    df <- data.frame(country, gender, age, height, owns_bicycle)
    

    我的问题:

    • 首先,我想根据身高值将身高分成5组(例如0%-20%、20%-40%等)
    • 接下来,我想根据年龄值将年龄分成5个同等大小的组(例如0%-20%、20%-40%等)
    • 然后,对于国家、性别、年龄组和身高组的每一个独特组合,我想找出拥有自行车的人的百分比。
    • 因此,这种类型的分析会让我知道一些事情,比如——“如果你是一个年龄在30-35岁之间、身高在150-155厘米之间、来自美国的男人,你有43%的机会拥有一辆自行车”。
    • 只是澄清一下——每个人应该只在一个小组中。每个小组的人数应该大致相同。

    这是我写的R代码:

    final = df %>%
      mutate(height_group = cut(height, breaks = 5),
             age_group = cut(age, breaks = 5)) %>%
      group_by(country, gender, height_group, age_group) %>%
      summarise(count = n(),
                percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) 
    

    有人能告诉我这个操作是否正确吗?

    > final
    # A tibble: 67 x 6
    # Groups:   country, gender, height_group [29]
       country gender height_group age_group   count percent_own_bicycle
       <chr>   <chr>  <fct>        <fct>       <int>               <dbl>
     1 Canada  F      (151,157]    (34.2,50.4]     4                  25
     2 Canada  F      (151,157]    (66.6,82.8]     2                   0
     3 Canada  F      (157,162]    (17.9,34.2]     2                   0
     4 Canada  F      (157,162]    (34.2,50.4]     1                 100
     5 Canada  F      (157,162]    (50.4,66.6]     2                   0
     6 Canada  F      (157,162]    (82.8,99.1]     1                   0
     7 Canada  F      (162,168]    (82.8,99.1]     2                  50
     8 Canada  F      (168,174]    (17.9,34.2]     3                   0
     9 Canada  F      (168,174]    (34.2,50.4]     1                 100
    10 Canada  F      (174,180]    (17.9,34.2]     1                   0
    # ... with 57 more rows
    # i Use `print(n = ...)` to see more rows
    

    谢谢

    1 回复  |  直到 1 年前
        1
  •  2
  •   Ricardo Semião    1 年前

    问这种类型的问题时要小心 "is my code correct"

    话虽如此,您的代码似乎很棒!但是 cut() 带有一个整数 breaks 争论不是你想要的。从它的帮助页面:

    如果将break指定为单个数字,则数据的范围为 分成相等长度的碎片

    因此,它并不是根据数据的分布来区分数据,而是根据数据的范围来区分数据。你想使用 quantile() 找到突破口。看看区别:

    > cut(df$height, 5) %>% levels()
    [1] "(151,157]" "(157,162]" "(162,168]" "(168,174]" "(174,180]"
    > cut(df$height, breaks = quantile(df$height, seq(0, 1, 0.2))) %>% levels()
    [1] "(151,156]" "(156,160]" "(160,167]" "(167,174]" "(174,180]"
    
    > cut(df$age, 5) %>% levels()
    [1] "(17.9,34.2]" "(34.2,50.4]" "(50.4,66.6]" "(66.6,82.8]" "(82.8,99.1]"
    > cut(df$age, breaks = quantile(df$age, seq(0, 1, 0.2))) %>% levels()
    [1] "(18,31]"     "(31,45.2]"   "(45.2,62.4]" "(62.4,78.4]" "(78.4,99]"
    

    将其应用于代码:

    df %>%
      mutate(height_group = cut(height, breaks = quantile(height, seq(0, 1, 0.2))),
             age_group = cut(age, breaks = quantile(age, seq(0, 1, 0.2)))) %>%
      group_by(country, gender, height_group, age_group) %>%
      summarise(count = n(),
                percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) 
    
    # A tibble: 75 × 6
    # Groups:   country, gender, height_group [31]
       country gender height_group age_group   count percent_own_bicycle
       <chr>   <chr>  <fct>        <fct>       <int>               <dbl>
     1 Canada  F      (151,156]    (31,45.2]       3                33.3
     2 Canada  F      (151,156]    (62.4,78.4]     1                 0  
     3 Canada  F      (151,156]    (78.4,99]       1                 0  
     4 Canada  F      (156,160]    (18,31]         2                 0  
     5 Canada  F      (156,160]    (31,45.2]       1               100  
     6 Canada  F      (156,160]    (62.4,78.4]     1                 0  
     7 Canada  F      (156,160]    (78.4,99]       1                 0  
     8 Canada  F      (160,167]    (45.2,62.4]     1                 0  
     9 Canada  F      (160,167]    (78.4,99]       1                 0  
    10 Canada  F      (167,174]    (18,31]         3                 0  
    # ℹ 65 more rows
    # ℹ Use `print(n = ...)` to see more rows
    
        2
  •  0
  •   stats_noob    1 年前

    使用@Ricardo Semio e Castro提供的答案中给出的逻辑,这里有一个基于data.table库的解决方案:

    library(data.table)
    
    dt = data.table(df)
    data_table_result = dt[, `:=`(height_group = cut(height, breaks = quantile(height, seq(0, 1, 0.2))),
              age_group = cut(age, breaks = quantile(age, seq(0, 1, 0.2))))][
                  , .(count = .N,
                      percent_own_bicycle = mean(owns_bicycle == "Yes") * 100),
                  by = .(country, gender, height_group, age_group)]