代码之家  ›  专栏  ›  技术社区  ›  marcel

使用ggplot2 facet_grid优化分类变量的绘图-二分法变量的两个值中只有一个的绘图比例

  •  0
  • marcel  · 技术社区  · 5 年前

    我有一个大数据集,有超过150个分类和连续变量。每个观察(行)属于A组或B组。例如:

    set.seed(16)
    mydf <- data.frame(ID = 1:500, group = sample(c("A", "B", "B", "B"), 500, replace = TRUE), 
    length = rnorm(n = 500, mean = 0, sd = 1), 
    weight = runif(500, min=0, max=1), 
    color = sample(c("red", "orange", "yellow", "green", "blue"), 500,  replace = TRUE), 
    size = sample(c("big", "small"), 500, replace = TRUE), 
    age = sample(c("old", "young"), 500, replace = T))
    

    我正致力于优化图的布局,以形象化的关系组和比例计数的范畴变量。到目前为止,有一些来自以前的帖子的帮助( https://stackoverflow.com/a/59562290/1905571 )我有使用ggplot2刻面网格的绘图,但是遇到了两个问题。

    问题A:条形图按字母顺序排列(例如,大、老、小、年轻),而不是按类别分组(年龄:年轻到老;大小:大到小,等等)。问题B:对于只有两个可能值的分类变量,我只想画出其中一个值在A组和B组中的比例。例如,只画出A组和B组中“老”的比例,因为“年轻”的比例图不会提供任何新的信息。其他分类变量,如具有多个值的color,应该为每个可能性绘制条形图。

    我用“mutate(value=factor(value,levels=c(“big”、“small”、“young”、“old”、“red”、“orange”、“yellow”、“green”、“blue”))将因子级别设置为所需的绘图顺序,现在绘图顺序按指定的方式显示,年龄组彼此相邻,颜色相邻,等等。

    data_cat <- 
      mydf %>% select(-ID) %>%
      mutate_if(.predicate = is.factor, .funs = as.character) %>%
      mutate(group = factor(group)) %>%
      pivot_longer(cols = which(sapply(., is.character)), names_to = 'key', values_to =  'value')%>%
      count(group, key, value) %>%
      group_by(group, key) %>%
      mutate(percent =  n/ sum(n)) %>%
      mutate(value = factor(value, levels=c("big", "small", "young", "old", "red", "orange", "yellow", "green", "blue"))) %>%
    
    ggplot(data_cat) +
      geom_col(aes(group, percent, fill = key)) +
      facet_grid(~ value)
    

    我还剩下问题B,抑制了两个结果中的一个为二分法的范畴变量的绘图。我想我必须找到一种从每个变量中提取“因子水平”的方法,然后处理这个值为==2的子集,已经搜索过但还没有找到一种方法。

    0 回复  |  直到 5 年前
        1
  •  0
  •   Conner Sexton    5 年前

    set.seed(16)
    mydf <- data.frame(ID = 1:500, group = sample(c("A", "B", "B", "B"), 500, replace = TRUE), 
                       length = rnorm(n = 500, mean = 0, sd = 1), 
                       weight = runif(500, min=0, max=1), 
                       color = sample(c("red", "orange", "yellow", "green", "blue"), 500,  replace = TRUE), 
                       size = sample(c("big", "small"), 500, replace = TRUE), 
                       age = sample(c("old", "young"), 500, replace = T))
    
    key <- lapply(mydf, function(x){ifelse(length(levels(x))==2, 1, 0)})
    dichotomous <- names(which(key == 1))[-1]
    
    mydf %>% select(-ID) %>%
      mutate_if(.predicate = is.factor, .funs = as.character) %>%
      mutate_at(.vars = vars(dichotomous), .funs = function(x){ifelse(x == unique(x)[2], NA, x)}) %>%
      mutate(group = factor(group)) %>%
      pivot_longer(cols = which(sapply(., is.character)), names_to = 'key', values_to =  'value')%>%
      count(group, key, value) %>%
      group_by(group, key) %>%
      mutate(percent =  n/ sum(n)) %>%
      mutate(value = factor(value, levels=c("big", "small", "young", "old", "red", "orange", "yellow", "green", "blue"))) %>%
      na.omit() -> data_cat
    
    ggplot(data_cat) +
      geom_col(aes(group, percent, fill = key)) +
      facet_grid(~ value)