代码之家  ›  专栏  ›  技术社区  ›  Mako212

整理并转换数据,标题成行排列

  •  1
  • Mako212  · 技术社区  · 7 年前
    demodf <- data.frame(
      name = c("Mike","Mike","Mike","Mike","Mike","Joe","Joe","Joe","Joe","Joe"),
      Field = c("EDUCATION","Degree","Title","WORK", "Title", "EDUCATION","Degree","Title", "WORK","Title"),
      Values = c("EDUCATION", "Masters", "Student", "WORK", "VP Sales", "EDUCATION", "Bachelors","Student", "WORK", "Analyst"))
    
       name     Field    Values
    1  Mike EDUCATION EDUCATION
    2  Mike    Degree   Masters
    3  Mike     Title   Student
    4  Mike      WORK      WORK
    5  Mike     Title  VP Sales
    6   Joe EDUCATION EDUCATION
    7   Joe    Degree  Bachelors
    8   Joe     Title   Student
    9   Joe      WORK      WORK
    10  Joe     Title   Analyst
    

    我想 tidyr::spread reshape2::dcast 转换为宽格式,其中 Field 成为列标题。

    该代码看起来像 dcast(demodf, name ~ Values) demodf %>% spread(Field, Values) dcast 强制为数字,以及 spread 抛出错误。

    问题是“标题”重复出现。你可以看到,由于数据中的一个怪癖,我们在数据中有教育和工作作为“假”标题。是否可以标记每个 领域 标题大写的条目,以便 dcast 将工作(即。 Title_EDUCATION Title_WORK )? 将这种转变应用到整体上会更好 领域 ,所以“教育”和“工作”一起消失了,剩下的只有 Degree_EDUCATION, TITLE_EDUCATION

    请注意,实际数据中还有更多的标题,因此最好将“假标题”标识为所有cap条目,或其中包含的条目 Field == Values

    期望输出:

    output <- data.frame(
     Name=c("Mike", "Joe"),
     Degree_EDUCATION =c("Masters", "Bachelors"),
     Title_EDUCATION = c("Student", "Student"),
     Title_WORK= c("VP Sales", "Analyst"))
    
      Name Degree_EDUCATION Title_EDUCATION Title_WORK
    1 Mike          Masters         Student   VP Sales
    2  Joe        Bachelors         Student    Analyst
    
    2 回复  |  直到 7 年前
        1
  •  3
  •   Aaron left Stack Overflow    7 年前

    关键是将重复的分类行添加为新列,然后您可以轻松地使用它。

    首先,添加 stringsAsFactors=FALSE 所以可以比较 Field Values :

    demodf <- data.frame(
      name = c("Mike","Mike","Mike","Mike","Mike","Joe","Joe","Joe","Joe","Joe"),
      Field = c("EDUCATION","Degree","Title","WORK", "Title", "EDUCATION","Degree","Title", "WORK","Title"),
      Values = c("EDUCATION", "Masters", "Student", "WORK", "VP Sales", "EDUCATION", "Bachelors","Student", "WORK", "Analyst"),
      stringsAsFactors=FALSE)
    

    现在使用 tidyr dplyr

    library(tidyr)
    library(dplyr)
    d2 <- demodf %>% mutate(IsCategory=Field==Values,
                            Category=ifelse(IsCategory, Field, NA)) %>%
      fill(Category) %>% subset(!IsCategory, select=-IsCategory)
    d2
    ##    name  Field    Values  Category
    ## 2  Mike Degree   Masters EDUCATION
    ## 3  Mike  Title   Student EDUCATION
    ## 5  Mike  Title  VP Sales      WORK
    ## 7   Joe Degree Bachelors EDUCATION
    ## 8   Joe  Title   Student EDUCATION
    ## 10  Joe  Title   Analyst      WORK
    

    dcast 然后会像你希望的那样工作!

    library(reshape2)    
    dcast(d2, name ~ Field+Category, value.var="Values")
    ##   name Degree_EDUCATION Title_EDUCATION Title_WORK
    ## 1  Joe        Bachelors         Student    Analyst
    ## 2 Mike          Masters         Student   VP Sales
    
        2
  •  0
  •   lmo    7 年前

    这是一个尝试 data.table

    library(data.table)
    # get groupings by titles (all caps)
    setDT(demodf)[, head := cumsum(Field == toupper(Field))]
    # merge titles onto full dataset and paste title to Field
    demodf[demodf[Field == toupper(Field), .(Field, head)], on="head",
           Field := paste(Field, i.Field, sep="_"), by=.EACHI]
    # now reshape wide
    dcast(demodf[Values != toupper(Values),], name~Field, value.var="Values")
    

       name Degree_EDUCATION Title_EDUCATION Title_WORK
    1:  Joe        Bachelors         Student    Analyst
    2: Mike          Masters         Student   VP Sales
    

    数据

    demodf <-
    structure(list(name = c("Mike", "Mike", "Mike", "Mike", "Mike", 
    "Joe", "Joe", "Joe", "Joe", "Joe"), Field = c("EDUCATION", "Degree", 
    "Title", "WORK", "Title", "EDUCATION", "Degree", "Title", "WORK", 
    "Title"), Values = c("EDUCATION", "Masters", "Student", "WORK", 
    "VP Sales", "EDUCATION", "Bachelors", "Student", "WORK", "Analyst"
    )), .Names = c("name", "Field", "Values"), row.names = c(NA, 
    -10L), class = "data.frame")