代码之家  ›  专栏  ›  技术社区  ›  D500

处理数据集以考虑重复测量

  •  0
  • D500  · 技术社区  · 7 年前

    鉴于:

    df <- data.frame(
                      CompanyID=c("Drinkers","Drinkers","Drinkers","Drinkers","Drinkers","Drinkers","Drinkers","Drinkers"
                                ,"Drinkers","Drinkers", "Liquders","Liquders","Liquders","PelletCoffeeCo","PelletCoffeeCo"),
                      Email= c("john@coffee.com", "john@coffee.com","john@coffee.com","john@coffee.com", "john@coffee.com", 
                              "john@coffee.com", "john@coffee.com", "john@coffee.com", "john@coffee.com", "john@coffee.com",
                              "george@liquid.com","george@liquid.com","george@liquid.com","stacy@pelletcoffee.com",
                            "stacy@pelletcoffee.com"),
                      Day= c("1","2","3","4","5","6","7","8","9","10","1","2","3","1","2"),
                     var1= c(4,5,5,5,2,3,2,7,6,5,7,6,6,2,3))
    

    df2 <- data.frame(CompanyID=c("Drinkers","Drinkers","Drinkers","Drinkers","Drinkers","Drinkers","Drinkers","Drinkers"
                                ,"Drinkers","Drinkers", "Liquders","Liquders","Liquders","Liquders","Liquders","Liquders",
                                "Liquders","Liquders","Liquders","Liquders", "PelletCoffeeCo","PelletCoffeeCo","PelletCoffeeCo",
                                "PelletCoffeeCo","PelletCoffeeCo","PelletCoffeeCo","PelletCoffeeCo","PelletCoffeeCo",
                                "PelletCoffeeCo","PelletCoffeeCo"),
                      Email= c("john@coffee.com", "john@coffee.com","john@coffee.com","john@coffee.com", "john@coffee.com", 
                                 "john@coffee.com", "john@coffee.com", "john@coffee.com", "john@coffee.com", "john@coffee.com",
                               "george@liquid.com","george@liquid.com","george@liquid.com","george@liquid.com","george@liquid.com",
                               "george@liquid.com","george@liquid.com","george@liquid.com","george@liquid.com","george@liquid.com","stacy@pelletcoffee.com",
                               "stacy@pelletcoffee.com","stacy@pelletcoffee.com","stacy@pelletcoffee.com","stacy@pelletcoffee.com",
                               "stacy@pelletcoffee.com","stacy@pelletcoffee.com","stacy@pelletcoffee.com","stacy@pelletcoffee.com",
                               "stacy@pelletcoffee.com"),
                      Day= c("1","2","3","4","5","6","7","8","9","10","1","2","3","4","5","6","7","8","9","10",
                             "1","2","3","4","5","6","7","8","9","10"),
                      var1= c(4,5,5,5,2,3,2,7,6,5,7,6,6, NA,NA,NA,NA,NA,NA,NA, 2,3,NA,NA,NA,NA,NA,NA,NA,NA))
    

    说明: 我有数据表明,我每天对人们进行一次为期10天的调查。在一个完美的世界里,我会从每个参与者那里得到10个回复,用day1:day10表示。然而,由于没有回应,一些参与者给出了3个回应,其他人,6个,其他人10个等等。我正在设置数据以运行增长模型,因此我需要“Day”列始终读取Day1-Day10,无论是否有这些回应的数据。我试图通过向没有全部10天数据的行中添加NA来证明这一点。

    2 回复  |  直到 3 年前
        1
  •  2
  •   www    7 年前

    试试这个:

    library(tidyr)
    
    df %>% 
      complete(nesting(CompanyID,Email), Day = seq(min(Day), max(Day), 1L)) %>%
      data.frame()
    

    输出:

            CompanyID                  Email Day var1
    1        Drinkers        john@coffee.com   1    4
    2        Drinkers        john@coffee.com   2    5
    3        Drinkers        john@coffee.com   3    5
    4        Drinkers        john@coffee.com   4    5
    5        Drinkers        john@coffee.com   5    5
    6        Drinkers        john@coffee.com   6    2
    7        Drinkers        john@coffee.com   7    3
    8        Drinkers        john@coffee.com   8    2
    9        Drinkers        john@coffee.com   9    7
    10       Drinkers        john@coffee.com  10    6
    11       Liquders      george@liquid.com   1    7
    12       Liquders      george@liquid.com   2   NA
    13       Liquders      george@liquid.com   3    6
    14       Liquders      george@liquid.com   4    6
    15       Liquders      george@liquid.com   5   NA
    16       Liquders      george@liquid.com   6   NA
    17       Liquders      george@liquid.com   7   NA
    18       Liquders      george@liquid.com   8   NA
    19       Liquders      george@liquid.com   9   NA
    20       Liquders      george@liquid.com  10   NA
    21 PelletCoffeeCo stacy@pelletcoffee.com   1    2
    22 PelletCoffeeCo stacy@pelletcoffee.com   2   NA
    23 PelletCoffeeCo stacy@pelletcoffee.com   3    3
    24 PelletCoffeeCo stacy@pelletcoffee.com   4   NA
    25 PelletCoffeeCo stacy@pelletcoffee.com   5   NA
    26 PelletCoffeeCo stacy@pelletcoffee.com   6   NA
    27 PelletCoffeeCo stacy@pelletcoffee.com   7   NA
    28 PelletCoffeeCo stacy@pelletcoffee.com   8   NA
    29 PelletCoffeeCo stacy@pelletcoffee.com   9   NA
    30 PelletCoffeeCo stacy@pelletcoffee.com  10   NA
    

    编辑:

    上述代码使用由该列中现有值的最小值和最大值(即分别为1和10)定义的一组完整的日值填充每组的日列值。填充这些日值的组可以根据需要重新定义,但我选择在这里将其定义为公司+电子邮件,并带有行“嵌套(CompanyID,Email)”。数据。frame()行正好用于将输出转换为数据。帧而不是tibble。如果是数据。帧输出是没有必要的,请随意更换或删除该行。

        2
  •  0
  •   pyll    7 年前

    首先,创建唯一公司ID的数据框架。 接下来,创建所需日期的数据框。

    将这些交叉连接在一起。

    然后连接到原始数据集以填写表格。

    comp <- data.frame(CompanyID = unique(df$CompanyID))
    Day <- data.frame(Day = c("1","2","3","4","5","6","7","8","9","10"))
    
    compDay <- merge(comp, Day, all = TRUE)
    
    dfday <- merge(df, compDay, by = c("CompanyID", "Day"), all = TRUE)