代码之家  ›  专栏  ›  技术社区  ›  Akki

从R中的数据帧中删除重复项

  •  0
  • Akki  · 技术社区  · 7 年前

    UserID   Quiz_answers            Quiz_Date       
      1     `a1,a2,a3`Positive       26-01-2017        
      1     `a1,a4,a3`Positive       26-01-2017        
      1     `a1,a2,a4`Negative       28-02-2017        
      1     `a1,a2,a3`Neutral        30-10-2017        
      1     `a1,a2,a4`Positive       30-11-2017        
      1     `a1,a2,a4`Negative       28-02-2018    
    
      2     `a1,a2,a3`Negative       27-01-2017            
      2     `a1,a7,a3`Neutral        28-08-2017        
      2     `a1,a2,a5`Negative       28-01-2017  
    

    我想删除重复的行:
    重复行的规则包括:

    1. QUITY\u答案列中反勾选(`)后出现的单词相同
    2. 对于此类行,如果userID和quick\u Date列值也相同,则该行是重复的`

       UserID<-c(1,1,1,1,1,1,2,2,2)
       Quiz_answers<-c("`a1,a2,a3`Positive","`a1,a4,a3`Positive","`a1,a2,a4`Negative","a1,a2,a3`Neutral","`a1,a2,a4`Positive","`a1,a2,a4`Negative","`a1,a2,a3`Negative","`a1,a7,a3`Neutral","`a1,a2,a5`Negative")  
       Quiz_Date<-as.Date(c("26-01-2017","26-01-2017","28-02-2017","30-10-2017","30-11-2017","28-02-2018","27-01-2017","28-08-2017","28-01-2017"),'%d-%m-%Y')  
       data<-data.frame(UserID,Quiz_answers,Quiz_Date)     
      

       data.removeDuplicates<- function(frames)
        {   
             apply(frames[ ,c(grep("UserID", colnames(data)),grep("Quiz_answers", colnames(data)),grep("Quiz_Date", colnames(data)))],1,function(slice){     
                 Outcome<-paste0("`",tail(strsplit(slice[2],split="`")[[1]],1))      
                 cat("\n\n Searching for records: ",slice[1],Outcome,slice[3])
                data<<-data[!( data$UserID == slice[1] &  paste0("`",sapply(strsplit(as.character(data[,2]),'`'), tail, 1)) == c(Outcome) & data[,3]==c(slice[3])), ]   
            })      
            print(frames)
        }
        data.removeDuplicates(data)
        print(data)
        [1] UserID       Quiz_answers Quiz_Date   
        <0 rows> (or 0-length row.names)
    

    我在期待结果

    UserID   Quiz_answers            Quiz_Date       
      1     `a1,a2,a3`Positive       26-01-2017        
      1     `a1,a2,a4`Negative       28-02-2017        
      1     `a1,a2,a3`Neutral        30-10-2017        
      1     `a1,a2,a4`Positive       30-11-2017        
      1     `a1,a2,a4`Negative       28-02-2018    
    
      2     `a1,a2,a3`Negative       27-01-2017            
      2     `a1,a7,a3`Neutral        28-08-2017        
      2     `a1,a2,a5`Negative       28-01-2017  
    

    根据规则,只有第二行应该从数据帧中删除,这是唯一满足重复条件的行。 我做错了什么?

    3 回复  |  直到 7 年前
        1
  •  1
  •   CPak    7 年前

    试试这个

    您的数据

    df <- read.table(text="UserID   Quiz_answers            Quiz_Date       
    1     `a1,a2,a3`Positive       26-01-2017        
    1     `a1,a4,a3`Positive       26-01-2017        
    1     `a1,a2,a4`Negative       28-02-2017        
    1     `a1,a2,a3`Neutral        30-10-2017        
    1     `a1,a2,a4`Positive       30-11-2017        
    1     `a1,a2,a4`Negative       28-02-2018    
    2     `a1,a2,a3`Negative       27-01-2017            
    2     `a1,a7,a3`Neutral        28-08-2017        
    2     `a1,a2,a5`Negative       28-01-2017", header = TRUE, stringsAsFactors=FALSE)
    

    解决方案;输出

    library(dplyr)
    ans <- df %>%
            mutate(grp = sub(".*`(\\D+)$", "\\1", Quiz_answers)) %>%
            group_by(grp, UserID, Quiz_Date) %>%
            slice(1) %>%
            ungroup() %>%
            select(-grp) %>%
            arrange(UserID, Quiz_Date)
    
    # A tibble: 8 x 3
      # UserID       Quiz_answers  Quiz_Date
       # <int>              <chr>      <chr>
    # 1      1 `a1,a2,a3`Positive 26-01-2017
    # 2      1 `a1,a2,a4`Negative 28-02-2017
    # 3      1 `a1,a2,a4`Negative 28-02-2018
    # 4      1  `a1,a2,a3`Neutral 30-10-2017
    # 5      1 `a1,a2,a4`Positive 30-11-2017
    # 6      2 `a1,a2,a3`Negative 27-01-2017
    # 7      2 `a1,a2,a5`Negative 28-01-2017
    # 8      2  `a1,a7,a3`Neutral 28-08-2017
    
        2
  •  0
  •   OmG    7 年前

    您可以使用 sqldf 包如下所示。首先,找到 Positive , Negative Neutral . 然后,使用 group by :

    require("sqldf")
    result <- sqldf("SELECT * FROM df WHERE Quiz_answers LIKE '%`Positive' GROUP BY UserID, Quiz_Date 
           UNION 
           SELECT * FROM df WHERE Quiz_answers LIKE '%`Negative' GROUP BY UserID, Quiz_Date 
           UNION 
           SELECT * FROM df WHERE Quiz_answers LIKE '%`Neutral' GROUP BY UserID, Quiz_Date")
    

    result 运行后:

      UserID       Quiz_answers  Quiz_Date
    1      1  `a1,a2,a3`Neutral 30-10-2017
    2      1 `a1,a2,a4`Negative 28-02-2017
    3      1 `a1,a2,a4`Negative 28-02-2018
    4      1 `a1,a2,a4`Positive 30-11-2017
    5      1 `a1,a4,a3`Positive 26-01-2017
    6      2 `a1,a2,a3`Negative 27-01-2017
    7      2 `a1,a2,a5`Negative 28-01-2017
    8      2  `a1,a7,a3`Neutral 28-08-2017
    
        3
  •  0
  •   ssp3nc3r    7 年前

    这是一个双线解决方案,仅使用基数R:

    data[,"group"] <- with(data, sub(".*`", "", Quiz_answers))
    
    data <- data[as.integer(rownames(unique(data[, !(names(data) %in% "Quiz_answers")   ]))), !(names(data) %in% "group")]