代码之家  ›  专栏  ›  技术社区  ›  Lotw

跨多个列筛选或ifelse

  •  0
  • Lotw  · 技术社区  · 6 年前

    我正在研究病人生病时的通讯线路。例如:一个人生病去看医生(A),然后去医院(B),联系保险公司(C)等等。每个病人的顺序是不同的。例如,一个病人直接去医院,另一个人先检查保险等,我们全程跟踪病人,在与不同的权威机构接触后,让他们再填写一份调查表。所以在每一个授权(“步骤”)之后,我们得到了一个调查的分数。这为我提供了以下数据集设置(实际上它是一个非常大的数据集):

    Patient<-c(1,1,1,1,1,1,1,2,2,2,2)
    sample6<-c("A","A","A","A","A","A","A","A","A","A","A")
    sample5<-c("Stop","B","B","B","B","B","B","Stop","C","C","C")
    sample4<-c(NA,"Stop","C","C","C","C","C",NA, "Stop","F","F")
    sample3<-c(NA,NA,"Stop","D","D","D","D",NA, NA,"Stop","G")
    sample2<-c(NA,NA,NA,"Stop","E","E","E",NA, NA,NA,"Stop")
    sample1<-c(NA,NA,NA,NA, "Stop","F","F",NA,NA,NA, NA)
    sample0<-c(NA,NA,NA,NA, NA,"Stop","G",NA,NA,NA, NA)
    sample00<-c(NA,NA,NA,NA, NA,NA,"Stop",NA,NA,NA, NA)
    Score<-c(90,88,65,44,78,98,66,38,93,88,80)
    Time<-c("01-01-2018", "02-01-2018", "03-01-2018", "04-01-2018", "05-01-2018", "06-01-2018", "07-01-2018","01-02-2018", "02-02-2018", "05-02-2018", "06-02-2018")
    
    df<-data.frame("Patient"=Patient, "step0"=sample6, "step1"=sample5, "step2"=sample4, "step3"=sample3, "step4"=sample2, 
                   "step5"=sample1,"step6"= sample0, "step7"=sample00, "Score"=Score, "Time"=Time)
    
    > df
       Patient step0 step1 step2 step3 step4 step5 step6 step7 Score       Time
    1        1     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    90 01-01-2018
    2        1     A     B  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    88 02-01-2018
    3        1     A     B     C  Stop  <NA>  <NA>  <NA>  <NA>    65 03-01-2018
    4        1     A     B     C     D  Stop  <NA>  <NA>  <NA>    44 04-01-2018
    5        1     A     B     C     D     E  Stop  <NA>  <NA>    78 05-01-2018
    6        1     A     B     C     D     E     F  Stop  <NA>    98 06-01-2018
    7        1     A     B     C     D     E     F     G  Stop    66 07-01-2018
    8        2     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    38 01-02-2018
    9        2     A     C  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    93 02-02-2018
    10       2     A     C     F  Stop  <NA>  <NA>  <NA>  <NA>    88 05-02-2018
    11       2     A     C     F     G  Stop  <NA>  <NA>  <NA>    80 06-02-2018
    

    例如:第1行在A权限之后有调查得分,第2行是针对同一患者的,在B权限之后有调查得分等。 现在我想比较具有相同最终过程的列,我将以“f”为例,但它也可以是“c”用于另一个分析。所以现在我要选择所有表示“f”的行作为最终权限,并选择之前的行,这样我就可以比较它们了。

    所以我想创建这个数据集:

       Patient step0 step1 step2 step3 step4 step5 step6 step7 Score       Time Indicator
    1        1     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    90 01-01-2018         0
    2        1     A     B  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    88 02-01-2018         0
    3        1     A     B     C  Stop  <NA>  <NA>  <NA>  <NA>    65 03-01-2018         0
    4        1     A     B     C     D  Stop  <NA>  <NA>  <NA>    44 04-01-2018         0
    5        1     A     B     C     D     E  Stop  <NA>  <NA>    78 05-01-2018         Before
    6        1     A     B     C     D     E     F  Stop  <NA>    98 06-01-2018         After
    7        1     A     B     C     D     E     F     G  Stop    66 07-01-2018         0
    8        2     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    38 01-02-2018         0
    9        2     A     C  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    93 02-02-2018         Before
    10       2     A     C     F  Stop  <NA>  <NA>  <NA>  <NA>    88 05-02-2018         After
    11       2     A     C     F     G  Stop  <NA>  <NA>  <NA>    80 06-02-2018         0
    

    我确实指出了包含“f”和前面的行:

    ProcessColumns <- 2:9
    d <- df[,ProcessColumns] == "F"
    df$Indicator <- rowSums(d,na.rm=T)
    df$filter[which(df$filter %in% 1)-1] <- "Before"
    df$filter[which(df$filter %in% 1)] <- "After"
    

    但现在它指出了所有包含“f”的行,而不仅仅是最后一行。有谁能帮我吗?

    3 回复  |  直到 6 年前
        1
  •  0
  •   Roman    6 年前

    一个 tidyverse 有很多线条,但通常都很管用。

    library(tidyverse)
    df %>%
      rownames_to_column() %>% 
      gather(k,v,-Patient,-rowname,-Score, -Time) %>% 
      group_by(rowname) %>% 
      mutate(Indicator=ifelse(any(v %in%"F" ),"After",NA)) %>% 
      spread(k,v)  %>% 
      arrange(as.numeric(rowname)) %>% 
      group_by(Patient) %>% 
      mutate(Indicator=ifelse(duplicated(Indicator), NA, Indicator)) %>% 
      mutate(Indicator2=ifelse(lead(Indicator) == "After", "Before", NA)) %>% 
      mutate(Indicator=ifelse(!is.na(Indicator2), Indicator2, Indicator)) %>% 
      select(Patient, starts_with("step"), Score, Time,Indicator, -Indicator2,-rowname) %>% 
      ungroup()
    # A tibble: 11 x 12
       Patient step0 step1 step2 step3 step4 step5 step6 step7 Score Time       Indicator
         <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <fct>      <chr>    
     1       1 A     Stop  NA    NA    NA    NA    NA    NA       90 01-01-2018 NA       
     2       1 A     B     Stop  NA    NA    NA    NA    NA       88 02-01-2018 NA       
     3       1 A     B     C     Stop  NA    NA    NA    NA       65 03-01-2018 NA       
     4       1 A     B     C     D     Stop  NA    NA    NA       44 04-01-2018 NA       
     5       1 A     B     C     D     E     Stop  NA    NA       78 05-01-2018 Before   
     6       1 A     B     C     D     E     F     Stop  NA       98 06-01-2018 After    
     7       1 A     B     C     D     E     F     G     Stop     66 07-01-2018 NA       
     8       2 A     Stop  NA    NA    NA    NA    NA    NA       38 01-02-2018 NA       
     9       2 A     C     Stop  NA    NA    NA    NA    NA       93 02-02-2018 Before   
    10       2 A     C     F     Stop  NA    NA    NA    NA       88 05-02-2018 After    
    11       2 A     C     F     G     Stop  NA    NA    NA       80 06-02-2018 NA  
    
        2
  •  2
  •   A. Suliman    6 年前

    我们可以这样做

    df %>% mutate(sum=rowSums(!is.na(.[2:9]))) %>% 
    group_by(Patient) %>% mutate(max = sum-max(sum), Indicator  = case_when(max == -2 ~ "Before", max == -1 ~ "After", TRUE ~ as.character(0)))
    
    # A tibble: 11 x 14
    # Groups:   Patient [2]
         Patient step0 step1 step2 step3 step4 step5 step6 step7 Score Time         sum   max Ind   
         <dbl> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <fct>      <dbl> <dbl> <chr> 
     1    1.00 A     Stop  NA    NA    NA    NA    NA    NA     90.0 01-01-2018  2.00 -6.00 0     
     2    1.00 A     B     Stop  NA    NA    NA    NA    NA     88.0 02-01-2018  3.00 -5.00 0     
     3    1.00 A     B     C     Stop  NA    NA    NA    NA     65.0 03-01-2018  4.00 -4.00 0     
     4    1.00 A     B     C     D     Stop  NA    NA    NA     44.0 04-01-2018  5.00 -3.00 0     
     5    1.00 A     B     C     D     E     Stop  NA    NA     78.0 05-01-2018  6.00 -2.00 Before
     6    1.00 A     B     C     D     E     F     Stop  NA     98.0 06-01-2018  7.00 -1.00 After 
     7    1.00 A     B     C     D     E     F     G     Stop   66.0 07-01-2018  8.00  0    0     
     8    2.00 A     Stop  NA    NA    NA    NA    NA    NA     38.0 01-02-2018  2.00 -3.00 0     
     9    2.00 A     C     Stop  NA    NA    NA    NA    NA     93.0 02-02-2018  3.00 -2.00 Before
    10    2.00 A     C     F     Stop  NA    NA    NA    NA     88.0 05-02-2018  4.00 -1.00 After 
    11    2.00 A     C     F     G     Stop  NA    NA    NA     80.0 06-02-2018  5.00  0    0 
    

    更新: 灵感来自@Andre Elrico Answer

    df %>% unite(All, matches("step"), sep="", remove=F ) %>% 
           mutate(Ind = str_detect(All,"BStop"), Indicator = case_when( lead(Ind) == TRUE ~ "Before", Ind == TRUE ~ "After", TRUE ~ as.character(0))) %>% 
           select(-All,-Ind)
    
        3
  •  1
  •   Andre Elrico    6 年前

    或者你可以:

    library(dplyr)
    
    After_IND <- df %>% apply(.,1,paste,collapse="") %>% grepl("FStop",.)
    Before_IND<- lead(After_IND,1,F)
    
    df$Indicator <- 0
    df$Indicator[After_IND]<-"After"
    df$Indicator[Before_IND]<-"Before"
    
    #  Patient step0 step1 step2 step3 step4 step5 step6 step7 Score       Time Indicator
    #        1     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    90 01-01-2018         0
    #        1     A     B  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    88 02-01-2018         0
    #        1     A     B     C  Stop  <NA>  <NA>  <NA>  <NA>    65 03-01-2018         0
    #        1     A     B     C     D  Stop  <NA>  <NA>  <NA>    44 04-01-2018         0
    #        1     A     B     C     D     E  Stop  <NA>  <NA>    78 05-01-2018    Before
    #        1     A     B     C     D     E     F  Stop  <NA>    98 06-01-2018     After
    #        1     A     B     C     D     E     F     G  Stop    66 07-01-2018         0
    #        2     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    38 01-02-2018         0
    #        2     A     C  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    93 02-02-2018    Before
    #        2     A     C     F  Stop  <NA>  <NA>  <NA>  <NA>    88 05-02-2018     After
    #        2     A     C     F     G  Stop  <NA>  <NA>  <NA>    80 06-02-2018         0
    

    请注意:

    如果你想比较b例如,你必须改变:

    ... %>% grepl("BStop",.)