代码之家  ›  专栏  ›  技术社区  ›  skylobo

如何为R中此“df”中的重复行分配唯一代码?

  •  1
  • skylobo  · 技术社区  · 6 年前

    我有这个数据框 df

    df <- data.frame(stringsAsFactors=FALSE,
              id = c(1L, 2L, 3L, 4L, 5L, 6L),
         Country = c("ESP", "ESP", "ESP", "ITA", "ITA", "ITA"),
            Year = c(1965L, 1965L, 1965L, 1965L, 1965L, 1965L),
       Time.step = c("Month", "Month", "Month", "Month", "Month", "Month"),
        GSA.numb = c("GSA 5", "GSA 5", "GSA 5", "GSA 17", "GSA 17", "GSA 17"),
         Species = c("Mullus", "Mullus", "Mullus", "Eledone", "Eledone", "Eledone"),
        Quantity = c(500L, 200L, 200L, 350L, 350L, 125L)
                    )
    
    df
    
       id  Country   Year    Time.step    GSA.numb  Species   Quantity
        1    ESP     1965     Month       GSA 5      Mullus     500   
        2    ESP     1965     Month       GSA 5      Mullus     200  
        3    ESP     1965     Month       GSA 5      Mullus     200 
        4    ITA     1965     Month       GSA 17     Eledone    350
        5    ITA     1965     Month       GSA 17     Eledone    350 
        6    ITA     1965     Month       GSA 17     Eledone    125
    

    我有一些重复的行,如:3和5。 我可以在复制行时为F或T逻辑值创建列:

    df$dup <- duplicated(df[,2:7]) #No id! 
    

    结果:

    id  Country   Year    Time.step    GSA.numb  Species   Quantity dup
     1    ESP     1965     Month       GSA 5      Mullus     500   FALSE
     2    ESP     1965     Month       GSA 5      Mullus     200   FALSE
     3    ESP     1965     Month       GSA 5      Mullus     200   TRUE
     4    ITA     1965     Month       GSA 17     Eledone    350   FALSE
     5    ITA     1965     Month       GSA 17     Eledone    350   TRUE
     6    ITA     1965     Month       GSA 17     Eledone    125   FALSE
    

    现在,我想要一个新的专栏(以动态方式,我的真实 东风 非常大,有许多行、列和变量),如果为真,则可以查看重复行的数目,如下所示:

    aspected.df
    
    id  Country Year  Time.step  GSA.numb  Species   Quantity dup  ref  
     1  ESP     1965  Month      GSA 5      Mullus     500   FALSE NA
     2  ESP     1965  Month      GSA 5      Mullus     200   FALSE NA
     3  ESP     1965  Month      GSA 5      Mullus     200   TRUE  =id2
     4  ITA     1965  Month      GSA 17     Eledone    350   FALSE NA
     5  ITA     1965  Month      GSA 17     Eledone    350   TRUE  =id4
     6  ITA     1965  Month      GSA 17     Eledone    125   FALSE NA
    

    我尝试过:

    with(df, ave(as.character(Species), df[,2:6], FUN = make.unique)) 
    

    但结果是:

    [1] "Mullus"    "Mullus.1"  "Mullus.2"  "Eledone"   "Eledone.1" "Eledone.2"
    

    我想我需要更多的代码输入。哪个功能有用?( duplicated,make.unit, row.names 等等……

    4 回复  |  直到 6 年前
        1
  •  4
  •   arg0naut91    6 年前

    data.table 方法,从初始文件开始:

    library(data.table)
    
    setDT(df)[, `:=` (dup = seq_len(.N) > 1, ref = paste0("id", first(id))), 
              by = .(Country, Year, Time.step, GSA.numb, Species, Quantity)][dup == FALSE, ref := NA]
    

    输出:

       id Country Year Time.step GSA.numb Species Quantity   dup  ref
    1:  1     ESP 1965     Month     GSA5  Mullus      500 FALSE <NA>
    2:  2     ESP 1965     Month     GSA5  Mullus      200 FALSE <NA>
    3:  3     ESP 1965     Month     GSA5  Mullus      200  TRUE  id2
    4:  4     ITA 1965     Month    GSA17 Eledone      350 FALSE <NA>
    5:  5     ITA 1965     Month    GSA17 Eledone      350  TRUE  id4
    6:  6     ITA 1965     Month    GSA17 Eledone      125 FALSE <NA>
    

    tidyverse 接近(与) dup 之前已创建):

    library(tidyverse)
    
    df %>% 
      group_by_at(vars(2:7)) %>% 
      mutate(ref = ifelse(dup, paste0("id", first(id)), NA_character_))
    

    输出:

         id Country  Year Time.step GSA.numb Species Quantity dup   ref  
      <int> <chr>   <int> <chr>     <chr>    <chr>      <int> <lgl> <chr>
    1     1 ESP      1965 Month     GSA5     Mullus       500 FALSE NA   
    2     2 ESP      1965 Month     GSA5     Mullus       200 FALSE NA   
    3     3 ESP      1965 Month     GSA5     Mullus       200 TRUE  id2  
    4     4 ITA      1965 Month     GSA17    Eledone      350 FALSE NA   
    5     5 ITA      1965 Month     GSA17    Eledone      350 TRUE  id4  
    6     6 ITA      1965 Month     GSA17    Eledone      125 FALSE NA
    

    如果你想创建 DUP 语句中的列:

    df %>% 
      group_by_at(vars(2:7)) %>% 
      mutate(
        dup = row_number() > 1,
        ref = ifelse(dup, paste0("id", first(id)), NA_character_))
    

    输出:

         id Country  Year Time.step GSA.numb Species Quantity dup   ref  
      <int> <chr>   <int> <chr>     <chr>    <chr>      <int> <lgl> <chr>
    1     1 ESP      1965 Month     GSA5     Mullus       500 FALSE NA   
    2     2 ESP      1965 Month     GSA5     Mullus       200 FALSE NA   
    3     3 ESP      1965 Month     GSA5     Mullus       200 TRUE  id2  
    4     4 ITA      1965 Month     GSA17    Eledone      350 FALSE NA   
    5     5 ITA      1965 Month     GSA17    Eledone      350 TRUE  id4  
    6     6 ITA      1965 Month     GSA17    Eledone      125 FALSE NA 
    
        2
  •  2
  •   dmi3kno    6 年前

    你可以使用 tidyverse 用于快速标识重复项的函数

    df$dup <- duplicated(df[,2:7]) #No id! 
    
    library(tidyverse)
    
    df %>% 
     group_by(dup) %>% 
     mutate(ref=ifelse(dup, paste0("id",1:n()), NA_character_))
    
    #> # A tibble: 6 x 9
    #> # Groups:   dup [2]
    #>      id Country  Year Time.step GSA.numb Species Quantity dup   ref  
    #>   <int> <chr>   <int> <chr>     <chr>    <chr>      <int> <lgl> <chr>
    #> 1     1 ESP      1965 Month     GSA 5    Mullus       500 FALSE NA   
    #> 2     2 ESP      1965 Month     GSA 5    Mullus       200 FALSE NA   
    #> 3     3 ESP      1965 Month     GSA 5    Mullus       200 TRUE  id1  
    #> 4     4 ITA      1965 Month     GSA 17   Eledone      350 FALSE NA   
    #> 5     5 ITA      1965 Month     GSA 17   Eledone      350 TRUE  id2  
    #> 6     6 ITA      1965 Month     GSA 17   Eledone      125 FALSE NA 
    
        3
  •  0
  •   Evan Friedland    6 年前

    此示例使用基R并将找到的重复项与原始值匹配。如果一行有多个副本,这也很有用。

    示例数据(使用 dput(control = NULL) 因此,字符/系数被转换为数字)

    df <- data.frame(id = c(1, 1, 1, 2, 2, 2), 
               Country = c(1965, 1965, 1965, 1965, 1965, 1965), 
               Year = c(1, 1, 1, 1, 1, 1), 
               Time.step = c(1, 1, 1, 1, 1, 1), 
               GSA.numb = c(5, 5, 5, 17, 17, 17), 
               Species = c(2, 2, 2, 1, 1, 1), Quantity = c(500, 200, 200, 350, 350, 125))
    

    代码是矢量化的,因此,尽管有外部循环,它应该在大型数据帧上运行得相当快。

    df$dup <- duplicated(df)
    dupes <- df[df$dup,]
    df$ref <- NA # initialize 
    for(i in 1:nrow(dupes)){
      z=which(df[,1] == dupes[i,1]&
              df[,2] == dupes[i,2]&
              df[,3] == dupes[i,3]&
              df[,4] == dupes[i,4]&
              df[,5] == dupes[i,5]&
              df[,6] == dupes[i,6]&
              df[,7] == dupes[i,7]) # make sure not to include that $dup column!
      df$ref[z[-1]] <- paste0("=id",min(z))
    }
    df
    #  id Country Year Time.step GSA.numb Species Quantity   dup  ref
    #1  1    1965    1         1        5       2      500 FALSE <NA>
    #2  1    1965    1         1        5       2      200 FALSE <NA>
    #3  1    1965    1         1        5       2      200  TRUE =id2
    #4  2    1965    1         1       17       1      350 FALSE <NA>
    #5  2    1965    1         1       17       1      350  TRUE =id4
    #6  2    1965    1         1       17       1      125 FALSE <NA>
    

    即使您可以使用apply函数来加强这一点,但这将运行得更快。

        4
  •  0
  •   tmfmnk    6 年前

    使用 tidyverse :

    df %>%
      group_by_at(vars(-id)) %>% #Group by all variables except of id
      mutate(n = n(), #Identifying the duplicate rows
             dup = ifelse(seq_along(n) > 1, TRUE, FALSE), #Coding the first unique row as TRUE and others as FALSE
             ref = ifelse(dup == TRUE, paste0("=id", first(id[dup == FALSE])), NA_character_)) %>% #Pasting the id of the first unique row
     select(-n)
    
         id Country  Year Time.step GSA.numb Species Quantity dup   ref  
      <int> <chr>   <int> <chr>     <chr>    <chr>      <int> <lgl> <chr>
    1     1 ESP      1965 Month     GSA 5    Mullus       500 FALSE <NA> 
    2     2 ESP      1965 Month     GSA 5    Mullus       200 FALSE <NA> 
    3     3 ESP      1965 Month     GSA 5    Mullus       200 TRUE  =id2 
    4     4 ITA      1965 Month     GSA 17   Eledone      350 FALSE <NA> 
    5     5 ITA      1965 Month     GSA 17   Eledone      350 TRUE  =id4 
    6     6 ITA      1965 Month     GSA 17   Eledone      125 FALSE <NA>