代码之家  ›  专栏  ›  技术社区  ›  zabada

总结将两列划分为百分比

  •  0
  • zabada  · 技术社区  · 7 年前

    使用飞机撞击数据集,我很难找到迁徙季节加拿大鹅被杀的百分比。

    #airline stats table
    airlines <- sd4 %>% 
    group_by(STATE) %>% 
    filter(SPECIES == "Canada goose" & total_kills > 1) %>% 
    mutate(fall_mig_kills = ifelse(SPECIES=="Canada goose" & INCIDENT_MONTH %in% c(9,10,11),total_kills,0)) %>% 
    summarise(
    pct_mig_kills = fall_mig_kills/total_kills
    ) %>% 
    select(STATE,SPECIES,INCIDENT_MONTH,total_kills,fall_mig_kills,pct_mig_kills)`
    

    这里是我得到错误的地方: summarise( pct_mig_kills = fall_mig_kills/total_kills )

    误差为:

    Error in summarise_impl(.data, dots) : 
    Column `pct_mig_kills` must be length 1 (a summary value), not 10
    

    不确定在划分两个整型列时如何得到长度大于1的值。

    任何帮助都将不胜感激!

    本杰明

    1 回复  |  直到 7 年前
        1
  •  2
  •   Kamil    7 年前

    让我们阅读数据,记录所有内容,并查看错误发生的地方。

    一般来说,您应该有一个到原始数据集的链接,或者提供一个简短的版本来遵循 reproducibility aircraft wildlife strikes, 1990-2015 Kaggle上的数据集,我将在这里使用。 注: 您需要有一个Kaggle帐户才能下载数据。也可以在 data.gov .

    读入数据

    library(dplyr)
    df <- read.csv("~/../Downloads/database.csv", stringsAsFactors = F)
    > df$Species.Name[grepl("Canada goose", df$Species.Name, ignore.case = T)][1]
    [1] "CANADA GOOSE"
    
    > names(df)
     [1] "Record.ID"            "Incident.Year"        "Incident.Month"      
     [4] "Incident.Day"         "Operator.ID"          "Operator"            
     [7] "Aircraft"             "Aircraft.Type"        "Aircraft.Make"       
    [10] "Aircraft.Model"       "Aircraft.Mass"        "Engine.Make"         
    [13] "Engine.Model"         "Engines"              "Engine.Type"         
    [16] "Engine1.Position"     "Engine2.Position"     "Engine3.Position"    
    [19] "Engine4.Position"     "Airport.ID"           "Airport"             
    [22] "State"                "FAA.Region"           "Warning.Issued"      
    [25] "Flight.Phase"         "Visibility"           "Precipitation"       
    [28] "Height"               "Speed"                "Distance"            
    [31] "Species.ID"           "Species.Name"         "Species.Quantity"    
    [34] "Flight.Impact"        "Fatalities"           "Injuries"            
    [37] "Aircraft.Damage"      "Radome.Strike"        "Radome.Damage"       
    [40] "Windshield.Strike"    "Windshield.Damage"    "Nose.Strike"         
    [43] "Nose.Damage"          "Engine1.Strike"       "Engine1.Damage"      
    [46] "Engine2.Strike"       "Engine2.Damage"       "Engine3.Strike"      
    [49] "Engine3.Damage"       "Engine4.Strike"       "Engine4.Damage"      
    [52] "Engine.Ingested"      "Propeller.Strike"     "Propeller.Damage"    
    [55] "Wing.or.Rotor.Strike" "Wing.or.Rotor.Damage" "Fuselage.Strike"     
    [58] "Fuselage.Damage"      "Landing.Gear.Strike"  "Landing.Gear.Damage" 
    [61] "Tail.Strike"          "Tail.Damage"          "Lights.Strike"       
    [64] "Lights.Damage"        "Other.Strike"         "Other.Damage"        
    [67] "totalKills"
    

    请注意,物种名称均为大写字母。使用 grepl 而不是 == 除非你确定自己一字不差地知道这个名字。

    没有 total_kills 变量,以及 Fatalities 变量表示 人类 死亡,所以我将忽略这个过滤器变量。我发现的是 Species.Quantity ,这可能就是你在寻找的,在一次事件中死亡的物种总数。

    > unique(df$Species.Quantity)
    [1] "1"        "2-10"     ""         "11-100"   "Over 100"
    

    在本例中,我们可以将这些值转换为数字。

    > dictNames <- unique(df$Species.Quantity)
    > dict <- c(1, 2, 0, 11, 100)
    > names(dict) <- dictNames
    > dict['1']
    1 
    1 
    > dict['2-10']
    2-10 
       2 
    > df <- df %>% mutate(totalKills = dict[Species.Quantity])
    > table(df$totalKills, useNA = "always")
    
         1      2     11    100   <NA> 
    146563  21852   1166     46   4477 
    

    太好了,现在让我们看看你的代码。

    试用您的代码并找出问题所在

    > df %>% 
    +   group_by(State) %>% 
    +   filter(Species.Name == "CANADA GOOSE" & totalKills > 1) %>% 
    +   mutate(fall_mig_kills = ifelse(Species.Name == "CANADA GOOSE" & 
    +                                    Incident.Month %in% c(9,10,11),
    +                                  totalKills,
    +                                  0)
    +          ) %>% 
    +   summarise(
    +     pct_mig_kills = fall_mig_kills/totalKills
    +   )
    Error in summarise_impl(.data, dots) : 
      Column `pct_mig_kills` must be length 1 (a summary value), not 19
    

    嗯,让我们看看为什么。通过键入来阅读帮助菜单 ?summarise 控制台上显示:

    总结{dplyr}R文档将多个值减少到 单个值

    描述

    Summary()通常用于group_by()创建的分组数据。 每个组的输出将有一行。

    好的,那么输出将具有 一排 对于每个 由于您已将变量分组,因此我们需要 总和 死亡总数。此外,您可能希望创建一个新变量“inSeason”,它将允许您适当地总结数据。

    因此,为了解决您的问题,您只需添加 sum :

    +   summarise(
    +     pct_mig_kills = sum(fall_mig_kills)/sum(totalKills)
    +   )
    # A tibble: 49 x 2
       State pct_mig_kills
       <chr>         <dbl>
     1          0.70212766
     2    AK    0.50000000
     3    AL    0.00000000
     4    AR    1.00000000
     5    CA    0.06185567
    

    无错误地重写代码

    现在让我们把它改成稍微简单一点。你关心的是 状态

    > df %>%
    +   # inSeason = seasons we care about monitoring
    +   # totalKills has NA values, we choose to put deaths at 0
    +   mutate(inSeason = ifelse(Incident.Month %in% 9:11, "in", "out"),
    +          totalKills = ifelse(is.na(totalKills), 0, totalKills)) %>%
    +   # canadian geese only
    +   filter(grepl("canada goose", Species.Name, ignore.case = T)) %>%
    +   # collect data by inSeason
    +   group_by(inSeason) %>%
    +   # sum them up
    +   summarise(totalDead = sum(totalKills)) %>%
    +   # add a ratio value
    +   mutate(percentDead = round(100*totalDead/sum(totalDead),0))
    # A tibble: 2 x 3
      inSeason totalDead percentDead
         <chr>     <dbl>       <dbl>
    1       in       838          34
    2      out      1620          66
    

    现在你有季节内和季节外,总死亡人数和百分比。如果要在状态中添加,请将该变量添加到分组中。

    另一个注意事项, group_by 用一个 summarise 自动删除其他列,因此不需要使用 select 最后。

    > df %>%
    +   mutate(inSeason = ifelse(Incident.Month %in% 9:11, "in", "out"),
    +          totalKills = ifelse(is.na(totalKills), 0, totalKills)) %>%
    +   filter(grepl("canada goose", Species.Name, ignore.case = T)) %>%
    +   group_by(State, inSeason) %>%
    +   summarise(totalDead = sum(totalKills)) %>%
    +   mutate(percentDead = round(100*totalDead/sum(totalDead),0))
    # A tibble: 98 x 4
    # Groups:   State [51]
       State inSeason totalDead percentDead
       <chr>    <chr>     <dbl>       <dbl>
     1             in        52          52
     2            out        48          48
     3    AB       in         1          50
     4    AB      out         1          50
     5    AK       in        13          33
     6    AK      out        26          67
     7    AL       in         2          40
     8    AL      out         3          60
     9    AR       in         6         100
    10    CA       in        13           8