代码之家  ›  专栏  ›  技术社区  ›  Zoltan

以编程方式为变量的每个可能值创建一个虚拟对象,并将这些虚拟对象传递给公式

  •  1
  • Zoltan  · 技术社区  · 6 年前

    我找到了 NHL shift data 并且想评估一个模型,在这个模型中,根据两个团队的具体情况,进球将遵循泊松分布。

    我的观点是,我们已经很清楚谁能得分(进球和助攻),但也许有人真的很擅长帮助他的球队得分,而不去计分表上(也许是通过产生失误?)或者是非常擅长阻止对方得分。

    我可以创建一个类似下面的“数据”的数据集。每支球队通常有5名队员,但我只放了2名让这个例子易于理解。

    基本上,我每班都有一条线,我知道轮班的结果(目标),轮班持续时间,我有一个为球队(为球员)和对方球队(对球员)比赛的球员ID列表。

    我想做什么 获取“数据”数据集,并创建“模型数据”,其中一个虚拟变量指示玩家是否在给定的轮班中处于冰上。然后,我将为我的泊松模型创建一个公式,该公式将包括所有的假人,并将其传递给模型。我也可以丢一个假人,一个假人,但我也可以让mgcv:gam为我做。

    我怀疑这会涉及到一些!!和quos(),但我不知道该怎么做。

    data <- tibble(
      shift_id = c(1, 2, 3, 4, 5, 6, 7, 8,9,10),
      shift_duration = c(12, 7, 30, 11, 14, 16, 19, 32,11,12),
      goal_for = c(1, 1, 0, 0, 1, 1, 0, 0,0,0),
      for_players = list(
        c("A", "B"),
        c("A", "C"),
        c("B", "C"),
        c("A", "C"),
        c("B", "C"),
        c("A", "B"),
        c("B", "C"),
        c("A", "B"),
        c("B", "C"),
        c("A", "B")
      ),
      against_players = list(
        c("X", "Z"),
        c("Y", "Z"),
        c("X", "Y"),
        c("X", "Y"),
        c("X", "Z"),
        c("Y", "Z"),
        c("X", "Y"),
        c("Y", "Z"),
        c("X", "Y"),
        c("Y", "Z")
      )
    )
    
    
    (black magic goes here)
    
    model_data <- tibble(
      shift_id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
      shift_duration = c(12, 7, 30, 11, 14, 16, 19, 32, 11, 12),
      goal_for = c(1, 1, 0, 0, 1, 1, 0, 0, 0, 0),
      for_player_A = c(1, 1, 0, 1, 0, 1, 0, 1, 0, 1),
      for_player_B = c(1, 0, 1, 0, 1, 1, 1, 1, 1, 1),
      for_player_C = c(0, 1, 1, 1, 1, 0, 1, 0, 1, 0),
      against_player_X = c(1, 0, 1, 1, 1, 0, 1, 0, 1, 0),
      against_player_Y = c(0, 1, 1, 1, 0, 1, 1, 1, 1, 1),
      against_player_Z = c(1, 1, 0, 0, 1, 1, 0, 1, 0, 1)
    )
    
    
    
    mod.gam <- mgcv::gam(
      data = model_data,
      formula =  goal_for ~ offset(log(shift_duration)) + for_player_A + for_player_B  + for_player_C +
        against_player_X + against_player_Y + against_player_Z,
      family = poisson(link = log)
    )
    

    数据 如下所示:

    > data
    # A tibble: 10 x 5
       shift_id shift_duration goal_for for_players against_players
          <dbl>          <dbl>    <dbl> <list>      <list>         
     1     1.00          12.0      1.00 <chr [2]>   <chr [2]>      
     2     2.00           7.00     1.00 <chr [2]>   <chr [2]>      
     3     3.00          30.0      0    <chr [2]>   <chr [2]>      
     4     4.00          11.0      0    <chr [2]>   <chr [2]>      
     5     5.00          14.0      1.00 <chr [2]>   <chr [2]>      
     6     6.00          16.0      1.00 <chr [2]>   <chr [2]>      
     7     7.00          19.0      0    <chr [2]>   <chr [2]>      
     8     8.00          32.0      0    <chr [2]>   <chr [2]>      
     9     9.00          11.0      0    <chr [2]>   <chr [2]>      
    10    10.0           12.0      0    <chr [2]>   <chr [2]>
    

    模型数据 如下所示:

    > model_data
    # A tibble: 10 x 9
       shift_id shift_duration goal_for for_player_A for_player_B for_player_C against_player_X against_player_Y against_player_Z
          <dbl>          <dbl>    <dbl>        <dbl>        <dbl>        <dbl>            <dbl>            <dbl>            <dbl>
     1     1.00          12.0      1.00         1.00         1.00         0                1.00             0                1.00
     2     2.00           7.00     1.00         1.00         0            1.00             0                1.00             1.00
     3     3.00          30.0      0            0            1.00         1.00             1.00             1.00             0   
     4     4.00          11.0      0            1.00         0            1.00             1.00             1.00             0   
     5     5.00          14.0      1.00         0            1.00         1.00             1.00             0                1.00
     6     6.00          16.0      1.00         1.00         1.00         0                0                1.00             1.00
     7     7.00          19.0      0            0            1.00         1.00             1.00             1.00             0   
     8     8.00          32.0      0            1.00         1.00         0                0                1.00             1.00
     9     9.00          11.0      0            0            1.00         1.00             1.00             1.00             0   
    10    10.0           12.0      0            1.00         1.00         0                0                1.00             1.00
    

    模型结果:

    Family: poisson 
    Link function: log 
    
    Formula:
    goal_for ~ offset(log(shift_duration)) + for_player_A + for_player_B + 
        for_player_C + against_player_X + against_player_Y + against_player_Z
    
    Parametric coefficients:
                      Estimate Std. Error z value Pr(>|z|)
    (Intercept)       -22.0296  4317.9341  -0.005    0.996
    for_player_A        0.0000     0.0000      NA       NA
    for_player_B       -2.3026     2.0000  -1.151    0.250
    for_player_C       -0.1542     1.4142  -0.109    0.913
    against_player_X    1.6094     1.4142   1.138    0.255
    against_player_Y    0.0000     0.0000      NA       NA
    against_player_Z   20.2378  4317.9339   0.005    0.996
    
    
    Rank: 5/7
    R-sq.(adj) =  0.353   Deviance explained = 73.6%
    UBRE = 0.26435  Scale est. = 1         n = 10
    
    1 回复  |  直到 6 年前
        1
  •  1
  •   CJ Yetman    6 年前

    你可以转换你的 data 数据帧到您的 model_data 使用函数的数据帧 tidyr

    library(dplyr)
    library(tidyr)
    
    model_data <-
      data %>% 
      unnest(for_players, .drop = F) %>% 
      spread(for_players, for_players, sep = '_') %>% 
      unnest(against_players, .drop = F) %>% 
      spread(against_players, against_players, sep = '_') %>% 
      mutate_at(vars(-(1:3)), funs(as.numeric(!is.na(.))))