代码之家  ›  专栏  ›  技术社区  ›  Matias Andina

理解R中的YeoJohnson变换

  •  0
  • Matias Andina  · 技术社区  · 1 年前

    我尝试使用执行YeoJohnson转换 caret recipes ,但我认为我没有正确指定调用,或者我缺少一些额外的参数。

    library(tidyverse)
    library(tidytuesdayR)
    
    # Data is all numeric except for column 7
    # get it from
    # https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-08-15/spam.csv
    # or load it with tt_load()
    spam <- tt_load(2023, week=33)$spam
    
    
    # pre-process
    pp_hpc <- caret::preProcess(spam[,1:6], 
                                method = c("center", "scale", "YeoJohnson"))
    # fails to transform variables all variables
    pp_hpc
    Created from 4601 samples and 6 variables
    
    Pre-processing:
      - centered (6)
      - ignored (0)
      - scaled (6)
      - Yeo-Johnson transformation (1)
    
    Lambda estimates for Yeo-Johnson transformation:
    0
    
    # I can apply the transformation but obviously doesn't do the expected transformation in all the columns
    transformed <- predict(pp_hpc, newdata = df$spam[,1:6])
    

    尝试使用 配方 现在

    # recipes package 
    library(recipes)
    # do I really need this just to transform the data?
    rec <- recipe(
      yesno ~ .,
      data = spam
    )
    
    yj_transform <- step_YeoJohnson(rec, all_numeric())
    # only transform some variables
    yj_estimates <- prep(yj_transform, verbose = T)
    yj_estimates
    
    ── Recipe ────────────────────────────────────────────────
    
    ── Inputs 
    Number of variables by role
    outcome:   1
    predictor: 6
    
    ── Training information 
    Training data contained 4601 data points and no
    incomplete rows.
    
    ── Operations 
    • Yeo-Johnson transformation on: crl.tot, bang | Trained
    

    同样,应用工作,但不是所有列都被转换(我也没有居中/缩放,因为这不是问题所在)。

    yj_te <- bake(yj_estimates, spam)
    

    这个 bestNormalize 包裹在这里似乎没有问题:

    # works as expected
    df_transformed <- select(spam, where(is.numeric)) %>% 
      mutate_all(.funs = function(x) predict(bestNormalize::yeojohnson(x), newdata = x))
    

    以防万一,这就是我在python中或使用 reticulate

    # Python version
    library(reticulate)
    repl_python()
    from sklearn import preprocessing
    X = r.spam.drop('yesno', axis = 1)
    scaler = preprocessing.PowerTransformer().set_output(transform="pandas")
    X = scaler.fit_transform(X)
    
    0 回复  |  直到 1 年前
        1
  •  2
  •   joran    1 年前

    因此,杨的转变需要 lambda 需要提供或估计的值。 step_YeoJohnson() 估计每个变量的合适值。

    这个 limits 参数设置要搜索的默认值范围。它默认为 c(-5,5)

    文件规定:

    如果估计转换参数非常接近 边界,或者如果优化失败,则使用NA值,而不是 应用变换。

    因此,基于此,如果您增加边界以搜索合适的值 λ 在中,您可能会看到更多的变量被转换。事实上,当我跑步时:

    yj_transform <- step_YeoJohnson(rec, all_numeric(),limits = c(-20,20))
    # only transform some variables
    yj_estimates <- prep(yj_transform, verbose = T)
    yj_estimates
    

    摘要输出报告它对所有6个数字变量(而不是仅对两个)运行了转换,并运行:

    > tidy(yj_estimates,number = 1)
    # A tibble: 6 × 3
      terms        value id              
      <chr>        <dbl> <chr>           
    1 crl.tot   0.000979 YeoJohnson_jKN6C
    2 dollar  -13.1      YeoJohnson_jKN6C
    3 bang     -3.88     YeoJohnson_jKN6C
    4 money   -14.6      YeoJohnson_jKN6C
    5 n000    -13.4      YeoJohnson_jKN6C
    6 make    -11.0      YeoJohnson_jKN6C
    

    …报告称,估计 λ 值远远超出 (-5,5) 除两个变量外的所有变量的范围。