代码之家 › 专栏 › 技术社区 › Matias Andina

理解R中的YeoJohnson变换

data-preprocessing r

Matias Andina · 技术社区 · 1 年前

我尝试使用执行YeoJohnson转换 caret 和 recipes ,但我认为我没有正确指定调用,或者我缺少一些额外的参数。

library(tidyverse)
library(tidytuesdayR)

# Data is all numeric except for column 7
# get it from
# https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-08-15/spam.csv
# or load it with tt_load()
spam <- tt_load(2023, week=33)$spam


# pre-process
pp_hpc <- caret::preProcess(spam[,1:6], 
                            method = c("center", "scale", "YeoJohnson"))
# fails to transform variables all variables
pp_hpc
Created from 4601 samples and 6 variables

Pre-processing:
  - centered (6)
  - ignored (0)
  - scaled (6)
  - Yeo-Johnson transformation (1)

Lambda estimates for Yeo-Johnson transformation:
0

# I can apply the transformation but obviously doesn't do the expected transformation in all the columns
transformed <- predict(pp_hpc, newdata = df$spam[,1:6])

尝试使用 配方 现在

# recipes package 
library(recipes)
# do I really need this just to transform the data?
rec <- recipe(
  yesno ~ .,
  data = spam
)

yj_transform <- step_YeoJohnson(rec, all_numeric())
# only transform some variables
yj_estimates <- prep(yj_transform, verbose = T)
yj_estimates

ââ Recipe ââââââââââââââââââââââââââââââââââââââââââââââââ

ââ Inputs 
Number of variables by role
outcome:   1
predictor: 6

ââ Training information 
Training data contained 4601 data points and no
incomplete rows.

ââ Operations 
â¢ Yeo-Johnson transformation on: crl.tot, bang | Trained

同样,应用工作,但不是所有列都被转换(我也没有居中/缩放,因为这不是问题所在)。

yj_te <- bake(yj_estimates, spam)

这个 bestNormalize 包裹在这里似乎没有问题:

# works as expected
df_transformed <- select(spam, where(is.numeric)) %>% 
  mutate_all(.funs = function(x) predict(bestNormalize::yeojohnson(x), newdata = x))

以防万一,这就是我在python中或使用 reticulate

# Python version
library(reticulate)
repl_python()
from sklearn import preprocessing
X = r.spam.drop('yesno', axis = 1)
scaler = preprocessing.PowerTransformer().set_output(transform="pandas")
X = scaler.fit_transform(X)

0 回复 | 直到 1 年前

joran 1 年前

因此,杨的转变需要 lambda 需要提供或估计的值。 step_YeoJohnson() 估计每个变量的合适值。

这个 limits 参数设置要搜索的默认值范围。它默认为 c(-5,5) 。

文件规定:

如果估计转换参数非常接近边界,或者如果优化失败,则使用NA值,而不是应用变换。

因此,基于此,如果您增加边界以搜索合适的值 λ 在中,您可能会看到更多的变量被转换。事实上,当我跑步时:

yj_transform <- step_YeoJohnson(rec, all_numeric(),limits = c(-20,20))
# only transform some variables
yj_estimates <- prep(yj_transform, verbose = T)
yj_estimates

摘要输出报告它对所有6个数字变量(而不是仅对两个)运行了转换,并运行:

> tidy(yj_estimates,number = 1)
# A tibble: 6 Ã 3
  terms        value id              
  <chr>        <dbl> <chr>           
1 crl.tot   0.000979 YeoJohnson_jKN6C
2 dollar  -13.1      YeoJohnson_jKN6C
3 bang     -3.88     YeoJohnson_jKN6C
4 money   -14.6      YeoJohnson_jKN6C
5 n000    -13.4      YeoJohnson_jKN6C
6 make    -11.0      YeoJohnson_jKN6C

…报告称,估计 λ 值远远超出 (-5,5) 除两个变量外的所有变量的范围。