代码之家 › 专栏 › 技术社区 › Stefan

R中配对t检验函数的公式表示数据组织

Stefan · 技术社区 · 5 年前

我试着理解配对 t.test() R中的函数使用公式规范计算受试者之间的差异( y ~ x ),即当数据以长格式排序时一按行号排列或 b 按治疗水平排列(见下文示例)。

对于配对t检验 相同的 必须先计算受试者(然后进行其他计算)。那么,如果我改变顺序,R如何计算出哪些科目属于一起?我查了源代码 t.test() 函数(见下文),但从那里我无法理解。我的直觉告诉我,它一定是“ 取级别1的第一次出现,并将其从级别2的第一次发生中减去,然后是第二次、第三次等。 “。下面的例子似乎证实了这一点。当我随机洗牌时,只有这样,结果才会略有不同 t 和 p-value .

提问 :源代码中的哪些函数 t.test() 函数(见下文)使受试者之间的差异计算正确,尽管行的顺序不同?

require(tidyverse)
#> Loading required package: tidyverse
set.seed(1)
## create some toy data to run paired t.test
d <-
  tibble(ID = 1:10 ,
         PRE = c(rnorm(10, mean = 4, sd = 2)),
         POST = c(rnorm(10, mean = 6, sd = 2)))

d
#> # A tibble: 10 x 3
#>       ID   PRE  POST
#>    <int> <dbl> <dbl>
#>  1     1  2.75  9.02
#>  2     2  4.37  6.78
#>  3     3  2.33  4.76
#>  4     4  7.19  1.57
#>  5     5  4.66  8.25
#>  6     6  2.36  5.91
#>  7     7  4.97  5.97
#>  8     8  5.48  7.89
#>  9     9  5.15  7.64
#> 10    10  3.39  7.19

## pivot long to use the formula representation in t.test
d_long <- d %>%
  pivot_longer(-ID, names_to = "treat", values_to = "values")
d_long
#> # A tibble: 20 x 3
#>       ID treat values
#>    <int> <chr>  <dbl>
#>  1     1 PRE     2.75
#>  2     1 POST    9.02
#>  3     2 PRE     4.37
#>  4     2 POST    6.78
#>  5     3 PRE     2.33
#>  6     3 POST    4.76
#>  7     4 PRE     7.19
#>  8     4 POST    1.57
#>  9     5 PRE     4.66
#> 10     5 POST    8.25
#> 11     6 PRE     2.36
#> 12     6 POST    5.91
#> 13     7 PRE     4.97
#> 14     7 POST    5.97
#> 15     8 PRE     5.48
#> 16     8 POST    7.89
#> 17     9 PRE     5.15
#> 18     9 POST    7.64
#> 19    10 PRE     3.39
#> 20    10 POST    7.19

## arrange by treat variable to change order
d_long_arranged <- d_long %>% 
  arrange(desc(treat))
d_long_arranged
#> # A tibble: 20 x 3
#>       ID treat values
#>    <int> <chr>  <dbl>
#>  1     1 PRE     2.75
#>  2     2 PRE     4.37
#>  3     3 PRE     2.33
#>  4     4 PRE     7.19
#>  5     5 PRE     4.66
#>  6     6 PRE     2.36
#>  7     7 PRE     4.97
#>  8     8 PRE     5.48
#>  9     9 PRE     5.15
#> 10    10 PRE     3.39
#> 11     1 POST    9.02
#> 12     2 POST    6.78
#> 13     3 POST    4.76
#> 14     4 POST    1.57
#> 15     5 POST    8.25
#> 16     6 POST    5.91
#> 17     7 POST    5.97
#> 18     8 POST    7.89
#> 19     9 POST    7.64
#> 20    10 POST    7.19

## randomly rearrange order of rows
d_long_random <- d_long %>% 
  slice(sample(1:n()))
d_long_random
#> # A tibble: 20 x 3
#>       ID treat values
#>    <int> <chr>  <dbl>
#>  1     5 POST    8.25
#>  2     3 POST    4.76
#>  3     8 PRE     5.48
#>  4     6 POST    5.91
#>  5     5 PRE     4.66
#>  6     4 PRE     7.19
#>  7    10 PRE     3.39
#>  8     8 POST    7.89
#>  9     4 POST    1.57
#> 10     7 PRE     4.97
#> 11     9 POST    7.64
#> 12     9 PRE     5.15
#> 13     7 POST    5.97
#> 14     1 POST    9.02
#> 15     2 PRE     4.37
#> 16    10 POST    7.19
#> 17     3 PRE     2.33
#> 18     2 POST    6.78
#> 19     6 PRE     2.36
#> 20     1 PRE     2.75

## Run t.tests
t.test(d$POST, d$PRE, paired = T)
#> 
#>  Paired t-test
#> 
#> data:  d$POST and d$PRE
#> t = 2.2879, df = 9, p-value = 0.04794
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  0.02508677 4.44148199
#> sample estimates:
#> mean of the differences 
#>                2.233284
t.test(values ~ treat, paired = T, d_long)
#> 
#>  Paired t-test
#> 
#> data:  values by treat
#> t = 2.2879, df = 9, p-value = 0.04794
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  0.02508677 4.44148199
#> sample estimates:
#> mean of the differences 
#>                2.233284
t.test(values ~ treat, paired = T, d_long_arranged)
#> 
#>  Paired t-test
#> 
#> data:  values by treat
#> t = 2.2879, df = 9, p-value = 0.04794
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  0.02508677 4.44148199
#> sample estimates:
#> mean of the differences 
#>                2.233284
t.test(values ~ treat, paired = T, d_long_random)
#> 
#>  Paired t-test
#> 
#> data:  values by treat
#> t = 2.3054, df = 9, p-value = 0.04658
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  0.04193459 4.42463416
#> sample estimates:
#> mean of the differences 
#>                2.233284

## pull up t.test.default source code
getAnywhere(t.test.default)
#> A single object matching 't.test.default' was found
#> It was found in the following places
#>   registered S3 method for t.test from namespace stats
#>   namespace:stats
#> with value
#> 
#> function (x, y = NULL, alternative = c("two.sided", "less", "greater"), 
#>     mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, 
#>     ...) 
#> {
#>     alternative <- match.arg(alternative)
#>     if (!missing(mu) && (length(mu) != 1 || is.na(mu))) 
#>         stop("'mu' must be a single number")
#>     if (!missing(conf.level) && (length(conf.level) != 1 || !is.finite(conf.level) || 
#>         conf.level < 0 || conf.level > 1)) 
#>         stop("'conf.level' must be a single number between 0 and 1")
#>     if (!is.null(y)) {
#>         dname <- paste(deparse(substitute(x)), "and", deparse(substitute(y)))
#>         if (paired) 
#>             xok <- yok <- complete.cases(x, y)
#>         else {
#>             yok <- !is.na(y)
#>             xok <- !is.na(x)
#>         }
#>         y <- y[yok]
#>     }
#>     else {
#>         dname <- deparse(substitute(x))
#>         if (paired) 
#>             stop("'y' is missing for paired test")
#>         xok <- !is.na(x)
#>         yok <- NULL
#>     }
#>     x <- x[xok]
#>     if (paired) {
#>         x <- x - y
#>         y <- NULL
#>     }
#>     nx <- length(x)
#>     mx <- mean(x)
#>     vx <- var(x)
#>     if (is.null(y)) {
#>         if (nx < 2) 
#>             stop("not enough 'x' observations")
#>         df <- nx - 1
#>         stderr <- sqrt(vx/nx)
#>         if (stderr < 10 * .Machine$double.eps * abs(mx)) 
#>             stop("data are essentially constant")
#>         tstat <- (mx - mu)/stderr
#>         method <- if (paired) 
#>             "Paired t-test"
#>         else "One Sample t-test"
#>         estimate <- setNames(mx, if (paired) 
#>             "mean of the differences"
#>         else "mean of x")
#>     }
#>     else {
#>         ny <- length(y)
#>         if (nx < 1 || (!var.equal && nx < 2)) 
#>             stop("not enough 'x' observations")
#>         if (ny < 1 || (!var.equal && ny < 2)) 
#>             stop("not enough 'y' observations")
#>         if (var.equal && nx + ny < 3) 
#>             stop("not enough observations")
#>         my <- mean(y)
#>         vy <- var(y)
#>         method <- paste(if (!var.equal) 
#>             "Welch", "Two Sample t-test")
#>         estimate <- c(mx, my)
#>         names(estimate) <- c("mean of x", "mean of y")
#>         if (var.equal) {
#>             df <- nx + ny - 2
#>             v <- 0
#>             if (nx > 1) 
#>                 v <- v + (nx - 1) * vx
#>             if (ny > 1) 
#>                 v <- v + (ny - 1) * vy
#>             v <- v/df
#>             stderr <- sqrt(v * (1/nx + 1/ny))
#>         }
#>         else {
#>             stderrx <- sqrt(vx/nx)
#>             stderry <- sqrt(vy/ny)
#>             stderr <- sqrt(stderrx^2 + stderry^2)
#>             df <- stderr^4/(stderrx^4/(nx - 1) + stderry^4/(ny - 
#>                 1))
#>         }
#>         if (stderr < 10 * .Machine$double.eps * max(abs(mx), 
#>             abs(my))) 
#>             stop("data are essentially constant")
#>         tstat <- (mx - my - mu)/stderr
#>     }
#>     if (alternative == "less") {
#>         pval <- pt(tstat, df)
#>         cint <- c(-Inf, tstat + qt(conf.level, df))
#>     }
#>     else if (alternative == "greater") {
#>         pval <- pt(tstat, df, lower.tail = FALSE)
#>         cint <- c(tstat - qt(conf.level, df), Inf)
#>     }
#>     else {
#>         pval <- 2 * pt(-abs(tstat), df)
#>         alpha <- 1 - conf.level
#>         cint <- qt(1 - alpha/2, df)
#>         cint <- tstat + c(-cint, cint)
#>     }
#>     cint <- mu + cint * stderr
#>     names(tstat) <- "t"
#>     names(df) <- "df"
#>     names(mu) <- if (paired || !is.null(y)) 
#>         "difference in means"
#>     else "mean"
#>     attr(cint, "conf.level") <- conf.level
#>     rval <- list(statistic = tstat, parameter = df, p.value = pval, 
#>         conf.int = cint, estimate = estimate, null.value = mu, 
#>         stderr = stderr, alternative = alternative, method = method, 
#>         data.name = dname)
#>     class(rval) <- "htest"
#>     rval
#> }
#> <bytecode: 0x0000000017f32b58>
#> <environment: namespace:stats>

## pull up t.test.formula source code
getAnywhere(t.test.formula)
#> A single object matching 't.test.formula' was found
#> It was found in the following places
#>   registered S3 method for t.test from namespace stats
#>   namespace:stats
#> with value
#> 
#> function (formula, data, subset, na.action, ...) 
#> {
#>     if (missing(formula) || (length(formula) != 3L) || (length(attr(terms(formula[-2L]), 
#>         "term.labels")) != 1L)) 
#>         stop("'formula' missing or incorrect")
#>     m <- match.call(expand.dots = FALSE)
#>     if (is.matrix(eval(m$data, parent.frame()))) 
#>         m$data <- as.data.frame(data)
#>     m[[1L]] <- quote(stats::model.frame)
#>     m$... <- NULL
#>     mf <- eval(m, parent.frame())
#>     DNAME <- paste(names(mf), collapse = " by ")
#>     names(mf) <- NULL
#>     response <- attr(attr(mf, "terms"), "response")
#>     g <- factor(mf[[-response]])
#>     if (nlevels(g) != 2L) 
#>         stop("grouping factor must have exactly 2 levels")
#>     DATA <- setNames(split(mf[[response]], g), c("x", "y"))
#>     y <- do.call("t.test", c(DATA, list(...)))
#>     y$data.name <- DNAME
#>     if (length(y$estimate) == 2L) 
#>         names(y$estimate) <- paste("mean in group", levels(g))
#>     y
#> }
#> <bytecode: 0x000000001807a9a0>
#> <environment: namespace:stats>

^{创建于2020年1月26日

reprex package

(v0.3.0)}

0 回复 | 直到 5 年前

Allan Cameron 5 年前

聪明是数学上的,与软件实现无关。

记住,配对t检验得到平均差异的所有方法都是将第一列和第二列之间的差异相加(想想 sum(post[1] - pre[1], post[2] - pre[2], post[3] - pre[3] ... 等等)并除以数据帧的长度( nrow(d) ). 差异之和可以重新排列为 sum(post[1:10] - pre[1:10]) ,其本身可以重新排列为 sum(post) - sum(pre) 因此,差值之和等于差值之和。当然, nrow(d) 保持不变,因此无论顺序如何,差异的平均值总是相同的。

你会注意到,当你改变顺序时,p值和t分数略有不同。那是因为分布当你重新排列观察结果时,差异会有所不同。然而总和差异(及其平均值)保持不变。