代码之家  ›  专栏  ›  技术社区  ›  Zheyuan Li

R core的“split”函数背后的算法是什么?

  •  7
  • Zheyuan Li  · 技术社区  · 6 年前

    split 是R核中一个特别重要的功能。许多提供基于数据操作的R基解决方案的堆栈溢出答案都依赖于它。它是任何一个操作组的工作程序。

    还有许多问题的解决方案只是一行 分裂 . 很多人不知道

    • split.data.frame 可以将矩阵按行拆分;
    • split.default 可以逐列拆分数据帧。

    也许R文档 分裂 做得不太好。它没有提到第一种用法,但没有提到第二种用法。

    有四种方法 分裂 在R核心:

    methods(split)
    #[1] split.data.frame split.Date       split.default    split.POSIXct
    

    我将提供一个深入解释如何 拆分.data.frame , 拆分。默认 和C级 .Internal(split(x, f)) 工作。其他答案在“日期”和“posixct”对象上受到欢迎。

    1 回复  |  直到 6 年前
        1
  •  7
  •   Zheyuan Li    6 年前

    怎么做? split.data.frame 工作?

    function (x, f, drop = FALSE, ...) 
    lapply(split(x = seq_len(nrow(x)), f = f, drop = drop, ...), 
           function(ind) x[ind, , drop = FALSE])
    

    它调用 split.default 拆分行索引向量 seq_len(nrow(x)) ,然后使用 lapply 循环将关联行提取到列表项中。

    这不是严格意义上的“data.frame”方法。 它按第1维分割任何二维对象,包括按行分割矩阵 .


    怎么做? 拆分。默认 工作?

    function (x, f, drop = FALSE, sep = ".", lex.order = FALSE, ...) 
    {
    if (!missing(...)) 
        .NotYetUsed(deparse(...), error = FALSE)
    if (is.list(f)) 
        f <- interaction(f, drop = drop, sep = sep, lex.order = lex.order)
    else if (!is.factor(f)) 
        f <- as.factor(f)
    else if (drop) 
        f <- factor(f)
    storage.mode(f) <- "integer"
    if (is.null(attr(x, "class"))) 
        return(.Internal(split(x, f)))
    lf <- levels(f)
    y <- vector("list", length(lf))
    names(y) <- lf
    ind <- .Internal(split(seq_along(x), f))
    for (k in lf) y[[k]] <- x[ind[[k]]]
    y
    }
    
    1. 如果 x 没有类(即,大部分是原子矢量), .Internal(split(x, f)) 使用;
    2. 否则,它使用 .Internal(split()) 将索引拆分 X ,然后使用 for 循环将关联元素提取到列表项中。

    原子矢量(参见 ?vector )是具有以下模式的向量:

    • “逻辑”、“整数”、“数字”、“复杂”、“字符”和“原始”
    • “名单”
    • “表达式”

    具有类的对象…呃。。。有那么多!!让我举三个例子:

    • “系数”
    • “数据帧”
    • “矩阵”

    在我看来 拆分。默认 写得不好。有那么多的对象有类,但是 拆分。默认 会以同样的方式通过 "[" . 这可以与“factor”和“data.frame”一起使用(因此我们将沿着列拆分数据帧!)但是它绝对不能像我们期望的那样与矩阵一起工作。

    A <- matrix(1:9, 3)
    #     [,1] [,2] [,3]
    #[1,]    1    4    7
    #[2,]    2    5    8
    #[3,]    3    6    9
    
    split.default(A, c(1, 1, 2))  ## it does not split the matrix by columns!
    #$`1`
    #[1] 1 2 4 5 7 8
    #
    #$`2`
    #[1] 3 6 9
    

    实际上,回收规则已经应用于 c(1, 1, 2) 我们也在做:

    split(c(A), rep_len(c(1,1,2), length(A)))
    

    为什么R CORE不为“矩阵”再写一行,比如

    for (k in lf) y[[k]] <- x[, ind[[k]], drop = FALSE]
    

    到目前为止,安全地按列拆分矩阵的唯一方法是将其转置,然后 拆分.data.frame ,然后是另一个转置。

    lapply(split.data.frame(t(A), c(1, 1, 2)), t)
    

    另一个解决方案VIA lapply(split.default(data.frame(A), c(1, 1, 2)), as.matrix) 如果 A 是字符矩阵。


    怎么做? .内部(拆分(x,f)) 工作?

    这真是核心的核心!我将举一个小例子来解释:

    set.seed(0)
    f <- sample(factor(letters[1:3]), 10, TRUE)
    # [1] c a b b c a c c b b
    #Levels: a b c
    
    x <- 0:9
    

    基本上有三个步骤。为了提高可读性,为每个步骤提供等效的R代码。

    第1步:制表(计算各要素水平的发生率)

    ## a factor has integer mode so `tabulate` works
    tab <- tabulate(f, nbins = nlevels(f))
    [1] 2 4 4
    

    步骤2:结果列表的存储分配

    result <- vector("list", nlevels(f))
    for (i in 1:length(tab)) result[[i]] <- vector(mode(x), tab[i])
    names(result) <- levels(f)
    

    我将如下注释这个列表,其中每一行都是一个列表元素,在这个例子中是一个向量,并且 [ ] 是该向量项的占位符。

    $a: [ ] [ ]
    
    $b: [ ] [ ] [ ] [ ]
    
    $c: [ ] [ ] [ ] [ ]
    

    步骤3:要素分配

    现在,发现一个因子的内部整数模式很有用:

    .f <- as.integer(f)
    #[1] 3 1 2 2 3 1 3 3 2 2
    

    我们需要扫描 X .f ,填充 x[i] 进入之内 正确的入口 属于 result[[.f[i]]] ,由累加器缓冲向量通知。

    ab <- integer(nlevels(f))  ## accumulator buffer
    
    for (i in 1:length(.f)) {
      fi <- .f[i] 
      counter <- ab[fi] + 1L
      result[[fi]][counter] <- x[i]
      ab[fi] <- counter
      }
    

    在下图中, ^ 指向访问或更新的元素的指针。

    ## i = 1
    
     x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
    .f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
         ^
    
    ab: [0] [0] [0]  ## on entry
                 ^
    
    $a: [ ] [ ]
    
    $b: [ ] [ ] [ ] [ ]
    
    $c: [0] [ ] [ ] [ ]
         ^
    
    ab: [0] [0] [1]  ## on exit
                 ^
    

    ## i = 2
    
     x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
    .f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
             ^
    
    ab: [0] [0] [1]  ## on entry
         ^
    
    $a: [1] [ ]
         ^
    $b: [ ] [ ] [ ] [ ]
    
    $c: [0] [ ] [ ] [ ]
    
    ab: [1] [0] [1]  ## on exit
         ^
    

    ## i = 3
    
     x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
    .f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                 ^
    
    ab: [1] [0] [1]  ## on entry
             ^
    
    $a: [1] [ ]
    
    $b: [2] [ ] [ ] [ ]
         ^
    $c: [0] [ ] [ ] [ ]
    
    ab: [1] [1] [1]  ## on exit
             ^
    

    ## i = 4
    
     x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
    .f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                     ^
    
    ab: [1] [1] [1]  ## on entry
             ^
    
    $a: [1] [ ]
    
    $b: [2] [3] [ ] [ ]
             ^
    $c: [0] [ ] [ ] [ ]
    
    ab: [1] [2] [1]  ## on exit
             ^
    

    ## i = 5
    
     x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
    .f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                         ^
    
    ab: [1] [2] [1]  ## on entry
                 ^
    
    $a: [1] [ ]
    
    $b: [2] [3] [ ] [ ]
    
    $c: [0] [4] [ ] [ ]
             ^
    
    ab: [1] [2] [2]  ## on exit
                 ^
    

    ## i = 6
    
     x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
    .f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                             ^
    
    ab: [1] [2] [2]  ## on entry
         ^
    
    $a: [1] [5]
             ^
    $b: [2] [3] [ ] [ ]
    
    $c: [0] [4] [ ] [ ]
    
    ab: [2] [2] [2]  ## on exit
         ^
    

    ## i = 7
    
     x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
    .f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                                 ^
    
    ab: [2] [2] [2]  ## on entry
                 ^
    
    $a: [1] [5]
    
    $b: [2] [3] [ ] [ ]
    
    $c: [0] [4] [6] [ ]
                 ^
    
    ab: [2] [2] [3]  ## on exit
                 ^
    

    ## i = 8
    
     x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
    .f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                                     ^
    
    ab: [2] [2] [3]  ## on entry
                 ^
    
    $a: [1] [5]
    
    $b: [2] [3] [ ] [ ]
    
    $c: [0] [4] [6] [7]
                     ^
    
    ab: [2] [2] [4]  ## on exit
                 ^
    

    ## i = 9
    
     x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
    .f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                                         ^
    
    ab: [2] [2] [4]  ## on entry
             ^
    
    $a: [1] [5]
    
    $b: [2] [3] [8] [ ]
                 ^
    $c: [0] [4] [6] [7]
    
    ab: [2] [3] [4]  ## on exit
             ^
    

    ## i = 10
    
     x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
    .f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                                             ^
    
    ab: [2] [3] [4]  ## on entry
             ^
    
    $a: [1] [5]
    
    $b: [2] [3] [8] [9]
                     ^
    $c: [0] [4] [6] [7]
    
    ab: [2] [4] [4]  ## on exit
             ^