代码之家  ›  专栏  ›  技术社区  ›  zlon

根据下标快速获取行平均值的方法

  •  3
  • zlon  · 技术社区  · 6 年前

    我有一个数据,可以用以下方式模拟:

    N = 10^6;%10^8;
    K = 10^4;%10^6; 
    
    subs = randi([1 K],N,1);
    M = [randn(N,5) subs];
    M(M<-1.2) = nan;
    

    换句话说,它是一个矩阵,最后一行是下标。 现在我想计算一下 nanmean() 对于每个下标。另外,我想为每个下标保存行数。我有一个“假”代码:

    uniqueSubs = unique(M(:,6));
    avM = nan(numel(uniqueSubs),6);
    for iSub = 1:numel(uniqueSubs)
        tmpM = M(M(:,6)==uniqueSubs(iSub),1:5);
        avM(iSub,:) = [nanmean(tmpM,1) size(tmpM,1)];
    end
    

    问题是,它太慢了。我想让它为 N = 10^8 K = 10^6 (见这些变量定义中的注释部分。

    如何才能更快地找到数据的平均值?

    2 回复  |  直到 6 年前
        1
  •  6
  •   Edric    6 年前

    这听起来是个完美的工作 findgroups splitapply .

    % Find groups in the final column
    G = findgroups(M(:,6));
    % function to apply per group
    fcn = @(group) [mean(group, 1, 'omitnan'), size(group, 1)];
    % Use splitapply to apply fcn to each group in M(:,1:5)
    result = splitapply(fcn, M(:, 1:5), G);
    % Check
    assert(isequaln(result, avM));
    
        2
  •  5
  •   Adriaan Decoder    6 年前
    M = sortrows(M,6); % sort the data per subscript
    IDX = diff(M(:,6)); % find where the subscript changes
    tmp = find(IDX);
    tmp = [0 ;tmp;size(M,1)]; % add start and end of data
    for iSub= 2:numel(tmp)
        % Calculate the mean over just a single subscript, store in iSub-1
        avM2(iSub-1,:) = [nanmean(M(tmp(iSub-1)+1:tmp(iSub),1:5),1) tmp(iSub)-tmp(iSub-1)];tmp(iSub-1)];
    end
    

    这比你在我电脑上的原始代码快60倍。加速主要来自于对数据进行预排序,然后找到下标更改的所有位置。这样,您不必每次遍历完整的数组来找到正确的下标,而是只检查每次迭代所需的内容。因此,您可以计算大约100行的平均值,而不必首先签入1000000行,以确定是否需要该迭代的每一行。

    因此:在原件中 numel(uniqueSubs) ,在本例中为10000,是否全部 N ,这里是1000000,数字属于某个类别,这将导致10^12个检查。建议的代码对行进行排序(排序是 NlogN ,因此这里是6000000),然后在整个数组上循环一次,而不需要额外的检查。


    为了完成,这里是原始代码和我的版本,它显示了两个相同的代码:

    N = 10^6;%10^8;
    K = 10^4;%10^6; 
    
    subs = randi([1 K],N,1);
    M = [randn(N,5) subs];
    M(M<-1.2) = nan;
    
    uniqueSubs = unique(M(:,6));
    %% zlon's original code
    avM = nan(numel(uniqueSubs),7); % add the subscript for comparison later
    tic
    uniqueSubs = unique(M(:,6));
    for iSub = 1:numel(uniqueSubs)
        tmpM = M(M(:,6)==uniqueSubs(iSub),1:5);
        avM(iSub,:) = [nanmean(tmpM,1) size(tmpM,1) uniqueSubs(iSub)];
    end
    toc
    %%%%% End of zlon's code
    avM = sortrows(avM,7); % Sort for comparison
    
    %% Start of Adriaan's code
    avM2 = nan(numel(uniqueSubs),6);
    tic
    M = sortrows(M,6);
    IDX = diff(M(:,6));
    tmp = find(IDX);
    tmp = [0 ;tmp;size(M,1)];
    for iSub = 2:numel(tmp)
        avM2(iSub-1,:) = [nanmean(M(tmp(iSub-1)+1:tmp(iSub),1:5),1) tmp(iSub)-tmp(iSub-1)];
    end
    toc %tic/toc should not be used for accurate timing, this is just for order of magnitude
    %%%% End of Adriaan's code
    
    all(avM(:,1:6) == avM2) % Do the comparison
    % End of script
    
    % Output
    Elapsed time is 58.561347 seconds.
    Elapsed time is 0.843124 seconds. % ~70 times faster
    
    ans =
    
      1×6 logical array
    
       1   1   1   1   1   1 % i.e. the matrices are equal to one another