代码之家 › 专栏 › 技术社区 › Nils Pipenbrinck

从数组中提取前n个唯一整数

search sorting algorithm

Nils Pipenbrinck · 技术社区 · 15 年前

我有一个整数(千)的大列表,我想从中提取前n个(按10-20的顺序)唯一元素。列表中的每个整数大约出现三次。

写一个算法来完成这项工作是很简单的,但我想知道哪种方法最快速、最节省内存。

在我的例子中还有一些额外的限制和信息:

在我的用例中,我在数组中多次提取我的uniques,每次都从一开始跳过一些元素。在唯一提取期间,我跳过的元素的数量是未知的。我甚至没有上限。因此,排序不具有速度效率(我必须保留数组的顺序)。
整数到处都是,所以作为查找解决方案的位数组是不可行的。
我想不惜一切代价避免在搜索过程中进行临时分配。

我当前的解决方案大致如下:

  int num_uniques = 0;
  int uniques[16];
  int startpos = 0;

  while ((num_uniques != N) && (start_pos < array_length))
  {
    // a temporary used later.
    int insert_position;

    // Get next element.
    int element = array[startpos++];

    // check if the element exist. If the element is not found
    // return the position where it could be inserted while keeping
    // the array sorted.

    if (!binary_search (uniques, element, num_uniques, &insert_position))
    {

      // insert the new unique element while preserving 
      // the order of the array.

      insert_into_array (uniques, element, insert_position);

      uniques++;
    }
  }

二进制搜索/插入到数组算法完成了任务,但性能并不好。insert_into_array调用会在很大程度上移动元素,这会减慢所有信号的传输速度。

有什么想法吗?

编辑

回答得很好,伙计们!每个人都应该得到一个公认的答案,但我只能给出一个。我将实现一些您的想法,并使用一些典型的数据进行性能测试。一个有着最快实现的想法的人得到了公认的答案。

我将在一台现代PC和一个嵌入式Cortexa8CPU上运行代码,并以某种方式权衡结果。也会发布结果。

编辑:拍摄结果

核心二人组的计时,在160kb测试数据集上进行100次迭代。

Bruteforce (Pete):            203 ticks
Hash and Bruteforce (Antti):  219 ticks
Inplace Binary Tree (Steven): 390 ticks
Binary-Search (Nils):         438 ticks

http://torus.untergrund.net/code/unique_search_shootout.zip (C源和测试数据)

附加说明:

对于真正的随机分布(我的测试数据有上升的趋势),就地二叉树绝对是摇摆不定的。
对于超过32个uniques,二进制搜索在我的测试数据上非常有效。它的性能几乎是线性的。

8 回复 | 直到 15 年前

Pete Kirkham 15 年前

对于一个小的数组(如果你想要前20个元素,平均有10个元素可以检查是否相等),线性扫描通常会执行二进制搜索,即使你不需要插入元素。

Tyler McHenry 15 年前

为什么不开始将数组元素插入std::set中,并在集合中有n个元素时停止?确保集合不存在重复项。它们也保证被排序,所以如果您从begin()到end()遍历一个集合,您将按照operator<的排序顺序进行排序。

John Rasch 15 年前

你所施加的限制所能达到的最快时间复杂性是 O(n) 使用字典 O(1) 查找唯一整数而不是二进制树。当你能在固定的时间内找到它们时,为什么还要费心去寻找它们呢?

因为您只处理“成千上万条记录”,所以任何其他内容都只是一个微不足道的添加。

Jay Kominek 15 年前

我会尝试在一个不平衡的二叉树中去掉唯一性。这样可以节省重新排列uniques列表的成本,并且如果源列表足够随机,那么插入到树中不会使其严重失衡。(如果不是一个二叉树就可以进行搜索和插入。)如果它变得不平衡,那么最坏的情况将与迭代16个元素列表而不是进行二叉搜索相同。

你知道二叉树的最大大小,所以你可以提前预先分配所有必要的内存,所以这不应该是个问题。您甚至可以使用“我的节点内存不足”条件,让您知道何时完成。

(编辑:显然,人们认为我主张在这里使用例外。我不是。我可能提倡实际的通用Lisp样式条件,但在大多数语言中没有发现转义延续样式的异常。此外,看起来他想为此做C。)

Pesto 15 年前

不要将唯一整数存储到数组中,而是使用实际的二叉树。这样可以避免重复移动数组元素。

Steven Huwig 15 年前

使用二进制树的数组表示形式。阵列的大小可以是3N。基本上

ARR[i]=值

arr[i+1]=左子数组索引

arr[i+2]=右子数组索引

每次插入k时都要浏览“树”,如果找不到k,则更新其父级的[i+1]或[i+2]并将其添加到下一个空索引中。当数组空间用完时,就得到了答案。

例如

查找422433123的前3个唯一项:数组大小=3*3=9。

在下表中,“v”是值,“l”是左子索引,“r”是右子索引。

 v  l  r  v  l  r  v  l  r
 _________________________
-1 -1 -1 -1 -1 -1 -1 -1 -1
 4 -1 -1 -1 -1 -1 -1 -1 -1
 4  3 -1  2 -1 -1 -1 -1 -1
 4  3 -1  2 -1 -1 -1 -1 -1
 4  3 -1  2 -1 -1 -1 -1 -1
 4  3 -1  2 -1  6  3 -1 -1

你的空间太大了。

数组索引0 mod 3是您的答案。

您可以使用4个组来保存顺序:

数组[i]=值

数组[I+1]=原始数组中的位置

数组[I+2]=左子索引

数组[I+3]=右子索引

Antti Huima 15 年前

如果你有数千个整数,并且每一个大约发生三次,你的算法应该很快找到n个唯一整数的集合,对于小的e(假设这些整数是相对随机的)大致按n(1+e)步进行。

这意味着您的算法将向uniques数组中插入n倍随机整数。在数组中平均移动k/2个元素时插入数字k,产生(n^2)/4个移动操作。您的二进制搜索大约需要N*(log(n)-1)个步骤。这将为您的算法生成(n^2)/4+n(log(n)-1)+n(1+e)的总复杂度。

我认为你可以通过以下方式更好:

int num_uniques = 0, startpos = 0, k, element;
int uniques[16];

/* Allocate and clear a bit table of 32 * 32 = 1024 bits. */
uint32 bit_table[32], hash;
memzero((void *)(&bit_table), sizeof(bit_table));

while (num_uniques < N && startpos < array_length) {
  element = array[startpos++];

  /* Hash the element quickly to a number from 0..1023 */
  hash = element ^ (element >> 16);
  hash *= 0x19191919;
  hash >>= 22;
  hash &= 1023;

  /* Map the hash value to a bit in the bit table.
     Use the low 5 bits of 'hash' to index bit_table
     and the other 5 bits to get the actual bit. */
  uint32 slot=hash & 31;
  uint32 bit=(1u << (hash >> 5));

  /* If the bit is NOT set, this is element is guaranteed unique. */
  if (!(bit_table[slot] & bit)) {
    bit_table[slot] |= bit;
    uniques[num_uniques++] = element;
  } else { /* Otherwise it can be still unique with probability
              num_uniques / 1024. */
    for (k=0; k<num_uniques; k++) { if (uniques[k] == element) break }
    if (k==num_uniques) uniques[num_uniques++] = element;
  }
}

由于运行内部循环(索引变量k)的概率较低,该算法将在预期时间n+n^2/128运行。

EvilTeach 15 年前

给定一个名为l、大小为n的整数列表

重复l一次以查找数组中的最大值和最小值。

分配(1个分配)大小为(小..的整数数组。大)名为A. 将此数组初始化为零

迭代l,使用l(i)下标到a中,增加在这里找到的整数。

然后进行处理。在L中选择你的起点,然后向前浏览列表,看a(i)。选择您想要的任何一组(i)>2。

完成后,处理。

如果您的空间确实很短,请使用2位而不是整数,解释如下

00 count = 0
01 count = 1
10 count = 2
11 count > 2