代码之家 › 专栏 › 技术社区 › Adam

从HashSet中选择“any”项非常慢-如何快速完成?

Adam · 技术社区 · 4 年前

我现在正在用贪婪算法做很多工作-他们不关心索引等,他们只在组/集上工作。但我发现85%的执行时间都花在了从哈希集中挑选一个项目上。

HashSet类提供了高性能的set操作。一套是不包含重复元素的集合,其元素没有特定的顺序。

……这似乎适用于除“获取值”之外的所有情况。

我试过:

ElementAt(0)-非常慢,据说是因为HashSet去生成一个临时排序,对所有内容进行排序,然后返回最先出现的内容
首先-非常慢(大概是做同样的事情)

我所期望的:

……但我似乎想不出一个办法。

HashSet<MyClass> candidates;
HashSet<MyClass> unvisited;

... // fill unvisited with typically: 100,000 to 10 million items
... // fill candidates with at least 1 item, potentially 100,000's of items

while( candidates.Count > 0 && unvisited.Count > 0 )
{
  var anyItem = candidates.First();

  while( ! CanProcess( anyItem ) ) // CanProcess probable checks some set intersections
  {
     candidates.Remove( anyItem );
     if( candidates.Count > 0 )
        anyItem = candidates.First();
     else
     {
        anyItem = null;
        break;
     }
  }

  if( anyItem == null ) // we've run out of candidates
     break;

  // For the algorithm: "processing" anyItem has a side-effect of 
  // transferring 0 or more items from "unvisited" into "candidates"
  var extraCandidates = Process( anyItem, unvisited );
  // ... Process probably does some set intersections
  
  ... // add all the extraCandidates to candidates
  ... // remove all the extraCandidates from unvisited
  
  candidates.Remove( anyItem );
}

i、例如:典型的贪心算法,有几个集合:一组“下一次迭代的起始点”,一组或多组(这里我只展示了一个)“尚未处理的数据,并且以某种方式连接到/从起始点可以到达”。

……那里的一切都很快,唯一慢的是“第一个”电话——我没有理由接第一个,我可以接任何电话,但我需要接些东西!

0 回复 | 直到 4 年前

Theodor Zoulias 3 年前

HashSet 哈希集 人口稀少。例如a 哈希集 一旦包含1000000个元素并减少为单个元素,则必须枚举1000000个插槽,然后才能生成存储在其单个非空插槽中的元素:

var set = new HashSet<int>(Enumerable.Range(1, 1_000_000));
set.ExceptWith(Enumerable.Range(1, 999_999));
var item = set.First(); // Very slow

这是一个不容易解决的问题。一个解决办法是给 TrimExcess 方法在每批删除之后。该方法通过分配一个新的插槽数组,将项从现有数组复制到新数组,最后丢弃旧数组,来最小化类在内部保留的空间。这是一个昂贵的行动,所以呼吁 修剪过多

另一个解决方案可以是使用不受此问题影响的第三方实现。例如 Rock.Collections 库包含 OrderedHashSet 类,该类使项保持其添加顺序。它通过以链表方式连接内部插槽来实现这一点。类不仅可以按正常顺序枚举,还可以按相反顺序枚举。我没试过,但很可能是打电话来的 First

下面是一个使用反射来欺骗 built-in enumerator 从一个随机索引(而不是索引0)开始插槽的枚举 known problems of reflection (开销、前向兼容性等)。静电 GetRandom 方法位于泛型静态类中,以便缓存 FieldInfo 每种类型的单独信息 T .

public static class HashSetRandomizer<T>
{
    private static FieldInfo _lastIndexField;
    private static FieldInfo _indexField;
    private static ThreadLocal<Random> _random;

    static HashSetRandomizer()
    {
        const BindingFlags FLAGS = BindingFlags.NonPublic | BindingFlags.Instance;
        _lastIndexField = typeof(HashSet<T>).GetField("m_lastIndex", FLAGS) // Framework
            ?? typeof(HashSet<T>).GetField("_lastIndex", FLAGS); // Core
        _indexField = typeof(HashSet<T>.Enumerator).GetField("index", FLAGS) // Framework
            ?? typeof(HashSet<T>.Enumerator).GetField("_index", FLAGS); // Core
        _random = new ThreadLocal<Random>(() => new Random());
    }

    public static T GetRandom(HashSet<T> source, Random random = null)
    {
        if (source == null) throw new ArgumentNullException(nameof(source));
        random = random ?? _random.Value;
        if (_lastIndexField == null)
            throw new NotSupportedException("FieldInfo lastIndex not found.");
        if (_indexField == null)
            throw new NotSupportedException("FieldInfo index not found.");
        if (source.Count > 0)
        {
            int lastIndex = (int)_lastIndexField.GetValue(source);
            if (lastIndex > 0)
            {
                var randomIndex = random.Next(0, lastIndex);
                using (var enumerator = source.GetEnumerator())
                {
                    _indexField.SetValue(enumerator, randomIndex);
                    if (enumerator.MoveNext()) return enumerator.Current;
                }
            }
            foreach (var item in source) return item; // Fallback
        }
        throw new InvalidOperationException("The source sequence is empty.");
    }
}

用法示例。项目将随机从 哈希集 ,直到集合为空。

var set = new HashSet<int>(Enumerable.Range(1, 1_000_000));
while (set.Count > 0)
{
    var item = HashSetRandomizer<int>.GetRandom(set); // Fast
    set.Remove(item);
}

Progman 4 年前

使用 First() 所有的时间都要求 Enumerator IEnumerator foreach HashSet 你必须处理这些条目。

为了防止任何“Collection was modified”异常,在迭代完成之前,不能从HashSet中删除继续的条目。因此,您可以保存已处理的条目,然后删除它们。源代码可能如下所示:

HashSet<MyClass> hs /// approx 500,000 items

while(/* metadata based on what's been processed*/ ) // might be adjusted now
{    
    Set<MyClass> toDelete = new HashSet<MyClass>();
    while (MyClass entry in hs) // get Enumerator only once, then iterate normally
    {
        if(ShouldProcess(entry))
        {
            Process(entry);
            toDelete.Remove(entry);
        }
    }
    // finally delete them
    foreach (MyClass entry in toDelete)
    {
        hs.Remove(entry);
    }
}

当您在整个HashSet上迭代,而不是在每个条目之后运行“元数据”检查时,您可能需要调整外部数据 while 由于一个外环 虽然 循环迭代整个HashSet被迭代(条目不会立即被删除)。