代码之家  ›  专栏  ›  技术社区  ›  Josh

使用NHibernate索引lucene.net中的大量数据

  •  1
  • Josh  · 技术社区  · 14 年前

    我们使用NHibernate作为数据访问层。我们有一个包含170万条记录的表,我们需要通过Lucene对其逐一进行索引,以便进行搜索。当我们运行我们为构建索引而编写的控制台应用程序时,它开始的很快,但是当它通过这些项目时,它会逐渐变得越来越慢。

    我们的第一次迭代就是将它们全部编入索引。第二次迭代是按类别对它们进行索引。现在,我们按类别选择子集,然后将它们分成100个“页面”。我们的表现还是有所下降。

    我打开了SQL事件探查器,当它迭代这些项时,它会逐个为图像调用每个项的SQL Server,即使延迟加载设置为不为图像加载。

    这是一个商业网站,我们正在索引目录项(产品)。每个目录项都是0到多个图像(存储在单独的表中)。

    这是我们的地图:

    public class ItemMap : ClassMap<Item>
        {
            public ItemMap()
            {
                Table("Products");
    
                Id(x => x.Id, "ProductId").GeneratedBy.GuidComb();
    
                Map(x => x.Model);
                Map(x => x.Description);
    
                Map(x => x.Created);
                Map(x => x.Modified);
                Map(x => x.IsActive);
                Map(x => x.PurchaseUrl).CustomType<UriType>();
    
                Component(x => x.Identifier, m =>
                    {
                        m.Map(x => x.Upc);
                        m.Map(x => x.Asin);
                        m.Map(x => x.Isbn);
                        m.Map(x => x.Tid);
                    });
    
                Component(x => x.Price, m =>
                    {
                        m.Map(x => x.Currency);
                        m.Map(x => x.Amount, "Price");
                        m.Map(x => x.Shipping);
                    });
    
                References(x => x.Brand, "BrandId");
                References(x => x.Category, "CategoryId");
                References(x => x.Supplier, "SupplierId");
                References(x => x.Provider, "ProviderId");
    
                HasMany(x => x.Images)
                    .Table("ProductImages")
                    .KeyColumn("ProductId")
                    .Not.LazyLoad();
    
    
    
    
                // TODO: Add variants
    
    
    
    
    
            }
    
        }
    

    这里是索引应用程序的根逻辑。

    public void IndexProducts()
            {
                Console.WriteLine("--- Begin Indexing Products ---");
                Console.WriteLine();
                var categories = categoryRepository.GetAll().ToList();
                Console.WriteLine(String.Format("--- {0} Categories found ---", categories.Count));
                categories.Add(null);
    
                foreach (var category in categories)
                {
                    string categoryName = "\"None\"";
    
                    if (category != null)
                        categoryName = category.Name;
    
                    Console.WriteLine(String.Format("--- Begin Indexing Category ({0}) ---", categoryName));
                    var categoryItems = from p in catalogRepository.GetList(new ActiveProductsByCategoryQuery(category))
                                        select p;
    
                    int count = categoryItems.Count();
                    int pageSize = 100;
                    int currentPage = 0;
                    int offest = currentPage * pageSize;
                    int current = 1;
    
                    Console.WriteLine(String.Format("Indexing {0} Products...", count));
    
                    while (offest < count)
                    {
                        var products = (from p in categoryItems
                                        select p).Skip(offest).Take(pageSize);
    
                        foreach (var item in products)
                        {
                            indexer.UpdateContent(item);
                            UpdateCounter(current, count);
                            current++;
                        }
    
                        currentPage++;
                        offest = currentPage * pageSize;
                    }
                    Console.WriteLine();
    
                    Console.WriteLine(String.Format("--- End Indexing Category ({0}) ---", categoryName));
                    Console.WriteLine();
                }
    
                Console.WriteLine("--- End Indexing Products ---");
                Console.WriteLine();
            }
    

    仅供参考,该类别的计数为26552。 它运行的第一个查询是:

    exec sp_executesql N'SELECT TOP 100 ProductId100_1_, Upc100_1_, Asin100_1_, Isbn100_1_, Tid100_1_, Currency100_1_, Price100_1_, Shipping100_1_, Model100_1_, Descrip10_100_1_, Created100_1_, Modified100_1_, IsActive100_1_, Purchas14_100_1_, BrandId100_1_, CategoryId100_1_, SupplierId100_1_, ProviderId100_1_, CategoryId103_0_, Name103_0_, ShortName103_0_, Created103_0_, Modified103_0_, ShortId103_0_, DisplayO7_103_0_, IsActive103_0_, ParentCa9_103_0_ FROM (SELECT this_.ProductId as ProductId100_1_, this_.Upc as Upc100_1_, this_.Asin as Asin100_1_, this_.Isbn as Isbn100_1_, this_.Tid as Tid100_1_, this_.Currency as Currency100_1_, this_.Price as Price100_1_, this_.Shipping as Shipping100_1_, this_.Model as Model100_1_, this_.Description as Descrip10_100_1_, this_.Created as Created100_1_, this_.Modified as Modified100_1_, this_.IsActive as IsActive100_1_, this_.PurchaseUrl as Purchas14_100_1_, this_.BrandId as BrandId100_1_, this_.CategoryId as CategoryId100_1_, this_.SupplierId as SupplierId100_1_, this_.ProviderId as ProviderId100_1_, category1_.CategoryId as CategoryId103_0_, category1_.Name as Name103_0_, category1_.ShortName as ShortName103_0_, category1_.Created as Created103_0_, category1_.Modified as Modified103_0_, category1_.ShortId as ShortId103_0_, category1_.DisplayOrder as DisplayO7_103_0_, category1_.IsActive as IsActive103_0_, category1_.ParentCategoryId as ParentCa9_103_0_, ROW_NUMBER() OVER(ORDER BY CURRENT_TIMESTAMP) as __hibernate_sort_row FROM Products this_ left outer join Categories category1_ on this_.CategoryId=category1_.CategoryId WHERE (this_.IsActive = @p0 and (1=0 or (this_.CategoryId is not null and category1_.CategoryId = @p1)))) as query WHERE query.__hibernate_sort_row > 500 ORDER BY query.__hibernate_sort_row',N'@p0 bit,@p1 uniqueidentifier',@p0=1,@p1='A988FD8C-DD93-4119-8F84-0AF3656DAEDD'
    

    然后对每个产品执行

    exec sp_executesql N'SELECT images0_.ProductId as ProductId1_, images0_.ImageId as ImageId1_, images0_.ImageId as ImageId98_0_, images0_.Description as Descript2_98_0_, images0_.Url as Url98_0_, images0_.Created as Created98_0_, images0_.Modified as Modified98_0_, images0_.ProductId as ProductId98_0_ FROM ProductImages images0_ WHERE images0_.ProductId=@p0',N'@p0 uniqueidentifier',@p0='487EA053-4DD5-4EBA-AA36-95B30C42F0CD'
    

    很好。问题是,前2000年的速度确实很快,但这个类别运行的时间越长,它获得的速度越慢,消耗的内存越多——即使它正在索引相同数量的产品。GC工作是因为内存使用率下降,但总体上随着处理器的工作而上升。

    我们能做些什么来加快索引器的速度吗?为什么它的性能在稳步下降?我不认为它是NHibernate或查询,因为它开始得太快了。我们在这里真是迷路了。

    谢谢

    3 回复  |  直到 14 年前
        1
  •  3
  •   AlexCuse    14 年前

    几周前,Ayende发表了一篇关于完成这项工作的文章(使用无状态会话和自定义IList实现)。

    http://ayende.com/Blog/archive/2010/06/27/nhibernate-streaming-large-result-sets.aspx

    这听起来正是您所需要的,至少对于加快记录检索和最小化内存使用来说是如此。

        2
  •  0
  •   sisve    14 年前

    您是否对所有呼叫使用相同的会话?如果是这种情况,它将缓存加载的实体,并在调用flush时(这取决于您的flushMode)循环这些实体以检查它们是否需要刷新。要么对每一页的项使用新会话,要么更改FlushMode。

    使用标准时,可以指定使用SQL联接预取特定属性,这可能会加快数据读取速度。我通常更信任Critiera API,而不是Linq to NHibernate,因为我实际上决定了每一个电话都做了什么。

        3
  •  0
  •   Josh    14 年前

    最后我们转到索尔进行索引。我们无法让它有效地索引,这可能是由于实现。

    供参考:

    http://lucene.apache.org/solr/

    http://code.google.com/p/solrnet/