代码之家 › 专栏 › 技术社区 › PhilipGarnero

postgresql distinct on+order by query优化

query-optimization optimization performance postgresql sql

PhilipGarnero · 技术社区 · 6 年前

我在这里有一个小问题,有一个疑问。

select distinct on(“reporting_processedMazonSnapshot”.“offer_id”)。*
来自“Reporting_ProcessedMazonSnapshot”内部连接
“优惠价”
打开(“Reporting_ProcessedMazonSnapshot”。“Offer_ID”=
“offers”内部联接
“优惠设置”
on(“offers_boooffer”.“id”=“offers_offersettings”.“offerre_id”)。
其中“offerresettings”。“account_id”=20
按“报告处理”的顺序订购。按“提供ID”的顺序订购。
“Reporting_ProcessedMazonSnapshot”。“Drapping_Date”描述
< /代码> 

我有一个索引,名为latest-scrapkingonoffer-id-asc,scrapking-date-descbut for some reason,postgresql is still doing a sort after using the index caused a greg performance issue.


我不明白为什么它不使用已经排序的数据而不是重新排序。我的索引错误吗?或者我应该尝试用另一种方式进行查询?


这是解释

用它的实际数据

'unique(cost=21260.47..21263.06 rows=519 width=1288)(actual time=38053.685..38177.348 rows=1783 loops=1)'
'->排序(成本=21260.47..21261.76行=519宽度=1288)(实际时间=38053.683..38161.478行=153095循环=1)'
'排序键:Reporting_processedMazonSnapshot.Offer_id,Reporting_processedMazonSnapshot.Scriping_date desc'
'排序方法:外部合并磁盘:162088KB'
'->嵌套循环(成本=41.90..21237.06行=519宽度=1288)(实际时间=70.874..36148.348行=153095循环=1)'
'->嵌套循环(成本=41.47..17547.90行=1627宽度=8)(实际时间=54.287..126.740行=1784循环=1)'
'->提供的位图堆扫描“offersettings(cost=41.04..4823.48 rows=1627 width=4)(实际时间=52.532..84.102 rows=1784 loops=1)'
'重新检查条件:(account_id=20)'
'堆块:精确=38'
'->对提供的位图索引扫描_offerSettings_account_id_fff7a8c0(cost=0.00..40.63 rows=1627 width=0)(实际时间=49.886..49.886 rows=4132 loops=1)'
'索引条件:(account_id=20)'
'->仅使用offers_boooffer_pkey on offers_boooffer(cost=0.43..7.81 rows=1 width=4)进行索引扫描(实际时间=0.019..0.020 rows=1 loops=1784)'
'索引条件:(id=offerresettings.offerresettings)'
'堆提取:1784'
'->使用最新的“对报告进行刮削”索引扫描处理mazonSnapshot(成本=0.43..1.69行=58宽度=1288)(实际时间=0.526..20.146行=86循环=1784)'
'index cond:(offer_id=offers_boooffer.id)'
'计划时间:187.133 ms'
'执行时间:38195.266 ms'
< /代码> <

我有一个索引叫latest_scraping在offer_id ASC, scraping_date DESC但出于某种原因,PostgreSQL在使用索引后仍在进行排序,这导致了巨大的性能问题。

我不明白为什么它不使用已经排序的数据而不是重新排序。我的索引错误吗?或者我应该尝试用另一种方式进行查询?

这是解释

用它的实际数据

'Unique  (cost=21260.47..21263.06 rows=519 width=1288) (actual time=38053.685..38177.348 rows=1783 loops=1)'
'  ->  Sort  (cost=21260.47..21261.76 rows=519 width=1288) (actual time=38053.683..38161.478 rows=153095 loops=1)'
'        Sort Key: reporting_processedamazonsnapshot.offer_id, reporting_processedamazonsnapshot.scraping_date DESC'
'        Sort Method: external merge  Disk: 162088kB'
'        ->  Nested Loop  (cost=41.90..21237.06 rows=519 width=1288) (actual time=70.874..36148.348 rows=153095 loops=1)'
'              ->  Nested Loop  (cost=41.47..17547.90 rows=1627 width=8) (actual time=54.287..126.740 rows=1784 loops=1)'
'                    ->  Bitmap Heap Scan on offers_offersettings  (cost=41.04..4823.48 rows=1627 width=4) (actual time=52.532..84.102 rows=1784 loops=1)'
'                          Recheck Cond: (account_id = 20)'
'                          Heap Blocks: exact=38'
'                          ->  Bitmap Index Scan on offers_offersettings_account_id_fff7a8c0  (cost=0.00..40.63 rows=1627 width=0) (actual time=49.886..49.886 rows=4132 loops=1)'
'                                Index Cond: (account_id = 20)'
'                    ->  Index Only Scan using offers_boooffer_pkey on offers_boooffer  (cost=0.43..7.81 rows=1 width=4) (actual time=0.019..0.020 rows=1 loops=1784)'
'                          Index Cond: (id = offers_offersettings.offer_id)'
'                          Heap Fetches: 1784'
'              ->  Index Scan using latest_scraping on reporting_processedamazonsnapshot  (cost=0.43..1.69 rows=58 width=1288) (actual time=0.526..20.146 rows=86 loops=1784)'
'                    Index Cond: (offer_id = offers_boooffer.id)'
'Planning time: 187.133 ms'
'Execution time: 38195.266 ms'

1 回复 | 直到 6 年前

Laurenz Albe 6 年前

为了使用索引避免排序,PostgreSQL首先必须扫描 所有的 "reporting_processedamazonsnapshot" 按索引顺序,然后联接 所有的 "offers_boooffer" 使用嵌套循环联接 (以便保留命令)然后加入 所有的 "offers_offersettings" 又一次 使用嵌套循环联接 .

最后,所有与条件不匹配的行 "offers_offersettings"."account_id" = 20 会被扔掉。

PostgreSQL相信——在我看来是正确的——从使用条件尽可能减少行数开始,然后使用最有效的联接方法联接表,然后对 DISTINCT 条款。

我想知道下面的查询是否更快:

SELECT DISTINCT ON (q.offer_id) *
FROM offers_offersettings ofs
   JOIN offers_boooffer bo ON bo.id = ofs.offer_id
   CROSS JOIN LATERAL
      (SELECT *
       FROM reporting_processedamazonsnapshot r
       WHERE r.offer_id = bo.offer_id
       ORDER BY r.scraping_date DESC
       LIMIT 1) q
WHERE ofs.account_id = 20
ORDER BY q.offer_id ASC, q.scraping_date DESC;

执行计划与此类似,只是需要从索引中扫描的行更少,这将减少最需要的执行时间。

如果你想加快分拣速度,增加 work_mem 对于该查询(如果您能负担得起的话)约500MB。