代码之家  ›  专栏  ›  技术社区  ›  The Lazy Log

内部连接条件中列的顺序严重影响性能

  •  7
  • The Lazy Log  · 技术社区  · 6 年前

    我有两个表,它们彼此链接如下:

    answered_questions 包含以下列和索引:

    • id
    • taken_test_id :整数(外键)
    • question_id questions
    • indexes :(参加了测试,问题)

    taken_tests

    • 身份证件
    • user_id :(外键,指向表用户的链接)
    • 索引: 用户id

    第一个查询(带解释分析输出):

    EXPLAIN ANALYZE 
    SELECT 
      "answered_questions".* 
    FROM 
      "answered_questions" 
      INNER JOIN "taken_tests" ON "answered_questions"."taken_test_id" = "taken_tests"."id" 
    WHERE 
      "taken_tests"."user_id" = 1;
    

    Nested Loop  (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.025..2.208 rows=653 loops=1)
       ->  Index Scan using index_taken_tests_on_user_id on taken_tests  (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.483 rows=371 loops=1)
             Index Cond: (user_id = 1)
       ->  Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions  (cost=0.56..1273.61 rows=365 width=61) (actual time=0.00
    2..0.003 rows=2 loops=371)
             Index Cond: (taken_test_id = taken_tests.id)
     Planning time: 0.276 ms
     Execution time: 2.365 ms
    (7 rows)
    

    另一个查询(当使用 joins (ActiveRecord中的方法)

    EXPLAIN ANALYZE 
    SELECT 
      "answered_questions".* 
    FROM 
      "answered_questions" 
      INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id" 
    WHERE 
      "taken_tests"."user_id" = 1;
    

    这是输出

    Nested Loop  (cost=0.99..116504.61 rows=1472 width=61) (actual time=23.611..1257.807 rows=653 loops=1)
       ->  Index Scan using index_taken_tests_on_user_id on taken_tests  (cost=0.43..274.18 rows=91 width=4) (actual time=10.451..71.474 rows=371 loops=1)
             Index Cond: (user_id = 1)
       ->  Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions  (cost=0.56..1273.61 rows=365 width=61) (actual time=2.07
    1..3.195 rows=2 loops=371)
             Index Cond: (taken_test_id = taken_tests.id)
     Planning time: 0.302 ms
     Execution time: 1258.035 ms
    (7 rows)
    

    内部连接 条件。在第一个查询中,它是 ON "answered_questions"."taken_test_id" = "taken_tests"."id" 在第二个查询中,它是 ON "taken_tests"."id" = "answered_questions"."taken_test_id" . 但是查询时间有很大的不同。

    你知道为什么会这样吗?我读了一些文章,其中说连接条件中列的顺序不应该影响执行时间(例如: Best practices for the order of joined columns in a sql join? )

    我用的是Postgres9.6。有超过4000万排在 表中有超过300万行 参加了大学考试 桌子

    (analyze true, verbose true, buffers true)

    EXPLAIN (ANALYZE TRUE, VERBOSE TRUE, BUFFERS TRUE) 
    SELECT
      "answered_questions".* 
    FROM
      "answered_questions"
      INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id" 
    WHERE
      "taken_tests"."user_id" = 1;
    

    输出

    Nested Loop  (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.030..2.192 rows=653 loops=1)
       Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated_at, a
    nswered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
       Buffers: shared hit=1986
       ->  Index Scan using index_taken_tests_on_user_id on public.taken_tests  (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.441 rows=371 loops=1)
             Output: taken_tests.id
             Index Cond: (taken_tests.user_id = 1)
             Buffers: shared hit=269
       ->  Index Scan using index_answered_questions_on_taken_test_id_and_question_id on public.answered_questions  (cost=0.56..1273.61 rows=365 width=61) (actual ti
    me=0.002..0.003 rows=2 loops=371)
             Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated
    _at, answered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
             Index Cond: (answered_questions.taken_test_id = taken_tests.id)
             Buffers: shared hit=1717
     Planning time: 0.238 ms
     Execution time: 2.335 ms
    
    0 回复  |  直到 5 年前
        1
  •  1
  •   Megadest    5 年前

    正如你从开头看到的 EXPLAIN ANALYZE

    差异来自于同一个单元的执行时间:

    -> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) ( 实际时间=0.014..0.483 rows=371 loops=1)

    ->索引扫描使用索引进行的测试用户id进行的测试(成本=0.43..274.18行=91宽度=4)( 行=371(循环=1)

    正如注释者已经指出的(请参阅wuestion注释中的文档链接),不管表的顺序如何,内部连接的查询计划都应该是相同的。它是根据查询计划器的决策排序的。这意味着您应该真正了解查询执行的其他性能优化部分。其中之一就是用于缓存的内存( SHARED BUFFER 增加共享缓冲区的大小可能有助于解决这个问题,但是查询的初始执行总是需要更长的时间——这只是您的磁盘访问速度。

    有关Pg数据库内存配置的更多提示,请参见: https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

    :VACUUM或ANALYZE命令在这里不太可能有帮助。两个查询已经在使用同一个计划。不过,请记住,由于PostgreSQL事务隔离机制(MVCC)的原因,它可能必须读取底层表行,以验证在从索引获取结果后,这些行对当前事务仍然可见。这可以通过更新可见性地图来改进(请参见 https://www.postgresql.org/docs/10/storage-vm.html