代码之家  ›  专栏  ›  技术社区  ›  JBaba

db2中with查询的sql查询性能改进

  •  -2
  • JBaba  · 技术社区  · 6 年前

    我在示例中给出的查询运行得非常慢。我有近400万张唱片 my_task 桌子。

    我们能在这方面做些什么改进吗?

    以下表为例,

    我在这里放了号码 start_dt end_dt 而不是 timestamp 格式。

    有空的附加说明 结束日期 意思是它是一个活跃的记录,由工人来处理。

    T_ID |start_dt |end_dt |code       |p_id
    -----|---------|-------|-----------|---
    1    |8        |4      |INPROGRESS |110
    1    |4        |       |ASSIGNED   |110
    4    |10       |4      |INPROGRESS |110
    4    |4        |       |ASSIGNED   |110
    5    |4        |4      |INPROGRESS |110
    6    |12       |12     |INPROGRESS |110
    6    |8        |8      |ASSIGNED   |110
    6    |8        |       |DONE       |110
    2    |12       |12     |INPROGRESS |210
    2    |8        |8      |ASSIGNED   |210
    2    |8        |       |DONE       |210
    3    |12       |12     |INPROGRESS |111
    

    输出看起来像,

    P_ID |avg_bgn_diff |assigned |in_progress |completed | comp_diff
    -----|-------------|---------|------------|----------|----------
    110  | 4           |   2     |    1       |     1    |      10
    210  | null        |   0     |    0       |     1    |      8
    111  | null        |   0     |    1       |     0    |      null
    

    输出说明: 我已经用虚构的名字屏蔽了原始查询表ref can break我为此提前道歉。

    • 我的任务表具有唯一的t\u id
    • 我的员工桌是员工桌
    • 我的任务参考表包含有关谁有什么任务的详细信息
    • 当每个状态更改操作结果为任务表中的记录时,任务具有状态。雕像,如指定的、正在进行的和已完成的
    • 现在,如果end_dt不存在,则表示活动记录
    • 第一输出字段 avg_bgn_diff 我们只想找到所有任务的平均时间(平均结束时间为空)“分配的”任务
    • 这片田野 assigned |in_progress |completed 表示每个员工在每个类别中有多少活动任务。
    • 求平均值 comp_diff 每位员工的完成时间。当记录进入程序时,员工开始工作。我们做今天完成的任务的平均值。我们得到进程的开始日期和完成的开始日期。

    我有以下疑问,

    WITH a AS (
        SELECT
            t1.t_id AS t_id,
            t1.start_dt AS start_dt,
            t1.end_dt AS end_dt,
            t1.code AS code,
            t2.p_id AS p_id
        FROM
            my_task t2
            INNER JOIN my_task_ref t1 ON t1.t_id = t2.t_id
            INNER JOIN my_people p1 ON t2.p_id = p1.p_id
        WHERE
            -- ignore DONE tasks
            t1.t_id NOT IN (
                SELECT t.t_id
                FROM my_task t
                WHERE t.code = 'DONE' AND trunc(t.execution_dt) < trunc(current_timestamp)
            )
            and p1.department_id = '1234' 
        ORDER BY p_id DESC
    ) SELECT
        d.p_id,
        d.avg_bgn_diff
        ,e.assigned
        ,e.in_progress
        ,e.completed
        ,g.comp_diff
      FROM
      `-- find average time for persons for diff ASSIGNMENT
        (
            SELECT c.p_id,AVG(c.bgn_diff) AS avg_bgn_diff
            FROM(
                    SELECT b.p_id,timestampdiff(4,current_timestamp - a.start_dt) AS bgn_diff
                    FROM ( SELECT p_id,t_id,start_dt FROM a WHERE end_dt IS NULL ) b
                    LEFT OUTER JOIN  ( SELECT p_id, t_id,start_dt FROM a WHERE 
                         code = 'ASSIGNED' AND   end_dt IS NULL ) x ON x.p_id = b.p_id
                ) c  GROUP BY C.p_id
        ) d
        -- find count of each codes person has
        INNER JOIN (
            SELECT 
                p_id,
                SUM( CASE WHEN code = 'ASSIGNED' THEN 1 ELSE 0 END ) AS assigned,
                SUM( CASE WHEN code = 'INPROGRESS' THEN 1 ELSE 0 END ) AS in_progress,
                SUM( CASE WHEN code = 'DONE' AND trunc(start_dt) = trunc(current_timestamp)
                        THEN 1 ELSE 0 END ) AS completed
            FROM
                a where end_dt IS NULL
            GROUP BY p_id
        ) e on D.p_id=E.p_id 
        -- find total avg diff of entire task took to compelete.
        LEFT OUTER JOIN (
            SELECT F.p_id,AVG(f.bgn_diff) AS comp_diff
            FROM
                (
                    SELECT a.p_id, timestampdiff(4,b.start_dt - a.start_dt) AS bgn_diff
                    FROM (
                            SELECT p_id, t_id, start_dt FROM a WHERE code = 'INPROGRESS'
                        ) a
                        INNER JOIN (
                            SELECT p_id, t_id, start_dt FROM a
                            WHERE code = 'DONE' AND   trunc(start_dt) = trunc(current_timestamp)
                        ) b ON a.t_id = b.t_id
                ) f GROUP BY F.p_id
        ) g ON D.p_id=G.p_id
    WITH
    ur;
    

    我们可以用不同的方式来写这篇文章来提高性能吗?

    注: 索引出现在所有必需的列中。

    提前谢谢。

    2 回复  |  直到 6 年前
        1
  •  0
  •   Clockwork-Muse anupkumar    6 年前

    如果您提供一个查询,我们当然可以做得更好 EXPLAIN 计划,索引列表,也许更好地解释您要做的事情(并更正表引用的语法错误 c ,但此版本的查询可能会加快速度。

    请注意整个评论!

    WITH Incomplete_Task AS (SELECT My_Task_Ref.t_id,
                                    My_Task_Ref.start_dt, My_Task_Ref.end_dt,
                                    My_Task_Ref.code,
                                    Task_A.p_id
                             FROM My_Task AS Task_A
                             JOIN My_Task_Ref
                               ON My_Task_Ref.t_id = Task_A.t_id
                             JOIN My_People
                               ON My_People.p_id = My_Task_Ref.p_id
                                  AND My_People.department_id = '1234'
                             -- NOT IN should be fine, I just default to NOT EXISTS
                             WHERE NOT EXISTS (SELECT 1
                                               FROM My_Task AS Task_B
                                               WHERE Task_B.t_id = Task_A.t_id
                                               AND Task_B.code = 'DONE'
                                               -- Calling a function on a column can 
                                               -- cause indices to be ignored
                                               AND Task_B.execution_dt < TIMESTAMP(CURRENT_DATE)))
    
    SELECT Average_Time_And_Code_Count.p_id,
           Average_Time_And_Code_Count.average_begin_difference,
           COALESCE(Average_Time_And_Code_Count.assigned, 0),
           COALESCE(Average_Time_And_Code_Count.in_progress, 0),
           COALESCE(Average_Time_And_Code_Count.completed, 0),
           Average_Complete_Time.average_complete_difference
    FROM (SELECT p_id,
                 -- The join you had previously was almost certainly duplicating 
                 -- some rows, distorting the results.
                 AVG(CASE WHEN code = 'ASSIGNED' 
                          -- TIMESTAMPDIFF works off an estimate, and will be wrong
                          -- if a task takes more than a month.
                          THEN TIMESTAMPDIFF(4, CURRENT_TIMESTAMP - A.start_dt) END) AS average_begin_difference,
                 SUM(CASE WHEN code = 'ASSIGNED' 
                                   THEN 1 END) AS assigned,
                 SUM(CASE WHEN code = 'INPROGRESS' 
                                   THEN 1 END) AS in_progress,
                 SUM(CASE WHEN code = 'DONE' 
                                        AND start_dt >= TIMESTAMP(CURRENT_DATE) 
                                   THEN 1 END) AS completed
          FROM Filtered_Task
          WHERE end_dt IS NULL
          GROUP BY p_id) AS Average_Time_And_Code_Count
    -- I'm not convinced this measures what you think it does,
    -- but I'm not sure what it is you think you _are_ measuring....
    LEFT JOIN (SELECT p_id, TIMESTAMPDIFF(4, Done.start_dt - InProgress.start_dt) AS average_complete_difference
               FROM Filtered_Task AS InProgress
               JOIN Filtered_Task AS Done
                 ON InProgress.t_id = Done.t_id
                    AND Done.code = 'DONE'
                    AND Done.start_dt >= TIMESTAMP(CURRENT_DATE)
               WHERE InProgress.code = 'INPROGRESS') AS Average_Complete_Time
           ON Average_Complete_Time.p_id = Averate_Time_And_Code_Count.p_id
    
        2
  •  -1
  •   DanB    6 年前

    尝试在第一个查询中删除order by p_id desc,通常order by非常昂贵。同样,在第一个查询中,not-in似乎正在查看与my_task相同的基表,因此我建议只在where子句中放入过滤器。

    WITH a AS (
    SELECT
        t1.t_id AS t_id,
        t1.start_dt AS start_dt,
        t1.end_dt AS end_dt,
        t1.code AS code,
        t2.p_id AS p_id
    FROM
        my_task t2
        INNER JOIN my_task_ref t1 ON t1.t_id = t2.t_id
        INNER JOIN my_people p1 ON t2.p_id = p1.p_id
    WHERE
        -- ignore DONE tasks
        t2.code <> 'DONE' AND trunc(t2.execution_dt) < trunc(current_timestamp)
        and p1.department_id = '1234' )
    

    此外,最好尝试减少子查询的深度/数量。 所以有点像

     SELECT c.p_id,AVG(c.bgn_diff) AS avg_bgn_diff
        FROM(
                SELECT b.p_id,timestampdiff(4,current_timestamp - a.start_dt) AS bgn_diff
                FROM ( SELECT p_id,t_id,start_dt FROM a WHERE end_dt IS NULL ) b
                LEFT OUTER JOIN  ( SELECT p_id, t_id,start_dt FROM a WHERE 
                     code = 'ASSIGNED' AND   end_dt IS NULL ) x ON x.p_id = b.p_id
            ) c  GROUP BY C.p_id
    

    可能变成…

    SELECT a.p_id,AVG(timestampdiff(4,current_timestamp - a.start_dt)) AS 
    avg_bgn_diff
    FROM a
    WHERE end_dt IS NULL OR (code = 'ASSIGNED' AND end_dt IS NULL )
    GROUP BY a.p_id