代码之家  ›  专栏  ›  技术社区  ›  shusson

如何优化查询以计算与行相关的日期时间关系?

  •  6
  • shusson  · 技术社区  · 6 年前

    假设我有一个简化模型,其中 patient 可以有零个或更多 events . 事件有 category 和A date . 我想支持以下问题:

    Find all patients that were given a medication after an operation and 
    the operation happened after an admission. 
    

    其中药物、手术和入院都是各种类型的事件类别。有大约100种可能的类别。

    我预计会有1000名患者,每个患者每个类别都有大约10个事件。

    我想出的天真的解决办法是有两张桌子, 病人 和一个 event 表。在上创建索引 event.category 然后使用内部联接进行查询,如:

    SELECT COUNT(DISTINCT(patient.id)) FROM patient
    INNER JOIN event AS medication
        ON  medication.patient_id = patient.id
        AND medication.category = 'medication'
    INNER JOIN event AS operation
        ON  operation.patient_id = patient.id
        AND operation.category = 'operation'
    INNER JOIN event AS admission
        ON  admission.patient_id = patient.id
        AND admission.category = 'admission'
    WHERE medication.date > operation.date
        AND operation.date > admission.date;
    

    但是,此解决方案不能很好地扩展,添加了更多的类别/筛选器。对于1000名患者和45000个事件,我看到了以下表现行为:

    | number of inner joins | approx. query response |
    | --------------------- | ---------------------- |
    | 2                     | 100ms                  |
    | 3                     | 500ms                  |
    | 4                     | 2000ms                 |
    | 5                     | 8000ms                 | 
    

    说明: explain

    有人对如何优化这个查询/数据模型有什么建议吗?

    额外信息:

    • 邮政10.6
    • 在解释输出中, project_result 等于 病人 在简化模型中。

    高级用例:

    Find all patients that were given a medication within 30 days after an 
    operation and the operation happened within 7 days after an admission.
    
    2 回复  |  直到 6 年前
        1
  •  4
  •   Erwin Brandstetter    6 年前

    首先,如果使用fk约束强制引用完整性,则可以删除 patient 完全来自查询的表:

    SELECT COUNT(DISTINCT patient)  -- still not optimal
    FROM   event a
    JOIN   event o USING (patient_id)
    JOIN   event m USING (patient_id)
    WHERE  a.category = 'admission'
    AND    o.category = 'operation'
    AND    m.category = 'medication'
    AND    m.date > o.date
    AND    o.date > a.date;
    

    接下来,去掉行和 DISTINCT 在外面反击 SELECT 通过使用 EXISTS 半连接代替:

    SELECT COUNT(*)
    FROM   event a
    WHERE  EXISTS (
       SELECT FROM event o
       WHERE  o.patient_id = a.patient_id
       AND    o.category = 'operation'
       AND    o.date > a.date
       AND    EXISTS (
          SELECT FROM event m
          WHERE  m.patient_id = a.patient_id
          AND    m.category = 'medication'
          AND    m.date > o.date
          )
       )
    AND    a.category = 'admission';
    

    注意,还有可能 录取通知书副本 但这可能是数据模型/查询设计中的一个主要问题,需要按照注释中的讨论进行澄清。

    如果你真的想 将同一病人的所有病例合并在一起 出于某种原因,在最初的步骤中,有多种方法可以为每个患者获得最早的住院时间,并对每个额外的步骤重复类似的方法。您的病例可能最快(将患者表重新引入查询):

    SELECT count(*)
    FROM   patient p
    CROSS  JOIN LATERAL ( -- get earliest admission
       SELECT e.date
       FROM   event e
       WHERE  e.patient_id = p.id 
       AND    e.category = 'admission'
       ORDER  BY e.date
       LIMIT  1
       ) a
    CROSS  JOIN LATERAL ( -- get earliest operation after that
       SELECT e.date
       FROM   event e
       WHERE  e.patient_id = p.id 
       AND    e.category = 'operation'
       AND    e.date > a.date
       ORDER  BY e.date
       LIMIT  1
       ) o
    WHERE EXISTS (  -- the *last* step can still be a plain EXISTS
          SELECT FROM event m
          WHERE  m.patient_id = p.id
          AND    m.category = 'medication'
          AND    m.date > o.date
          );
    

    见:

    您可以通过缩短冗长(和冗余)的类别名称来优化表设计。使用查阅表格并只存储 integer (甚至) int2 "char" 值为fk。

    为了获得最佳性能(这一点至关重要),请 多列索引 (parent_id, category, date DESC) 确保所有三列都已定义 NOT NULL . 索引表达式的顺序很重要。 DESC 主要是可选的。Postgres可以使用默认的索引 ASC 排序顺序在您的案例中几乎同样有效。

    如果 VACUUM (最好是以自动真空的形式)可以跟上写操作,或者你有一个只读的情况开始,你会得到非常快的 index-only scans 就这样。

    相关:


    id

    SELECT COUNT(*)                    -- to count cases
       --  COUNT(DISTINCT patient_id)  -- to count patients
    FROM   event a
    WHERE  EXISTS (
       SELECT FROM event o
       WHERE  o.patient_id = a.patient_id
       AND    o.category = 'operation'
       AND    o.date >= a.date      -- or ">"
       AND    o.date <  a.date + 7  -- based on data type "date"!
       AND    EXISTS (
          SELECT FROM event m
          WHERE  m.patient_id = a.patient_id
          AND    m.category = 'medication'
          AND    m.date >= o.date       -- or ">"
          AND    m.date <  o.date + 30  -- syntax for timestamp is different
          )
       )
    AND    a.category = 'admission';
    

    date timestamp

        2
  •  1
  •   Gordon Linoff    6 年前

    select e.patient_id
    from events e
    group by e.patient_id
    having (max(date) filter (where e.category = 'medication') > 
            min(e.date) filter (where e.category = 'operation')
           ) and
           (min(date) filter (where e.category = 'operation') >
            min(e.date) filter (where e.category = 'admission'
           );
    

    group by having

    admission --> operation --> admission --> medication