代码之家  ›  专栏  ›  技术社区  ›  Juan Delgado

删除多余的SQL代码

  •  5
  • Juan Delgado  · 技术社区  · 6 年前

    代码

    下面的代码根据大量数据计算线性回归的斜率和截距。然后应用方程 y = mx + b 根据相同的结果集计算每行的回归线值。

    如何连接这两个查询,以便在不执行 WHERE 条款两次?

    问题的一般形式是:

    SELECT a.group, func(a.group, avg_avg)
    FROM a
        (SELECT AVG(field1_avg) as avg_avg
         FROM (SELECT a.group, AVG(field1) as field1_avg
               FROM a
               WHERE (SOME_CONDITION)
               GROUP BY a.group) as several_lines -- potentially
        ) as one_line -- always
    WHERE (SOME_CONDITION)
    GROUP BY a.group -- again, potentially several lines
    

    我有 SOME_CONDITION 执行两次。如下所示(更新为 STRAIGHT_JOIN 优化):

    SELECT STRAIGHT_JOIN
      AVG(D.AMOUNT) as AMOUNT,
      Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
      Y.YEAR as YEAR,
      MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
      ymxb.SLOPE,
      ymxb.INTERCEPT,
      ymxb.CORRELATION,
      ymxb.MEASUREMENTS
    FROM
      CITY C,
      STATION S,
      STATION_DISTRICT SD,
      YEAR_REF Y,
      MONTH_REF M,
      DAILY D,
      (SELECT
        SUM(MEASUREMENTS) as MEASUREMENTS,
    
        ((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
        (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,
    
        ((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
        (sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
        (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
    
        ((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
        (stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION
      FROM (
        SELECT STRAIGHT_JOIN
          COUNT(1) as MEASUREMENTS,
          AVG(D.AMOUNT) as AMOUNT,
          Y.YEAR as YEAR
        FROM
          CITY C,
          STATION S,
          STATION_DISTRICT SD,
          YEAR_REF Y,
          MONTH_REF M,
          DAILY D
        WHERE
          -- For a specific city ...
          --
          $X{ IN, C.ID, CityCode } AND
    
          -- Find all the stations within a specific unit radius ...
          --
          6371.009 *
          SQRT(
            POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
            (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
             POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
    
          SD.ID = S.STATION_DISTRICT_ID AND
    
          -- Gather all known years for that station ...
          --
          Y.STATION_DISTRICT_ID = SD.ID AND
    
          -- The data before 1900 is shaky; insufficient after 2009.
          --
          Y.YEAR BETWEEN 1900 AND 2009 AND
    
          -- Filtered by all known months ...
          --
          M.YEAR_REF_ID = Y.ID AND
    
          -- Whittled down by category ...
          --
          M.CATEGORY_ID = $P{CategoryCode} AND
    
          -- Into the valid daily climate data.
          --
          M.ID = D.MONTH_REF_ID AND
          D.DAILY_FLAG_ID <> 'M'
        GROUP BY
          Y.YEAR
      ) t
    ) ymxb
    WHERE
      -- For a specific city ...
      --
      $X{ IN, C.ID, CityCode } AND
    
      -- Find all the stations within a specific unit radius ...
      --
      6371.009 *
      SQRT(
        POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
        (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
         POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
    
      SD.ID = S.STATION_DISTRICT_ID AND
    
      -- Gather all known years for that station ...
      --
      Y.STATION_DISTRICT_ID = SD.ID AND
    
      -- The data before 1900 is shaky; insufficient after 2009.
      --
      Y.YEAR BETWEEN 1900 AND 2009 AND
    
      -- Filtered by all known months ...
      --
      M.YEAR_REF_ID = Y.ID AND
    
      -- Whittled down by category ...
      --
      M.CATEGORY_ID = $P{CategoryCode} AND
    
      -- Into the valid daily climate data.
      --
      M.ID = D.MONTH_REF_ID AND
      D.DAILY_FLAG_ID <> 'M'
    GROUP BY
      Y.YEAR
    

    问题

    如何在每个查询中只执行一次重复位,而不是两次?重复代码:

      $X{ IN, C.ID, CityCode } AND
      6371.009 *
      SQRT(
        POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
        (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
         POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
      SD.ID = S.STATION_DISTRICT_ID AND
      Y.STATION_DISTRICT_ID = SD.ID AND
      Y.YEAR BETWEEN 1900 AND 2009 AND
      M.YEAR_REF_ID = Y.ID AND
      M.CATEGORY_ID = $P{CategoryCode} AND
      M.ID = D.MONTH_REF_ID AND
      D.DAILY_FLAG_ID <> 'M'
    GROUP BY
      Y.YEAR
    

    更新1

    使用变量和拆分查询似乎允许缓存启动,因为它现在在3.5秒内运行,而它以前在7秒内运行。不过,如果有任何方法可以删除重复的代码,我会感谢任何帮助。

    更新2

    上面的代码不在jaspereports中运行,一个视图,虽然是一个可能的修复,但可能会非常低效(因为where子句是参数化的)。

    更新3

    使用Unreason提出的毕达哥拉斯公式和会合子午线验证距离:

      6371.009 *
      SQRT(
        POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
        (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
        POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) )
    

    (这与问题无关,但是否有人想知道…)

    更新4

    如图所示,代码在jaspereports中工作,运行在mysql数据库上。jaspereports不允许变量或多个查询。

    更新5

    我在寻找一个执行干净的解决方案。;-)我已经编写了一些部分工作的解决方案,但不幸的是,mysql不理解 部分正确 . 请参阅与Unreason的讨论,以获取几乎有效的答案。

    更新6

    我也许可以从一开始就重用变量 哪里 子句并将它们与第二个子句进行比较(从而消除 一些 复制--检查 $P{} 但我真的希望消除重复。

    更新7

    比较 YEAR 子句,如前一次更新中所假设的,以消除重复项 BETWEEN ,不起作用。

    相关的

    How to eliminate duplicate calculation in SQL?

    谢谢您!

    4 回复  |  直到 6 年前
        1
  •  5
  •   Dave Jarvis James Eichele    14 年前

    你应该能够一次性得到你需要的一切:

     SELECT
        AVG(D.AMOUNT) as AMOUNT,
        Y.YEAR as YEAR,
        MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
        Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,             
        ((avg(AVG(D.AMOUNT) * Y.YEAR)) - avg(AVG(D.AMOUNT)) * avg(Y.YEAR)) /                  
        (stddev( AVG(D.AMOUNT) ) * stddev( Y.YEAR )) as CORRELATION,                     
        ((sum(Y.YEAR) * sum(AVG(D.AMOUNT))) - (count(1) * sum(Y.YEAR * AVG(D.AMOUNT)))) /
        (power(sum(Y.YEAR), 2) - count(1) * sum(power(Y.YEAR, 2))) as SLOPE,   
        ((sum( Y.YEAR ) * sum( Y.YEAR * AVG(D.AMOUNT) )) -
        (sum( AVG(D.AMOUNT) ) * sum(power(Y.YEAR, 2)))) / 
        (power(sum(Y.YEAR), 2) - count(1) * sum(power(Y.YEAR, 2))) as INTERCEPT
     FROM
        CITY C,
        STATION S,
        YEAR_REF Y,
        MONTH_REF M,
        DAILY D
     WHERE
        $X{ IN, C.ID, CityCode } AND
        SQRT(
            POW( C.LATITUDE - S.LATITUDE, 2 ) +
            POW( C.LONGITUDE - S.LONGITUDE, 2 ) ) < $P{Radius} AND
        S.STATION_DISTRICT_ID = Y.STATION_DISTRICT_ID AND
        Y.YEAR BETWEEN 1900 AND 2009 AND
        M.YEAR_REF_ID = Y.ID AND
        M.CATEGORY_ID = $P{CategoryCode} AND
        M.ID = D.MONTH_REF_ID AND
        D.DAILY_FLAG_ID <> 'M'
     GROUP BY
        Y.YEAR
    

    事情 不会起作用的 直接从上面的查询(它有无意义的组合聚合和其他错误);现在是检查公式的好时机

    如果决定执行子查询确实简化了公式,则:

    • 您可以在最内部的查询中获取(您确实要获取)所有必需的数据,而不必再重复外部查询中的所有表(只需从t中选择相关列,它们已经在您的掌握中)
    • 你不必重复where条件
        2
  •  1
  •   tzot    14 年前

    这个问题比你的概括要难一点。我要声明如下:

    SELECT a.group, func(a.group, avg_avg)
    FROM a
        (SELECT AVG(field1_avg) as avg_avg
         FROM (SELECT a.group, AVG(field1) as field1_avg
               FROM a
               WHERE (YOUR_CONDITION)
               GROUP BY a.group) as several_lines -- potentially
        ) as one_line -- always
    WHERE (YOUR_CONDITION)
    GROUP BY a.group -- again, potentially several lines
    

    您有一部分数据(受条件限制),这些数据被分组并为每个组进行聚合。然后,将聚合合并为单个值,并希望将该值的函数再次应用于每个组。显然,在分组子查询的结果可以作为实体引用之前,不能重用该条件。

    在mssql和oracle中,您可以使用 WITH 接线员。在mysql中,唯一的选择是使用临时表。我假设您的报告中有一年多的时间(否则,查询会简单得多)。

    UPD :很抱歉,我现在无法发布就绪代码(明天可以),但我有一个想法:

    您可以将需要在子查询中输出的数据与 GROUP_CONCAT 并在外部查询中将其拆分为 FIND_IN_SET SUBSTRING_INDEX 功能。外部查询将只连接年份和聚合结果。

    外部查询中的条件将是 WHERE FIND_IN_SET(year, concatenated_years) .

    UPD :

    下面是使用group concat将所需数据传递给外部联接的版本。

    我的评论以 --newtover: . 顺便说一句,1)我认为直接加入不会增加任何好处,2) COUNT(*) 在mysql和 应该 要计算行数时使用。

    SELECT STRAIGHT_JOIN
      -- newtover: extract the corresponding amount back
      SUBSTRING_INDEX(SUBSTRING_INDEX(GROUPED_AMOUNTS, '|', @pos),'|', -1) as AMOUNT,
      Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
      Y.YEAR as YEAR,
      MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
      ymxb.SLOPE,
      ymxb.INTERCEPT,
      ymxb.CORRELATION,
      ymxb.MEASUREMENTS
    FROM
      -- newtover: list of tables now contains only the subquery, YEAR_REF for grouping and init_vars to define the variable
      YEAR_REF Y,
      (SELECT
        SUM(MEASUREMENTS) as MEASUREMENTS,
        ((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
        (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,
        ((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
        (sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
        (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
        ((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
        (stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION,
        -- newtover: grouped fields for matching years and the corresponding amounts
        GROUP_CONCAT(Y.YEAR) as GROUPED_YEARS,
        GROUP_CONCAT(AMOUNT SEPARATOR '|') as GROUPED_AMOUNTS
      FROM (
        SELECT STRAIGHT_JOIN
          COUNT(1) as MEASUREMENTS,
          AVG(D.AMOUNT) as AMOUNT,
          Y.YEAR as YEAR
        FROM
          CITY C,
          STATION S,
          STATION_DISTRICT SD,
          YEAR_REF Y,
          MONTH_REF M,
          DAILY D
        WHERE
          -- For a specific city ...
          $X{ IN, C.ID, CityCode } AND
          -- Find all the stations within a specific unit radius ...
          6371.009 *
          SQRT(
            POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
            (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
             POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
          SD.ID = S.STATION_DISTRICT_ID AND
          -- Gather all known years for that station ...
          Y.STATION_DISTRICT_ID = SD.ID AND
          -- The data before 1900 is shaky; insufficient after 2009.
          Y.YEAR BETWEEN 1900 AND 2009 AND
          -- Filtered by all known months ...
          M.YEAR_REF_ID = Y.ID AND
          -- Whittled down by category ...
          M.CATEGORY_ID = $P{CategoryCode} AND
          -- Into the valid daily climate data.
          M.ID = D.MONTH_REF_ID AND
          D.DAILY_FLAG_ID <> 'M'
        GROUP BY
          Y.YEAR
      ) t
    ) ymxb,
    (SELECT @pos:=NULL) as init_vars
    WHERE
        -- newtover: check if the year is in the list and store the index into the variable
        @pos:=CAST(FIND_IN_SET(Y.YEAR, GROUPED_YEARS) as UNSIGNED)
    GROUP BY
      Y.YEAR
    
        3
  •  0
  •   Unreason    14 年前

    由于问题中的sql被大量挂起(现在只显示相关部分),这里是我的新答案

    假设:条件实际上是相同的,子查询和外部查询之间不会出现复杂的列别名。

    回答: 可以删除外部查询中的where。

    SELECT
      /* aggregate data */
      ymxb.*
    FROM (
      SELECT
        /* similar aggregate data */
      WHERE
        /* some condition */
      GROUP BY
        YEAR
    ) ymxb
    GROUP BY
      YEAR
    

    这应该会给你同样的结果。

    (还要注意,您可以删除内部的where并保留外部的where-结果应该相同,但是性能可能不一样)。

    最后, 重复where子句可能对性能没有太大影响 -与任何i/o相比,计算额外条件(甚至是sqrt等表达式)非常便宜(而且这些条件不在任何新列上操作,因此所有i/o都已完成)

    此外,内部查询和外部查询使用相同的group by,外部查询从子查询获取所有数据。

    这使得外部查询中的任何聚合函数都是多余的(作为外部查询源的子查询中的行已经按年份分组)。

    这使得整个子选择是多余的。

        4
  •  0
  •   John Boker    14 年前

    你能在你的情况下使用临时桌子吗?尽管它仍然要求您使用where子句两次,但它应该会大大提高您的性能。

    DROP TEMPORARY TABLE IF EXISTS TEMP_DATA
    
    CREATE TEMPORARY TABLE TEMP_DATA 
        (SELECT AVG(field1_avg) as avg_avg
         FROM (SELECT a.group, AVG(field1) as field1_avg
               FROM a
               WHERE (SOME_CONDITION)
               GROUP BY a.group)
        )
    
    SELECT t.group, func(t.group, t.avg_avg)
    FROM TEMP_DATA AS t
    WHERE (SOME_CONDITION)
    GROUP BY t.group
    

    希望这有帮助! ——配音