代码之家  ›  专栏  ›  技术社区  ›  Richard

在两者之间加入U-SQL以创建每月结果

  •  1
  • Richard  · 技术社区  · 6 年前

    编辑: 我一直在研究所需的结果。让我更好地解释一下:我试图达到的是为每个SomeData的每个YearMonth创建一行,介于开始日期和结束日期之间。

    例如,一些数据“888888888888888888 1”,带有startDate “2005-12-06 00:00:00.000”和结束日期2006-03-13 00:00:000”。 我希望行的格式如下:

    88888888888888888881, 200512
    88888888888888888881, 200601
    88888888888888888881, 200602
    88888888888888888881, 200603
    

    我知道这可能会将结果“爆炸”成一个巨大的文件。

    在我的帖子下面:

    我正在尝试用U-SQL重写之前在T-SQL中所做的事情。

    问题是,U-SQL不允许在连接过程中出现中间值。

    T-SQL连接如下所示:

    SELECT rf.SomeData AS SomeData,
           rd.YearMonth AS YearMonth,
          (rf.SomeData + '-' + rd.YearMonth.ToString()) AS MonthlyKey,
           rf.SomeKey AS SomeKey
    FROM MyTable rf
        INNER JOIN dbo.DimDate rd
        ON rd.Date >= rf.StartDate
        AND rd.Date <= (CASE WHEN rf.EndDate IS NULL THEN GETDATE() ELSE rf.EndDate END)
    

    在U-SQL中,我是这样开始的,但是我现在应该如何编写连接呢

    @EditedTable =
        SELECT rf.SomeData AS SomeData,
               rd.YearMonth AS YearMonth,
               (rf.SomeData + "-" + rd.YearMonth.ToString()) AS MonthlyKey,
               rf.SomeKey AS SomeKey
        FROM @MyTable AS rf
             INNER JOIN
                 @date AS rd
             ON 
    

    重要的是,我们要获取开始日期和结束日期之间的所有数据,并创建一个月度键,以便“SomeData”稍后可以与另一个表联接。

    我尝试过使用交叉连接,但在运行时,它被卡在了80%,而且似乎永远不会结束。它在一个顶点中不断写入GB。此外,我不确定这是否会产生同样的结果。

    @EditedTableCROSS =
        SELECT rfj.SomeData AS SomeData,
               rfj.StartDate AS StartDate,
               rfj.EndDate AS EndDate,
               (rfj.SomeData + "-" + dtj.YearMonth.ToString()) AS MonthlyKey, 
               rfj.SomeKey AS SomeKey
    
            FROM
        (
            SELECT SomeData AS SomeData,
                   StartDate AS StartDate,
                   EndDate AS EndDate,
                   SomeKey AS SomeKey
            FROM @TableA
            WHERE SomeData != ""
        ) AS rfj
    
        CROSS JOIN
    
        (
        SELECT DISTINCT
               dt.Date AS Date,
               dt.YearMonth AS YearMonth,
               dt.Month AS Month,
               rf.StartDate AS StartDate
        FROM @date AS dt INNER JOIN @TableA AS rf ON rf.StartDate == dt.Date
        WHERE rf.StartDate >= dt.Date AND
              dt.Date <= DateTime.Now
              ) AS dtj
    
        WHERE rfj.StartDate <= dtj.Date AND
              rfj.EndDate >= dtj.Date;
    

    上面代码的问题是“内部连接@TableA AS rf ON rf.StartDate==dt.Date”不在唯一键上,某些日期会多次出现。所以我怀疑这是一种方式。。。。

    请分享你的想法?

    编辑: 要求提供样本数据的人员,结束日期可以包含:

    2006-03-13 10:27:13.000
    2016-03-02 18:48:11.000
    2016-03-02 18:42:57.000
    NULL
    2013-09-12 09:19:05.000
    NULL
    2016-03-02 18:59:37.000
    NULL
    NULL
    

    开始日期:

    2005-12-06 00:00:00.000
    2011-03-29 20:57:51.000
    2007-11-01 00:00:00.000
    2007-11-01 00:00:00.000
    2007-11-01 00:00:00.000
    2011-02-28 00:00:00.000
    2011-02-28 00:00:00.000
    2011-02-28 00:00:00.000
    2008-01-17 00:00:00.000
    

    DimDate包含从2000年到2018年的日期。

    SomeDate和SomeKey看起来像:

    88888888888888888881
    88888888888888888882
    88888888888888888883
    88888888888888888884
    88888888888888888885
    88888888888888888886
    88888888888888888887
    88888888888888888888
    88888888888888888889
    
    2 回复  |  直到 6 年前
        1
  •  4
  •   Michael Rys    6 年前

    U-SQL在谓词中不支持BETWEEN的原因是没有适用于非Equijoin的横向扩展连接算法。即使我们在句法上允许它,它仍然会在计划中交叉连接。

    您要做的是获得一个可以分区的连接。一种方法是在分区键上进行相等连接,然后在该分区内进行交叉连接。

    然而,在你的情况下,我认为你真的不需要加入。我认为您要做的是在开始日期和结束日期之间每天生成一行。

    我会用 CROSS APPLY EXPLODE 没有规模限制。以下是一个示例:

    @MyTable = 
      SELECT * 
      FROM (VALUES
            (81,81,(DateTime?) DateTime.Parse("2005-12-06 00:00:00.000"),(DateTime?) DateTime.Parse("2006-03-13 10:27:13.000")),
            (82,82,(DateTime?) DateTime.Parse("2011-03-29 20:57:51.000"),(DateTime?) DateTime.Parse("2016-03-02 18:48:11.000")),
            (83,83,(DateTime?) DateTime.Parse("2007-11-01 00:00:00.000"),(DateTime?) DateTime.Parse("2016-03-02 18:42:57.000")),
            (84,84,(DateTime?) DateTime.Parse("2007-11-01 00:00:00.000"),(DateTime?) null),
            (85,85,(DateTime?) DateTime.Parse("2007-11-01 00:00:00.000"),(DateTime?) DateTime.Parse("2013-09-12 09:19:05.000")),
            (86,86,(DateTime?) DateTime.Parse("2011-02-28 00:00:00.000"),(DateTime?) null),
            (87,87,(DateTime?) DateTime.Parse("2011-02-28 00:00:00.000"),(DateTime?) DateTime.Parse("2016-03-02 18:59:37.000")),
            (88,88,(DateTime?) DateTime.Parse("2011-02-28 00:00:00.000"),(DateTime?) null),
            (89,89,(DateTime?) DateTime.Parse("2008-01-17 00:00:00.000"),(DateTime?) null)
        ) AS T(SomeKey, SomeData, StartDate, EndDate);
    
    @res = 
      SELECT SomeKey, SomeData, StartDate, EndDate, DailyDate 
      FROM @MyTable 
           CROSS APPLY EXPLODE 
             (Enumerable.Range(0, 
                   1 + (EndDate == (DateTime?) null ? DateTime.Now 
                                                    : EndDate.Value).Subtract(StartDate.Value).Days)
               .Select(offset => StartDate.Value.AddDays(offset))
              ) AS T(DailyDate);
    
    OUTPUT @res
    TO "/output/test.csv"
    USING Outputters.Csv(outputHeader : true);
    

    这是一个典型的例子,说明基于问题场景而不是要求翻译更容易回答:)。

        2
  •  3
  •   wBob    6 年前

    我用这个脚本处理我生成的一些示例数据。

    @dateDim =
        EXTRACT xdate DateTime,
                yearMonth string
        FROM "/input/dbo.DimDate.tsv"
        USING Extractors.Tsv();
    
    @data =
        EXTRACT 
                someKey int,
                someData string,
                startDate DateTime,
                endDate DateTime?
        FROM "/input/dbo.MyTable.tsv"
        USING Extractors.Tsv();
    
    /*
    // Use U-SQL ISNULL conditional operator which is ?
    @working =
        SELECT COUNT( * ) AS records
        FROM
        (
            SELECT *
            FROM @dateDim AS dd
                 CROSS JOIN
                     @data AS d
            WHERE dd.xdate BETWEEN d.startDate AND (d.endDate == (DateTime?)null ? DateTime.Now : d.endDate)
        ) AS x;
    */
    
    @working =
        SELECT COUNT( * ) AS records
        FROM
        (
            SELECT *
            FROM @dateDim AS dd
                 CROSS JOIN
                     @data AS d
            WHERE dd.xdate >= d.startDate
              AND dd.xdate <= (d.endDate == (DateTime?)null ? DateTime.Now : d.endDate)
    ) AS x;