代码之家  ›  专栏  ›  技术社区  ›  Hodza

如何在usql窗口表达式中获取最后一个非空值?

  •  1
  • Hodza  · 技术社区  · 7 年前

    我试图在窗口表达式中确定最后一个非空值:

    LAST_VALUE([b]) OVER (ORDER BY Timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS bf
    

    不幸的是,它不起作用。 我编写了自定义聚合函数,但它也不起作用。

    public class LastNonNull<T> : IAggregate<T, T>
            where T : class
        {
            T last;
    
            public override void Init()
            {
                last = null;
            }
    
            public override void Accumulate(T val)
            {
                if (val != null)
                {
                    last = val;
                }
            }
    
            public override T Terminate()
            {
                return last;
            }
    
        }
    

    尝试使用:

    AGG<DataLakeTest.LastNonNull<string>>([b]) OVER (ORDER BY Timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS bf
    

    错误E\u CSC\u USER\u UNEXPECTEDOVERCLAUSE:意外的OVER子句。 描述:OVER子句必须跟在排序函数调用之后(例如 作为RANK()或ROW_NUMBER())或组内子句。分辨率:Make 确保OVER子句紧跟在排序函数调用或 组内子句。

    使现代化

    脚本:

    @tb1 = SELECT * FROM 
            ( VALUES
            (1, "Val1"),
            (2, (string)null),
              (3, "Val3"),
              (5, (string) null),
              (6, (string)null),
              (7, "Val7"),
              (8, "Val8")
            ) AS T(Timestamp, a);
    
    @tb1 =
        SELECT Timestamp,
               ??? AS a
        FROM @tb1;
    
    OUTPUT @tb1 TO "/test.csv" USING Outputters.Csv(outputHeader: true);
    

    预期输出:

    "Timestamp","a"
    1,"Val1"
    2,"Val1"
    3,"Val3"
    5,"Val3"
    6,"Val3"
    7,"Val7"
    8,"Val8"
    

    更新2:

    不幸的是,我不能使用LAG函数,因为非null值之间的null值计数未知。不能使用交叉连接,因为当我有非常大的表时,处理步骤就会冻结。我目前的解决方案(我不乐意使用):

    @tb1 =
        SELECT Timestamp,
               [a],
               [a] != null && [a] != LEAD([a], 1) OVER(ORDER BY Timestamp ASC) AS aSwitch
        FROM @tb1;
    
    @tb1 =
        SELECT Timestamp,
               [a],
               SUM(aSwitch ? 1 : 0) OVER(ORDER BY Timestamp ASC ROWS UNBOUNDED PRECEDING) AS aGrp
        FROM @tb1;
    
    @tb1 =
        SELECT Timestamp,
               FIRST_VALUE([a]) OVER(PARTITION BY aGrp ORDER BY Timestamp ASC) AS a
        FROM @tb1;
    

    最终解决方案:

    public class ReplaceNullReducer : IReducer
    {
        string lastValue = null;
        public override IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output)
        {
            foreach (var row in input.Rows)
            {
                var val = row.Get<string>("a");
                if (val != null) lastValue = val;
                output.Set<string>("a", lastValue);
                output.Set<int>("Timestamp", row.Get<int>("Timestamp"));
                yield return output.AsReadOnly();
            }
        } 
    } 
    

    USQL(出于某种原因,“ALL”选项会触发E\u CSC\u USER\u语法错误,因此我引入了dumb device列):

    @tb1 = SELECT * FROM 
            ( VALUES
            (1, "Val1", 1),
            (2, (string)null, 1),
              (3, "Val3", 1),
              (5, (string) null, 1),
              (6, (string)null, 1),
              (7, "Val7", 1),
              (8, "Val8", 1)
            ) AS T(Timestamp, a, device);
    @tb1 = REDUCE @tb1 PRESORT [Timestamp] ON device
           PRODUCE [Timestamp] int, [a] string
           USING new DataLakeTest.ReplaceNullReducer();
    
    1 回复  |  直到 7 年前
        1
  •  1
  •   Michael Rys    7 年前

    假设您有多个可能为null的行,您的解决方案似乎可以工作,或者您可以编写一个自定义减速机,对预排序列表进行操作并返回值。

    @raw = SELECT * FROM 
        ( VALUES
            (1, "Val1"),
            (2, (string) null),
            (3, "Val3"),
            (5, (string) null),
            (6, (string) null),
            (7, "Val7"),
            (8, "Val8")
        ) AS T(Timestamp, a);
    
    @res = REDUCE @raw PRESORT Timestamp ALL
           PRODUCE Timestamp int, a string
           USING new ReduceSample.ReplaceNullReducer();
    

    以下博客文章详细介绍了减速机: https://blogs.msdn.microsoft.com/azuredatalake/2016/06/27/how-do-i-combine-overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos/