代码之家  ›  专栏  ›  技术社区  ›  Elliott Brossard

什么等价于BigQuery中的OTRANSLATE?

  •  0
  • Elliott Brossard  · 技术社区  · 5 年前

    我正在尝试将一个查询转换为在BigQuery中运行,它使用 OTRANSLATE function 来自Teradata。例如,

    SELECT OTRANSLATE(text, 'ehlo', 'EHLO')
    FROM (
      SELECT 'hello world' AS text UNION ALL
      SELECT 'elliott'
    );
    

    这将产生:

    HELLO wOrLd
    ELLiOtt
    

    1 回复  |  直到 5 年前
        1
  •  2
  •   Mikhail Berlyant    5 年前

    另一种稍有不同的方法(BigQuery标准SQL)

    #standardSQL
    CREATE TEMP FUNCTION OTRANSLATE(text STRING, from_string STRING, to_string STRING) AS ((
      SELECT STRING_AGG(IFNULL(y, a), '' ORDER BY pos)
      FROM UNNEST(SPLIT(text, '')) a WITH OFFSET pos
      LEFT JOIN (
        SELECT x, y
        FROM UNNEST(SPLIT(from_string, '')) x WITH OFFSET
        JOIN UNNEST(SPLIT(to_string, '')) y WITH OFFSET
        USING(OFFSET)
      )
      ON a = x
    ));
    WITH `project.dataset.table` AS (
      SELECT 'hello world' AS text UNION ALL
      SELECT 'elliott'
    )
    SELECT text, OTRANSLATE(text, 'ehlo', 'EHLO') as new_text
    FROM `project.dataset.table`   
    

    有输出

    Row     text            new_text     
    1       hello world     HELLO wOrLd  
    2       elliott         ELLiOtt   
    

    注意:以上版本假设from和to字符串的长度相等,from字符串中没有重复的字符

    更新以跟踪BigQuery中该函数版本的“扩展期望”

    #standardSQL
    CREATE TEMP FUNCTION OTRANSLATE(text STRING, from_string STRING, to_string STRING) AS ((
      SELECT STRING_AGG(IFNULL(y, a), '' ORDER BY pos)
      FROM UNNEST(SPLIT(text, '')) a WITH OFFSET pos
      LEFT JOIN (
        SELECT x, ARRAY_AGG(IFNULL(y, '') ORDER BY OFFSET LIMIT 1)[OFFSET(0)] y
        FROM UNNEST(SPLIT(from_string, '')) x WITH OFFSET
        LEFT JOIN UNNEST(SPLIT(to_string, '')) y WITH OFFSET
        USING(OFFSET)
        GROUP BY x
      )
      ON a = x
    ));
    SELECT -- text, OTRANSLATE(text, 'ehlo', 'EHLO') as new_text
      OTRANSLATE("hello world", "", "EHLO") AS empty_from, -- 'hello world'
      OTRANSLATE("hello world", "hello world1", "EHLO") AS larger_from_than_source, -- 'EHLLL'
      OTRANSLATE("hello world", "ehlo", "EHLO") AS equal_size_from_to, -- 'HELLO wOrLd'
      OTRANSLATE("hello world", "ehlo", "EH") AS larger_size_from, -- 'HE wrd'
      OTRANSLATE("hello world", "ehlo", "EHLOPQ") AS larger_size_to, -- 'hello world'
      OTRANSLATE("hello world", "ehlo", "") AS empty_to; -- 'wrd'
    

    有结果的

    Row empty_from  larger_from_than_source equal_size_from_to  larger_size_from    larger_size_to  empty_to     
    1   hello world EHLLL                   HELLO wOrLd             HE wrd          HELLO wOrLd     wrd    
    .   
    

    注意:这个函数的Teradata版本是递归的,所以当前的实现不是Teradata的OTRANSLATE的精确实现

    使用说明
    如果from_string中的第一个字符出现在源_string中,则它的所有出现都将替换为to_string中的第一个字符。对from_string中的所有字符和from_string中的所有字符重复此操作。替换是逐字符执行的,也就是说,第二个字符的替换是对第一个字符替换后的字符串执行的。

    这可以很容易地用JS UDF实现,这很简单,我想我不会朝这个方向:o)

        2
  •  1
  •   Elliott Brossard    5 年前

    是的,可以使用字符串上的数组操作来完成此操作。以下是一个解决方案:

    CREATE TEMP FUNCTION OTRANSLATE(s STRING, key STRING, value STRING) AS (
      (SELECT
         STRING_AGG(
           IFNULL(
             (SELECT value[OFFSET(
                SELECT o FROM UNNEST(SPLIT(key, '')) AS k WITH OFFSET o2
                WHERE k = c)]
             ),
             c),
           '' ORDER BY o1)
       FROM UNNEST(SPLIT(s, '')) AS c WITH OFFSET o1)
      )
    );
    
    SELECT OTRANSLATE(text, 'ehlo', 'EHLO')
    FROM (
      SELECT 'hello world' AS text UNION ALL
      SELECT 'elliott'
    );
    

    这样做的目的是找到角色在 key value 弦。如果在 字符串,我们最后得到一个空偏移量,因此 IFNULL 使其返回未映射的字符。然后我们再聚合成一个字符串,按字符偏移量排序。

    编辑:这里有一个变体,它还处理键和值长度的差异:

    CREATE TEMP FUNCTION otranslate(s STRING, key STRING, value STRING) AS (
      IF(LENGTH(key) < LENGTH(value) OR LENGTH(s) < LENGTH(key), s,
      (SELECT
         STRING_AGG(
           IFNULL(
             (SELECT ARRAY_CONCAT([c], SPLIT(value, ''))[SAFE_OFFSET((
                SELECT IFNULL(MIN(o2) + 1, 0) FROM UNNEST(SPLIT(key, '')) AS k WITH OFFSET o2
                WHERE k = c))]
             ),
             ''),
           '' ORDER BY o1)
       FROM UNNEST(SPLIT(s, '')) AS c WITH OFFSET o1
      ))
    );
    SELECT
      otranslate("hello world", "", "EHLO") AS empty_from, -- 'hello world'
      otranslate("hello world", "hello world1", "EHLO") AS larger_from_than_source, -- 'hello world'
      otranslate("hello world", "ehlo", "EHLO") AS equal_size_from_to, -- 'HELLO wOrLd'
      otranslate("hello world", "ehlo", "EH") AS larger_size_from, -- 'HE wrd'
      otranslate("hello world", "ehlo", "EHLOPQ") AS larger_size_to, -- 'hello world'
      otranslate("hello world", "ehlo", "") AS empty_to; -- 'wrd'