代码之家  ›  专栏  ›  技术社区  ›  Kaushik Nayak

在PostgreSQL中按字母顺序对字符串中的字母进行排序

  •  3
  • Kaushik Nayak  · 技术社区  · 6 年前

    我目前正在使用这种方法在PostgreSQL中按字母顺序对字符串中的字母进行排序。还有其他有效的方法吗?

    select string_agg(c, '') as s
    from   (select unnest(regexp_split_to_array('ijsAafhareDbv', '')) as c 
            order  by c) as t; 
    
           s   
     --------------
     ADaabefhijrsv
    
    3 回复  |  直到 6 年前
        1
  •  4
  •   Kaushik Nayak    6 年前

    我创建了3个函数,一个使用我的查询,另一个使用Laurenz的查询,还有一个:我创建了一个用于排序的Python(plpythonu)函数。最后,我创建了一个包含100000行的表(我现在是从Mac笔记本电脑上创建的) 每个包含一个随机的15个字符的字符串,使用 random_string 在这方面的作用 Link

    create table t as select random_string(15) as s FROM generate_series(1,100000);
    

    下面是3个函数。

    CREATE or REPLACE FUNCTION sort1(x TEXT) RETURNS TEXT AS $$
    select string_agg(c, '') as s
    from   (select unnest(regexp_split_to_array($1, '')) as c 
            order  by c) as t;
    $$ LANGUAGE SQL IMMUTABLE;
    
    
    CREATE or REPLACE FUNCTION sort2(x TEXT) RETURNS TEXT AS $$
    WITH t(s) AS (VALUES ($1))
    SELECT string_agg(substr(t.s, g.g, 1), ''
                      ORDER BY substr(t.s, g.g, 1)
                     )
    FROM t
       CROSS JOIN LATERAL generate_series(1, length(t.s)) g;
    
    $$ LANGUAGE SQL IMMUTABLE;
    
    
    create language plpythonu;
    CREATE or REPLACE FUNCTION pysort(x text)
      RETURNS text
    AS $$
      return ''.join(sorted(x))
    $$ LANGUAGE plpythonu IMMUTABLE;
    

    这些是来自 EXPLAIN ANALYSE 为了这三个人。

    knayak=# EXPLAIN ANALYSE select sort1(s)  FROM t;
                                                     QUERY PLAN                                                  
    -------------------------------------------------------------------------------------------------------------
     Seq Scan on t  (cost=0.00..26541.00 rows=100000 width=32) (actual time=0.266..7097.740 rows=100000 loops=1)
     Planning time: 0.119 ms
     Execution time: 7106.871 ms
    (3 rows)
    
    knayak=# EXPLAIN ANALYSE select sort2(s)  FROM t;
                                                     QUERY PLAN                                                  
    -------------------------------------------------------------------------------------------------------------
     Seq Scan on t  (cost=0.00..26541.00 rows=100000 width=32) (actual time=0.418..7012.935 rows=100000 loops=1)
     Planning time: 0.270 ms
     Execution time: 7021.587 ms
    (3 rows)
    
    knayak=# EXPLAIN ANALYSE select pysort(s) FROM t;
                                                     QUERY PLAN                                                 
    ------------------------------------------------------------------------------------------------------------
     Seq Scan on t  (cost=0.00..26541.00 rows=100000 width=32) (actual time=0.060..389.729 rows=100000 loops=1)
     Planning time: 0.048 ms
     Execution time: 395.760 ms
    (3 rows)
    

    从这个分析来看- Python排序 是最快的,前两名之间没有显著差异。不过,我们需要实时检查系统中大型表的性能。

        2
  •  2
  •   Erwin Brandstetter    6 年前

    在中实现的函数 C 非常 比我们所能达到的任何速度都快 LANGUAGE sql plpgsql 所以 your plpythonu function 以绝对优势赢得表演比赛。

    但是 大蟒蛇 是一个 不可信 程序语言。默认情况下不会安装,只有超级用户才能使用不受信任的语言创建函数。你需要意识到安全隐患。而且在大多数云服务上根本不存在不受信任的语言。
    The current manual (quote from pg 10):

    PL/Python只能作为不受信任的语言使用,这意味着 不提供任何方式来限制用户可以在其中做什么,并且 故名 大蟒蛇 .可靠的变种 plpython 可能会变成 如果开发了安全的执行机制,则在将来可用 在Python中。在不受信任的PL/Python中编写函数的人必须 注意,该函数不能用于做任何不需要的事情,因为 它将能够做任何登录用户可以做的事情 作为数据库管理员。只有超级用户才能在中创建函数 不受信任的语言,例如 大蟒蛇 .

    您测试的SQL函数没有得到很好的优化。有一千零一个 提高绩效的方法 然而:

    演示

    -- func to create random strings
    CREATE OR REPLACE FUNCTION f_random_string(int)
      RETURNS text AS
    $func$
    SELECT array_to_string(ARRAY(
       SELECT substr('0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', (ceil(random()*62))::int, 1)
       FROM generate_series(1, $1)
       ), '')
    $func$  LANGUAGE sql VOLATILE;
    
    -- test tbl with 100K rows
    CREATE TABLE tbl(str text);
    INSERT INTO tbl
    SELECT f_random_string(15)
    FROM   generate_series(1, 100000) g;
    
    VACUUM ANALYZE tbl;
    
    -- 1: your test function 1 (inefficient)
    CREATE OR REPLACE FUNCTION sort1(text)  RETURNS text AS
    $func$  -- your test function 1 (very inefficient)
    SELECT string_agg(c, '')
    FROM  (SELECT unnest(regexp_split_to_array($1, '')) AS c ORDER  BY c) t;
    $func$ LANGUAGE sql IMMUTABLE;
    
    -- 2: your test function 2 ( inefficient)
    CREATE OR REPLACE FUNCTION sort2(text)  RETURNS text AS
    $func$
    WITH t(s) AS (VALUES ($1))
    SELECT string_agg(substr(t.s, g.g, 1), '' ORDER BY substr(t.s, g.g, 1))
    FROM   t
    CROSS  JOIN LATERAL generate_series(1, length(t.s)) g;
    $func$  LANGUAGE sql IMMUTABLE;
    
    -- 3: remove pointless CTE from sort2
    CREATE OR REPLACE FUNCTION sort3(text)  RETURNS text AS
    $func$
    SELECT string_agg(substr($1, g, 1), '' ORDER BY substr($1, g, 1))
    FROM   generate_series(1, length($1)) g;
    $func$  LANGUAGE sql IMMUTABLE;
    
    -- 4: use unnest instead of calling substr N times
    CREATE OR REPLACE FUNCTION sort4(text)  RETURNS text AS
    $func$
    SELECT string_agg(c, '' ORDER BY c)
    FROM   unnest(string_to_array($1, NULL)) c
    $func$  LANGUAGE sql IMMUTABLE;
    
    -- 5: ORDER BY in subquery
    CREATE OR REPLACE FUNCTION sort5(text)  RETURNS text AS
    $func$
    SELECT string_agg(c, '')
    FROM  (
       SELECT c
       FROM   unnest(string_to_array($1, NULL)) c
       ORDER  BY c
       ) sub
    $func$  LANGUAGE sql IMMUTABLE;
    
    -- 6: SRF in SELECT list
    CREATE OR REPLACE FUNCTION sort6(text)  RETURNS text AS
    $func$
    SELECT string_agg(c, '')
    FROM  (SELECT unnest(string_to_array($1, NULL)) c ORDER BY 1) sub
    $func$  LANGUAGE sql IMMUTABLE;
    
    -- 7: ARRAY constructor instead of aggregate func
    CREATE OR REPLACE FUNCTION sort7(text)  RETURNS text AS
    $func$
    SELECT array_to_string(ARRAY(SELECT unnest(string_to_array($1, NULL)) c ORDER BY c), '')
    $func$  LANGUAGE sql IMMUTABLE;
    
    -- 8: The same with COLLATE "C"
    CREATE OR REPLACE FUNCTION sort8(text)  RETURNS text AS
    $func$
    SELECT array_to_string(ARRAY(SELECT unnest(string_to_array($1 COLLATE "C", NULL)) c ORDER BY c), '')
    $func$  LANGUAGE sql IMMUTABLE;
    
    SELECT str, sort1(str), sort2(str), sort3(str), sort4(str), sort5(str), sort6(str), sort7(str), sort8(str) FROM tbl LIMIT 1;  -- result sample 
    
    str             | sort1           | sort2           | sort3           | sort4           | sort5           | sort6           | sort7           | sort8          
    :-------------- | :-------------- | :-------------- | :-------------- | :-------------- | :-------------- | :-------------- | :-------------- | :--------------
    tUkmori4D1rHhI1 | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DHIUhikmorrt
    
    EXPLAIN (ANALYZE, TIMING OFF) SELECT sort1(str) FROM tbl;
    
    | QUERY PLAN                                                                               |
    | :--------------------------------------------------------------------------------------- |
    | Seq Scan on tbl  (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
    | Planning time: 0.053 ms                                                                  |
    | Execution time: 2742.904 ms                                                              |
    
    EXPLAIN (ANALYZE, TIMING OFF) SELECT sort2(str) FROM tbl;
    
    | QUERY PLAN                                                                               |
    | :--------------------------------------------------------------------------------------- |
    | Seq Scan on tbl  (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
    | Planning time: 0.105 ms                                                                  |
    | Execution time: 2579.397 ms                                                              |
    
    EXPLAIN (ANALYZE, TIMING OFF) SELECT sort3(str) FROM tbl;
    
    | QUERY PLAN                                                                               |
    | :--------------------------------------------------------------------------------------- |
    | Seq Scan on tbl  (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
    | Planning time: 0.079 ms                                                                  |
    | Execution time: 2191.228 ms                                                              |
    
    EXPLAIN (ANALYZE, TIMING OFF) SELECT sort4(str) FROM tbl;
    
    | QUERY PLAN                                                                               |
    | :--------------------------------------------------------------------------------------- |
    | Seq Scan on tbl  (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
    | Planning time: 0.075 ms                                                                  |
    | Execution time: 2194.780 ms                                                              |
    
    EXPLAIN (ANALYZE, TIMING OFF) SELECT sort5(str) FROM tbl;
    
    | QUERY PLAN                                                                               |
    | :--------------------------------------------------------------------------------------- |
    | Seq Scan on tbl  (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
    | Planning time: 0.083 ms                                                                  |
    | Execution time: 1902.829 ms                                                              |
    
    EXPLAIN (ANALYZE, TIMING OFF) SELECT sort6(str) FROM tbl;
    
    | QUERY PLAN                                                                               |
    | :--------------------------------------------------------------------------------------- |
    | Seq Scan on tbl  (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
    | Planning time: 0.075 ms                                                                  |
    | Execution time: 1866.407 ms                                                              |
    
    EXPLAIN (ANALYZE, TIMING OFF) SELECT sort7(str) FROM tbl;
    
    | QUERY PLAN                                                                               |
    | :--------------------------------------------------------------------------------------- |
    | Seq Scan on tbl  (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
    | Planning time: 0.067 ms                                                                  |
    | Execution time: 1863.713 ms                                                              |
    
    EXPLAIN (ANALYZE, TIMING OFF) SELECT sort8(str) FROM tbl;
    
    | QUERY PLAN                                                                               |
    | :--------------------------------------------------------------------------------------- |
    | Seq Scan on tbl  (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
    | Planning time: 0.074 ms                                                                  |
    | Execution time: 1569.376 ms                                                              |
    

    db<>不停摆弄 here

    最后一个没有 COLLATION 严格来说,这是由字符的字节值决定的,这要便宜得多。但你可能会也可能不会 需要 不同区域设置的排序顺序。

    The manual about COLLATION expressions.

        3
  •  2
  •   Laurenz Albe    6 年前

    如果想要一个没有正则表达式的解决方案,可以使用以下方法:

    WITH t(s) AS (VALUES ('amfjwzeils'))
    SELECT string_agg(substr(t.s, g.g, 1), ''
                      ORDER BY substr(t.s, g.g, 1)
                     )
    FROM t
       CROSS JOIN LATERAL generate_series(1, length(t.s)) g;
    
     string_agg 
    ------------
     aefijlmswz
    (1 row)
    

    我会测试哪个解决方案更快。