代码之家  ›  专栏  ›  技术社区  ›  Mohammad

如何选择与另一个字符串最匹配的子字符串

  •  1
  • Mohammad  · 技术社区  · 6 年前

    假设完整的字符串是

    下面的示例检查字符串,查找由Comas绑定的第一个子字符串

    子字符串是

    有界子串

    如果包含90%匹配的使用SQL的子字符串,是否有任何方法可以检查完整的字符串?

    喜欢这个词 有界子串 子字符串有界 在我的例子中

    子字符串可以是多个单词的组合,因此我无法将整个字符串拆分为单词。

    2 回复  |  直到 6 年前
        1
  •  2
  •   Marmite Bomber    6 年前

    首先在单词表中转换文本。你会发现很多关于这个主题的文章,例如 here

    您必须调整分隔符的列表才能提取单词。

    这是一个示例查询

     with t1 as (select 1 rn, 'The following example examines the string, looking for the first substring bounded by comas' col from dual  ),
          t2 as (select  rownum colnum from dual connect by level < 16 /* (max) number of words */),
          t3 as (select t1.rn, t2.colnum, rtrim(ltrim(regexp_substr(t1.col,'[^ ,]+', 1, t2.colnum)))  col  from t1, t2 
          where regexp_substr(t1.col, '[^ ,]+', 1, t2.colnum) is not null)
     select * from t3;
    
    COL      
    ----------
    The        
    following  
    example    
    examines
    ...
    

    在下一步你的 Levenshtein Distance 得到结束语。

     with t1 as (select 1 rn, 'The following example examines the string, looking for the first substring bounded by comas' col from dual  ),
          t2 as (select  rownum colnum from dual connect by level < 16 /* (max) number of words */),
          t3 as (select t1.rn, t2.colnum, rtrim(ltrim(regexp_substr(t1.col,'[^ ,]+', 1, t2.colnum)))  col  from t1, t2 
          where regexp_substr(t1.col, '[^ ,]+', 1, t2.colnum) is not null)
     select col, str, UTL_MATCH.EDIT_DISTANCE(col, str)  distance
     from t3
     cross join (select 'commas' str from dual)
     order by 3;
    
    COL        STR      DISTANCE
    ---------- ------ ----------
    comas      commas          1 
    for        commas          5 
    examines   commas          6 
    ...
    

    检查levenshtein距离的定义,并在距离上定义一个阈值,以获取候选词。

    与单词边界无关的匹配 简单地扫描您的输入,并在匹配字符串的一个时间段内调整所有子字符串以适应不同的情况,例如添加大约10%。

    您可以通过过滤仅从单词边界开始的子字符串来限制候选字符串。其余的是相同的距离计算。

     with txt as (select  'The following example examines the string, looking for the first substring bounded by comas' txt from dual),
          str as (select  'substing bounded' str from dual),
          t1 as (select  substr(txt, rownum, (select length(str) * 1.1 from str)) substr, /* add 10% length for the match */
                         (select str from str) str 
                 from txt connect by level < (select length(txt) from txt) - (select length(str) from str)) 
     select SUBSTR, STR, 
            UTL_MATCH.EDIT_DISTANCE(SUBSTR, STR)  distance
     from t1
     order by 3;
    
    SUBSTR               STR                DISTANCE
    -------------------- ---------------- ----------
    substring bounded    substing bounded          1 
    ubstring bounded     substing bounded          3 
     substring bounde    substing bounded          3 
    t substring bound    substing bounded          5 
    ...
    
        2
  •  0
  •   TenG    6 年前

    实验 SOUNDEX 功能。

    我还没有测试过这个,但这可能会帮助您:

        WITH strings AS (
          select regexp_substr('The following example examines the string, looking for the first substring bounded by comas','[ ]+', 1, level) ss 
          from dual
          connect by regexp_substr('The following example examines the string, looking for the first substring bounded by comas', '[ ]+', 1, level) is not null
        )
        SELECT ss 
        FROM strings
        WHERE SOUNDEX(ss) = SOUNDEX( 'commas' ) ;
    

    这个 REGEXP_SUBSTR 具有 CONNECT BY 将长字符串拆分为单词(空格)-根据需要修改分隔符,以包括标点符号等。

    这里我们依靠的是内置的 桑德克斯 符合我们的期望。