代码之家  ›  专栏  ›  技术社区  ›  Yossale

Java一次替换字符串中的多个不同子字符串(或以最有效的方式)

  •  117
  • Yossale  · 技术社区  · 16 年前

    我需要以最有效的方式替换字符串中的许多不同子字符串。

    11 回复  |  直到 16 年前
        1
  •  88
  •   Todd Owen    16 年前

    如果您正在操作的字符串很长,或者您正在操作许多字符串,那么使用java.util.regex可能是值得的。匹配器(这需要预先编译时间,因此如果您的输入非常小或您的搜索模式频繁更改,则效率不高)。

    下面是一个完整的示例,基于从地图上获取的令牌列表。(使用Apache Commons Lang的StringUtils)。

    Map<String,String> tokens = new HashMap<String,String>();
    tokens.put("cat", "Garfield");
    tokens.put("beverage", "coffee");
    
    String template = "%cat% really needs some %beverage%.";
    
    // Create pattern of the format "%(cat|beverage)%"
    String patternString = "%(" + StringUtils.join(tokens.keySet(), "|") + ")%";
    Pattern pattern = Pattern.compile(patternString);
    Matcher matcher = pattern.matcher(template);
    
    StringBuffer sb = new StringBuffer();
    while(matcher.find()) {
        matcher.appendReplacement(sb, tokens.get(matcher.group(1)));
    }
    matcher.appendTail(sb);
    
    System.out.println(sb.toString());
    

    一旦编译了正则表达式,扫描输入字符串通常非常快(尽管如果你的正则表达式很复杂或涉及回溯,那么你仍然需要进行基准测试来确认这一点!)

        2
  •  40
  •   Community Mohan Dere    9 年前

    替换匹配字符串(没有正则表达式)的最有效方法之一是使用 Aho-Corasick algorithm 与表演 Trie (发音为“try”),快速 hashing 算法,高效 collections 实施。

    简单代码

    一个简单的解决方案利用了Apache的 StringUtils.replaceEach 如下:

      private String testStringUtils(
        final String text, final Map<String, String> definitions ) {
        final String[] keys = keys( definitions );
        final String[] values = values( definitions );
    
        return StringUtils.replaceEach( text, keys, values );
      }
    

    这会减慢大文本的速度。

    Bor's implementation Aho-Corasick算法引入了更多的复杂性,通过使用具有相同方法签名的外观成为实现细节:

      private String testBorAhoCorasick(
        final String text, final Map<String, String> definitions ) {
        // Create a buffer sufficiently large that re-allocations are minimized.
        final StringBuilder sb = new StringBuilder( text.length() << 1 );
    
        final TrieBuilder builder = Trie.builder();
        builder.onlyWholeWords();
        builder.removeOverlaps();
    
        final String[] keys = keys( definitions );
    
        for( final String key : keys ) {
          builder.addKeyword( key );
        }
    
        final Trie trie = builder.build();
        final Collection<Emit> emits = trie.parseText( text );
    
        int prevIndex = 0;
    
        for( final Emit emit : emits ) {
          final int matchIndex = emit.getStart();
    
          sb.append( text.substring( prevIndex, matchIndex ) );
          sb.append( definitions.get( emit.getKeyword() ) );
          prevIndex = emit.getEnd() + 1;
        }
    
        // Add the remainder of the string (contains no more matches).
        sb.append( text.substring( prevIndex ) );
    
        return sb.toString();
      }
    

    基准测试

    对于基准测试,缓冲区是使用 randomNumeric 如下:

      private final static int TEXT_SIZE = 1000;
      private final static int MATCHES_DIVISOR = 10;
    
      private final static StringBuilder SOURCE
        = new StringBuilder( randomNumeric( TEXT_SIZE ) );
    

    哪里 MATCHES_DIVISOR 指定要注入的变量数量:

      private void injectVariables( final Map<String, String> definitions ) {
        for( int i = (SOURCE.length() / MATCHES_DIVISOR) + 1; i > 0; i-- ) {
          final int r = current().nextInt( 1, SOURCE.length() );
          SOURCE.insert( r, randomKey( definitions ) );
        }
      }
    

    JMH 似乎有些过头了):

    long duration = System.nanoTime();
    final String result = testBorAhoCorasick( text, definitions );
    duration = System.nanoTime() - duration;
    System.out.println( elapsed( duration ) );
    

    1,000,000 : 1,000

    一个简单的微基准测试,包含1000000个字符和1000个随机放置的字符串进行替换。

    • testStringUtils: 25秒,25533毫秒
    • testBorAhoCorasick: 0秒68毫秒

    没有竞争。

    10,000 : 1,000

    • testStringUtils: 1秒,1402毫秒
    • testBorAhoCorasick: 0秒37毫秒

    分歧结束了。

    1,000 : 10

    使用1000个字符和10个匹配字符串替换:

    • testStringUtils: 0秒,7毫秒
    • testBorAhoCorasick: 0秒,19毫秒

    对于短字符串,设置Aho Corasick的开销超过了暴力方法 StringUtils.replaceEach .

    实现

    考虑比较长度超过1 MB的文本的其他实现,包括:

    论文

    与算法相关的论文和信息:

        3
  •  7
  •   Steve McLeod    16 年前

    这对我奏效了:

    String result = input.replaceAll("string1|string2|string3","replacementString");
    

    例子:

    String input = "applemangobananaarefruits";
    String result = input.replaceAll("mango|are|ts","-");
    System.out.println(result);
    

    输出: 苹果香蕉果-

        4
  •  4
  •   Brian Agnew    14 年前

    如果你要多次更改一个String,那么使用StringBuilder通常会更有效 (但要衡量你的表现才能发现) :

    String str = "The rain in Spain falls mainly on the plain";
    StringBuilder sb = new StringBuilder(str);
    // do your replacing in sb - although you'll find this trickier than simply using String
    String newStr = sb.toString();
    

    每次对String进行替换时,都会创建一个新的String对象,因为String是不可变的。StringBuilder是可变的,也就是说,它可以随意更改。

        5
  •  2
  •   Avi    16 年前

    StringBuilder 将更有效地执行替换,因为其字符数组缓冲区可以指定为所需的长度。 字符串构建器

    当然,真正的问题是,这种优化是否太过分了?JVM非常擅长处理多个对象的创建和随后的垃圾收集,就像所有优化问题一样,我的第一个问题是,你是否已经衡量过这一点并确定这是一个问题。

        6
  •  2
  •   Gelin Luo    13 年前

    检查这个:

    String.format(str,STR[])
    

    例如:

    String.format( "Put your %s where your %s is", "money", "mouth" );
    
        7
  •  2
  •   bikram    7 年前

    Rythm是一个java模板引擎,现在发布了一个名为 String interpolation mode 它允许您执行以下操作:

    String result = Rythm.render("@name is inviting you", "Diana");
    

    上面的例子表明,您可以按位置将参数传递给模板。Rythm还允许您按名称传递参数:

    Map<String, Object> args = new HashMap<String, Object>();
    args.put("title", "Mr.");
    args.put("name", "John");
    String result = Rythm.render("Hello @title @name", args);
    

    注意:Rythm非常快,比String.format和velocity快2到3倍,因为它将模板编译成java字节码,运行时性能非常接近StringBuilder。

    链接:

        8
  •  1
  •   Community Mohan Dere    14 年前

    以下内容基于 Todd Owen's answer 该解决方案存在一个问题,即如果替换包含在正则表达式中具有特殊含义的字符,则可能会得到意外的结果。我还希望能够选择性地进行不区分大小写的搜索。以下是我的想法:

    /**
     * Performs simultaneous search/replace of multiple strings. Case Sensitive!
     */
    public String replaceMultiple(String target, Map<String, String> replacements) {
      return replaceMultiple(target, replacements, true);
    }
    
    /**
     * Performs simultaneous search/replace of multiple strings.
     * 
     * @param target        string to perform replacements on.
     * @param replacements  map where key represents value to search for, and value represents replacem
     * @param caseSensitive whether or not the search is case-sensitive.
     * @return replaced string
     */
    public String replaceMultiple(String target, Map<String, String> replacements, boolean caseSensitive) {
      if(target == null || "".equals(target) || replacements == null || replacements.size() == 0)
        return target;
    
      //if we are doing case-insensitive replacements, we need to make the map case-insensitive--make a new map with all-lower-case keys
      if(!caseSensitive) {
        Map<String, String> altReplacements = new HashMap<String, String>(replacements.size());
        for(String key : replacements.keySet())
          altReplacements.put(key.toLowerCase(), replacements.get(key));
    
        replacements = altReplacements;
      }
    
      StringBuilder patternString = new StringBuilder();
      if(!caseSensitive)
        patternString.append("(?i)");
    
      patternString.append('(');
      boolean first = true;
      for(String key : replacements.keySet()) {
        if(first)
          first = false;
        else
          patternString.append('|');
    
        patternString.append(Pattern.quote(key));
      }
      patternString.append(')');
    
      Pattern pattern = Pattern.compile(patternString.toString());
      Matcher matcher = pattern.matcher(target);
    
      StringBuffer res = new StringBuffer();
      while(matcher.find()) {
        String match = matcher.group(1);
        if(!caseSensitive)
          match = match.toLowerCase();
        matcher.appendReplacement(res, replacements.get(match));
      }
      matcher.appendTail(res);
    
      return res.toString();
    }
    

    以下是我的单元测试用例:

    @Test
    public void replaceMultipleTest() {
      assertNull(ExtStringUtils.replaceMultiple(null, null));
      assertNull(ExtStringUtils.replaceMultiple(null, Collections.<String, String>emptyMap()));
      assertEquals("", ExtStringUtils.replaceMultiple("", null));
      assertEquals("", ExtStringUtils.replaceMultiple("", Collections.<String, String>emptyMap()));
    
      assertEquals("folks, we are not sane anymore. with me, i promise you, we will burn in flames", ExtStringUtils.replaceMultiple("folks, we are not winning anymore. with me, i promise you, we will win big league", makeMap("win big league", "burn in flames", "winning", "sane")));
    
      assertEquals("bcaacbbcaacb", ExtStringUtils.replaceMultiple("abccbaabccba", makeMap("a", "b", "b", "c", "c", "a")));
      assertEquals("bcaCBAbcCCBb", ExtStringUtils.replaceMultiple("abcCBAabCCBa", makeMap("a", "b", "b", "c", "c", "a")));
      assertEquals("bcaacbbcaacb", ExtStringUtils.replaceMultiple("abcCBAabCCBa", makeMap("a", "b", "b", "c", "c", "a"), false));
    
      assertEquals("c colon  backslash temp backslash  star  dot  star ", ExtStringUtils.replaceMultiple("c:\\temp\\*.*", makeMap(".", " dot ", ":", " colon ", "\\", " backslash ", "*", " star "), false));
    }
    
    private Map<String, String> makeMap(String ... vals) {
      Map<String, String> map = new HashMap<String, String>(vals.length / 2);
      for(int i = 1; i < vals.length; i+= 2)
        map.put(vals[i-1], vals[i]);
      return map;
    }
    
        9
  •  0
  •   Robin479    9 年前

    如何使用 replaceAll() 方法?

        10
  •  0
  •   Community Mohan Dere    9 年前
    public String replace(String input, Map<String, String> pairs) {
      // Reverse lexic-order of keys is good enough for most cases,
      // as it puts longer words before their prefixes ("tool" before "too").
      // However, there are corner cases, which this algorithm doesn't handle
      // no matter what order of keys you choose, eg. it fails to match "edit"
      // before "bed" in "..bedit.." because "bed" appears first in the input,
      // but "edit" may be the desired longer match. Depends which you prefer.
      final Map<String, String> sorted = 
          new TreeMap<String, String>(Collections.reverseOrder());
      sorted.putAll(pairs);
      final String[] keys = sorted.keySet().toArray(new String[sorted.size()]);
      final String[] vals = sorted.values().toArray(new String[sorted.size()]);
      final int lo = 0, hi = input.length();
      final StringBuilder result = new StringBuilder();
      int s = lo;
      for (int i = s; i < hi; i++) {
        for (int p = 0; p < keys.length; p++) {
          if (input.regionMatches(i, keys[p], 0, keys[p].length())) {
            /* TODO: check for "edit", if this is "bed" in "..bedit.." case,
             * i.e. look ahead for all prioritized/longer keys starting within
             * the current match region; iff found, then ignore match ("bed")
             * and continue search (find "edit" later), else handle match. */
            // if (better-match-overlaps-right-ahead)
            //   continue;
            result.append(input, s, i).append(vals[p]);
            i += keys[p].length();
            s = i--;
          }
        }
      }
      if (s == lo) // no matches? no changes!
        return input;
      return result.append(input, s, hi).toString();
    }