代码之家 › 专栏 › 技术社区 › Mike Trpcic

检查一个字符串中的单词是否在另一个字符串中最快的方法是什么?

performance regex ruby ruby-on-rails

Mike Trpcic · 技术社区 · 14 年前

我有一串词,我们叫它们吧 bad :

bad = "foo bar baz"

我可以将此字符串保留为空格分隔的字符串或列表:

bad = bad.split(" ");

如果我有另一根绳子,就像这样:

str = "This is my first foo string"

最快的检查方法是什么 坏的 字符串在比较字符串中, 和如果找到这个词,最快的删除方法是什么?

#Find if a word is there
bad.split(" ").each do |word|
  found = str.include?(word)
end

#Remove the word
bad.split(" ").each do |word|
  str.gsub!(/#{word}/, "")
end

8 回复 | 直到 7 年前

steenslag 14 年前

如果坏词列表变大,哈希就快得多:

    require 'benchmark'

    bad = ('aaa'..'zzz').to_a    # 17576 words
    str= "What's the fasted way to check if any word from the bad string is within my "
    str += "comparison string, and what's the fastest way to remove said word if it's "
    str += "found" 
    str *= 10

    badex = /\b(#{bad.join('|')})\b/i

    bad_hash = {}
    bad.each{|w| bad_hash[w] = true}

    n = 10
    Benchmark.bm(10) do |x|

      x.report('regex:') {n.times do 
        str.gsub(badex,'').squeeze(' ')
      end}

      x.report('hash:') {n.times do
        str.gsub(/\b\w+\b/){|word| bad_hash[word] ? '': word}.squeeze(' ')
      end}

    end
                user     system      total        real
regex:     10.485000   0.000000  10.485000 ( 13.312500)
hash:       0.000000   0.000000   0.000000 (  0.000000)

jeem 14 年前

bad=“foo bar baz”

=>“foo bar baz”(foo bar baz)

str=“这是我的第一个foo字符串”

=>“这是我的第一个foo字符串”

(str.split('')-错误。split('')).join('')

=>“这是我的第一个字符串”

the Tin Man 12 年前

如果情况不匹配,所有的解决方案都无法捕捉坏词。通过添加忽略大小写标志,regex解决方案最容易修复:

badex = /\b(#{bad.split.join('|')})\b/i

此外,使用 "String".include?(" String ") 将导致字符串中的第一个和最后一个单词出现边界问题,或目标单词带有标点或连字符的字符串出现边界问题。测试这些情况将导致需要大量其他代码。因此,我认为regex解决方案是最好的解决方案。它不是最快的,但它将是更灵活的开箱即用,而且,如果其他算法被调整处理大小写折叠和复合词,regex解决方案可能会领先。

#!/usr/bin/ruby

require 'benchmark'

bad = 'foo bar baz comparison'
badex = /\b(#{bad.split.join('|')})\b/i
str = "What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?" * 10

n = 10_000
Benchmark.bm(20) do |x|
  x.report('regex:') do 
    n.times { str.gsub(badex,'').gsub('  ',' ') }
  end

  x.report('regex with squeeze:') do 
    n.times{ str.gsub(badex,'').squeeze(' ') }
  end

  x.report('array subtraction') do
    n.times { (str.split(' ') - bad.split(' ')).join(' ') }
  end
end

我让str变量变长了很多,使例程工作起来更困难。

                          user     system      total        real
regex:                0.740000   0.010000   0.750000 (  0.752846)
regex with squeeze:   0.570000   0.000000   0.570000 (  0.581304)
array subtraction     1.430000   0.010000   1.440000 (  1.449578)

呸!我已经习惯了其他语言如何处理它们的基准测试。现在我开始工作了,看起来好多了!

只需对OP想要做的事情做一点小小的评论:黑名单中的单词删除很容易被愚弄,维护起来很痛苦。L33T-SP34K使得轻蔑的话语变得微不足道。根据应用程序的不同,人们会认为这是一个游戏,以寻找方法将冒犯性的词汇过滤掉。当我被要求处理这个问题时,我发现的最好的解决方案是创建一个生成器,它可以在一个单词上创建所有的变体,并将它们转储到一个数据库中,在这个数据库中,一些进程可以尽快进行检查,而不是实时进行检查。如果你在一长串冒犯性的词语中搜索,要检查一百万个小字符串可能需要一段时间;我相信我们可以列出很多让人觉得冒犯的东西,但这是另一天的练习。

在Ruby中我没有看到类似于Perl的 Regexp::Assemble 但这是解决此类问题的好方法。您可以传递一个单词数组,以及大小写折叠和单词边界的选项,它将输出一个与所有单词匹配的regex模式,并考虑它们的共性,以产生与列表中所有单词匹配的最小模式。之后的问题是找到原始字符串中哪个单词与模式找到的匹配,以便删除它们。复合词的大小写和命中数的不同使得替换更加有趣。

而且我们甚至不会根据上下文使用善意或冒犯性的词语。

我为数组减法基准添加了一个更全面的测试,以适应在真正的代码中如何工作。这个 if 在答案中指定了子句,这反映了它:

#!/usr/bin/env ruby

require 'benchmark'

bad = 'foo bar baz comparison'
badex = /\b(#{bad.split.join('|')})\b/i
str = "What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?" * 10

str_split = str.split
bad_split = bad.split

n = 10_000
Benchmark.bm(20) do |x|
  x.report('regex') do 
    n.times { str.gsub(badex,'').gsub('  ',' ') }
  end

  x.report('regex with squeeze') do 
    n.times{ str.gsub(badex,'').squeeze(' ') }
  end

  x.report('bad.any?') do
    n.times { 
      if (bad_split.any? { |bw| str.include?(bw) })
        (str_split - bad_split).join(' ')
      end
    }
  end

  x.report('array subtraction') do
    n.times { (str_split - bad_split).join(' ') }
  end

end

两次试运行:

ruby test.rb 
                          user     system      total        real
regex                 1.000000   0.010000   1.010000 (  1.001093)
regex with squeeze    0.870000   0.000000   0.870000 (  0.873224)
bad.any?              1.760000   0.000000   1.760000 (  1.762195)
array subtraction     1.350000   0.000000   1.350000 (  1.346043)

ruby test.rb 
                          user     system      total        real
regex                 1.000000   0.010000   1.010000 (  1.004365)
regex with squeeze    0.870000   0.000000   0.870000 (  0.868525)
bad.any?              1.770000   0.000000   1.770000 (  1.775567)
array subtraction     1.360000   0.000000   1.360000 (  1.359100)

jeem 14 年前

我通常在没有测量的情况下不进行优化,但这里有一个wag:

为了加快速度,应该对每个字符串迭代一次。您希望避免出现计数*str count内部比较错误的循环。所以,你可以用它构建一个大型的regexp和gsub。

(添加foo变量以测试单词边界工程)

str = "This is my first foo fooo ofoo string"

=> "This is my first foo fooo ofoo string"

badex = /\b(#{bad.split.join('|')})\b/

=> /\b(foo|bar|baz)\b/

str.gsub(badex,'').gsub('  ',' ')

=> "This is my first fooo ofoo string"

当然,产生的巨大regexp可能和我另一个答案中隐含的嵌套迭代一样慢。唯一知道的方法就是测量。

glenn jackman 14 年前

bad = %w(foo bar baz)
str = "This is my first foo string"

# find the first word in the list
found = bad.find {|word| str.include?(word)}

# remove it
str[found] = ''  ;# str => "This is my first  string"

Levi 14 年前

我将以此为基准:

bad = "foo bar baz".split(' ')
str = "This is my first foo string".split(' ')

# 1. What's the fasted way to check if any word from the bad string is within my comparison string
p bad.any? { |bw| str.include?(bw) }

# 2. What's the fastest way to remove said word if it's found?
p (str - bad).join(' ')

有吗?一看到匹配就快速检查。如果你能按坏词的概率排序,你就可以节省一些循环。

Jack Rothrock 7 年前

这是一个检查单词和短语的方法。

 def checkContent(str)
     bad = ["foo", "bar", "this place sucks", "or whatever"]

     # may be best to map and singularize everything as well. 
     # maybe add some regex to catch those pesky, "How i make $69 dollars each second online..."
     # maybe apply some comparison stuff to check for weird characters in those pesky, "How i m4ke $69 $ollars an hour"


     bad_hash = {}
     bad_phrase_hash = {}

     bad.map(&:downcase).each do |word|
         words = word.split().map(&:downcase)
         if words.length > 1
             words.each do |inner|
                if bad_hash.key?(inner)
                    if bad_hash[inner].is_a?(Hash) && !bad_hash[inner].key?(words.length)
                         bad_hash[inner][words.length] = true
                    elsif bad_hash[inner] === 1
                        bad_hash[inner] = {1=>true,words.length => true}
                    end
                else
                    bad_hash[inner] = {words.length => true}
                end
             end
             bad_phrase_hash[word] = true
         else
             bad_hash[word] = 1
         end
     end

     string = str.split().map(&:downcase)
     string.each_with_index do |word,index|
        if bad_hash.key?(word)
            if bad_hash[word].is_a?(Hash)
                if bad_hash[word].key?(1)
                    return false
                else
                    bad_hash[word].keys.sort.each do |length|
                        value = string[index...(index + length)].join(" ")
                        if bad_phrase_hash.key?(value)
                            return false
                        end
                    end
                end
            else
                return false
            end
        end
     end
     return true
  end

-2

z-index 14 年前

包括?方法就是你需要的。Ruby字符串的具体内容是:

是否包含str.include?(字符串)->真或假如果str包含给定的字符串或字符,则返回true。

“你好。”包括吗?Lo“—>真

“你好。”包括吗?ol“->错误

“你好”。包括吗??H->真

注意它有O(n),而你的目的是O(n^2)