代码之家 › 专栏 › 技术社区 › Jeffrey Blake

Preg_更换效率

preg-replace performance loops regex php

Jeffrey Blake · 技术社区 · 14 年前

执行摘要:

preg_replace() 比字符串比较运行得快。为什么?正则表达式不应该慢一点吗?

在一个 recent question 关于检测给定输入中任何不允许的子字符串数组,我建议比较 PrggRePosit() 调用原始输入,因为 PrggRePosit() 可以将模式数组作为输入。因此,我的方法可能是 if 而其他解决方案需要一个(或多个)循环。

我不想讨论我的答案,因为实际上它比循环的可读性/可维护性差。我的答案仍然是-1,为了可读性/易于维护,我会接受这一点,但我的方法指出的最大错误是缺乏效率。这让我很好奇,并让我做了一些测试。我的结果让我有点惊讶:其他因素都是一样的, PrggRePosit() 是更快比任何其他方法都好。

你能解释一下为什么会这样吗?

以下是我的这些测试代码以及结果:

$input = "In a recent question about detecting any of an array of disallowed substrings within a given input, I suggested comparing the result of a `preg_replace()` call to the original input, since `preg_replace()` can take an array of patterns as input. Thus my method for this could be a single `if` whereas the other solutions required one (or many) loops. I'm not interested in debating my answer, because really it is less readable/maintainable than the loops. However, the biggest fault pointed out with my method was a lack of efficiency. That got me curious, and led me to do some testing. My results were a bit surprising to me: with all other factors held equal, `preg_replace()` was **faster** than any of the other methods. Can you explain why this was the case?";
$input2 = "Short sentence - no matches";
$input3 = "Word";
$input4 = "Short sentence - matches loop";

$start1 = microtime(true);
$rejectedStrs = array("loop", "efficiency", "explain");
$p_matches = 0;
for ($i = 0; $i < 10000; $i++) {
    if (str_check($rejectedStrs, $input)) $p_matches++;
    if (str_check($rejectedStrs, $input2)) $p_matches++;
    if (str_check($rejectedStrs, $input3)) $p_matches++;
    if (str_check($rejectedStrs, $input4)) $p_matches++;
}

$start2 = microtime(true);
$rejectedStrs = array("loop", "efficiency", "explain");
$l_matches = 0;
for ($i = 0; $i < 10000; $i++) {
    if (loop_check($rejectedStrs, $input)) $l_matches++;
    if (loop_check($rejectedStrs, $input2)) $l_matches++;
    if (loop_check($rejectedStrs, $input3)) $l_matches++;
    if (loop_check($rejectedStrs, $input4)) $l_matches++;
}

$start3 = microtime(true);
$rejectedStrs = array("/loop/", "/efficiency/", "/explain/");
$s_matches = 0;
for ($i = 0; $i < 10000; $i++) {
    if (preg_check($rejectedStrs, $input)) $s_matches++;
    if (preg_check($rejectedStrs, $input2)) $s_matches++;
    if (preg_check($rejectedStrs, $input3)) $s_matches++;
    if (preg_check($rejectedStrs, $input4)) $s_matches++;
}

$end = microtime(true);
echo $p_matches." ".$l_matches." ".$s_matches."\n";
echo "str_match: ".$start1." ".$start2."= ".($start2-$start1)."\nloop_match: ".$start2." ".$start3."=".($start3-$start2)."\npreg_match: ".$start3." ".$end."=".($end-$start3);

function preg_check($rejectedStrs, $input) {
    if($input == preg_replace($rejectedStrs, "", $input)) 
        return true;
    return false;
}

function loop_check($badwords, $string) {

    foreach (str_word_count($string, 1) as $word) {
        foreach ($badwords as $bw) {
            if (stripos($word, $bw) === 0) {
                return false;
            }
        }
    }
    return true;
}

function str_check($badwords, $str) {
    foreach ($badwords as $word) {
        if (stripos(" $str ", " $word ") !== false) {
            return false;
        }
    }
    return true;
}

结果

20000万20000

str_匹配:1282270516.6934 1282270518.5881=1.894730091095

回路匹配:1282270518.5881 1282270523.0943=4.506185700348

Preg_匹配:1282270523.0943 1282270523.6191=0.52475500106812

2 回复 | 直到 14 年前

Artefacto 14 年前

我们先看看 preg_check 和 loop_check . 它们都必须遍历整个字符串,并且必须检查每个遍历中的每个单独单词。所以他们的行为至少 O(n*m) 在哪里 n 是字符串的长度,并且 m 坏词的数目。您可以通过使用递增的值运行算法来测试这一点。 n 和 米 绘制三维图形(但是,您可能,也可能不必,必须以非常高的值运行它 n 和 米 看到这种行为)。

循环检查 是更多 (渐近)这里是有效的。原因是一个字符串的字数与它们的长度不成正比——我记得它通常跟在对数函数后面。它可能使用哈希表来存储它通过这种方式找到的单词,这是在平均常量时间内完成的(如果我们忽略了这一点,我们可能需要不时地重建哈希表以容纳更多的元素)。

因此 循环检查 会有一个渐进的行为,如下 n + m * log(n) ,哪个比 n*m .

现在,这是指算法的渐近行为,即当 米 和 n 长得非常大(可能需要“非常大”)。对于小值 米 和 n 常数起着很大的作用。尤其是,执行php操作码和php函数调用比在C中实现的相同任务成本更高,只需调用一个函数。这不会使regex算法更快,只会使较小值的regex更快。 米 和 n .

Frank Farmer 14 年前

你能解释一下为什么会这样吗?

容易的。preg_match在C中实现,其他解决方案在PHP中实现。现在,这并不意味着雷杰克斯会总是比等效的PHP快,但大多数时候,它可能会。

我最近也遇到过类似的情况,我有一个函数(一个camelcase转换器),它被称为成千上万次的10s,并且占用了相当多的CPU(我分析过)。我尝试了所有我能想到的PHP重新实现。这个 preg_replace 总是更快。最后,我离开了原来的功能,把它记忆起来,这就完成了。

在许多情况下,执行的PHP语句越少越好。如果您可以用一个函数调用来替换一个循环,该函数是在C中实现的,那么这可能是您的最佳选择。

实际上,它的可读性/可维护性不如循环

我不同意。一句话很简单。尽管我可能会选择更像

function preg_check($rejectedStrs, $input) {
    return preg_match($rejectedStrs, "", $input);
}