代码之家 › 专栏 › 技术社区 › Simon

用于修复损坏的序列化PHP数据的regex/code

php

Simon · 技术社区 · 14 年前

我有一个巨大的多维数组,它已经被PHP序列化了。存储在MySQL中,数据字段不够大…末端被切断了…我需要提取数据… unserialize 不会工作…有人知道可以关闭所有数组的代码吗…重新计算字符串长度…手工做的数据太多了。

多谢。

12 回复 | 直到 5 年前

-6

fabrik 14 年前

我认为这几乎是不可能的。在修复阵列之前,您需要知道它是如何损坏的。有多少孩子失踪了?内容是什么?

对不起,我知道你做不到。

Emil M 7 年前

这将重新计算序列化数组中元素的长度:

$fixed = preg_replace_callback(
    '/s:([0-9]+):\"(.*?)\";/',
    function ($matches) { return "s:".strlen($matches[2]).':"'.$matches[2].'";';     },
    $serialized
);

但是,如果字符串包含 "; . 在这种情况下,无法自动修复序列化数组字符串——需要手动编辑。

Roman Newaza 7 年前

我已经尝试过这篇文章中的所有内容,但没有任何对我有用的。经过数小时的痛苦之后,我在谷歌的深度页面上找到了以下内容,并最终成功:

function fix_str_length($matches) {
    $string = $matches[2];
    $right_length = strlen($string); // yes, strlen even for UTF-8 characters, PHP wants the mem size, not the char count
    return 's:' . $right_length . ':"' . $string . '";';
}
function fix_serialized($string) {
    // securities
    if ( !preg_match('/^[aOs]:/', $string) ) return $string;
    if ( @unserialize($string) !== false ) return $string;
    $string = preg_replace("%\n%", "", $string);
    // doublequote exploding
    $data = preg_replace('%";%', "ÂµÂµÂµ", $string);
    $tab = explode("ÂµÂµÂµ", $data);
    $new_data = '';
    foreach ($tab as $line) {
        $new_data .= preg_replace_callback('%\bs:(\d+):"(.*)%', 'fix_str_length', $line);
    }
    return $new_data;
}

您可以按如下方式调用例程:

//Let's consider we store the serialization inside a txt file
$corruptedSerialization = file_get_contents('corruptedSerialization.txt');

//Try to unserialize original string
$unSerialized = unserialize($corruptedSerialization);

//In case of failure let's try to repair it
if(!$unSerialized){
    $repairedSerialization = fix_serialized($corruptedSerialization);
    $unSerialized = unserialize($repairedSerialization);
}

//Keep your fingers crossed
var_dump($unSerialized);

T.Todua Laurent W. 6 年前

解决方案:

1)尝试在线:

Serialized String Fixer (online tool)

2)使用功能:

unserialize( serialize_corrector( $serialized_string ) ) ;

代码:

function serialize_corrector($serialized_string){
    // at first, check if "fixing" is really needed at all. After that, security checkup.
    if ( @unserialize($serialized_string) !== true &&  preg_match('/^[aOs]:/', $serialized_string) ) {
         $serialized_string = preg_replace_callback( '/s\:(\d+)\:\"(.*?)\";/s',    function($matches){return 's:'.strlen($matches[2]).':"'.$matches[2].'";'; },   $serialized_string );
    }
    return $serialized_string;
}

T.Todua Laurent W. 8 年前

使用 preg_replace_callback() 而不是 preg_replace(.../e) (因为 /e 修饰语是 deprecated )

$fixed_serialized_String = preg_replace_callback('/s:([0-9]+):\"(.*?)\";/',function($match) {
    return "s:".strlen($match[2]).':"'.$match[2].'";';
}, $serializedString);

$correct_array= unserialize($fixed_serialized_String);

lubosdz 8 年前

以下代码段将尝试读取和分析递归损坏的序列化字符串(BLOB数据)。例如,如果您存储在数据库列字符串中的时间太长,它被切断。数字原语和bool保证有效,字符串可能被切断和/或数组键可能丢失。该例行程序可能很有用,例如,如果恢复重要(不是全部)部分数据对您来说是足够的解决方案。

class Unserializer
{
    /**
    * Parse blob string tolerating corrupted strings & arrays
    * @param string $str Corrupted blob string
    */
    public static function parseCorruptedBlob(&$str)
    {
        // array pattern:    a:236:{...;}
        // integer pattern:  i:123;
        // double pattern:   d:329.0001122;
        // boolean pattern:  b:1; or b:0;
        // string pattern:   s:14:"date_departure";
        // null pattern:     N;
        // not supported: object O:{...}, reference R:{...}

        // NOTES:
        // - primitive types (bool, int, float) except for string are guaranteed uncorrupted
        // - arrays are tolerant to corrupted keys/values
        // - references & objects are not supported
        // - we use single byte string length calculation (strlen rather than mb_strlen) since source string is ISO-8859-2, not utf-8

        if(preg_match('/^a:(\d+):{/', $str, $match)){
            list($pattern, $cntItems) = $match;
            $str = substr($str, strlen($pattern));
            $array = [];
            for($i=0; $i<$cntItems; ++$i){
                $key = self::parseCorruptedBlob($str);
                if(trim($key)!==''){ // hmm, we wont allow null and "" as keys..
                    $array[$key] = self::parseCorruptedBlob($str);
                }
            }
            $str = ltrim($str, '}'); // closing array bracket
            return $array;
        }elseif(preg_match('/^s:(\d+):/', $str, $match)){
            list($pattern, $length) = $match;
            $str = substr($str, strlen($pattern));
            $val = substr($str, 0, $length + 2); // include also surrounding double quotes
            $str = substr($str, strlen($val) + 1); // include also semicolon
            $val = trim($val, '"'); // remove surrounding double quotes
            if(preg_match('/^a:(\d+):{/', $val)){
                // parse instantly another serialized array
                return (array) self::parseCorruptedBlob($val);
            }else{
                return (string) $val;
            }
        }elseif(preg_match('/^i:(\d+);/', $str, $match)){
            list($pattern, $val) = $match;
            $str = substr($str, strlen($pattern));
            return (int) $val;
        }elseif(preg_match('/^d:([\d.]+);/', $str, $match)){
            list($pattern, $val) = $match;
            $str = substr($str, strlen($pattern));
            return (float) $val;
        }elseif(preg_match('/^b:(0|1);/', $str, $match)){
            list($pattern, $val) = $match;
            $str = substr($str, strlen($pattern));
            return (bool) $val;
        }elseif(preg_match('/^N;/', $str, $match)){
            $str = substr($str, strlen('N;'));
            return null;
        }
    }
}

// usage:
$unserialized = Unserializer::parseCorruptedBlob($serializedString);

mickmackusa 5 年前

看起来所有之前的文章都只是复制粘贴其他人的regex模式。如果不打算在替换中使用,则没有理由捕获可能损坏的字节计数。另外,添加 s 如果字符串值包含换行符/换行符,则模式修饰符是一个合理的包含。

*对于那些不知道用串行化处理多字节字符的人,请参阅我的输出…

代码: Demo )

$corrupted = <<<STRING
a:4:{i:0;s:3:"three";i:1;s:5:"five";i:2;s:2:"newline1
newline2";i:3;s:6:"garÃ§on";}
STRING;

$repaired = preg_replace_callback(
        '/s:\d+:"(.*?)";/s',
        function ($m) {
            return "s:" . strlen($m[1]) . ":\"{$m[1]}\";";
        },
        $corrupted
    );

echo $corrupted , "\n" , $repaired;
echo "\n---\n";
var_export(unserialize($repaired));

输出:

a:4:{i:0;s:3:"three";i:1;s:5:"five";i:2;s:2:"newline1
Newline2";i:3;s:6:"garÃ§on";}
a:4:{i:0;s:5:"three";i:1;s:4:"five";i:2;s:17:"newline1
Newline2";i:3;s:7:"garÃ§on";}
---
array (
  0 => 'three',
  1 => 'five',
  2 => 'newline1
Newline2',
  3 => 'garÃ§on',
)

Kamal Saleh 8 年前

基于@emil m答案这里有一个固定的版本,用于包含双引号的文本。

function fix_broken_serialized_array($match) {
    return "s:".strlen($match[2]).":\"".$match[2]."\";"; 
}
$fixed = preg_replace_callback(
    '/s:([0-9]+):"(.*?)";/',
    "fix_broken_serialized_array",
    $serialized
);

T.Todua Laurent W. 8 年前

最佳解决方案:

$output_array = unserialize(My_checker($serialized_string));

代码:

function My_checker($serialized_string){
    // securities
    if (empty($serialized_string))                      return '';
    if ( !preg_match('/^[aOs]:/', $serialized_string) ) return $serialized_string;
    if ( @unserialize($serialized_string) !== false ) return $serialized_string;

    return
    preg_replace_callback(
        '/s\:(\d+)\:\"(.*?)\";/s', 
        function ($matches){  return 's:'.strlen($matches[2]).':"'.$matches[2].'";';  },
        $serialized_string )
    ;
}

-2

Quamis 14 年前

我怀疑是否有人会编写代码来检索部分保存的数组:) 我修过一次这样的东西,但是用手,花了好几个小时,然后我意识到我不需要数组的那部分…

除非它是非常重要的数据(我的意思是非常重要),否则你最好离开这一个

-2

Mike Kormendy 9 年前

您可以通过数组()将无效的序列化数据返回到正常状态。

str = "a:1:{i:0;a:4:{s:4:\"name\";s:26:\"20141023_544909d85b868.rar\";s:5:\"dname\";s:20:\"HTxRcEBC0JFRWhtk.rar\";s:4:\"size\";i:19935;s:4:\"dead\";i:0;}}"; 

preg_match_all($re, $str, $matches);

if(is_array($matches) && !empty($matches[1]) && !empty($matches[2]))
{
    foreach($matches[1] as $ksel => $serv)
    {
        if(!empty($serv))
        {
            $retva[] = $serv;
        }else{
            $retva[] = $matches[2][$ksel];
        }
    }

    $count = 0;
    $arrk = array();
    $arrv = array();
    if(is_array($retva))
    {
        foreach($retva as $k => $va)
        {
            ++$count;
            if($count/2 == 1)
            {
                $arrv[] = $va;
                $count = 0;
            }else{
                $arrk[] = $va;
            }
        }
        $returnse = array_combine($arrk,$arrv);
    }

}

print_r($returnse);

-3

Ben 14 年前

序列化几乎总是不好的,因为你不能以任何方式搜索它。对不起,但你好像掉进了一个角落…