代码之家 › 专栏 › 技术社区 › pinkgothic sudip

HTML净化器:将<body>转换为<div>

html htmlpurifier html-parsing php

pinkgothic sudip · 技术社区 · 14 年前

前提

我想用 HTML Purifier 转变 <body> 标记到 <div> < <body style="background:color#000000;">Hi there.</body> 会转向 <div style="background:color#000000;">Hi there.</div> . 我看到的是 custom tag 和一个 TagTransform 班级。

在我的配置部分中,我正在执行以下操作:

$htmlDef  = $this->configuration->getHTMLDefinition(true);
// defining the element to avoid triggering 'Element 'body' is not supported'
$bodyElem = $htmlDef->addElement('body', 'Block', 'Flow', 'Core');
$bodyElem->excludes = array('body' => true);
// add the transformation rule
$htmlDef->info_tag_transform['body'] = new HTMLPurifier_TagTransform_Simple('div');

…以及允许 < 以及它的 style class ,和 id )属性(它们是解析为 HTML.AllowedElements HTML.AllowedAttributes ).

我已经关掉了。

$config->set('Cache.DefinitionImpl', null);

不幸的是,在这种情况下 HTMLPurifier_TagTransform_Simple 从来没有它的 transform()

我想罪魁祸首是我的 HTML.Parent ,设置为 'div' <部门> 不允许孩子 < 元素。但是,设置 HTML.父级 'html' 网络我:

ErrorException:不能将无法识别的元素用作父元素

正在添加。。。

$htmlElem = $htmlDef->addElement('html', 'Block', 'Flow', 'Core');
$htmlElem->excludes = array('html' => true);

…删除了错误消息,但仍然不转换标记-而是将其删除。

$htmlElem = $htmlDef->addElement('html', 'Block', 'Custom: head?, body', 'Core');
$htmlElem->excludes = array('html' => true);

…也不做任何操作,因为它会给我一条错误消息:

ErrorException: Trying to get property of non-object       

[...]/library/HTMLPurifier/Strategy/FixNesting.php:237
[...]/library/HTMLPurifier/Strategy/Composite.php:18
[...]/library/HTMLPurifier.php:181
[...]

HTML.TidyLevel?

HTML.TidyLevel 设置为 'heavy' . 我还没有尝试所有可能的星座,但到目前为止,这是没有区别。

完整配置

我的配置数据存储在JSON中,然后解析为HTML净化器。文件如下:

{
    "CSS" : {
        "MaxImgLength" : "800px"
    },
    "Core" : {
        "CollectErrors" : true,
        "HiddenElements" : {
            "script"   : true,
            "style"    : true,
            "iframe"   : true,
            "noframes" : true
        },
        "RemoveInvalidImg" : false
    },
    "Filter" : {
        "ExtractStyleBlocks" : true
    },
    "HTML" : {
        "MaxImgLength" : 800,
        "TidyLevel"    : "heavy",
        "Doctype"      : "XHTML 1.0 Transitional",
        "Parent"       : "html"
    },
    "Output" : {
        "TidyFormat"   : true
    },
    "Test" : {
        "ForceNoIconv" : true
    },
    "URI" : {
        "AllowedSchemes" : {
            "http"     : true,
            "https"    : true,
            "mailto"   : true,
            "ftp"      : true
        },
        "DisableExternalResources" : true
    }
}

( URI.Base , URI.Munge Cache.SerializerPath 也设置了,但我已经在这个粘贴中删除了它们。也, 警告:如前所述,通常设置为 '分割' .)

2 回复 | 直到 14 年前

Edward Z. Yang 14 年前

这段代码是您所做的工作不起作用的原因:

/**
 * Takes a string of HTML (fragment or document) and returns the content
 * @todo Consider making protected
 */
public function extractBody($html) {
    $matches = array();
    $result = preg_match('!<body[^>]*>(.*)</body>!is', $html, $matches);
    if ($result) {
        return $matches[1];
    } else {
        return $html;
    }
}

您可以使用%Core.ConvertDocumentToFragment作为false关闭它;如果您的代码的其余部分是无错误的,它应该从那里直接工作。我不认为你的bodyElem定义是必要的

Ben 14 年前

这不是更容易做到:

$search = array('<body', 'body>');
$replace = array('<div', 'div>');

$html = '<body style="background:color#000000;">Hi there.</body>';

echo str_replace($search, $replace, $html);

>> '<div style="background:color#000000;">Hi there.</div>';