代码之家  ›  专栏  ›  技术社区  ›  dev-null-dweller

在用DomDocument解析html时,有没有办法保持实体完好无损?

  •  6
  • dev-null-dweller  · 技术社区  · 14 年前

    我的职责是确保 img公司

    function absoluteSrc($html, $encoding = 'utf-8')
    {
        $dom = new DOMDocument();
        // Workaround to use proper encoding
        $prehtml  = "<html><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset={$encoding}\"></head><body>";
        $posthtml = "</body></html>";
    
        if($dom->loadHTML( $prehtml . trim($html) . $posthtml)){
            foreach($dom->getElementsByTagName('img') as $img){
                if($img instanceof DOMElement){
                    $src = $img->getAttribute('src');
                    if( strpos($src, 'http://') !== 0 ){
                        $img->setAttribute('src', 'http://my.server/' . $src);
                    }
                }
            }
    
            $html = $dom->saveHTML();
    
            // Remove remains of workaround / DomDocument additions
            $cut_start  = strpos($html, '<body>') + 6;
            $cut_length = -1 * (1+strlen($posthtml));
            $html = substr($html, $cut_start, $cut_length);
        }
        return $html;
    }
    

    它工作正常,但它返回解码实体作为unicode字符

    $html = <<< EOHTML
    <p><img src="images/lorem.jpg" alt="lorem" align="left">
    Lorem ipsum dolor sit amet consectetuer Nullam felis laoreet
    Cum magna. Suscipit sed vel tincidunt urna.<br>
    Vel consequat pretium Curabitur faucibus justo adipiscing elit.
    <img src="others/ipsum.png" alt="ipsum" align="right"></p>
    
    <center>&copy; Dr&nbsp;Jekyll &#38; Mr&nbsp;Hyde</center>
    EOHTML;
    
    echo absoluteSrc($html);
    

    输出:

    <p><img src="http://my.server/images/lorem.jpg" alt="lorem" align="left">
    Lorem ipsum dolor sit amet consectetuer Nullam felis laoreet
    Cum magna. Suscipit sed vel tincidunt urna.<br>
    Vel consequat pretium Curabitur faucibus justo adipiscing elit.
    <img src="http://my.server/others/ipsum.png" alt="ipsum" align="right"></p>
    
    <center>© Dr Jekyll &amp; Mr Hyde</center>
    

    正如你在最后一行看到的

    • & © (U+00A9),
    • &nbsp; 至不间断间隔(U+00A0),
    • & &

    我希望它们保持与输入字符串中的相同。

    3 回复  |  直到 14 年前
        1
  •  0
  •   Byron Whitlock    14 年前

    我也想知道这个问题的答案。

    我最终将&实体转换为 **ENTITY-...-ENTITY** 在解析和转换完成后返回之前。

        2
  •  1
  •   Terry    13 年前

    下面的代码似乎有效

       $dom= new DOMDocument('1.0', 'UTF-8');
       $dom->loadHTML($this->htmlentities2stringcode(rawurldecode($content)) );
       $dom->preserveWhiteSpace = true; 
    
       $innerHTML = str_replace("<html></html><html><body>", "", 
       str_replace("</body></html>", "", 
    str_replace("+","%2B",str_replace("<p></p>", "", $this->getInnerHTML( $dom )))));
    
       return $this->stringcode2htmlentities($innerHTML));
    }
    // ----------------------------------------------------------
    function htmlentities2stringcode($string) {
       // This method will convert htmlentities such as &copy; into the pseudo string version ^copy^; etc
            $from   = array_keys($this->getHTMLEntityStringCodeArray());
            $to     = array_values($this->getHTMLEntityStringCodeArray());
       return str_replace($from, $to, $string);
     }
     // ----------------------------------------------------------
     function stringcode2htmlentities ($string) {
        // This method will convert pseudo string such as ^copy^ to the original html entity &copy; etc
        $from   = array_values($this->getHTMLEntityStringCodeArray());
        $to     = array_keys($this->getHTMLEntityStringCodeArray());
        return str_replace($from, $to, $string);
      } 
      // -------------------------------------------------------------
      function getHTMLEntityStringCodeArray() {
    
          return array('&Alpha;'=>'^Alpha^', 
                                        '&Beta;'=>'^Beta^', 
                                        '&Chi;'=>'^Chi^', 
                                        '&Dagger;'=>'^Dagger^', 
                                        '&Delta;'=>'^Delta^', 
                                        '&Epsilon;'=>'^Epsilon^', 
                                        '&Eta;'=>'^Eta^', 
                                        '&Gamma;'=>'^Gamma^', 
                                        '&Iota;'=>'^lota^', 
                                        '&Kappa;'=>'^Kappa^', 
                                        '&Lambda;'=>'^Lambda^', 
                                        '&Mu;'=>'^Mu^', 
                                        '&Nu;'=>'^Nu^', 
                                        '&OElig;'=>'^OElig^', 
                                        '&Omega;'=>'^Omega^', 
                                        '&Omicron;'=>'^Omicron^',
                                        '&Phi;'=>'^Phi^', 
                                        '&Pi;'=>'^Pi^', 
                                        '&Prime;'=>'^Prime^', 
                                        '&Psi;'=>'^Psi^', 
                                        '&Rho;'=>'^Rho^', 
                                        '&Scaron;'=>'^Scaron^',
                                        '&Scaron;'=>'^Scaron^',
                                        '&Sigma;'=>'^Sigma^',
                                        '&Tau;'=>'^Tau^',
                                        '&Theta;'=>'^Theta^',
                                        '&Upsilon;'=>'^Upsilon^',
                                        '&Xi;'=>'^Xi^',
                                        '&Yuml;'=>'^Yuml^',
                                        '&Zeta;'=>'^Zeta^',
                                        '&alefsym;'=>'^alefsym^',
                                        '&alpha;'=>'^alpha^',
                                        '&and;'=>'^and^',
                                        '&ang;'=>'^ang^',
                                        '&asymp;'=>'^asymp^',
                                        '&bdquo;'=>'^bdquo^',
                                        '&beta;'=>'^beta^',
                                        '&bull;'=>'^bull^',
                                        '&cap;'=>'^cap^',
                                        '&chi;'=>'^chi^',
                                        '&circ;'=>'^circ^',
                                        '&clubs;'=>'^clubs^',
                                        '&cong;'=>'^cong^',
                                        '&crarr;'=>'^crarr^',
                                        '&cup;'=>'^cup^',
                                        '&dArr;'=>'^dArr^',
                                        '&dagger;'=>'^dagger^',
                                        '&darr;'=>'^darr^',
                                        '&delta;'=>'^delta^',
                                        '&diams;'=>'^diams^',
                                        '&empty;'=>'^empty^',
                                        '&emsp;'=>'^emsp^',
                                        '&ensp;'=>'^ensp^',
                                        '&epsilon;'=>'^epsilon^',
                                        '&equiv;'=>'^equiv^',
                                        '&eta;'=>'^eta^',
                                        '&euro;'=>'^euro^',
                                        '&exist;'=>'^exist^',
                                        '&fnof;'=>'^fnof^',
                                        '&forall;'=>'^forall^',
                                        '&frasl;'=>'^frasl^',
                                        '&gamma;'=>'^gamma^',
                                        '&ge;'=>'^ge^',
                                        '&hArr;'=>'^hArr^',
                                        '&harr;'=>'^harr^',
                                        '&hearts;'=>'^hearts^',
                                        '&hellip;'=>'^hellip^',
                                        '&image;'=>'^image^',
                                        '&infin;'=>'^infin^',
                                        '&int;'=>'^int^',
                                        '&iota;'=>'^iota^',
                                        '&isin;'=>'^isin^',
                                        '&kappa;'=>'^kappa^',
                                        '&lArr;'=>'^lArr^',
                                        '&lambda;'=>'^lambda^',
                                        '&lang;'=>'^lang^',
                                        '&larr;'=>'^larr^',
                                        '&lceil;'=>'^lceil^',
                                        '&ldquo;'=>'^ldquo^',
                                        '&le;'=>'^le^',
                                        '&lfloor;'=>'^lfloor^',
                                        '&lowast;'=>'^lowast^',
                                        '&loz;'=>'^loz^',
                                        '&lrm;'=>'^lrm^',
                                        '&lsaquo;'=>'^lsaquo^',
                                        '&lsquo;'=>'^lsquo^',
                                        '&mdash;'=>'^mdash^',
                                        '&minus;'=>'^minus^',
                                        '&mu;'=>'^mu^',
                                        '&nabla;'=>'^nabla^',
                                        '&ndash;'=>'^ndash^',
                                        '&ne;'=>'^ne^',
                                        '&ni;'=>'^ni^',
                                        '&notin;'=>'^notin^',
                                        '&nsub;'=>'^nsub^',
                                        '&nu;'=>'^nu^',
                                        '&oelig;'=>'^oelig^',
                                        '&oline;'=>'^oline^',
                                        '&omega;'=>'^omega^',
                                        '&omicron;'=>'^omicron^',
                                        '&oplus;'=>'^oplus^',
                                        '&or;'=>'^or^',
                                        '&otimes;'=>'^otimes^',
                                        '&part;'=>'^part^',
                                        '&permil;'=>'^permil^',
                                        '&perp;'=>'^perp^',
                                        '&phi;'=>'^phi^',
                                        '&pi;'=>'^pi^', 
                                        '&piv;'=>'^piv^',
                                        '&prime;'=>'^prime^',
                                        '&prod;'=>'^prod^',
                                        '&prop;'=>'^prop^',
                                        '&psi;'=>'^psi^',
                                        '&rArr;'=>'^rArr^',
                                        '&radic;'=>'^radic^',
                                        '&rang;'=>'^rang^',
                                        '&rarr;'=>'^rarr^',
                                        '&rceil;'=>'^rceil^',
                                        '&rdquo;'=>'^rdquo^',
                                        '&real;'=>'^real^',
                                        '&rfloor;'=>'^rfloor^',
                                        '&rho;'=>'^rho^',
                                        '&rlm;'=>'^rlm^',
                                        '&rsaquo;'=>'^rsaquo^',
                                        '&rsquo;'=>'^rsquo^',
                                        '&sbquo;'=>'^sbquo^',
                                        '&scaron;'=>'^scaron^',
                                        '&sdot;'=>'^sdot^',
                                        '&sigma;'=>'^sigma^',
                                        '&sigmaf;'=>'^sigmaf^',
                                        '&sim;'=>'^sim^',
                                        '&spades;'=>'^spades^',
                                        '&sub;'=>'^sub^',
                                        '&sube;'=>'^sube^',
                                        '&sum;'=>'^sum^',
                                        '&sup;'=>'^sup^',
                                        '&supe;'=>'^supe^',
                                        '&tau;'=>'^tau^',
                                        '&there4;'=>'^there4^',
                                        '&theta;'=>'^thetasym^',
                                        '&thetasym;'=>'^thetasym^',
                                        '&thinsp;'=>'^thinsp^',
                                        '&tilde;'=>'^tilde^',
                                        '&trade;'=>'^trade^',
                                        '&uArr;'=>'^uArr^',
                                        '&uarr;'=>'^uarr^',
                                        '&upsih;'=>'^upsih^',
                                        '&upsilon;'=>'^upsilon^',
                                        '&weierp;'=>'^weierp^',
                                        '&xi;'=>'^xi^',
                                        '&yuml;'=>'^yuml^',
                                        '&zeta;'=>'^zeta^',
                                        '&zwj;'=>'^zwj^',
                                        '&zwnj;'=>'^zwnj^');
        }
    
        3
  •  1
  •   Gavin Ballard    11 年前

    另一种解决方案是使用DOMDocument->saveHTMLFile()(它不转换HTML实体)并将保存的文件的内容读回字符串。