代码之家 › 专栏 › 技术社区 › Sangram Nandkhile Viktor Klang

删除不完整的P标记(使用REGEX或任何其他方法)

parsing visual-studio regex .net c#

Sangram Nandkhile Viktor Klang · 技术社区 · 14 年前

首先,

 标签不适用于任何其他标记。所以你不必担心任何其他标签。

我有一个html文件,这是一个软件的输出,但它有一些错误,如未关闭  标签。

如。

我的文件就像。。

    <html>
    ....
    ....
      <head>
      </head>
    ....
    ....
       <body>

    ...
    ...
    <p>                 // tag is to be removed as no closing tag

<p align="left">   AAA   </p>
<p class="style6">   BBB    </P>
<p class="style1" align="center">    CCC    </P>

<p align="left">  DDD               // tag is to be removed as no closing tag
<p class="style6">   EEE              // tag is to be removed as no closing tag
<p class="style1" align="center">    FFF             // tag is to be removed as no closing tag

<p class="style15"><strong>xxyyzz</strong><br/></p>

<p>                // tag is to be removed as no closing tag



<p> stack Overflow </P>


       <body>
      </html>

带有DDD、EEE、FFF和未关闭标签  标签将被移除如你所见,它应该适用于每一个未关闭  标记它是否具有class或align等属性。

 在另一个内部标记 

<p>
    <p>
    </p>

     <p>
     </p>

</p>

这种情况永远不会发生。

对于那些愿意帮助的人,我们会提前很多时间。

当做

4 回复 | 直到 14 年前

chakrit Dutchie432 14 年前

使用 Html Agility Pack

它是一个.NET代码库,允许你需要解析“网外”HTML 解析器非常宽容与“现实世界”格式错误的HTML。 对象模型与提议命名空间,但对于HTML。

只需将文档加载到DOM中,遍历元素以查找  把它们过滤掉,就像你在做有效的XML操作一样。

Jim Mischel 14 年前

免责声明:请注意,我并不主张尝试用正则表达式或简单的子字符串匹配来解析任意HTML。下面的解决方案是 ,这似乎是有目的的限制,使解析可能与简单的方法。一般来说,我同意这样的共识:要解析HTML,请使用HTML解析器。

考虑到这一点  < 没有对应的标签  .

string inputText = GetHtmlText();
int scanPos = 0;
int startTag = inputText.IndexOf("<p>", scanPos);
while (startTag != -1)
{
    scanPos += 4;
    // Now look for a closing tag or another open tag
    int closeTag = inputText.IndexOf("</p">, scanPos);
    int nextStartTag = inputText.IndexOf("<p>", scanPos);
    if (closeTag == -1 || nextStartTag < closeTag)
    {
        // Error at position startTag.  No closing tag.
    }
    else
    {
        // You have a full paragraph between startTag and (closeTag+5).
    }
    startTag = nextStartTag;
}

代码假定字符串 < 和 

补充:

处理诸如  等等,就不那么确定了。如果你能保证 > <p 最后呢 > ,则可以修改上面的代码以搜索 <第 以及 < ,如果找到,则找到关闭 > . 有点乱,但不是特别难。

  ,两者都是完全正确的(我在野外也遇到过)。

Sangram Nandkhile Viktor Klang 14 年前

我真的很感谢你们所有人的帮助,特别是吉姆·阿历克斯。。我试过了,效果很好。太多了。

 public static string CleanUpXHTML(string xhtml)
            {
                int pOpen = 0, pClose = 0, pSlash = 0, pNext = 0, length = 0;
                pOpen = xhtml.IndexOf("<p", 0);
                pClose = xhtml.IndexOf(">", pOpen);
                pSlash = xhtml.IndexOf("</p>", pClose);
                pNext = xhtml.IndexOf("<p", pClose);

                while (pSlash > -1)
                {


                    if (pSlash < pNext)
                    {
                        if (pSlash < pNext)
                        {
                            pOpen = pNext;
                            pClose = xhtml.IndexOf(">", pOpen);
                            pSlash = xhtml.IndexOf("</p>", pClose);
                            pNext = xhtml.IndexOf("<p", pClose);
                        }
                    }
                    else
                    {
                        length = pClose - pOpen + 1;
                        if (pNext < 0 && pSlash > 0)
                        {
                            break;
                        }


                        xhtml = xhtml.Remove(pOpen, length);

                        pOpen = pNext - length;
                        pClose = xhtml.IndexOf(">", pOpen);
                        pSlash = xhtml.IndexOf("</p>", pClose);
                        pNext = xhtml.IndexOf("<p", pClose);


                    }

                    if (pSlash < 0)
                    {
                        int lastp = 0, lastclosep = 0, lastnextp = 0, length3 = 0, TpSlash =0 ;

                        lastp = xhtml.IndexOf("<p",pOpen-1);

                        lastclosep = xhtml.IndexOf(">", lastp);
                        lastnextp = xhtml.IndexOf("<p", lastclosep);


                        while (lastp >0)
                        {
                            length3 = lastclosep - lastp + 1;
                            xhtml = xhtml.Remove(lastp, length3);
                            if (lastnextp < 0)
                            {
                                break;
                            }
                            lastp = lastnextp-length3;
                            lastclosep = xhtml.IndexOf(">", lastp);
                            lastnextp = xhtml.IndexOf("<p", lastclosep);

                        }

                        break;
                    }

                }

                return xhtml;

            }

Community miroxlav 7 年前

首先,请看一看 here . 如果这并没有阻止您使用正则表达式来解析HTML(因为我知道这是一个非常特殊的情况,可能不需要使用完整的DOM解析器,即使这是绝对最好的推荐方式),我已经发布了一个类似问题的答案 here ;您可以很容易地根据您的情况调整它,但请理解,它不是推荐的,如果您决定使用它,许多事情可能会出错(包括,如上面第一个链接中概述的,世界末日等:P)。

如果我指给你的正则表达式看起来太复杂,或者你在理解或简化它时遇到了问题,请发表评论,我会补充更多的说明。