代码之家 › 专栏 › 技术社区 › Vincent

用于删除XML标记及其内容的正则表达式

regex vb.net xml .net c#

Vincent · 技术社区 · 17 年前

我有以下字符串,我想删除 <bpt *>*</bpt> 和 <ept *>*</ept> (注意其中还需要删除额外的标记内容),而不使用XML解析器(对于小字符串来说开销太大)。

The big <bpt i="1" x="1" type="bold"><b></bpt>black<ept i="1"></b></ept> <bpt i="2" x="2" type="ulined"><u></bpt>cat<ept i="2"></u></ept> sleeps.

vb.net或c中的任何regex都可以。

7 回复 | 直到 14 年前

tyshock 17 年前

如果只想从字符串中删除所有标记,请使用(c):

try {
    yourstring = Regex.Replace(yourstring, "(<[be]pt[^>]+>.+?</[be]pt>)", "");
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

编辑:

我决定用更好的选择补充我的解决方案。如果有嵌入的标记,前面的选项将不起作用。这个新的解决方案应该去掉所有<**pt*>标记,不管是否嵌入。此外,此解决方案使用对原始[be]匹配的后引用,以便找到完全匹配的结束标记。此解决方案还创建了一个可重用的regex对象,以提高性能,这样每次迭代都不必重新编译regex:

bool FoundMatch = false;

try {
    Regex regex = new Regex(@"<([be])pt[^>]+>.+?</\1pt>");
    while(regex.IsMatch(yourstring) ) {
        yourstring = regex.Replace(yourstring, "");
    }
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

附加说明:

在评论中,用户表示担心“.”模式匹配器会占用大量CPU。虽然这在独立贪婪''的情况下是正确的,但是使用非贪婪字符'?'使regex引擎只能向前看,直到它找到模式中下一个字符与贪婪的“.”的第一个匹配,这要求引擎一直向前看到字符串的末尾。我用 RegexBuddy 作为一个regex开发工具,它包含一个调试器,让您可以看到不同regex模式的相对性能。如果需要的话,它还会自动注释您的regex,因此我决定在这里包含这些注释来解释上面使用的regex:

    // <([be])pt[^>]+>.+?</\1pt>
// 
// Match the character "<" literally Â«<Â»
// Match the regular expression below and capture its match into backreference number 1 Â«([be])Â»
//    Match a single character present in the list "be" Â«[be]Â»
// Match the characters "pt" literally Â«ptÂ»
// Match any character that is not a ">" Â«[^>]+Â»
//    Between one and unlimited times, as many times as possible, giving back as needed (greedy) Â«+Â»
// Match the character ">" literally Â«>Â»
// Match any single character that is not a line break character Â«.+?Â»
//    Between one and unlimited times, as few times as possible, expanding as needed (lazy) Â«+?Â»
// Match the characters "</" literally Â«</Â»
// Match the same text as most recently matched by backreference number 1 Â«\1Â»
// Match the characters "pt>" literally Â«pt>Â»

davenpcj 17 年前

我想你想把标签全部放下?

(<bpt .*?>.*?</bpt>)|(<ept .*?>.*?</ept>)

这个?在*使其不贪婪之后,它将尝试匹配尽可能少的字符。

您将遇到的一个问题是嵌套标记。因为第一个匹配,所以不会看到第二个。

Andy Lester 17 年前

你为什么说开销太大?你量过了吗?或者你在猜测?

使用regex而不是正确的解析器是一种快捷方式,当有人遇到类似<bpt foo=“bar>”>的问题时,您可能会遇到这种情况。

Torsten Marek 17 年前

.NET regex引擎是否支持负lookaheads?如果是,则可以使用

(<([eb])pt[^>]+>((?!</\2pt>).)+</\2pt>)

这使得 大黑猫睡觉。 如果删除所有匹配项,则从上面的字符串中删除。但是请记住,如果您嵌套了 bpt / ept 元素。您可能还想添加 \s 在某些地方允许在结束元素等中有额外的空白。

Robert Rossney 17 年前

如果要使用regex删除XML元素,最好确保输入的XML不使用来自不同命名空间的元素,或者包含不想修改其内容的CDATA节。

正确的(即同时执行和正确的)方法是使用XSLT。将除特定元素以外的所有内容复制到输出的XSLT转换是标识转换的一个简单扩展。一旦转换被编译,它将以极快的速度执行。它不会包含任何隐藏的缺陷。

community wiki here4u 16 年前

对于XML类型的文本,有没有可能获得regex.pattern的全局解决方案? 这样我就摆脱了replace函数,shell使用regex。问题是要分析是否有序。还将保留字符替换为'&等。这是密码 '处理特殊字符功能 friend函数replacespecchars(byval str as string)as string 作为新的收藏比新系列更暗如果不是isdbnull(str),则

  str = CStr(str)
  If Len(str) > 0 Then
    str = Replace(str, "&", "&amp;")
    str = Replace(str, "'", "&apos;")
    str = Replace(str, """", "&quot;")
    arrLessThan = FindLocationOfChar("<", str)
    arrGreaterThan = FindLocationOfChar(">", str)
    str = ChangeGreaterLess(arrLessThan, arrGreaterThan, str)
    str = Replace(str, Chr(13), "chr(13)")
    str = Replace(str, Chr(10), "chr(10)")
  End If
  Return str
Else
  Return ""
End If

端函数 friend函数changegreaterless(byval lh as collection,byval gr as collection,byval str as string)as string 对于i,整数=0到lh。计数如果CINT(左侧项目(i))>CINT(右侧项目(i)),则 str=replace(str,“<”,<“)///problems//// 结束如果

  Next


    str = Replace(str, ">", "&gt;")

端函数 friend函数findlocationofchar(byval chr as char,byval str as string)as collection Dim arr作为新系列对于i,整数=1到str.length()-1 如果str.tochararray(i,1)=chr,则添加(i) 结束如果接下来返回ARR 端函数

在问题标记处遇到问题

这是一个具有不同标记的Standart XML,我想分析..

Eamon Nerbonne 15 年前

你量过这个吗?我有使用.NET的regex引擎会遇到性能问题,但相比之下,解析的XML文件大约为40GB。没有使用XML解析器时出现问题(但是,对于较大的字符串,需要使用XmlReader)。

请发布一个实际的代码示例并说明您的性能要求:我怀疑 Regex 如果性能很重要,那么类是最好的解决方案。