代码之家 › 专栏 › 技术社区 › fredley

某些HTML标记中包含的转义引号

mysql python

fredley · 技术社区 · 14 年前

我已经完成了一个大型数据库的mysqldump,大约300MB。但是它犯了一个错误,它没有转义任何包含在 <o:p>...</o:p> 标签。这是一个示例:

...Text here\' escaped correctly, <o:p> But text in here isn't. </o:p> Out here all\'s well again...

是否可以编写脚本(最好是用python编写,但我会接受任何内容!)这将能够自动扫描和修复这些错误?它们有很多,而且记事本++不能很好地处理这种大小的文件…

1 回复 | 直到 14 年前

Alex Martelli 14 年前

如果将文件分成的“行”长度合理,并且其中没有“以文本形式读取”将中断的二进制序列,则可以使用 fileinput “方便”功能:

   import re
   import fileinput

   tagre = re.compile(r"<o:p>.*?</o:p>")
   def sub(mo):
     return mo.group().replace(r"'", r"\'")

   for line in fileinput.input('thefilename', inplace=True):
     print tagre.sub(sub, line),

如果没有,您将不得不自己模拟“就地重写”,例如(过于简单化…):

   with open('thefilename', 'rb') as inf:
     with open('fixed', 'wb') as ouf:
       while True:
         b = inf.read(1024*1024)
         if not b: break
         ouf.write(tagre.sub(sub, b))

然后移动 'fixed' 代替 'thefilename' (以代码或手动方式)如果修复后需要保留该文件名。

这过于简单化,因为 <o:p> ... </o:p> 部分最终可能会在两个连续的兆字节“块”之间被拆分,因此无法识别(在第一个示例中,我假设每个这样的部分始终完全包含在一个“行”中——如果不是这样,那么您不应该使用该代码,而是使用以下代码。修复此问题需要更复杂的代码……:

   with open('thefilename', 'rb') as inf:
     with open('fixed', 'wb') as ouf:
       while True:
         b = getblock(inf)
         if not b: break
         ouf.write(tagre.sub(sub, b))

例如

   partsofastartag = '<', '<o', '<o:', '<o:p'
   def getblock(inf):
     b = ''
     while True:
       newb = inf.read(1024 * 1024)
       if not newb: return b
       b += newb
       if any(b.endswith(p) for p in partsofastartag):
         continue
       if b.count('<o:p>') != b.count('</o:p>'):
         continue
       return b

正如您所看到的,这是一个非常微妙的代码,因此,如果它没有经过测试,我就不能知道这对你的问题是正确的。特别是,有没有 <o:p> 与收盘价不符的 </o:p> 反之亦然?如果是,那么打电话给 getblock 最终可能会以相当昂贵的方式返回整个文件,甚至重新匹配和替换可能会适得其反(如果此类标记中的某些单引号已经正确转义,但不是全部转义,则也会发生后一种情况)。

如果您至少有一个GB左右的内存,那么至少避免块划分的微妙问题是可行的,因为所有东西都应该放在内存中,使代码更简单:

   with open('thefilename', 'rb') as inf:
     with open('fixed', 'wb') as ouf:
         b = inf.read()
         ouf.write(tagre.sub(sub, b))

但是,上面提到的其他问题(可能是不平衡的打开/关闭标签等)可能仍然存在——只有您可以研究现有的有缺陷的数据,看看它是否提供了这样一种合理简单的修复方法!