代码之家 › 专栏 › 技术社区 › Martin Ba

是否有可能克隆xmlTextReader(或多次读取)?

xmlreader libxml2 xml-parsing xml c

Martin Ba · 技术社区 · 11 年前

我目前必须修复现有的应用程序才能使用 DOM interface 属于 libxml2 因为它传递的XML文件太大,以至于无法加载到内存中。

我已经重写了数据加载,从在DOM树上迭代到使用 xmlTextReader 现在大部分时间都没有太多问题。(我使用 xmlNewTextReaderFilename 打开本地文件。)

然而,事实证明,必须不按顺序读取大数据所在的子树,但我必须在另一个之前收集一些(少量)数据。(问题是,正是这个子树包含了大量的数据,所以只将这个子树加载到内存中也没有多大意义。)

最简单的方法是“克隆”/“复制”我当前的阅读器,提前阅读,然后返回到原始实例继续阅读。(看起来我 not the first one …甚至还有一些在C#端实现的东西: XML Reader with Bookmarks .)

然而,似乎没有任何方法可以“复制”xmlTextReader的状态。

如果我不能重读部分对于文件,我也可以重新读取整体文件,虽然很浪费,但在这里可以,但我仍然需要记住我之前在哪里?

对于xmlTextReader,是否有一种简单的方法可以记住它在当前文档中的位置,以便以后在第二次读取文档/文件时再次找到该位置?

下面是一个问题示例:

<root>
  <cat1>
    <data attrib="x1">
      ... here goes up to one GB in stuff ...
    </data>
    <data attrib="y2"> <!-- <<< Want to remember this position without having to re-read the stuff before -->
      ... even more stuff ...
    </data>
    <data attrib="z3">
       <!-- I need (part of) the data here to meaningfully interpret the data in [y2] that 
            came before. The best approach would seem to first skip all that data
            and then start back there at <data attrib="y2"> ... not having to re-read
            the whole [x1] data would be a big plus! -->
    </data>
  </cat1>
  ...
</root>

1 回复 | 直到 11 年前

Martin Ba 11 年前

我想从我 learned at the XML mailing list :

在xmlReader上“克隆”状态并不是一种简单的方法,然而,应该可以而且应该非常容易的是计算一个人对文档的读取次数。

也就是说,要使用xmlReader读取文档,您可能需要调用以下内容:

// looping ...
status = ::xmlTextReaderRead(pReader);

如果您以结构化的方式做到这一点(例如,我最终编写了一个封装xmlReader使用模式的小包装器类),那么添加计数器就相对容易了:

// looping ...
status = ::xmlTextReaderRead(pReader);
if (1 == status) { // success
  ++m_ReadCounter;
}

对于重读文档(到达某个位置),您只需打电话 xmlTextReaderRead 一些 m_ReadCounter 次,丢弃结果,直到到达想要重新开始的位置。

是的,您必须重新解析整个文档,但这可能足够快了。(实际上可能比缓存文档的很大一部分更好/更快。)