代码之家  ›  专栏  ›  技术社区  ›  Hypothetical Ninja

使用元素树模块分析docx

  •  0
  • Hypothetical Ninja  · 技术社区  · 10 年前

    我有这个文档,我需要解析它并获得一个XML等价物。基本上,我需要一个ElementTree类型的对象,但它没有实现。我尝试过很多不同的组合,但我还没有弄清楚。 以下是我所做的:

    import xml.etree.ElementTree as ET
    z = zf.ZipFile("INTRODUCTION.docx")
    doc_xml = z.read("word/document.xml")
    print doc_xml           #type(doc_xml) is str  
    
    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14"><w:body><w:p w:rsidR="00470EEF" w:rsidRDefault="00456755"><w:pPr><w:rPr><w:b/></w:rPr></w:pPr><w:r w:rsidRPr="00456755"><w:rPr><w:b/></w:rPr><w:t>INTRODUCTION</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:r w:rsidRPr="00456755"><w:t>This is a test document for xml</w:t></w:r><w:r><w:t>.</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:proofErr w:type="spellStart"/><w:proofErr w:type="gramStart"/><w:r><w:t>Lets</w:t></w:r><w:proofErr w:type="spellEnd"/><w:proofErr w:type="gramEnd"/><w:r><w:t xml:space="preserve"> see how this works.</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"/><w:p w:rsidR="00456755" w:rsidRDefault="00456755"/><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:pPr><w:rPr><w:b/></w:rPr></w:pPr><w:r w:rsidRPr="00456755"><w:rPr><w:b/></w:rPr><w:t>Conclusion</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRPr="00456755" w:rsidRDefault="00456755"><w:r w:rsidRPr="00456755"><w:t>It should hopefully</w:t></w:r><w:r><w:t>..</w:t></w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/></w:p><w:sectPr w:rsidR="00456755" w:rsidRPr="00456755"><w:pgSz w:w="11906" w:h="16838"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>
    

    自从 doc_xml 类型为string,我使用以下方法获取Element。

    rooted = ET.fromstring(doc_xml)    #type(rooted) is 'Element'
    type(rooted)
    

    这也是:

    tree = ET.ElementTree(doc_xml)  #type(tree) is 'ElementTree'
    type(tree)
    

    我认为这是可行的,但当我这样做时:

    for branch in tree.iter():
        print branch  
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-83-d503315fb5e6> in <module>()
    ----> 1 for branch in tree.iter():
          2     print branch
    
    C:\Anaconda\lib\xml\etree\ElementTree.pyc in iter(self, tag)
        671     def iter(self, tag=None):
        672         # assert self._root is not None
    --> 673         return self._root.iter(tag)
        674 
        675     # compatibility
    
    AttributeError: 'str' object has no attribute 'iter'
    

    变量 tree 属于ElementTree类型。如何解决此问题?

    1 回复  |  直到 10 年前
        1
  •  3
  •   mzjn    10 年前

    利用这条线,

    rooted = ET.fromstring(doc_xml) 
    

    你得到一个 Element 通过解析作为字符串给出的XML文档。您可以迭代此实例:

    for branch in rooted.iter():
        print branch
    

    当你这样做时,

    tree = ET.ElementTree(doc_xml)
    

    您可以创建 ElementTree 将字符串作为参数。这不会导致错误消息,但尝试遍历树失败,因为它不是“真正”的树(在这种情况下,XML不会被解析)。


    如果您需要 元素树 例如,我建议这样做:

    import xml.etree.ElementTree as ET
    import zipfile as zf
    
    z = zf.ZipFile("INTRODUCTION.docx")
    f = z.open("word/document.xml")   # a file-like object
    tree = ET.parse(f)                # an ElementTree instance
    
    for elem in tree.iter():
        print elem