代码之家  ›  专栏  ›  技术社区  ›  Cláudio

无法计算从非空XHTML生成的docx文件中的字符数

  •  0
  • Cláudio  · 技术社区  · 10 年前

    我使用DocX4J实现了一个到DocX的XHTML转换器。它可以毫无问题地创建DocX文件。

    为了完成任务,我决定实施一个简单的测试。测试包括计算DocX中创建的字符数,然后将其与XHTML中已知的字符数进行比较(参见下面的源代码)。

    我的测试代码基于DocX4J站点的一个示例,但对我来说不起作用。虽然我可以看到转换器创建的DocX的内容与XHTML文件的内容相同,但我的测试码始终将DocX文件的字符数归零。:-\

    有人能帮我找出这个意外结果的原因吗?

    提前感谢!

    package main;
    
    import java.io.File;
    import java.io.IOException;
    import java.io.StringWriter;
    
    import org.docx4j.TextUtils;
    import org.docx4j.jaxb.Context;
    import org.docx4j.openpackaging.contenttype.ContentType;
    import org.docx4j.openpackaging.exceptions.Docx4JException;
    import org.docx4j.openpackaging.exceptions.InvalidFormatException;
    import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
    import org.docx4j.openpackaging.parts.PartName;
    import org.docx4j.openpackaging.parts.WordprocessingML.AlternativeFormatInputPart;
    import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
    import org.docx4j.relationships.Relationship;
    import org.docx4j.wml.CTAltChunk;
    import org.docx4j.wml.Document;
    
    /**
     * Count chars from a DocX file generated from a XHTML using Docx4J
     * 
     * @author Cláudio
     */
    public class CountChars {
    
        public static void main(String[] args) {
    	String xhtml = "<html><body><table border=\"1\"><tr><td>Propriedade</td><td>Amostra 1</td><td>Amostra 2</td></tr><tr><td>Prop1</td><td>10.0</td><td>111.0</td></tr><tr><td>Prop2</td><td>20.0</td><td>222.0</td></tr></table></body></html>";
    	int expectedNChars = 57;
    
    	WordprocessingMLPackage docx = export(xhtml);
    	try {
    	    docx.save(new File("test.docx")); // Proves that docx is
    		                              // successfully created
    	} catch (Docx4JException e) {
    	    // TODO Auto-generated catch block
    	    e.printStackTrace();
    	}
    
    	if (countCharacters(docx) == expectedNChars) {
    	    System.out.println("Success");
    	} else {
    	    System.out.println("Fail");
    	}
        }
    
        private static WordprocessingMLPackage export(String xhtml) {
    	WordprocessingMLPackage wordMLPackage = null;
    	AlternativeFormatInputPart afiPart = null;
    	Relationship altChunkRel = null;
    
    	try {
    	    wordMLPackage = WordprocessingMLPackage.createPackage();
    	    afiPart = new AlternativeFormatInputPart(new PartName("/hw.html"));
    	} catch (InvalidFormatException e) {
    	    // TODO Auto-generated catch block
    	    e.printStackTrace();
    	}
    
    	afiPart.setBinaryData(xhtml.getBytes());
    	afiPart.setContentType(new ContentType("text/html"));
    
    	try {
    	    altChunkRel = wordMLPackage.getMainDocumentPart().addTargetPart(
    		    afiPart);
    	} catch (InvalidFormatException e) {
    	    // TODO Auto-generated catch block
    	    e.printStackTrace();
    	}
    
    	// .. the bit in document body
    	CTAltChunk ac = Context.getWmlObjectFactory().createCTAltChunk();
    	ac.setId(altChunkRel.getId());
    	wordMLPackage.getMainDocumentPart().addObject(ac);
    
    	// .. content type
    	wordMLPackage.getContentTypeManager().addDefaultContentType("html",
    	        "text/html");
    
    	return wordMLPackage;
        }
    
        /**
         * Counts chars (even whitespaces) in a docx.
         * 
         * Referência:
         * http://www.docx4java.org/forums/docx-java-f6/how-to-count-number
         * -of-characters-in-a-docx-file-t767.html
         * 
         * @param docx
         *            Document
         * 
         * @return Number of chars in the document
         */
        private static int countCharacters(WordprocessingMLPackage docx) {
    	String strString = null;
    
    	MainDocumentPart documentPart = docx.getMainDocumentPart();
    	Document wmlDocument = documentPart.getJaxbElement();
    
    	StringWriter strWriter = null;
    	try {
    	    strWriter = new StringWriter();
    	    TextUtils.extractText(wmlDocument, strWriter);
    	    strString = strWriter.toString();
    	} catch (Exception e) {
    	    // TODO Auto-generated catch block
    	    e.printStackTrace();
    	} finally {
    	    if (strWriter != null) {
    		try {
    		    strWriter.close();
    		} catch (IOException e) {
    		    // TODO Auto-generated catch block
    		    e.printStackTrace();
    		}
    	    }
    	}
    
    	if (strString == null) {
    	    throw new NullPointerException();
    	}
    
    	return strString.length();
        }
    }
    2 回复  |  直到 10 年前
        1
  •  1
  •   JasonPlutext    10 年前

    您将XHTML添加为AlternativeFormatInputPart(AFIP),通常由Word将XHTML转换为真正的docx内容。

    同时,XHTML内容不在MainDocumentPart文档部分中,而是在AFIP中。所以,当然,计算文档部分中的单词不会给你所希望的。。。

        2
  •  0
  •   Cláudio    10 年前

    使用docx4j 2.8.1,方法导出的正确实现如下:

    private static WordprocessingMLPackage export(String xhtml) {
    WordprocessingMLPackage wordMLPackage = null;
    
    try {
        wordMLPackage = WordprocessingMLPackage.createPackage();
        List<Object> content = XHTMLImporter.convert(xhtml, null,
            wordMLPackage);
        wordMLPackage.getMainDocumentPart().getContent().addAll(content);
    } catch (Docx4JException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    
    return wordMLPackage;
    }