代码之家  ›  专栏  ›  技术社区  ›  Thufir

如何使用Saxon从HTML构建格式良好的XML文档?

  •  1
  • Thufir  · 技术社区  · 6 年前

    具体错误是 Exception in thread "main" java.net.MalformedURLException: no protocol 然而,作为 html URL

    Saxon-HE tagsoup ,是否要验证 streamResult 第一?

    读取控制台输出它 包装 html格式 在里面 xml Document 流结果 .

    thufir@dur:~/NetBeansProjects/helloWorldSaxon$ gradle clean run
    
    > Task :run
    Exception in thread "main" java.net.MalformedURLException: no protocol: <?xml version="1.0" encoding="UTF-8"?><!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html xmlns:html="http://www.w3.org/1999/xhtml" class="no-js" lang="en-us"><!--<![endif]-->
       <head>
          <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
          <title>
        All products | Books to Scrape - Sandbox
    </title>
          <meta name="created" content="24th Jun 2016 09:29" />
          <meta name="description" content="" />
          <meta name="viewport" content="width=device-width" />
          <meta name="robots" content="NOARCHIVE,NOCACHE" />
          <!-- Le HTML5 shim, for IE6-8 support of HTML elements --><!--[if lt IE 9]>
            <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
            <![endif]-->
          <link rel="shortcut icon" href="static/oscar/favicon.ico" />
          <link rel="stylesheet" type="text/css" href="static/oscar/css/styles.css" />
          <link rel="stylesheet" href="static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.css" />
          <link rel="stylesheet" type="text/css" href="static/oscar/css/datetimepicker.css" />
       </head>
    

    ..

          <!-- Version: N/A -->
    
          </body>
    </html>
            at java.net.URL.<init>(URL.java:593)
            at java.net.URL.<init>(URL.java:490)
            at java.net.URL.<init>(URL.java:439)
            at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:620)
            at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:148)
            at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:806)
            at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
            at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
            at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
            at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
            at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177)
            at helloWorldSaxon.HandlerForXML.parseFromURL(HandlerForXML.java:53)
            at helloWorldSaxon.App.scrapeHTML(App.java:26)
            at helloWorldSaxon.App.main(App.java:19)
    
    > Task :run FAILED
    
    FAILURE: Build failed with an exception.
    
    * What went wrong:
    Execution failed for task ':run'.
    > Process 'command '/usr/lib/jvm/java-8-openjdk-amd64/bin/java'' finished with non-zero exit value 1
    
    * Try:
    Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.
    
    * Get more help at https://help.gradle.org
    
    BUILD FAILED in 3s
    4 actionable tasks: 4 executed
    thufir@dur:~/NetBeansProjects/helloWorldSaxon$ 
    

    值得注意的是没有结束语 xml格式 标签。

    代码:

        public void parseFromURL() throws SAXException, ParserConfigurationException, IOException, TransformerException {
            StringWriter writer = new StringWriter();
            StreamResult streamResult = new StreamResult(writer);
    
            TransformerFactory transformerFactory = TransformerFactory.newInstance();
            XMLReader xmlReader = XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");
            Source source = new SAXSource(xmlReader, new InputSource(url.toString()));
    
            Transformer transformer = transformerFactory.newTransformer();
            transformer.transform(source, streamResult);
    
            String stringResult = writer.toString();
            LOG.fine(stringResult);
    
            DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
            DocumentBuilder builder = documentBuilderFactory.newDocumentBuilder();
            Document document;
            document = builder.parse(stringResult);
    
        }
    

    期待 build xml格式 stringResult .

    0 回复  |  直到 6 年前