代码之家  ›  专栏  ›  技术社区  ›  alias_neo92

如何将html转换为文本,保留确切的div位置(左、上、高、宽)

  •  0
  • alias_neo92  · 技术社区  · 4 年前

    我尝试了所有当前的html到文本转换工具,如html2text、beautifulsoup。在将html转换为文本时,它们会丢失div框的位置,并按顺序打印文本。

    对于这样的html代码

    <div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:107px; top:372px; width:89px; height:126px;"><span style="font-family: b\'TimesNewRomanPS-BoldMT\'; font-size:15px">
    
    <br>Location :
    <br>Date:
    <br>Date_Assigned:
    <br>Date_Inspected:
    </div>
    <div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:215px; top:375px; width:248px; height:140px;"><span style="font-family: b\'TimesNewRomanPS-BoldMT\'; font-size:15px">
    <br>USA
    <br>July 4, 2018
    <br>July 5, 2018
    <br>July 9, 2018
    
    
    

    我从beautifulsoup的get_Text()中得到的纯文本输出如下

    Location : Date: Date_Assigned:Date_Inspected:USA July 4, 2018July 5, 2018July 9, 2018
    

    从html2text中,输出如下

    Location :  
    Date:  
    Date_Assigned:  
    Date_Inspected:
    
    
    USA  
    July 4, 2018  
    July 5, 2018  
    July 9, 2018
    

    如果考虑两个div的位置,预期的输出是

    <div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:107px; top:2px; width:89px; height:126px;"><span style="font-family: b\'TimesNewRomanPS-BoldMT\'; font-size:15px">
    
    <br>Location :
    <br>Date:
    <br>Date_Assigned:
    <br>Date_Inspected:
    </div>
    <div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:215px; top:2px; width:248px; height:140px;"><span style="font-family: b\'TimesNewRomanPS-BoldMT\'; font-size:15px">
    <br>USA
    <br>July 4, 2018
    <br>July 5, 2018
    <br>July 9, 2018

    是否可以使用beautiful soup或任何其他可用的python包转换为保留div位置的文本?

    0 回复  |  直到 4 年前