我尝试了所有当前的html到文本转换工具,如html2text、beautifulsoup。在将html转换为文本时,它们会丢失div框的位置,并按顺序打印文本。
对于这样的html代码
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:107px; top:372px; width:89px; height:126px;"><span style="font-family: b\'TimesNewRomanPS-BoldMT\'; font-size:15px">
<br>Location :
<br>Date:
<br>Date_Assigned:
<br>Date_Inspected:
</div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:215px; top:375px; width:248px; height:140px;"><span style="font-family: b\'TimesNewRomanPS-BoldMT\'; font-size:15px">
<br>USA
<br>July 4, 2018
<br>July 5, 2018
<br>July 9, 2018
我从beautifulsoup的get_Text()中得到的纯文本输出如下
Location : Date: Date_Assigned:Date_Inspected:USA July 4, 2018July 5, 2018July 9, 2018
从html2text中,输出如下
Location :
Date:
Date_Assigned:
Date_Inspected:
USA
July 4, 2018
July 5, 2018
July 9, 2018
如果考虑两个div的位置,预期的输出是
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:107px; top:2px; width:89px; height:126px;"><span style="font-family: b\'TimesNewRomanPS-BoldMT\'; font-size:15px">
<br>Location :
<br>Date:
<br>Date_Assigned:
<br>Date_Inspected:
</div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:215px; top:2px; width:248px; height:140px;"><span style="font-family: b\'TimesNewRomanPS-BoldMT\'; font-size:15px">
<br>USA
<br>July 4, 2018
<br>July 5, 2018
<br>July 9, 2018
是否可以使用beautiful soup或任何其他可用的python包转换为保留div位置的文本?