代码之家 › 专栏 › 技术社区 › Mridang Agarwalla

提取HTML页面中的所有<script>标记并附加到文档底部

beautifulsoup python

Mridang Agarwalla · 技术社区 · 14 年前

有人能告诉我怎么提取和去除 <script> 在HTML文档中标记并将它们添加到文档末尾,就在 </body></html> ?我想尽量避免使用 lxml 拜托。

谢谢。

1 回复 | 直到 7 年前

bigbounty 7 年前

答案很简单,可能会遗漏很多细微差别。无论如何,这应该给你一个如何去做的想法,总体上改善它。我相信这是可以改进的,但是您应该能够在文档的帮助下快速地做到这一点。

参考文档: http://www.crummy.com/software/BeautifulSoup/documentation.html

from bs4 import BeautifulSoup

doc = ['<html><script type="text/javascript">document.write("Hello World!")',
       '</script><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))


for tag in soup.findAll('script'):
    # Use extract to remove the tag
    tag.extract()
    # use simple insert
    soup.body.insert(len(soup.body.contents), tag)

print soup.prettify()

输出:

<html>
 <head>
  <title>
   Page title
  </title>
 </head>
 <body>
  <p id="firstpara" align="center">
   This is paragraph
   <b>
    one
   </b>
   .
  </p>
  <p id="secondpara" align="blah">
   This is paragraph
   <b>
    two
   </b>
   .
  </p>
  <script type="text/javascript">
   document.write("Hello World!")
  </script>
 </body>
</html>

推荐文章

yash agarwal · Python Selenium-如何基于span标记内的文本提取元素?

3 年前

Amar · 漂亮汤错误:“NoneType”对象没有属性“find\u all”

3 年前

ihonestlydontKnow · Python(BeautifulSoup)仅1个结果

3 年前

ARH · 如何使用Selenium识别网站中使用的所有标签

3 年前

Kevin Rodgers Jr. · Python BeautifulSoup:在in select语句中排除其他标记

3 年前

Jensen Holm · 在非常大的字符串中查找链接时遇到问题

3 年前

koshiboto · 使用python(bs4)从段落中获取第一个不位于括号之间的常规链接

3 年前

LaddieMawery · Beautifulsoup获取嵌套跨元素时遇到问题

3 年前

Ventorro · Python和Web抓取的新手。抓取一个HTML表格——但是它并没有显示所有的列

3 年前

aphexlog · 正在尝试使用BeautifulSoup将新行附加到表体中的第一行

3 年前