由于以下两个原因,您的代码无法打印任何内容:
-
您不解码http响应,而是尝试解析字节而不是字符串
-
link.find('http') >= 1
对于从
http
或
https
。您应该使用
link.find('http') == 0
或
link.startswith('http')
如果要坚持使用HTMLParser,可以按如下方式修改代码:
from html.parser import HTMLParser
import urllib.request
class myParser(HTMLParser):
links = []
def handle_starttag(self, tag, attrs):
if tag =='a':
for attr in attrs:
if attr[0]=='href' and str(attr[1]).startswith('http'):
print(attr[1])
self.links.append(attr[1])
with urllib.request.urlopen("http://www.asriran.com") as response:
handle = response.read().decode('utf-8')
parser = myParser()
parser.feed(handle)
http_links = myParser.links
否则,我建议切换到Beautiful Soup并解析响应,例如:
from bs4 import BeautifulSoup
import urllib.request
with urllib.request.urlopen("http://www.asriran.com") as response:
html = response.read().decode('utf-8')
soup = BeautifulSoup(html, 'html.parser')
all_links = [a.get('href') for a in soup.find_all('a')]
http_links = [link for link in all_links if link.startswith('http')]