代码之家  ›  专栏  ›  技术社区  ›  Stefano Potter

使用BeautifulSoup和ftlib访问ftp网站时出错

  •  1
  • Stefano Potter  · 技术社区  · 6 年前

    我正在尝试访问网页以下载以下数据:

    from bs4 import BeautifulSoup
    import urllib.request
    from lxml import html
    
    download_url = "ftp://nomads.ncdc.noaa.gov/NARR_monthly/"
    
    s = requests.session()                                                         
    
    
    page = BeautifulSoup(s.get(download_url).text, "lxml")
    

    但这又回来了:

    Traceback (most recent call last):
    
      File "<ipython-input-271-59c5b15a7e34>", line 1, in <module>
        r = requests.get(download_url)
    
      File "/anaconda3/lib/python3.6/site-packages/requests/api.py", line 72, in get
        return request('get', url, params=params, **kwargs)
    
      File "/anaconda3/lib/python3.6/site-packages/requests/api.py", line 58, in request
        return session.request(method=method, url=url, **kwargs)
    
      File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
        resp = self.send(prep, **send_kwargs)
    
      File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 612, in send
        adapter = self.get_adapter(url=request.url)
    
      File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 703, in get_adapter
        raise InvalidSchema("No connection adapters were found for '%s'" % url)
    
    InvalidSchema: No connection adapters were found for 'ftp://nomads.ncdc.noaa.gov/NARR_monthly/'
    

    通常情况下,我会像这样循环遍历每个链接,如果它起作用的话:

    for a in page.find_all('a', href=True):
        file = a['href']
        print (file)
    

    我也试过:

    import ftplib
    
    ftp = ftplib.FTP(download_url)
    

    但这又回来了:

      File "<ipython-input-284-60bd19e600fe>", line 1, in <module>
        ftp = ftplib.FTP(download_url)
    
      File "/anaconda3/lib/python3.6/ftplib.py", line 117, in __init__
        self.connect(host)
    
      File "/anaconda3/lib/python3.6/ftplib.py", line 152, in connect
        source_address=self.source_address)
    
      File "/anaconda3/lib/python3.6/socket.py", line 704, in create_connection
        for res in getaddrinfo(host, port, 0, SOCK_STREAM):
    
      File "/anaconda3/lib/python3.6/socket.py", line 745, in getaddrinfo
        for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    
    gaierror: [Errno 8] nodename nor servname provided, or not known
    
    1 回复  |  直到 6 年前
        1
  •  2
  •   t.m.adam    6 年前

    不幸的是 requests 不支持FTP链接,但可以使用内置 urllib

    import urllib.request
    
    download_url = "ftp://nomads.ncdc.noaa.gov/NARR_monthly/"
    with urllib.request.urlopen(download_url) as r:
        data = r.read()
    
    print(data)
    

    响应不是html,因此不能用 BeautifulSoup ,但可以使用regex或字符串操作。

    links = [
        download_url + line.split()[-1] 
        for line in data.decode().splitlines()
    ]
    for link in links:
        print(link)
    

    你也可以使用 ftplib 如果你愿意,但你只需要使用主机名。然后你可以cd到“NARR_monthly”获取数据。

    from ftplib import FTP
    
    with FTP('nomads.ncdc.noaa.gov') as ftp:
        ftp.login() 
        ftp.cwd('NARR_monthly')
        data = ftp.nlst()
    
    path = "ftp://nomads.ncdc.noaa.gov/NARR_monthly/"
    links = [path + i for i in data]