代码之家  ›  专栏  ›  技术社区  ›  Markus

如何使用python在网页上提交下载文件

  •  0
  • Markus  · 技术社区  · 4 年前

    我想从一个网站上下载一份PDF文件。 当你第一次点击PDF下载它时,它会把你带到一个页面,在那里你必须点击“同意并继续”。一旦你这样做了,浏览器会存储cookie(所以你再也不需要同意),然后在浏览器中打开PDF(我想下载)。

    这是到accept页面的链接- https://www.asx.com.au/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02232753 "

    <form name="showAnnouncementPDFForm" method="post" action="announcementTerms.do"> <input value="Decline" onclick="window.close();return false;" type="submit"> <input value="Agree and proceed" type="submit"> <input name="pdfURL" value="/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf" type="hidden"> </form>

    这是最后一页-“ https://www.asx.com.au/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf

    import requests
    values = {}
    values['showAnnouncementRDFForm'] = 'announcementTerms.do'
    values['pdfURL'] = '/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf'
    req = requests.post('https://asx.com.au/', data=values)
    print(req.text)
    

    作为最终的解决方案,我希望python代码可以使用PDF链接,自动同意并继续,将cookie存储到next to以避免将来的审批,然后下载PDF。

    希望这是有道理的,谢谢你花时间来读我的问题。

    0 回复  |  直到 4 年前
        1
  •  2
  •   Dan-Dev    4 年前

    import requests
    response = requests.get("https://www.asx.com.au/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf")
    with open('./test1.pdf', 'wb') as f:
        f.write(response.content)
    

    如果您不知道可以从表单中读取的URL,则无需cookie即可直接访问:

    import requests
    from bs4 import BeautifulSoup
    base_url = "https://www.asx.com.au"
    response = requests.get(f"{base_url}/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02232753")
    soup = BeautifulSoup(response.text, 'html.parser')
    pdf_url = soup.find('input', {'name': 'pdfURL'}).get('value')
    response = requests.get(f'{base_url}{pdf_url}')
    with open('./test2.pdf', 'wb') as f:
        f.write(response.content)
    

    import requests
    cookies = {'companntc': 'tc'}
    response = requests.get("https://www.asx.com.au/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf", cookies=cookies)
    with open('./test3.pdf', 'wb') as f:
        f.write(response.content)
    

    如果您真的想使用POST:

    import requests   
    payload = {'pdfURL': '/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf'}
    response = requests.post('https://www.asx.com.au/asx/statistics/announcementTerms.do', params=payload)
    with open('./test4.pdf', 'wb') as f:
        f.write(response.content)
    

    或者从表格中阅读pdfURL并发布:

    import requests
    from bs4 import BeautifulSoup
    base_url = "https://www.asx.com.au"
    response = requests.get(f"{base_url}/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02232753")
    soup = BeautifulSoup(response.text, 'html.parser')
    pdf_url = soup.find('input', {'name': 'pdfURL'}).get('value')
    payload = {'pdfURL': pdf_url}
    response = requests.post(f"{base_url}/asx/statistics/announcementTerms.do", params=payload)
    with open('./test5.pdf', 'wb') as f:
        f.write(response.content)