代码之家  ›  专栏  ›  技术社区  ›  Roman

urllib:html到csv错误

  •  0
  • Roman  · 技术社区  · 6 年前

    我尝试获取表格数据并保存为csv文件,如下所示:

    import urllib, pandas as pd
    url = 'https://finance.yahoo.com/quote/BTC-JPY/history?period1=1314403200&period2=1314489600&interval=1d&filter=history&frequency=1d'
    
    fo = 'test.txt'                
    
    response = urllib.request.urlopen(url)
    html =  response.read()
    data = pd.read_html(html)
    data.to_csv(fo, index = False, header=False, sep=',', mode='w')
    

    但出现以下错误:

    TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U95') dtype('<U95') dtype('<U95')
    

    没有熊猫

    lines = html.splitlines()
    for l in lines:
        fo.write(str(l) + '\n') 
    

    它写入字节,无法读取的格式。

    我只需要表格数据

    Date    Open    High    Low Close*  Adj Close** Volume
    Aug 28, 2011    700.99  700.99  700.99  700.99  700.99  -
    Aug 27, 2011    700.99  700.99  700.99  700.99  700.99  700
    
    1 回复  |  直到 6 年前
        1
  •  1
  •   Rehan Azher    6 年前

    下面的代码将生成您要查找的内容:

    import urllib
    import urllib.request
    import html5lib
    import pandas as pd
    from bs4 import BeautifulSoup
    url = 'https://finance.yahoo.com/quote/BTC-JPY/history?period1=1314403200&period2=1314489600&interval=1d&filter=history&frequency=1d'
    
    fo = 'test.txt'
    pd.set_option('display.max_colwidth', -1)
    response = urllib.request.urlopen(url)
    html =  response.read()
    soup = BeautifulSoup(html)
    table = soup.find("table")
    headings = [th.get_text() for th in table.find("tr").find_all("th")]
    file = open(fo,"w")
    file.write(",".join(headings) + "\n")
    for row in table.find_all("tr")[1:]:
        data = [td.get_text() for td in row.find_all("td")]
        if len(data)==len(headings):
            file.write(",".join(data) + "\n")
    file.close()
    

    输出:

    Date,Open,High,Low,Close*,Adj Close**,Volume
    Aug 28, 2011,700.99,700.99,700.99,700.99,700.99,-
    Aug 27, 2011,700.99,700.99,700.99,700.99,700.99,700