代码之家 › 专栏 › 技术社区 › Job

如何解决使用beautifulsoup时的属性错误?

findall attributeerror beautifulsoup dataframe python

Job · 技术社区 · 6 年前

我正在读一本书。类似于以下格式的html文件:

html = '''
<tr>
<td class="SmallFormText" colspan="3">hours per response:</td><td class="SmallFormTextR">23.8</td>
</tr>
<hr>
<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="Form 13F-NT Header Information">
<tbody>
<tr>
<td class="FormTextC">COLUMN 1</td><td class="FormTextC">COLUMN 2</td><td class="FormTextC">COLUMN 3</td><td class="FormTextR">COLUMN 4</td><td class="FormTextC" colspan="3">COLUMN 5</td><td class="FormTextC">COLUMN 6</td><td class="FormTextR">COLUMN 7</td><td class="FormTextC" colspan="3">COLUMN 8</td>
</tr>
<tr>
<td class="FormText"></td><td class="FormText"></td><td class="FormText"></td><td class="FormTextR">VALUE</td><td class="FormTextR">SHRS OR</td><td class="FormText">SH/</td><td class="FormText">PUT/</td><td class="FormText">INVESTMENT</td><td class="FormTextR">OTHER</td><td class="FormTextC" colspan="3">VOTING AUTHORITY</td>
</tr>
<tr>
<td class="FormText">NAME OF ISSUER</td><td class="FormText">TITLE OF CLASS</td><td class="FormText">CUSIP</td><td class="FormTextR">(x$1000)</td><td class="FormTextR">PRN AMT</td><td class="FormText">PRN</td><td class="FormText">CALL</td><td class="FormText">DISCRETION</td><td class="FormTextR">MANAGER</td><td class="FormTextR">SOLE</td><td class="FormTextR">SHARED</td><td class="FormTextR">NONE</td>
</tr>
<tr>
<td class="FormData">1ST SOURCE CORP</td><td class="FormData">COM</td><td class="FormData">336901103</td><td class="FormDataR">8</td><td class="FormDataR">335</td><td class="FormData">SH</td><td>&nbsp;</td><td class="FormData">SOLE</td><td class="FormData">7</td><td class="FormDataR">335</td><td class="FormDataR">0</td><td class="FormDataR">0</td>
</tr>
<tr>
<td class="FormData">1ST UNITED BANCORP INC FLA</td><td class="FormData">COM</td><td class="FormData">33740N105</td><td class="FormDataR">7</td><td class="FormDataR">989</td><td class="FormData">SH</td><td>&nbsp;</td><td class="FormData">SOLE</td><td class="FormData">7</td><td class="FormDataR">989</td><td class="FormDataR">0</td><td class="FormDataR">0</td>
</tr>    '''

在这段代码中,我试图提取<tr>和(<)/tr>标签。特别是,我想使用beautiful soup将一个给定的信息(例如“发卡机构名称”)分配给一个名为“NAME\u OF\u ISSUER”的列名。然而,当我运行以下代码时,我遇到了一个看起来很容易解决的错误(这或多或少是一个数据格式问题)。考虑到我是Python新手,我在尝试其他解决方案时被困了几个小时。如有任何意见或反馈,我将不胜感激。

这是我的代码(请同时运行上述代码以获取html数据):

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr')[11:]
positions = []
dic = {}
position = rows.find_all('td')
dic["NAME_OF_ISSUER"] = position[0].text
dic["CUSIP"] = position[2].text
dic["VALUE"] = int(position[3].text.replace(',', ''))*1000
dic["SHARES"] = int(position[4].text.replace(',', ''))
positions.append(dic)
df = pd.DataFrame(positions)

在定义位置之后,我得到了一个“AttributeError”,表示列表对象没有“find\u all”属性。

这到底意味着什么?此外,我需要如何转换html数据以避免此问题?

编辑的部分:

以下是完整堆栈跟踪:

position = rows.find_all('td')
Traceback (most recent call last):

  File "<ipython-input-8-37353b5ab2ef>", line 1, in <module>
    position = rows.find_all('td')

AttributeError: 'list' object has no attribute 'find_all'

1 回复 | 直到 6 年前

tdelaney 6 年前

soup.find_all 返回python list 的元素。您所需要做的就是遍历列表并从这些元素中获取数据。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr')

# scan for header row and trim list
for index, row in enumerate(rows):
    cells = row.find_all('td')
    if cells and "NAME OF ISSUER" in cells[0].text.upper():
        del rows[:index+1]
        break

# convert remaining html rows to dict to create dataframe
positions = []
for position in rows:
    dic = {}
    cells = position.find_all('td')
    dic["NAME_OF_ISSUER"] = cells[0].text
    dic["CUSIP"] = cells[2].text
    dic["VALUE"] = int(cells[3].text.replace(',', ''))*1000
    dic["SHARES"] = int(celss[4].text.replace(',', ''))
    positions.append(dic)
df = pd.DataFrame(positions)