我现在很喜欢scrapy和Python3.6。我的目标是用这样的html代码从表中清除所有数据:
<table class="table table-a">
<tbody><tr>
<td colspan="2">
<h2 class="text-center no-margin">Geometry</h2>
</td>
</tr>
<tr>
<td title="Depth of section">h = 267 mm</td>
<td rowspan="8" class="text-center">
<a href="http://www.staticstools.eu/assets/image/profile-ipea.png" target="_blank">
<img src="http://www.staticstools.eu/assets/image/profile-ipea-thumb.png" alt="Section IPEA" class="img-responsive">
</a>
</td>
</tr>
<tr>
<td title="Width of section">b = 135 mm</td>
</tr>
<tr>
<td title="Flange thickness">t<sub>f</sub> = 8.7 mm</td>
</tr>
<tr>
<td title="Web thickness">t<sub>w</sub> = 5.5 mm</td>
</tr>
<tr>
<td title="Radius of root fillet">r<sub>1</sub> = 15 mm</td>
</tr>
<tr>
<td title="Distance of centre of gravity along y-axis">y<sub>s</sub> = 67.5 mm</td>
</tr>
<tr>
<td title="Depth of straight portion of web">d = 219.6 mm</td>
</tr>
<tr>
<td title="Area of section">A = 3915 mm<sup>2</sup></td>
</tr>
<tr>
<td title="Painting surface per unit lenght">A<sub>L</sub> = 1.04 m<sup>2</sup>.m<sup>-1</sup></td>
<td title="Mass per unit lenght">G = 30.7 kg.m<sup>-1</sup></td>
</tr>
</tbody></table>
在我面对的几排
<sup>
<sub>
索引格式使得一切都变得困难。我的意思是,通过使用:
response.css('table.table.table-a td::text').extract()
输出为:
['\n ',
'\n ',
'h = 267 mm',
'\n ',
'\n ',
'b = 135 mm',
't',
' = 8.7 mm',
't',
' = 5.5 mm',
'r',
' = 15 mm',
'y',
' = 67.5 mm',
'd = 219.6 mm',
'A = 3915 mm',
'A',
' = 1.04 m',
'.m',
'G = 30.7 kg.m']
所以一切都有点混乱。我还可以使用以下方法包含嵌套标记:
response.css('table.table.table-a td *::text').extract()
输出如下:
['\n ',
'Geometry',
'\n ',
'h = 267 mm',
'\n ',
'\n ',
'\n ',
'\n ',
'b = 135 mm',
't',
'f',
' = 8.7 mm',
't',
'w',
' = 5.5 mm',
'r',
'1',
' = 15 mm',
'y',
's',
' = 67.5 mm',
'd = 219.6 mm',
'A = 3915 mm',
'2',
'A',
'L',
' = 1.04 m',
'2',
'.m',
'-1',
'G = 30.7 kg.m',
'-1']
我当然可以对这些数据进行后期处理,但我想知道是否可以在刮削过程中实现?我希望输出数据如下:
['h = 267 mm',
'b = 135 mm',
'tf = 8.7 mm',
'tw = 5.5 mm',
'r1 = 15 mm',
'ys = 67.5 mm',
'd = 219.6 mm',
'A = 3915 mm2',
'AL = 1.04 m2.m-1',
'G = 30.7 kg.m-1']