代码之家 › 专栏 › 技术社区 › bc291

下标<sub>数据的刮削表

scrapy python

bc291 · 技术社区 · 6 年前

我现在很喜欢scrapy和Python3.6。我的目标是用这样的html代码从表中清除所有数据:

<table class="table table-a">
                    <tbody><tr>
                        <td colspan="2">
                            <h2 class="text-center no-margin">Geometry</h2>
                        </td>
                    </tr>
                    <tr>
                        <td title="Depth of section">h = 267 mm</td>
                        <td rowspan="8" class="text-center">
                            <a href="http://www.staticstools.eu/assets/image/profile-ipea.png" target="_blank">
                                <img src="http://www.staticstools.eu/assets/image/profile-ipea-thumb.png" alt="Section IPEA" class="img-responsive">
                            </a>
                        </td>
                    </tr>
                    <tr>
                        <td title="Width of section">b = 135 mm</td>
                    </tr>
                    <tr>
                        <td title="Flange thickness">t<sub>f</sub> = 8.7 mm</td>
                    </tr>
                    <tr>
                        <td title="Web thickness">t<sub>w</sub> = 5.5 mm</td>
                    </tr>
                    <tr>
                        <td title="Radius of root fillet">r<sub>1</sub> = 15 mm</td>
                    </tr>
                    <tr>
                        <td title="Distance of centre of gravity along y-axis">y<sub>s</sub> = 67.5 mm</td>
                    </tr>
                    <tr>
                        <td title="Depth of straight portion of web">d = 219.6 mm</td>
                    </tr>
                    <tr>
                        <td title="Area of section">A = 3915 mm<sup>2</sup></td>
                    </tr>
                    <tr>
                        <td title="Painting surface per unit lenght">A<sub>L</sub> = 1.04 m<sup>2</sup>.m<sup>-1</sup></td>
                        <td title="Mass per unit lenght">G = 30.7 kg.m<sup>-1</sup></td>
                    </tr>
                </tbody></table>

在我面对的几排 <sup> <sub> 索引格式使得一切都变得困难。我的意思是,通过使用:

response.css('table.table.table-a td::text').extract()

输出为:

['\n                            ',
 '\n                        ',
 'h = 267 mm',
 '\n                            ',
 '\n                        ',
 'b = 135 mm',
 't',
 ' = 8.7 mm',
 't',
 ' = 5.5 mm',
 'r',
 ' = 15 mm',
 'y',
 ' = 67.5 mm',
 'd = 219.6 mm',
 'A = 3915 mm',
 'A',
 ' = 1.04 m',
 '.m',
 'G = 30.7 kg.m']

所以一切都有点混乱。我还可以使用以下方法包含嵌套标记:

response.css('table.table.table-a td *::text').extract()

输出如下:

['\n                            ',
 'Geometry',
 '\n                        ',
 'h = 267 mm',
 '\n                            ',
 '\n                                ',
 '\n                            ',
 '\n                        ',
 'b = 135 mm',
 't',
 'f',
 ' = 8.7 mm',
 't',
 'w',
 ' = 5.5 mm',
 'r',
 '1',
 ' = 15 mm',
 'y',
 's',
 ' = 67.5 mm',
 'd = 219.6 mm',
 'A = 3915 mm',
 '2',
 'A',
 'L',
 ' = 1.04 m',
 '2',
 '.m',
 '-1',
 'G = 30.7 kg.m',
 '-1']

我当然可以对这些数据进行后期处理,但我想知道是否可以在刮削过程中实现?我希望输出数据如下:

 ['h = 267 mm',
     'b = 135 mm',
     'tf = 8.7 mm',
     'tw = 5.5 mm',
     'r1 = 15 mm', 
     'ys = 67.5 mm',
     'd = 219.6 mm',
     'A = 3915 mm2',
     'AL = 1.04 m2.m-1',
     'G = 30.7 kg.m-1']

1 回复 | 直到 6 年前

Dodge Sagar Bahadur Tamang 6 年前

是的,您可以在spider类的parse方法中处理尽可能多的数据。类似于下面这样的东西在这里工作:

import scrapy

class MySpider(scrapy.Spider):
    name = "myspider"

    def start_requests(self):
        urls = [
            'www.example.com'
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # perform data below

        data = response.xpath("//table").extract()

        data = pd.read_html(data[0])[0]

        # perform data processing above

        yield {'data':data}

运行以下命令将结果df保存为json:

scrapy crawl myscraper -o table.json

如果要更仔细地查看一些要插入到解析方法中的代码,请查看以下内容:

df = pd.read_html(html)[0]

df

    0               1
0   Geometry        NaN
1   h = 267 mm      NaN
2   b = 135 mm      NaN
3   tf = 8.7 mm     NaN
4   tw = 5.5 mm     NaN
5   r1 = 15 mm      NaN
6   ys = 67.5 mm    NaN
7   d = 219.6 mm    NaN
8   A = 3915 mm2    NaN
9   AL = 1.04 m2.m-1    G = 30.7 kg.m-1

df = pd.DataFrame([i.split(r' ') for i in df[0].map(str)])
df.drop([1,3], axis=1, inplace=True)

df

    0   2
0   Geometry    None
1   h   267
2   b   135
3   tf  8.7
4   tw  5.5
5   r1  15
6   ys  67.5
7   d   219.6
8   A   3915
9   AL  1.04