代码之家  ›  专栏  ›  技术社区  ›  bc291

下标<sub>数据的刮削表

  •  1
  • bc291  · 技术社区  · 6 年前

    我现在很喜欢scrapy和Python3.6。我的目标是用这样的html代码从表中清除所有数据:

    <table class="table table-a">
                        <tbody><tr>
                            <td colspan="2">
                                <h2 class="text-center no-margin">Geometry</h2>
                            </td>
                        </tr>
                        <tr>
                            <td title="Depth of section">h = 267 mm</td>
                            <td rowspan="8" class="text-center">
                                <a href="http://www.staticstools.eu/assets/image/profile-ipea.png" target="_blank">
                                    <img src="http://www.staticstools.eu/assets/image/profile-ipea-thumb.png" alt="Section IPEA" class="img-responsive">
                                </a>
                            </td>
                        </tr>
                        <tr>
                            <td title="Width of section">b = 135 mm</td>
                        </tr>
                        <tr>
                            <td title="Flange thickness">t<sub>f</sub> = 8.7 mm</td>
                        </tr>
                        <tr>
                            <td title="Web thickness">t<sub>w</sub> = 5.5 mm</td>
                        </tr>
                        <tr>
                            <td title="Radius of root fillet">r<sub>1</sub> = 15 mm</td>
                        </tr>
                        <tr>
                            <td title="Distance of centre of gravity along y-axis">y<sub>s</sub> = 67.5 mm</td>
                        </tr>
                        <tr>
                            <td title="Depth of straight portion of web">d = 219.6 mm</td>
                        </tr>
                        <tr>
                            <td title="Area of section">A = 3915 mm<sup>2</sup></td>
                        </tr>
                        <tr>
                            <td title="Painting surface per unit lenght">A<sub>L</sub> = 1.04 m<sup>2</sup>.m<sup>-1</sup></td>
                            <td title="Mass per unit lenght">G = 30.7 kg.m<sup>-1</sup></td>
                        </tr>
                    </tbody></table>
    

    在我面对的几排 <sup> <sub> 索引格式使得一切都变得困难。我的意思是,通过使用:

    response.css('table.table.table-a td::text').extract()
    

    输出为:

    ['\n                            ',
     '\n                        ',
     'h = 267 mm',
     '\n                            ',
     '\n                        ',
     'b = 135 mm',
     't',
     ' = 8.7 mm',
     't',
     ' = 5.5 mm',
     'r',
     ' = 15 mm',
     'y',
     ' = 67.5 mm',
     'd = 219.6 mm',
     'A = 3915 mm',
     'A',
     ' = 1.04 m',
     '.m',
     'G = 30.7 kg.m']
    

    所以一切都有点混乱。我还可以使用以下方法包含嵌套标记:

    response.css('table.table.table-a td *::text').extract()
    

    输出如下:

    ['\n                            ',
     'Geometry',
     '\n                        ',
     'h = 267 mm',
     '\n                            ',
     '\n                                ',
     '\n                            ',
     '\n                        ',
     'b = 135 mm',
     't',
     'f',
     ' = 8.7 mm',
     't',
     'w',
     ' = 5.5 mm',
     'r',
     '1',
     ' = 15 mm',
     'y',
     's',
     ' = 67.5 mm',
     'd = 219.6 mm',
     'A = 3915 mm',
     '2',
     'A',
     'L',
     ' = 1.04 m',
     '2',
     '.m',
     '-1',
     'G = 30.7 kg.m',
     '-1']
    

    我当然可以对这些数据进行后期处理,但我想知道是否可以在刮削过程中实现?我希望输出数据如下:

     ['h = 267 mm',
         'b = 135 mm',
         'tf = 8.7 mm',
         'tw = 5.5 mm',
         'r1 = 15 mm', 
         'ys = 67.5 mm',
         'd = 219.6 mm',
         'A = 3915 mm2',
         'AL = 1.04 m2.m-1',
         'G = 30.7 kg.m-1']
    
    1 回复  |  直到 6 年前
        1
  •  1
  •   Dodge Sagar Bahadur Tamang    6 年前

    是的,您可以在spider类的parse方法中处理尽可能多的数据。类似于下面这样的东西在这里工作:

    import scrapy
    
    class MySpider(scrapy.Spider):
        name = "myspider"
    
        def start_requests(self):
            urls = [
                'www.example.com'
            ]
    
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)
    
        def parse(self, response):
            # perform data below
    
            data = response.xpath("//table").extract()
    
            data = pd.read_html(data[0])[0]
    
            # perform data processing above
    
            yield {'data':data}
    

    运行以下命令将结果df保存为json:

    scrapy crawl myscraper -o table.json
    

    如果要更仔细地查看一些要插入到解析方法中的代码,请查看以下内容:

    df = pd.read_html(html)[0]
    
    df
    
        0               1
    0   Geometry        NaN
    1   h = 267 mm      NaN
    2   b = 135 mm      NaN
    3   tf = 8.7 mm     NaN
    4   tw = 5.5 mm     NaN
    5   r1 = 15 mm      NaN
    6   ys = 67.5 mm    NaN
    7   d = 219.6 mm    NaN
    8   A = 3915 mm2    NaN
    9   AL = 1.04 m2.m-1    G = 30.7 kg.m-1
    
    df = pd.DataFrame([i.split(r' ') for i in df[0].map(str)])
    df.drop([1,3], axis=1, inplace=True)
    
    df
    
        0   2
    0   Geometry    None
    1   h   267
    2   b   135
    3   tf  8.7
    4   tw  5.5
    5   r1  15
    6   ys  67.5
    7   d   219.6
    8   A   3915
    9   AL  1.04