代码之家  ›  专栏  ›  技术社区  ›  Nick

在VBA中只显示网页的特定部分

  •  0
  • Nick  · 技术社区  · 3 年前

    我再次访问web scraping,尝试开发一种可以从数据库中提取数据的工具。

    这里我使用的是一份物质档案,可在以下网址找到: https://echa.europa.eu/registration-dossier/-/registered-dossier/16016/7/1 .

    以下是本卷宗中可以找到的各种毒理学信息的列表,但我只对被称为DNELs的偏离点值(POD)感兴趣:

    基本上复制一个答案给我提供了一段时间前,我有以下代码拉第一个豆荚。

    Public Sub GetContents()
        
    
    'Start ECHA Search via XML HTTP Request
    
    Dim XMLReq As New MSXML2.XMLHTTP60
    Dim HTMLDoc As New MSHTML.HTMLDocument
    
    XMLReq.Open "Get", "https://echa.europa.eu/registration-dossier/-/registered-dossier/16016/7/1", False
    XMLReq.send
     
    If XMLReq.Status <> 200 Then
            
        MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
        Exit Sub
    
        End If
     
    HTMLDoc.body.innerHTML = XMLReq.responseText
    
    
    'Retrieve Data
    
    'POD Population and Route
    Set Info = HTMLDoc.getElementById("sWorkersHazardViaInhalationRoute")
    
    Debug.Print Info.innerText
    
    'POD Type
    Set Info = HTMLDoc.getElementsByClassName("HorDL")(0)
    Set data = Info.getElementsByTagName("dd")(0)
    Debug.Print data.innerText
    
    'POD Value
    Set data = Info.getElementsByTagName("dd")(1)
    Debug.Print data.innerText
    
    End Sub
    

    这项功能可以为第一条管理路线拉吊舱:WorkershardViadermalRoute

    Workers - Hazard via inhalation route
    DNEL (Derived No Effect Level)
    238 mg/m³
    

    这很好,但我真的希望能够对其进行调整,以拉动DNEL及其对每种管理方式的价值。这里用蓝色突出显示:

    enter image description here

    因此,在本例中,整个所需的输出将跨越3列(虽然只想提取数据,但现在在3列中并不重要):

    Workers - Hazard via inhalation route, DNEL (Derived No Effect Level), 238 mg/m³
    Workers - Hazard via dermal route, DNEL (Derived No Effect Level), 84 mg/kg bw/day
    General Population - Hazard via inhalation route, DNEL (Derived No Effect Level), 70 mg/m³
    General Population - Hazard via dermal route, DNEL (Derived No Effect Level), 51 mg/kg bw/day
    General Population - Hazard via oral route, DNEL (Derived No Effect Level), 24 mg/kg bw/day
    

    我遇到的问题是,我正在使用class元素“HorDL”来获取这些信息,但不幸的是,这个类并不局限于每个路由中以蓝色突出显示的部分。所以(“HorDL”)(0)工作查找,但(“HorDL”)(1)为同一条路线直接提取下面的信息。

    出于这个原因,我怀疑使用这个class元素来获取信息不是最好的方法,但是我想不出任何其他方法来做到这一点。

    我已经有了一种方法来提取相关的档案,所以如果这个方法有效的话,它将成为一个整洁的工具,只提取相关的信息。我考虑过提取所有信息,然后在excel中应用过滤器,但我认为这不是一个特别优雅的解决方案。

    非常感谢您的回复。

    0 回复  |  直到 3 年前
        1
  •  2
  •   Raymond Wu    3 年前

    这假设您只需要带有关键字的DNEL Workers General Population 在标题和其中,排除DNEL和 Hazard for the eyes

    注意:您应该声明所有变量,插入 Option Explicit 位于模块顶部,以帮助您实施它。

    Option Explicit
    
    Public Sub GetContents()
        Const DNELTitle As Long = 1
        Const DNELAssessment As Long = 2
        Const DNELValue As Long = 3
        
        Const resultFirstCell As String = "A1" 'Change the first cell address to insert the result accordingly
        
        Dim ws As Worksheet
        Set ws = ThisWorkbook.Worksheets("Sheet1") 'Change worksheet name accordingly
        
        'Start ECHA Search via XML HTTP Request
        Dim XMLReq As New MSXML2.XMLHTTP60
        Dim HTMLDoc As New MSHTML.HTMLDocument
        
        XMLReq.Open "Get", "https://echa.europa.eu/registration-dossier/-/registered-dossier/16016/7/1", False
        XMLReq.send
         
        If XMLReq.Status = 200 Then
            HTMLDoc.body.innerHTML = XMLReq.responseText
            
            '==== Loop through each anchors and get the relevant ID for interested DNEL
            Dim anchors As Object
            Set anchors = HTMLDoc.getElementById("SectionAnchors")
            Set anchors = anchors.getElementsByTagName("a")
            
            Dim anchorsColl As Collection
            Set anchorsColl = New Collection
            
            Dim i As Long
            For i = 0 To anchors.Length - 1
                Dim anchorText As String
                anchorText = anchors(i).innerText
                
                If InStr(anchorText, "Workers - ") <> 0 Or _
                    InStr(anchorText, "General Population - ") <> 0 Then
                    
                    If InStr(anchorText, "Additional Information") = 0 And _
                        InStr(anchorText, "Hazard for the eyes") = 0 Then
                        
                        anchorsColl.Add Replace(anchors(i).href, "about:blank#", vbNullString)
                    End If
                End If
            Next i
            '====
            
            If anchorsColl.Count <> 0 Then
                Dim outputArr() As String
                ReDim outputArr(1 To anchorsColl.Count, 1 To 3) As String
                
                For i = 1 To anchorsColl.Count
                    Dim anchorEle As Object
                                    
                    Set anchorEle = HTMLDoc.getElementById(anchorsColl(i))
                    outputArr(i, DNELTitle) = anchorEle.innerText
                    
                    'Loop through the anchor's sibling until it finds the DL tag to extract the values
                    Do While anchorEle.nodeName <> "DL"
                        Set anchorEle = anchorEle.NextSibling
                    Loop
                    
                    'Assumes that the assessment conclusion is in the first DD tag
                    'Assumes that the value is in the second DD tag
                    outputArr(i, DNELAssessment) = anchorEle.getElementsByTagName("dd")(0).innerText
                    outputArr(i, DNELValue) = anchorEle.getElementsByTagName("dd")(1).innerText
                Next i
                
                'Write the extraction result to the worksheet starting from A1
                ws.Range(resultFirstCell).Resize(UBound(outputArr, 1), 3).Value = outputArr
            Else
                Debug.Print "No DNEL found."
            End If
            
            Set ws = Nothing
            Set HTMLDoc = Nothing
        Else
            MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
        End If
        
        Set XMLReq = Nothing
    End Sub
    
        2
  •  0
  •   Nick    3 年前

    到目前为止我自己的答案。

    接下来,我将通过一个值列表进行循环,以返回每个值的DNEL。还需要包括一些错误处理。

    Sub GetData()
        
    
    'Start ECHA Search via XML HTTP Request
    
    Dim XMLReq As New MSXML2.XMLHTTP60
    Dim HTMLDoc As New MSHTML.HTMLDocument
    
    XMLReq.Open "Get", "https://echa.europa.eu/registration-dossier/-/registered-dossier/16016/7/1", False
    XMLReq.send
     
    If XMLReq.Status <> 200 Then
            
        MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
        Exit Sub
    
        End If
     
    HTMLDoc.body.innerHTML = XMLReq.responseText
    
    
    'Retrieve Data for General population
    
    'Defines class element for each route
    Dim Route(1 To 3) As String
    
    Route(1) = "sGeneralPopulationHazardViaInhalationRoute"
    Route(2) = "sGeneralPopulationHazardViaDermalRoute"
    Route(3) = "sGeneralPopulationHazardViaOralRoute"
    
    'Loops through each element
    
    r = 4
    c = 6
    
    Dim i As Long
    
    For i = 1 To UBound(Route, 1)
    
    
    Set Info = HTMLDoc.getElementById(Route(i))
    Debug.Print Info.innerText
    
    Set Info = HTMLDoc.getElementById(Route(i)).NextSibling.NextSibling.NextSibling
    Set Data = Info.getElementsByTagName("dd")(0)
    Debug.Print Data.innerText
    
    Set Data = Info.getElementsByTagName("dd")(1)
    Debug.Print Data.innerText
    
    
    Cells(r, c) = Data.innerText
    
    c = c + 1
    
    Next i
    
    r = r + 1
    
    
    End Sub