代码之家  ›  专栏  ›  技术社区  ›  sclarky

使用R从带有JavaScript按钮的ASP.NET网页中删除表

  •  1
  • sclarky  · 技术社区  · 10 年前

    我滔滔不绝地问了很多相关的问题,但都无济于事。我需要从ASP.NET网页中抓取价格信息表( http://www.spp.org/LIP.asp )基于我指定的日期和时间。我很熟悉并想使用R。我的基本障碍是URL不反映搜索参数,它是静态的,我也不知道如何在ASP.NET网站上提交包含Javascript的HTML表单。

    我查看了上面URL的源代码。我发现在iframe中有一个指向另一个“源数据”页面的链接: http://www.spp.org/LIPPosting/LIP.aspx 。我尝试基于此StackOverflow线程在R中执行POST请求: What if I want to web scrape with R for a page with parameters? .

    ##ASP.NET site scrape
    forms = getHTMLFormDescription("http://www.spp.org/LIPPosting/LIP.aspx")
    # Name the list for easy reference
    names(forms)='spp'
    # Use the createFunction tool so I can submit a search
    fun = createFunction(forms$spp, verbose=T)
    # Submit an HTML form looking for data using all form defaults
    # Except change the hour to '03'
    results <- fun(ddlHour = '03')
    # Grab the table results from the HTML based on its id tag  
    tableData <- getNodeSet(htmlParse(results), "//*/table[@id = 'dgLIP']")
    readHTMLTable(tableData[[1]])
    

    HTML结果显示,在“hour”表单元素中,我确实选择了“03”。

               <td style="height: 42px; width: 77px;">
    <span id="lblLIPHour">Hour</span><br><select name="ddlHour" id="ddlHour"><option value="1">01</option>
    <option value="2">02</option>
    <option selected value="3">03</option>
    <option value="4">04</option>
    <option value="5">05</option>
    <option value="6">06</option>
    <option value="7">07</option>
    <option value="8">08</option>
    

    但是,此请求不会传递给服务器,因为当我查看实际的表结果时,它是当前时间的,而不是“03”。

    > readHTMLTable(tableData[[1]])
       Publish Date   Price Date                PNode Price        Parent PNode Settlement Location
    1  201402281552 201402281600                 AECI 23.45                AECI                AECI
    2  201402281552 201402281600                 AMRN 23.45                AMRN                AMRN
    3  201402281552 201402281600                 BLKW 23.45                BLKW                BLKW
    4  201402281552 201402281600                 CLEC 23.45                CLEC                CLEC
    5  201402281552 201402281600         CSWS_AECC_LA 23.45        CSWS_AECC_LA           AECC_CSWS
    

    此外,我只能获取从服务器返回的页面的HTML,它不包含所有结果。事实上,页面底部有JavaScript箭头按钮,当我在网页中时,可以在所有结果之间切换。

    在网页本身,要查看从下拉菜单中选择后的结果,我必须点击“查看”按钮。有没有一种方法可以在R中复制它,以将我的“03”参数作为查询发送到服务器,从而将新的HTML返回到网页?

    如果我能做到这一点,我也可以写一些东西来“推”页面箭头。

    2 回复  |  直到 7 年前
        1
  •  2
  •   jdharrison    10 年前

    您可以使用Selenium。看见 http://johndharrison.github.io/RSelenium/ .免责声明我是RSelenium包的作者。可在 RSelenium basics RSelenium: Testing Shiny apps

    require(RSelenium)
    # RSelenium::startServer() # if needed
    remDr <- remoteDriver()
    remDr$open()
    remDr$setImplicitWaitTimeout(3000)
    remDr$navigate("http://www.spp.org/LIP.asp")
    remDr$switchToFrame("content_frame")
    dateElem <- remDr$findElement(using = "id", "txtLIPDate") # select the date
    dateRequired <- "01/14/2014"
    dateElem$clearElement()
    dateElem$sendKeysToElement(list("01/14/2014", key = "enter")) # send a date to app
    hourElem <- remDr$findElement(using = "css selector", '#ddlHour [value="5"]') # select the 5th hour
    hourElem$clickElement() # select this hour
    buttonElem <-remDr$findElement(using = "id", "cmdView")
    buttonElem$clickElement() # click the view button
    
    #Sys.sleep(5)
    tableElem <- remDr$findElement(using = "id", "dgLIP")
    readHTMLTable(htmlParse(tableElem$getElementAttribute("outerHTML")[[1]]))
    
    [1] "tableElem$getElementAttribute(\"outerHTML\")"
    $dgLIP
    V1           V2                   V3    V4                  V5                  V6
    1  Publish Date   Price Date                PNode Price        Parent PNode Settlement Location
    2  201401132252 201401132300                 AECI 19.14                AECI                AECI
    3  201401132252 201401132300                 AMRN 18.87                AMRN                AMRN
    4  201401132252 201401132300                 BLKW 20.28                BLKW                BLKW
    5  201401132252 201401132300                 CLEC 18.99                CLEC                CLEC
    6  201401132252 201401132300         CSWS_AECC_LA 19.77        CSWS_AECC_LA           AECC_CSWS
    7  201401132252 201401132300  CSWS_GREEN_LIGHT_LA  18.5 CSWS_GREEN_LIGHT_LA        GSEC_GL_CSWS
    8  201401132252 201401132300              CSWS_LA 19.01             CSWS_LA           AEPM_CSWS
    9  201401132252 201401132300              CSWS_LA 19.01             CSWS_LA            AEP_LOSS
    10 201401132252 201401132300         CSWS_OMPA_LA 18.66        CSWS_OMPA_LA           OMPA_CSWS
    11 201401132252 201401132300      CSWS_TENASKA_LA 18.95     CSWS_TENASKA_LA        GATEWAY_LOAD
    12 201401132252 201401132300      CSWS112_WGORLD1  18.7             CSWS_LA           AEPM_CSWS
    13 201401132252 201401132300      CSWS112_WGORLD1  18.7             CSWS_LA            AEP_LOSS
    14 201401132252 201401132300      CSWS116PEORILD1  18.9             CSWS_LA           AEPM_CSWS
    15 201401132252 201401132300      CSWS116PEORILD1  18.9             CSWS_LA            AEP_LOSS
    16 201401132252 201401132300    CSWS121EASTLDXFL1 18.92             CSWS_LA           AEPM_CSWS
    17 201401132252 201401132300    CSWS121EASTLDXFL1 18.92             CSWS_LA            AEP_LOSS
    18 201401132252 201401132300      CSWS121LYNN4LD1 18.91             CSWS_LA           AEPM_CSWS
    19 201401132252 201401132300      CSWS121LYNN4LD1 18.91             CSWS_LA            AEP_LOSS
    20 201401132252 201401132300   CSWS12TH_STLD69_12 18.92             CSWS_LA           AEPM_CSWS
    21 201401132252 201401132300   CSWS12TH_STLD69_12 18.92             CSWS_LA            AEP_LOSS
    22 201401132252 201401132300 CSWS12TH_STLD69_12_2 18.92             CSWS_LA           AEPM_CSWS
    23 201401132252 201401132300 CSWS12TH_STLD69_12_2 18.92             CSWS_LA            AEP_LOSS
    24 201401132252 201401132300      CSWS136_YALELD1  18.9             CSWS_LA           AEPM_CSWS
    25 201401132252 201401132300      CSWS136_YALELD1  18.9             CSWS_LA            AEP_LOSS
    26 201401132252 201401132300  CSWS141_PINELDXFMR1 19.09             CSWS_LA           AEPM_CSWS
    27          < >         <NA>                 <NA>  <NA>                <NA>                <NA>
    
        2
  •  0
  •   sclarky    10 年前

    为了子孙后代,我还想把我正在使用的代码放在结果页面之间的页面点击上(没有“全部显示”选项)。我让RSelenium点击所有页面,直到不再有“向前点击”选项。在每一页上,它都将HTML表抓取到一个列表中:

    # Get the first page of results
    tableElem <- remDr$findElement(using = "id", "dgLIP")
    tmp <- readHTMLTable(htmlParse(tableElem$getElementAttribute("outerHTML")[[1]]))
    hourlyData <- list()
    # Save the first table without the last row, which is gibberish
    hourlyData[[1]] <- tmp[[1]][-27,]
    
    # Click the 'greater than' arrow javascript href element to get to next page  
    acc <- 2
    while("javascript:__doPostBack('dgLIP$_ctl29$_ctl1','')" %in% unlist(lapply(remDr$findElements("css selector", "[href]"), function(x){x$getElementAttribute("href")}))) {
      webElems <- remDr$findElements("css selector", "[href]")
      clickers <- unlist(lapply(webElems, function(x){x$getElementAttribute("href")}))
      pager <- webElems[[which(clickers == "javascript:__doPostBack('dgLIP$_ctl29$_ctl1','')")]]
      pager$clickElement()
      tableElem <- remDr$findElement(using = "id", "dgLIP")
      tmp <- readHTMLTable(htmlParse(tableElem$getElementAttribute("outerHTML")[[1]]))
      hourlyData[[acc]] <- tmp[[1]]
      acc <- acc + 1
      Sys.sleep(3)
    }