代码之家  ›  专栏  ›  技术社区  ›  Sushanta Deb

使用rvest、RCurl或httr在R中刮取页面

  •  3
  • Sushanta Deb  · 技术社区  · 7 年前

    https://www.mcxindia.com/market-data/spot-market-price

    我试过rvest和RCurl,但在这两种情况下,下载的页面与我在浏览器中看到的不同。我假设存在某种我无法检测或遵循的重定向形式

    任何帮助都将不胜感激

    这是我迄今为止一直在尝试的:

    1.HTTR

    base_url <- "https://www.mcxindia.com/market-data/spot-market-price"
    ua       <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
    library(httr)
    library(XML)
    doc <- POST(base_url,user_agent(ua),set_cookies(`_ga` = "GA1.2.543290785.1505100652",`_gid`="GA1.2.1409943545.1505881384",`_gat`="1"))
    doc <- htmlParse(doc)
    poptable<-readHTMLTable(doc,which=7)
    

    2.RCurl

    library(RCurl)
    curl <- getCurlHandle()
    curlSetOpt(curl = curl,
               ssl.verifypeer = FALSE,
               useragent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
               timeout = 60,
               followlocation = TRUE,
               cookiejar = "./cookies",
               cookiefile = "./cookies")
    newDoc = getURL("https://www.mcxindia.com/market-data/spot-market-price", curl=curl)
    newDoc <- htmlParse(newDoc)
    poptable<-readHTMLTable(newDoc,which=7)
    

    结果:未找到数据!!!!

    1 回复  |  直到 7 年前
        1
  •  4
  •   Sushanta Deb    7 年前

    这是答案

    library(rvest)
    library(stringi)
    library(V8)
    
      ctx <- v8()
      pg <- read_html("https://www.mcxindia.com/market-data/spot-market-price")
      html_nodes(pg, xpath=".//script[contains(., 'Data')]")[[1]] %>% 
        html_text() %>% stri_unescape_unicode() %>% stri_replace_all_fixed('\\\\', '')%>% 
        ctx$eval() -> ignore_the_blank_return_value
      data <- ctx$get("vSMP")$Data[,c("Symbol","TodaysSpotPrice","Unit")]
    

    享受