代码之家  ›  专栏  ›  技术社区  ›  Kasia Kulma

R:从公司内部API获取PDF文档

  •  9
  • Kasia Kulma  · 技术社区  · 6 年前

    我正在尝试使用R从API获取文档。感谢您在 this post .我已经部分成功地执行了上述步骤,但在获取文档内容的最后一步仍然失败:

    1. 查找您感兴趣的文档归档(例如,为公司提出归档历史请求1)。在“links”:“document_metadata”:“link uri fragment here”字段中分析指向文档的链接的响应。

    没问题:

    library(httr)
    library(jsonlite)
    library(openssl)
    
    ### retrieving filing history ####
    company_num = 'FC013908'
    key = 'my_key'
    fh_path = paste0('/company/', str_to_upper(company_num), "/filing-history")
    fh_url <- modify_url("https://api.companieshouse.gov.uk/", path = fh_path)
    fh_test <- GET(fh_url, authenticate(key, "")) #status_code = 200
    fh_parsed <- jsonlite::fromJSON(content(fh_test, "text",encoding = "utf-8"), flatten = TRUE)
    docs <- fh_parsed$items
    

    完成。

    2对于给定的文档,通过ch document api3请求文档元数据。分析响应以获得可用的文档(mime)类型以及到实际文档数据(文档uri片段)的链接。

    这里没有问题:

    md_meta_url = docs$links.document_metadata[1]  
    key_pass <- paste0(key,":")
    decoded_auth <- paste0('Basic ', base64_encode(key_pass))
    
    md_test <- GET(md_meta_url,
                   add_headers(Authorization = decoded_auth)
                   )
    md_test #status_code = 200!
    md_parsed <- jsonlite::fromJSON(content(md_test, "text",encoding = "utf-8"), flatten = TRUE)
    

    这样我就可以获得内容URL:

    cont_url = md_parsed$links$document
    

    请求实际文档9,指定mime类型(例如“application/pdf”)。

    我在不遵循重定向的情况下执行此操作,如预期的那样,我使用 location 标题:

    accept = 'application/pdf'
    cont_test <- GET(cont_url, 
               add_headers(Authorization = decoded_auth,
                           Accept = accept),
               config(followlocation = FALSE)
    )
    
    final_url <- cont_test$headers$location
    
    > final_url
    [1] "https://s3-eu-west-1.amazonaws.com/document-api-images-prod/docs/LjBouRHeXXpIYAvqYIPWL06iXaliPz6Pucp1OXCXQhI/application-pdf?AWSAccessKeyId=ASIAJX7TVURFXZTY5DNQ&Expires=1529483765&Signature=uUQx6RTW7XBLqx4L6pYr5tOUySg%3D&x-amz-security-token=FQoDYXdzEP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDGxe7meYGe3OYhNwcSK3AwcVYJUXaUMf19oVO9s4qNPWN8AHjNNd5rrZhgE9YTkF1OmzyZSL5xHbls664kDP%2Bxd7dz9PIU5O1D%2BVxoDyoYcFiS6acDnO28KpfFE56lUZNfedf1jys%2FP0SJ8f%2F50Cbn93bfOlm0MZA9%2BQ2DYQvPfkWSvrDjMyCXHbu57gpZHjQKPNRTgzGXzUUCvFwREytGMM4eThhn4Glvvx%2FA8IiLbnsvgmEKw9iAj7KWIenhoJq3cTRytUpVeipLnQoBVLau8dFYkKdAHZaYM2Tlx0z6ObRb%2BGdm7W7eOVA1bFXuUXmUmnAHruDIwwLlgOVN2IJ9CxmJU22lY8jrEm%2BUivtrdp2oofn32PryBEJ8jJOg9cIpLbBBx%2FeOkng9zJwnZbute7Nmh%2BnaY2btsId6JjraFNsTvR%2B1qEZX9uuznUdJdqgVfTMj2gGrAmntwk0JAkILlvamzjWC%2F9vAqK7Xvt8aC6hlIMB2vdzTCU9Jf%2FrIMTClTJkk0BzBuvJ86t1l%2BXb4rF5Pab%2FegFpJ6nvZKqde%2F77wMMiTyG35EndmYx4AWqTIh9EofYwKZa9uciNvRT0E2%2BYnT5jZMo%2BdWn2QU%3D"
    

    但是,当我试图

    再次从Amazon请求此URI,并再次传递所需的内容类型。 我得到400个错误:

     final_test <- GET(final_url, 
                     add_headers(Authorization = decoded_auth,
                                 Accept = accept
                                 ))
    
    > final_test
    Response [https://s3-eu-west-1.amazonaws.com/document-api-images-prod/docs/LjBouRHeXXpIYAvqYIPWL06iXaliPz6Pucp1OXCXQhI/application-pdf?AWSAccessKeyId=ASIAJX7TVURFXZTY5DNQ&Expires=1529483765&Signature=uUQx6RTW7XBLqx4L6pYr5tOUySg%3D&x-amz-security-token=FQoDYXdzEP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDGxe7meYGe3OYhNwcSK3AwcVYJUXaUMf19oVO9s4qNPWN8AHjNNd5rrZhgE9YTkF1OmzyZSL5xHbls664kDP%2Bxd7dz9PIU5O1D%2BVxoDyoYcFiS6acDnO28KpfFE56lUZNfedf1jys%2FP0SJ8f%2F50Cbn93bfOlm0MZA9%2BQ2DYQvPfkWSvrDjMyCXHbu57gpZHjQKPNRTgzGXzUUCvFwREytGMM4eThhn4Glvvx%2FA8IiLbnsvgmEKw9iAj7KWIenhoJq3cTRytUpVeipLnQoBVLau8dFYkKdAHZaYM2Tlx0z6ObRb%2BGdm7W7eOVA1bFXuUXmUmnAHruDIwwLlgOVN2IJ9CxmJU22lY8jrEm%2BUivtrdp2oofn32PryBEJ8jJOg9cIpLbBBx%2FeOkng9zJwnZbute7Nmh%2BnaY2btsId6JjraFNsTvR%2B1qEZX9uuznUdJdqgVfTMj2gGrAmntwk0JAkILlvamzjWC%2F9vAqK7Xvt8aC6hlIMB2vdzTCU9Jf%2FrIMTClTJkk0BzBuvJ86t1l%2BXb4rF5Pab%2FegFpJ6nvZKqde%2F77wMMiTyG35EndmYx4AWqTIh9EofYwKZa9uciNvRT0E2%2BYnT5jZMo%2BdWn2QU%3D]
      Date: 2018-06-20 08:37
      Status: 400
      Content-Type: application/xml
      Size: 523 B
    <BINARY BODY>
    

    不用说,执行

    browseURL(final_test$url)
    

    退货 Access Denied 错误。我怀疑这可能与亚马逊的授权问题有点类似 here .有什么办法来解决这个最后的障碍吗?

    谢谢!

    1 回复  |  直到 6 年前
        1
  •  5
  •   Kasia Kulma    6 年前

    @voracityemail my question on Companies House Developers Hub final_test 以下内容:

    final_test <- GET(final_url, add_headers(Accept = accept))
    

    200

    > final_test
    Response [https://s3-eu-west-1.amazonaws.com/document-api-images-prod/docs/Rl1qKy2kNqdskHUIsqU9u0bGzH2goTfJfnCrNg4S0lg/application-pdf?AWSAccessKeyId=ASIAJMG7NTZHYC4NH3MA&Expires=1530093768&Signature=EteMSmwXS%2FqqdOFRmYY%2Fgf187Aw%3D&x-amz-security-token=FQoDYXdzELf%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDOMKrcNPR6jb5bnzGSK3A1yzaoVZWhgAeXYCN9WJnxx8b%2BTKCEEZyZui3aR5j0WoNWIQhW9GIQ8R4xTGVkRjwQIhzgDp%2BRCfXGQ0CfPCOfseaQri5m%2BWTEWBgjfToL7%2FMdcC1IINMTFRrih1APE%2FmmTcQaW7SvyZWv3Q4bVQB%2FtOsiX5k8rWVsT7%2FecfQmnJMljcKF0%2F3vDRTtLRURTCtrdegfnIFrSqXkelLxVVypKY9UeURBgxAgngOgoP7YhYt3wD%2BEz5rBdNfMvF1Zuv91hLGDyBaKuV4fRKMRXlymDHCwNgNZl3JeyuAmnX8pexK6PJzH7MerM8QX8LoPfge1yutvqEj0%2FjRSYEShOWUebecQ2tJqWIEOZly0Ji8fc%2BMtFDO1FWZBrMl6lXgkwTMpELnTH5%2BP4ULMdFfEz30bWSnAuTGXcAxsoFWsFTIE2uO35zgkOsAUT2un4UNGnL2S8XexWbgwq%2B%2Bhtxo9ruP9WA8mTpjBkup2Qe5EpvUiNwGX9APjThi7QFTllVWWvpKgzKTSBh%2Btua9xK8RgiNAYDgEa5k%2BH%2FmWIP56WglBE6r3HGsXgbi%2Bff8Rg8z2lVFLo8f9hVv%2BCYoptXM2QU%3D]
      Date: 2018-06-27 10:02
      Status: 200
      Content-Type: application/pdf
      Size: 21.7 kB
    <BINARY BODY>
    

    然后

    browseURL(final_test$url)