代码之家  ›  专栏  ›  技术社区  ›  nad

一系列JSON对象到数据帧的转换

  •  0
  • nad  · 技术社区  · 6 年前

    我已从下载了一个示例数据集 here

    {
      "id": "4cd223df721b722b1c40689caa52932a41fcc223",
      "title": "Knowledge-rich, computer-assisted composition of Chinese couplets",
      "paperAbstract": "Recent research effort in poem composition has focused on the use of automatic language generation...",
      "entities": [
        "Conformance testing",
        "Natural language generation",
        "Natural language processing",
        "Parallel computing",
        "Stochastic grammar",
        "Web application"
      ],
      "s2Url": "https://semanticscholar.org/paper/4cd223df721b722b1c40689caa52932a41fcc223",
      "s2PdfUrl": "",
      "pdfUrls": [
        "https://doi.org/10.1093/llc/fqu052"
      ],
      "authors": [
        {
          "name": "John Lee",
          "ids": [
            "3362353"
          ]
        },
        "..."
      ],
      "inCitations": [
        "c789e333fdbb963883a0b5c96c648bf36b8cd242"
      ],
      "outCitations": [
        "abe213ed63c426a089bdf4329597137751dbb3a0",
        "..."
      ],
      "year": 2016,
      "venue": "DSH",
      "journalName": "DSH",
      "journalVolume": "31",
      "journalPages": "152-163",
      "sources": [
        "DBLP"
      ],
      "doi": "10.1093/llc/fqu052",
      "doiUrl": "https://doi.org/10.1093/llc/fqu052",
      "pmid": ""
    }
    

    最终我需要和 paperAbsrtract

    filename = "sample-S2-records"
    df = pd.read_json(filename, lines=True) 
    df.head()
    

    这显示了所有 doi doiUrl 列为空。

    另外,如果我只选择抽象列并检查标题,我会看到5行中有2行是空的

    abstract = df['paperAbstract']
    abstract.head()
    
    0                                                     
    1    The search for new administrators in complex s...
    2    The human N-formyl peptide receptor (FPR) is a...
    3    Serum CA 19-9 (2-3 sialyl Le(a)) is a marker o...
    4                                                     
    Name: paperAbstract, dtype: object
    

    我错过了什么?有什么建议吗?

    1 回复  |  直到 6 年前
        1
  •  1
  •   Liudvikas Akelis    6 年前

    我查看了你的数据样本,我认为你得到了正确的结果。如果我们手工解析JSON:

    import json
    filename = "sample-S2-records"
    with open(filename, 'r') as f:
        d = [json.loads(x) for x in f]
    

    >>> d[0]['paperAbstract']
    ''
    

    所以看起来像是第一行 paperAbstract 字段为空。

    旁白:我认为这个问题需要解决,我怀疑它对其他人是否有帮助