代码之家 › 专栏 › 技术社区 › Chris

用多个API调用填充可变嵌套字典

medical json python

Chris · 技术社区 · 6 年前

我在使用公共API www.gpcontract.co.uk 填充代表英国卫生组织层次结构的大型可变嵌套字典。

一些背景信息

最高级别是四个英国国家(英格兰、苏格兰、威尔士和北爱尔兰),然后是区域组织,一直到各个诊所。每个国家的等级制度的深度各不相同,并可能随年份而变化。每个组织都有一个名称、组织代码和字典,列出其子组织。

不幸的是,完整的嵌套层次结构在API中不可用,而是调用 http://www.gpcontract.co.uk/api/children/[organisation code]/[year] 将返回任何其他组织的直接子组织。

为了在我的应用程序中轻松导航层次结构,我想生成一个完整层次结构的脱机词典(每年一次),该词典将使用 pickle 与应用程序捆绑在一起。

获取这意味着许多API调用,并且我在将返回的JSON转换为我需要的Dictionary对象时遇到了困难。

下面是层次结构中一个很小的部分的例子(我只展示了一个单独的子组织作为例子)。

JSON层次结构示例

{
  "eng": {
    "name": "England",
    "orgcode": "eng",
    "children": {}
  },
  "sco": {
    "name": "Scotland",
    "orgcode": "sco",
    "children": {}
  },
  "wal": {
    "name": "Wales",
    "orgcode": "wal",
    "children": {}
  },
  "nir": {
    "name": "Northern Ireland",
    "orgcode": "nir",
    "children": {
      "blcg": {
        "name": "Belfast Local Commissioning Group",
        "orgcode": "blcg",
        "children": {
          "abc": {
            "name": "Random Clinic",
            "orgcode": "abc",
            "children": {}
          }
        }
      }
    }
  }
}

下面是我用来调用API并填充字典的脚本:

我的剧本

import json, pickle, urllib.request, urllib.error, urllib.parse

# Organisation hierarchy may vary between years. Set the year here.
year = 2017

# This function returns a list containing a dictionary for each child organisation with keys for name and orgcode
def get_child_orgs(orgcode, year):
    orgcode = str(orgcode)
    year = str(year)

    # Correct 4-digit year to 2-digit
    if len(year) > 2:
        year = year[2:]

    try:
        child_data = json.loads(urllib.request.urlopen('http://www.gpcontract.co.uk/api/children/' + str(orgcode) + '/' + year).read())

        output = []

        if child_data != []:
            for item in child_data['children']:
                output.append({'name' : item['name'], 'orgcode' : str(item['orgcode']).lower(), 'children' : {}})
        return output
    except urllib.error.HTTPError:
        print('HTTP error!')
    except:
        print('Other error!')

# I start with a template of the top level of the hierarchy and then populate it
hierarchy = {'eng' : {'name' : 'England', 'orgcode' : 'eng', 'children' : {}}, 'nir' : {'name' : 'Northern Ireland', 'orgcode' : 'nir', 'children' : {}}, 'sco' : {'name' : 'Scotland', 'orgcode' : 'sco', 'children' : {}}, 'wal' : {'name' : 'Wales', 'orgcode' : 'wal', 'children' : {}}}

print('Loading data...\n')

# Here I use nested for loops to make API calls and populate the dictionary down the levels of the hierarchy. The bottom level contains the most items.
for country in ('eng', 'nir', 'sco', 'wal'): 

    for item1 in get_child_orgs(country, year):
        hierarchy[country]['children'][item1['orgcode']] = item1

        for item2 in get_child_orgs(item1['orgcode'], year):
            hierarchy[country]['children'][item1['orgcode']]['children'][item2['orgcode']] = item2

            # Only England and Wales hierarchies go deeper than this
            if country in ('eng', 'wal'):

                level3 = get_child_orgs(item2['orgcode'], year)
                # Check not empty array
                if level3 != []:
                    for item3 in level3:
                        hierarchy[country]['children'][item1['orgcode']]['children'][item2['orgcode']]['children'][item3['orgcode']] = item3

                        level4 = get_child_orgs(item3['orgcode'], year)
                        # Check not empty array
                        if level4 != []:
                            for item4 in level4:
                                hierarchy[country]['children'][item1['orgcode']]['children'][item2['orgcode']]['children'][item3['orgcode']]['children'][item4['orgcode']] = item4

# Save the completed hierarchy with pickle
file_name = 'hierarchy_' + str(year) + '.dat'
with open(file_name, 'wb') as out_file:
    pickle.dump(hierarchy, out_file)

print('Success!')

问题

这似乎在大多数情况下都有效,但当嵌套的for循环返回“nonetype is not iterable error”时,它会感觉很糟糕,有时会崩溃。我意识到这会进行大量的API调用,并且需要几分钟的时间才能运行,但是我看不到解决这一问题的方法,因为我希望完成的层次结构可以脱机使用,以便用户快速搜索数据。然后,我将以稍微不同的方式使用API来获取所选组织的实际医疗保健数据。

我的问题

是否有一种更清洁、更灵活的方法来适应组织层次结构的可变嵌套?

有没有一种方法可以更快地做到这一点?

我在JSON方面相对缺乏经验,因此任何帮助都会受到感激。

1 回复 | 直到 6 年前

Reid Ballard 6 年前

我认为这个问题可能更适合于代码检查堆栈交换,但是正如您提到的,您的代码有时会崩溃并返回 NoneType 如果有错误,我会给予怀疑的好处。

看看你的描述,这就是我最看得出来的

每个组织都有一个名称、组织代码和字典,列出其子组织。[API调用]将返回任何其他组织的直接子组织。

因此,这对我(以及它在示例数据中的外观)的建议是,您的所有数据都是完全等效的;层次结构仅因数据嵌套而存在,并且不由任何特定节点的格式强制执行。

因此,这意味着您应该能够拥有一段代码来处理无限(或者任意,如果您愿意的话)深树的嵌套。显然,您这样做是为了API调用本身( get_child_orgs() ,所以只需复制它来构建树。

def populate_hierarchy(organization,year):
    """ Recursively Populate the Organization Hierarchy

        organization should be a dict with an "orgcode" key with a string value
        and "children" key with a dict value.

        year should be a 2-4 character string representing a year.
    """
    orgcode = organization['orgcode']

    ## get_child_orgs returns a list of organizations
    children = get_child_orgs(orgcode,year)

    ## get_child_orgs returns None on Errors
    if children:
        for child in children:

            ## Add child to the current organization's children, using
            ## orgcode as its key
            organization['children'][child['orgcode']] = child

            ## Recursively populate the child's sub-hierarchy
            populate_hierarchy(child,year)

    ## Technically, the way this is written, returning organization is
    ## pointless because we're modifying organization in place, but I'm
    ## doing it anyway to explicitly denote the end of the function
    return organization

 for country in hierarchy.values():
     populate_hierarchy(country,year)

值得注意的是(因为您在迭代原始代码之前检查了空列表) for x in y 如果 y 是一个空列表,因此您不需要检查。

这个 非类型 错误可能是因为您在 get_child_orgs 然后隐式返回 None .因此,例如 level3 = get_child_orgs[etc...] 结果 level3 = None ;这导致 if None != []: 在下一行中为true,然后尝试迭代 无 具有 for item3 in None: 这会引起错误。正如上面的代码所指出的,这就是为什么我要检查 children .

至于这是否可以更快地完成,您可以尝试使用 threading/multiprocessing 模块。我只是不知道其中任何一个会有多大的利润,原因有三个:

我还没有尝试过API,所以我不知道您需要从实现多个线程/进程中获得多少时间
我已经看到了API,当您查询得太快/太频繁时,从IP地址请求超时(这将使实现变得毫无意义)。
你说你每年只运行一次这个过程,所以从一整年的角度来看,运行时间似乎是微不足道的(显然,除非当前的API调用需要几天的时间才能完成)。

最后,我想问一下 pickle 是存储信息的适当方法,或者如果您不想更好地使用 json.dump/load (作为记录, json 模块不关心您是否将扩展名更改为 .dat 如果你喜欢那个扩展名)。