代码之家 › 专栏 › 技术社区 › loretoparisi

Sagemaker在使用keras.utils.multi-gpu-model的多gpu时失败

aws-sagemaker sagemaker amazon-sagemaker keras tensorflow

loretoparisi · 技术社区 · 6 年前

使用自定义模型运行AWS Sagemaker时,当在多GPU配置中使用Keras Plus TensorFlow后端时,培训作业将失败,并出现 算法错误 :

从keras.utils导入multi-gpu模型并行模型=多GPU模型(模型,GPU=K) parallel_model.compile(loss='categorial_cross熵', 优化器='rmsprop') 平行拟合(X,Y,epochs=20,批量拟合=256) < /代码>

这种简单的并行模型加载将失败。CloudWatch日志记录没有进一步的错误或异常。此配置在具有2x Nvidia GTX 1080、相同Keras TensorFlow后端的本地计算机上正常工作。

根据Sagemaker文档和 tutorials the multi-gpu-model utility will work ok when keras backend is mxnet,but I did not find any notice when the backend is tensorflow with the same mulTI GPU配置。

[update] 。

我用下面建议的答案更新了代码,并在培训作业挂起之前添加了一些日志记录。

此日志记录重复两次

2018-11-27 10:02:49.878414:i tensorflow/core/common-runtime/gpu/gpu-device.cc:1511]添加可见gpu设备:0,1,2,3 2018-11-27 10:02:49.878462:i TensorFlow/core/common-runtime/gpu/gpu-device.cc:982]设备互连流执行器,带强度1边缘矩阵: 2018-11-27 10:02:49.878471:i TensorFlow/core/common-runtime/gpu/gpu-device.cc:988]0 1 2 3 2018-11-27 10:02:49.878477:i TensorFlow/core/common-runtime/gpu/gpu-device.cc:1001]0:n y y y 2018-11-27 10:02:49.878481:i TensorFlow/core/common-runtime/gpu/gpu-device.cc:1001]1:y n y 2018-11-27 10:02:49.878486:i TensorFlow/core/common-runtime/gpu/gpu-device.cc:1001]2:y y y n y 2018-11-27 10:02:49.878492:i TensorFlow/core/common-runtime/gpu/gpu-device.cc:1001]3:y y y y n 2018-11-27 10:02:49.879340:i TensorFlow/core/common-runtime/gpu/gpu-device.cc:1115]创建了TensorFlow设备(/device:gpu:0,内存14874 MB)->物理gpu(设备:0,名称:Tesla v100-sxm2-16GB,PCI总线ID:0000:00:1b.0,计算能力:7.0) 2018-11-27 10:02:49.879486:i TensorFlow/core/common-runtime/gpu/gpu-device.cc:1115]创建了TensorFlow设备(/device:gpu:1,内存14874 MB)->物理gpu(设备:1,名称:Tesla v100-sxm2-16GB,PCI总线ID:0000:00:1C.0,计算能力:7.0) 2018-11-27 10:02:49.879694:i TensorFlow/core/common-runtime/gpu/gpu-device.cc:1115]创建了TensorFlow设备(/device:gpu:2,内存14874 MB)->物理gpu(设备:2,名称:Tesla v100-sxm2-16GB,PCI总线ID:0000:00:1d.0,计算能力:7.0) 2018-11-27 10:02:49.879872:i tensorflow/core/common-runtime/gpu/gpu-device.cc:1115]创建tensorflow设备(/device:gpu:3,内存14874 MB)->物理gpu(设备:3,名称:特斯拉v100-sxm2-16GB,PCI总线ID:0000:00:1e.0,计算能力:7.0) < /代码> 在有关于每个GPU的日志信息之前,重复4次2018-11-27 10:02:46.447639:i tensorflow/core/common-runtime/gpu/gpu-device.cc:1432]found device 3 with properties: 名称:Tesla V100-SXM2-16GB主要:7次要:0内存锁存率(GHz):1.53 PCIBusID:0000:00:1e.0 总内存:15.78Gib可用内存:15.37Gib < /代码> 根据日志记录,所有4个GPU都可见并加载在TensorFlow Keras后端中。此后,没有应用程序日志记录,培训作业状态为“正在进行中”,过一段时间后,它变为“失败”,具有相同的“算法错误”。查看CloudWatch日志记录,我可以看到一些工作中的指标。具体来说,gpu memory utilization,cpu utilizationare ok,whilegpu utilizationis 0%. [update]。由于aknownkeras上关于保存多gpu模型的bug,我正在使用multi-gpu-model的此覆盖从keras.layers导入lambda,连接从Keras导入模型将TensorFlow导入为tf def multi_gpu_型号(型号,gpu): #来源:https://github.com/keras team/keras/issues/8123 issuecomment-354857044 如果isInstance(gpus,(list,tuple)): num_gpus=长度(gpus) target_gpu_ids=gpu 其他: NUMPGGPUP= GPU target_gpu_ids=范围(num_gpu) def get_slice(数据,i,部件): 形状=tf.形状(数据) 批次尺寸=形状[:1] 输入形状=形状[1:] 步骤=批次尺寸//零件如果i==num_gpus-1: 尺寸=批次尺寸-步骤*i 其他: 大小=步骤大小=tf.concat([大小,输入_形状],轴=0) 步幅=tf.concat([步幅,输入_-shape*0],轴=0) 开始=跨步*i 返回tf.slice(数据、开始、大小) 所有输出=[] 对于范围内的i(len(model.outputs)): 所有输出。附加([]) #在每个GPU上放置一个模型副本, #每个人都得到一部分输入。对于i,枚举中的gpu-id(目标gpu-id): 使用tf.device('/gpu:%d'%gpu\u id): 使用tf.name_scope(“副本”%d“%gpu id”): 输入= [] #检索输入的切片。对于x in model.inputs: 输入_shape=tuple(x.get_shape().as_list())[1:] slice_i=lambda(获取_slice, 输出形状=输入形状, arguments='i':我, “零件”:num_gpus)(x) inputs.append(切片i) #在切片上应用模型 #(在目标设备上创建模型副本)。输出=模型(输入) 如果不是IsInstance(输出,列表): 输出=[输出] #保存输出以便稍后重新合并。对于范围内的O(长度(输出)): 所有输出[O].附加(输出[O]) #合并CPU上的输出。使用tf.device(“/cpu:0”): 合并= 对于名称,以zip格式输出(model.output_name,all_outputs): 合并.append(concatenate(outputs, 轴=0,名称=名称) 返回模型(model.inputs,合并) < /代码> 这在本地2X Nvidia GTX 1080/Intel Xeon/Ubuntu 16.04上工作正常。它将在Sagemaker培训工作中失败。我已经在美国焊接学会Sagemaker论坛上发表了这个问题。带有KERA后端和多GPU的培训作业自定义算法 sagemaker在使用多个GPU时失败 keras.utils.multi-gpu模型 [update]。我稍微修改了tf.sessioncode添加了一些初始值设定项 with tf.session()as session: K.设置会话(会话) session.run(tf.global_variables_initializer()) session.run(tf.tables_initializer()) < /代码> 现在至少我可以看到一个GPU(我假设devicegpu:0)是从实例度量中使用的。多GPU无论如何都不工作。 from keras.utils import multi_gpu_model parallel_model = multi_gpu_model(model, gpus=K) parallel_model.compile(loss='categorical_crossentropy', optimizer='rmsprop') parallel_model.fit(x, y, epochs=20, batch_size=256) 这种简单的并行模型加载将失败。CloudWatch日志记录没有进一步的错误或异常。此配置在具有2x Nvidia GTX 1080、相同Keras TensorFlow后端的本地计算机上正常工作。根据Sagemaker文件和tutorials这个multi_gpu_model当keras后端是mxnet时,实用程序可以正常工作,但是当后端是具有相同多GPU配置的TensorFlow时,我没有发现任何提示。 [更新] 我用下面建议的答案更新了代码,并在培训作业挂起之前添加了一些日志记录。此日志记录重复两次 2018-11-27 10:02:49.878414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3 2018-11-27 10:02:49.878462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-11-27 10:02:49.878471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3 2018-11-27 10:02:49.878477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y 2018-11-27 10:02:49.878481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y 2018-11-27 10:02:49.878486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y 2018-11-27 10:02:49.878492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N 2018-11-27 10:02:49.879340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 14874 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1b.0, compute capability: 7.0) 2018-11-27 10:02:49.879486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 14874 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1c.0, compute capability: 7.0) 2018-11-27 10:02:49.879694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:2 with 14874 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1d.0, compute capability: 7.0) 2018-11-27 10:02:49.879872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:3 with 14874 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0) 在有关于每个GPU的日志信息之前,重复4次 2018-11-27 10:02:46.447639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:00:1e.0 totalMemory: 15.78GiB freeMemory: 15.37GiB 根据日志记录,所有4个GPU都可见并加载在TensorFlow Keras后端中。之后没有应用程序日志记录,培训作业状态为正在进行中一段时间后,它变成失败用同样的方法算法误差. 查看CloudWatch日志记录,我可以看到一些工作中的指标。明确地GPU Memory Utilization,CPU Utilization没问题,不过GPU utilization是0%。 [更新] 由于AknownKeras上关于保存多GPU模型的bug,我使用的是多GPU模型效用克拉斯 from keras.layers import Lambda, concatenate from keras import Model import tensorflow as tf def multi_gpu_model(model, gpus): #source: https://github.com/keras-team/keras/issues/8123#issuecomment-354857044 if isinstance(gpus, (list, tuple)): num_gpus = len(gpus) target_gpu_ids = gpus else: num_gpus = gpus target_gpu_ids = range(num_gpus) def get_slice(data, i, parts): shape = tf.shape(data) batch_size = shape[:1] input_shape = shape[1:] step = batch_size // parts if i == num_gpus - 1: size = batch_size - step * i else: size = step size = tf.concat([size, input_shape], axis=0) stride = tf.concat([step, input_shape * 0], axis=0) start = stride * i return tf.slice(data, start, size) all_outputs = [] for i in range(len(model.outputs)): all_outputs.append([]) # Place a copy of the model on each GPU, # each getting a slice of the inputs. for i, gpu_id in enumerate(target_gpu_ids): with tf.device('/gpu:%d' % gpu_id): with tf.name_scope('replica_%d' % gpu_id): inputs = [] # Retrieve a slice of the input. for x in model.inputs: input_shape = tuple(x.get_shape().as_list())[1:] slice_i = Lambda(get_slice, output_shape=input_shape, arguments={'i': i, 'parts': num_gpus})(x) inputs.append(slice_i) # Apply model on slice # (creating a model replica on the target device). outputs = model(inputs) if not isinstance(outputs, list): outputs = [outputs] # Save the outputs for merging back together later. for o in range(len(outputs)): all_outputs[o].append(outputs[o]) # Merge outputs on CPU. with tf.device('/cpu:0'): merged = [] for name, outputs in zip(model.output_names, all_outputs): merged.append(concatenate(outputs, axis=0, name=name)) return Model(model.inputs, merged) 这个在本地可以用2x NVIDIA GTX 1080 / Intel Xeon / Ubuntu 16.04. 它将在Sagemaker培训工作中失败。我已经在美国焊接学会Sagemaker论坛上发表了这个问题。 TrainingJob custom algorithm with Keras backend and multi GPU SageMaker Fails when using Multi-GPU with keras.utils.multi_gpu_model [更新] 我稍微修改了一下tf.session代码添加一些初始值设定项 with tf.Session() as session: K.set_session(session) session.run(tf.global_variables_initializer()) session.run(tf.tables_initializer()) 现在至少我能看到一个GPUgpu:0)从实例度量中使用。多GPU无论如何都不工作。

2 回复 | 直到 6 年前

deKeijzer 6 年前

这可能不是解决您问题的最佳答案,但这正是我用于TensorFlow后端的多GPU模型的原因。首先,我用以下方法初始化:

def setup_multi_gpus():
    """
    Setup multi GPU usage

    Example usage:
    model = Sequential()
    ...
    multi_model = multi_gpu_model(model, gpus=num_gpu)
    multi_model.fit()

    About memory usage:
    https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory
    """
    import tensorflow as tf
    from keras.utils.training_utils import multi_gpu_model
    from tensorflow.python.client import device_lib

    # IMPORTANT: Tells tf to not occupy a specific amount of memory
    from keras.backend.tensorflow_backend import set_session  
    config = tf.ConfigProto()  
    config.gpu_options.allow_growth = True  # dynamically grow the memory used on the GPU  
    sess = tf.Session(config=config)  
    set_session(sess)  # set this TensorFlow session as the default session for Keras.


    # getting the number of GPUs 
    def get_available_gpus():
       local_device_protos = device_lib.list_local_devices()
       return [x.name for x in local_device_protos if x.device_type    == 'GPU']

    num_gpu = len(get_available_gpus())
    print('Amount of GPUs available: %s' % num_gpu)

    return num_gpu

然后我打电话

# Setup multi GPU usage
num_gpu = setup_multi_gpus()

创建一个模型。

...

之后你就可以把它变成一个多GPU模型了。

multi_model = multi_gpu_model(model, gpus=num_gpu)
multi_model.compile...
multi_model.fit...

这里唯一不同于您所做的是TensorFlow初始化GPU的方式。我无法想象这是个问题,但它可能值得尝试。

祝你好运!

编辑:我注意到序列到序列不能与多GPU一起工作。这就是你想训练的那种型号吗?

ByungWook 6 年前

我为反应缓慢而道歉。

似乎有很多线程是并行运行的,我希望将它们链接在一起,以便其他有相同问题的人可以看到进展和讨论的进行。

https://forums.aws.amazon.com/thread.jspa?messageID=881541 https://forums.aws.amazon.com/thread.jspa?messageID=881540

https://github.com/aws/sagemaker-python-sdk/issues/512

关于这个有几个问题。

什么版本的TensorFlow和Keras?

我不太确定是什么导致了这个问题。您的容器是否具有所有需要的依赖项,如CUDA等? https://www.tensorflow.org/install/gpu

你能用单GPU和角膜进行训练吗?