代码之家  ›  专栏  ›  技术社区  ›  Jekyll SONG

运行分布式TensorFlow,不可用错误:端点读取失败

  •  0
  • Jekyll SONG  · 技术社区  · 6 年前

    我是TensorFlow的新手,没有太多经验。我现在正在尝试分布式张量流。

    按照官方指南,我首先创建两个服务器。我在两个独立的终端中运行以下代码

    import sys
    import tensorflow as tf
    
    task_number = int(sys.argv[1])
    
    cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
    server = tf.train.Server(cluster, job_name="local", task_index=task_number)
    
    print("Starting server #{}".format(task_number))
    
    server.start()
    server.join()
    

    服务器已成功安装

    2018-01-25 20:05:37.651802: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job local -> {0 -> localhost:2222, 1 -> localhost:2223}
    2018-01-25 20:05:37.652881: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
    Starting server #0
    2018-01-25 20:05:37.652938: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:328] Server already started (target: grpc://localhost:2222)
    

    然后我运行以下程序

    import tensorflow as tf
    x = tf.constant(2)
    
    with tf.device("/job:local/task:1"):
        y2 = x - 66
    
    with tf.device("/job:local/task:0"):
        y1 = x + 300
        y = y1 + y2
    
    with tf.Session("grpc://localhost:2223") as sess:
        result = sess.run(y)
        print(result)
    

    然后它会给我以下错误消息

    E0125 20:05:49.573488650   10292 ev_epoll1_linux.c:1051]     grpc epoll fd: 5
    Traceback (most recent call last):
      File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
        return fn(*args)
      File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1293, in _run_fn
        self._extend_graph()
      File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1354, in _extend_graph
        self._session, graph_def.SerializeToString(), status)
      File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
        c_api.TF_GetCode(self.status.status))
    tensorflow.python.framework.errors_impl.UnavailableError: Endpoint read failed
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/home/****/Documents/intern/sample_data/try.py", line 25, in <module>
        result = sess.run(y)
      File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
        run_metadata_ptr)
      File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run
        feed_dict_tensor, options, run_metadata)
      File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
        options, run_metadata)
      File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
        raise type(e)(node_def, op, message)
    tensorflow.python.framework.errors_impl.UnavailableError: Endpoint read failed
    

    我在google上搜索了一下,有人认为这可能是proxy的问题,所以我禁用了proxy,但没有任何改变。

    有人知道可能存在什么问题吗?非常感谢。

    1 回复  |  直到 6 年前
        1
  •  1
  •   Jekyll SONG    6 年前

    没关系,问题解决了。它是关于代理的设置。我们需要在服务器和客户端上取消代理设置,以使程序正常工作。