代码之家  ›  专栏  ›  技术社区  ›  B. Sun

Dataproc PySpark工作人员没有使用gsutil的权限

  •  0
  • B. Sun  · 技术社区  · 7 年前

    在我运行的Datalab笔记本中

    import subprocess
    all_parent_direcotry = subprocess.Popen("gsutil ls gs://parent-directories ",shell=True,stdout=subprocess.PIPE).stdout.read()
    

    这给了我所有的子目录没有问题。

    gsutil ls

    def get_sub_dir(path):
        import subprocess
        p = subprocess.Popen("gsutil ls gs://parent-directories/" + path, shell=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        return p.stdout.read(), p.stderr.read()
    

    然后运行 get_sub_dir(sub-directory)

    然而

     sub_dir = sc.parallelize([sub-directory])
     sub_dir.map(get_sub_dir).collect()
    

    给我:

     Traceback (most recent call last):
      File "/usr/bin/../lib/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 99, in <module>
        main()
      File "/usr/bin/../lib/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 30, in main
        project, account = bootstrapping.GetActiveProjectAndAccount()
      File "/usr/lib/google-cloud-sdk/bin/bootstrapping/bootstrapping.py", line 205, in GetActiveProjectAndAccount
        project_name = properties.VALUES.core.project.Get(validate=False)
      File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1373, in Get
        required)
      File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1661, in _GetProperty
        value = _GetPropertyWithoutDefault(prop, properties_file)
      File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1699, in _GetPropertyWithoutDefault
        value = callback()
      File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 222, in GetProject
        return c_gce.Metadata().Project()
      File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 203, in Metadata
        _metadata_lock.lock(function=_CreateMetadata, argument=None)
      File "/usr/lib/python2.7/mutex.py", line 44, in lock
        function(argument)
      File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 202, in _CreateMetadata
        _metadata = _GCEMetadata()
      File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 59, in __init__
        self.connected = gce_cache.GetOnGCE()
      File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 141, in GetOnGCE
        return _SINGLETON_ON_GCE_CACHE.GetOnGCE(check_age)
      File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 81, in GetOnGCE
        self._WriteDisk(on_gce)
      File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 113, in _WriteDisk
        with files.OpenForWritingPrivate(gce_cache_path) as gcecache_file:
      File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 715, in OpenForWritingPrivate
        MakeDir(full_parent_dir_path, mode=0700)
      File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 115, in MakeDir
        (u'Please verify that you have permissions to write to the parent '
    googlecloudsdk.core.util.files.Error: Could not create directory [/home/.config/gcloud]: Permission denied.
    
    Please verify that you have permissions to write to the parent directory.
    

    检查后,在具有 whoami ,它显示 yarn .

    使用 gsutil ,或者是否有其他方法可以从Dataproc PySpark工作节点访问bucket?

    1 回复  |  直到 7 年前
        1
  •  2
  •   Dennis Huo    7 年前

    CLI在从元数据服务获取令牌时,会查看当前homedir以查找放置缓存凭据文件的位置。中的相关代码 googlecloudsdk/core/config.py

    def _GetGlobalConfigDir():
      """Returns the path to the user's global config area.
    
      Returns:
        str: The path to the user's global config area.
      """
      # Name of the directory that roots a cloud SDK workspace.
      global_config_dir = encoding.GetEncodedValue(os.environ, CLOUDSDK_CONFIG)
      if global_config_dir:
        return global_config_dir
      if platforms.OperatingSystem.Current() != platforms.OperatingSystem.WINDOWS:
        return os.path.join(os.path.expanduser('~'), '.config',
                            _CLOUDSDK_GLOBAL_CONFIG_DIR_NAME)
    

    yarn ,如果你只是跑 sudo su yarn 你会看到的 ~ 决心 /var/lib/hadoop-yarn 在Dataproc节点上,纱线实际上是传播的 yarn.nodemanager.user-home-dir /home/ . 因此,即使你可以 sudo -u yarn gsutil ... ,它的行为与纱线容器中的gsutil不同,而且很自然地 root /主页/ 目录

    长话短说,你有两个选择:

    1. 在代码中,添加 HOME=/var/lib/hadoop-yarn 就在你的 gsutil 陈述

       p = subprocess.Popen("HOME=/var/lib/hadoop-yarn gsutil ls gs://parent-directories/" + path, shell=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    
    1. 创建簇时,请指定纱线属性。

    gcloud dataproc clusters create --properties yarn:yarn.nodemanager.user-home-dir=/var/lib/hadoop-yarn ...
    

    /etc/hadoop/conf/yarn-site.xml sudo systemctl restart hadoop-yarn-nodemanager.service