代码之家  ›  专栏  ›  技术社区  ›  outside2344

在Kubernetes上的Spark 2.3下运行时,Cassandra连接器出现故障

  •  0
  • outside2344  · 技术社区  · 6 年前

    我正在尝试使用连接器,在过去我已经使用了很多次,非常成功地使用了新的Spark 2.3原生Kubernetes支持,并且遇到了很多麻烦。

    我有一份非常简单的工作,看起来像这样:

    package io.rhom
    
    import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql.cassandra._
    
    import com.datastax.spark.connector.cql.CassandraConnectorConf
    import com.datastax.spark.connector.rdd.ReadConf
    
    /** Computes an approximation to pi */
    object BackupLocations {
      def main(args: Array[String]) {
        val spark = SparkSession
          .builder
          .appName("BackupLocations")
          .getOrCreate()
    
        spark.sparkContext.hadoopConfiguration.set(
          "fs.defaultFS",
          "wasb://<snip>"
        )
    
        spark.sparkContext.hadoopConfiguration.set(
          "fs.azure.account.key.rhomlocations.blob.core.windows.net",
          "<snip>"
        )
    
        val df = spark
          .read
          .format("org.apache.spark.sql.cassandra")
          .options(Map( "table" -> "locations", "keyspace" -> "test"))
          .load()
    
        df.write
          .mode("overwrite")
          .format("com.databricks.spark.avro")
          .save("wasb://<snip>")
    
        spark.stop()
      }
    }
    

    我正在用Scala 2.11在SBT下构建它,并用Dockerfile打包,如下所示:

    FROM timfpark/spark:20180305
    
    COPY core-site.xml /opt/spark/conf
    
    RUN mkdir -p /opt/spark/jars
    COPY target/scala-2.11/rhom-backup-locations_2.11-0.1.0-SNAPSHOT.jar /opt/spark/jars
    

    然后执行:

    bin/spark-submit --master k8s://blue-rhom-io.eastus2.cloudapp.azure.com:443  \
                                 --deploy-mode cluster  \
                                 --name backupLocations \
                                 --class io.rhom.BackupLocations \
                                 --conf spark.executor.instances=2 \
                                 --conf spark.cassandra.connection.host=10.1.0.10 \
                                 --conf spark.kubernetes.container.image=timfpark/rhom-backup-locations:20180306v12 \
                                  --jars https://dl.bintray.com/spark-packages/maven/datastax/spark-cassandra-connector/2.0.3-s_2.11/spark-cassandra-connector-2.0.3-s_2.11.jar,http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.2/hadoop-azure-2.7.2.jar,http://central.maven.org/maven2/com/microsoft/azure/azure-storage/3.1.0/azure-storage-3.1.0.jar,http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar  \ 
                                   local:///opt/spark/jars/rhom-backup-locations_2.11-0.1.0-SNAPSHOT.jar
    

    除了Cassandra连接件外,所有这些都能正常工作,该连接件最终因以下原因而失效:

    2018-03-07 01:19:38 WARN  TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, 10.4.0.46, executor 1): org.apache.spark.SparkException: Task failed while writing rows.
            at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
            at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
            at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
            at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
            at org.apache.spark.scheduler.Task.run(Task.scala:109)
            at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            at java.lang.Thread.run(Thread.java:748)
    Caused by: java.io.IOException: Exception during preparation of SELECT "user_id", "timestamp", "accuracy", "altitude", "altitude_accuracy", "course", "features", "latitude", "longitude", "source", "speed" FROM "rhom"."locations" WHERE token("user_id") > ? AND token("user_id") <= ?   ALLOW FILTERING: org/apache/spark/sql/catalyst/package$ScalaReflectionLock$
            at com.datastax.spark.connector.rdd.CassandraTableScanRDD.createStatement(CassandraTableScanRDD.scala:323)
            at com.datastax.spark.connector.rdd.CassandraTableScanRDD.com$datastax$spark$connector$rdd$CassandraTableScanRDD$$fetchTokenRange(CassandraTableScanRDD.scala:339)
            at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:367)
            at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:367)
            at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
            at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
            at com.datastax.spark.connector.util.CountingIterator.hasNext(CountingIterator.scala:12)
            at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
            at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
            at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
            at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
            at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:380)
            at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
            at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
            at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
            at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
            ... 8 more
    Caused by: java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/package$ScalaReflectionLock$
            at org.apache.spark.sql.catalyst.ReflectionLock$.<init>(ReflectionLock.scala:5)
            at org.apache.spark.sql.catalyst.ReflectionLock$.<clinit>(ReflectionLock.scala)
            at com.datastax.spark.connector.types.TypeConverter$.<init>(TypeConverter.scala:73)
            at com.datastax.spark.connector.types.TypeConverter$.<clinit>(TypeConverter.scala)
            at com.datastax.spark.connector.types.BigIntType$.converterToCassandra(PrimitiveColumnType.scala:50)
            at com.datastax.spark.connector.types.BigIntType$.converterToCassandra(PrimitiveColumnType.scala:46)
            at com.datastax.spark.connector.types.ColumnType$.converterToCassandra(ColumnType.scala:231)
            at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$11.apply(CassandraTableScanRDD.scala:312)
            at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$11.apply(CassandraTableScanRDD.scala:312)
            at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
            at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
            at scala.collection.Iterator$class.foreach(Iterator.scala:893)
            at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
            at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
            at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
            at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
            at scala.collection.AbstractTraversable.map(Traversable.scala:104)
            at com.datastax.spark.connector.rdd.CassandraTableScanRDD.createStatement(CassandraTableScanRDD.scala:312)
            ... 23 more
    Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.package$ScalaReflectionLock$
            at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
            at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
            ... 41 more
    
    2018-03-07 01:19:38 INFO  TaskSetManager:54 - Starting task 0.1 in stage 0.0 (TID 3, 10.4.0.46, executor 1, partition 0, ANY, 9486 bytes)
    

    我已经想尽一切办法来解决这个问题——有人有什么想法吗?这可能是由另一个无关的问题引起的吗?

    1 回复  |  直到 6 年前
        1
  •  2
  •   outside2344    6 年前

    事实证明,Datastax Cassandra连接器的2.0.7版本目前不支持Spark 2.3。为此,我在Datastax的网站上打开了一张JIRA票证,希望它能很快得到解决。