代码之家  ›  专栏  ›  技术社区  ›  Ged

用Scala滤波SPARK数据帧中的非爆炸结构

  •  0
  • Ged  · 技术社区  · 6 年前

    我有:

     +-----------------------+-------+------------------------------------+
     |cities                 |name   |schools                             |
     +-----------------------+-------+------------------------------------+
     |[palo alto, menlo park]|Michael|[[stanford, 2010], [berkeley, 2012]]|
     |[santa cruz]           |Andy   |[[ucsb, 2011]]                      |
     |[portland]             |Justin |[[berkeley, 2014]]                  |
     +-----------------------+-------+------------------------------------+
    

     val res = df.select ("*").where (array_contains (df("schools.sname"), "berkeley")).show(false)
    

    但是,如果不想爆炸或使用自定义项,我以与上面相同或类似的方式,我如何做类似的事情:

     return all rows where at least 1 schools.sname starts with "b"  ?
    

    例如。:

     val res = df.select ("*").where (startsWith (df("schools.sname"), "b")).show(false)
    

    当然,这是错误的,只是为了证明这一点。但是,我怎样才能在不使用自定义项的情况下进行分解或返回true/false或其他任何内容,以及在不使用自定义项的情况下进行一般的过滤呢?也许这是不可能的。我找不到这样的例子。还是这样

    得到的答案显示了某些事物如何具有某种方法,因为SCALA中不存在某些功能。我读了一篇文章,指出了在这之后将要实现的新数组特性,因此证明了这一点。

    2 回复  |  直到 6 年前
        1
  •  1
  •   stack0114106    6 年前

    这个怎么样。

    scala> val df = Seq ( ( Array("palo alto", "menlo park"), "Michael", Array(("stanford", 2010), ("berkeley", 2012))),
         |     (Array(("santa cruz")),"Andy",Array(("ucsb", 2011))),
         |       (Array(("portland")),"Justin",Array(("berkeley", 2014)))
         |     ).toDF("cities","name","schools")
    df: org.apache.spark.sql.DataFrame = [cities: array<string>, name: string ... 1 more field]
    
    scala> val df2 = df.select ("*").withColumn("sch1",df("schools._1"))
    df2: org.apache.spark.sql.DataFrame = [cities: array<string>, name: string ... 2 more fields]
    
    scala> val df3=df2.select("*").withColumn("sch2",concat_ws(",",df2("sch1")))
    df3: org.apache.spark.sql.DataFrame = [cities: array<string>, name: string ... 3 more fields]
    
    scala> df3.select("*").where( df3("sch2") rlike "^b|,b" ).show(false)
    +-----------------------+-------+------------------------------------+--------------------+-----------------+
    |cities                 |name   |schools                             |sch1                |sch2             |
    +-----------------------+-------+------------------------------------+--------------------+-----------------+
    |[palo alto, menlo park]|Michael|[[stanford, 2010], [berkeley, 2012]]|[stanford, berkeley]|stanford,berkeley|
    |[portland]             |Justin |[[berkeley, 2014]]                  |[berkeley]          |berkeley         |
    +-----------------------+-------+------------------------------------+--------------------+-----------------+
    

    在另一个步骤中,您可以删除不需要的列。

        2
  •  1
  •   Leonard Kerr    6 年前

    Dataset[Student] 哪里:

    case class School(sname: String, year: Int)
    case class Student(cities: Seq[String], name: String, schools: Seq[School])
    

    然后您可以简单地执行以下操作:

    students
        .filter(
            r => r.schools.filter(_.sname.startsWith("b")).size > 0)
    

    DataFrame 然后:

    import org.apache.spark.sql.Row
    
    students.toDF
        .filter(
            r => r.getAs[Seq[Row]]("schools").filter(_.getAs[String]("name")
                                             .startsWith("b")).size > 0)
    

    +-----------------------+-------+------------------------------------+
    |cities                 |name   |schools                             |
    +-----------------------+-------+------------------------------------+
    |[palo alto, menlo park]|Michael|[[stanford, 2010], [berkeley, 2012]]|
    |[portland]             |Justin |[[berkeley, 2014]]                  |
    +-----------------------+-------+------------------------------------+