代码之家  ›  专栏  ›  技术社区  ›  user3407267

如何从scala列中提取字符串?

  •  0
  • user3407267  · 技术社区  · 6 年前

    我有一个datafram,其值类似于List[INTERSTED\u FIELD:details]。我只是想从中得到感兴趣的领域。如何删除不感兴趣的字段?

    例子:

    val df = Seq(
      "TESTING:Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low", 
      "PURCHASE:BLACKLIST_ITEM: Foo purchase count (12, 4) is too low ", 
       "UNKOWN:#!@", 
       "BLACKLIST_ITEM:item (mejwnw) is blacklisted",
       "BLACKLIST_ITEM:item (1) is blacklisted, UNKOWN:#!@" 
    ).toDF("raw_type")
    
    df.show(false)
    
    +-----------------------------------------------------------------+
    |raw_type                                                         |
    +-----------------------------------------------------------------+
    |TESTING:Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|
    |PURCHASE:BLACKLIST_ITEM: Foo purchase count (12, 4) is too low   |
    |UNKOWN:#!@                                                       |
    |BLACKLIST_ITEM:item (mejwnw) is blacklisted                      |
    |BLACKLIST_ITEM:item (1) is blacklisted, UNKOWN:#!@               |
    +-----------------------------------------------------------------+
    

    我想得到:

    +-----------------------------------------------------------------+
    |raw_type                                                         |
    +-----------------------------------------------------------------+
    |TESTING                                                          | 
    |PURCHASE,BLACKLIST_ITEM                                          |
    |UNKOWN                                                           |
    |BLACKLIST_ITEM                                                   |
    |BLACKLIST_ITEM, UNKNOWN                                          |
    +-----------------------------------------------------------------+
    
    2 回复  |  直到 6 年前
        1
  •  1
  •   stack0114106    6 年前

    检查此自定义项解决方案

    scala> val df = Seq(
         |   "TESTING:Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
         |   "PURCHASE:BLACKLIST_ITEM: Foo purchase count (12, 4) is too low ",
         |    "UNKOWN:#!@",
         |    "BLACKLIST_ITEM:item (mejwnw) is blacklisted",
         |    "BLACKLIST_ITEM:item (1) is blacklisted, UNKOWN:#!@"
         | ).toDF("raw_type")
    df: org.apache.spark.sql.DataFrame = [raw_type: string]
    
    scala> def matchlist(a:String):String=
         | {
         | import scala.collection.mutable.ArrayBuffer
         | val x = ArrayBuffer[String]()
         | val pt = "([A-Z_]+):".r
         | pt.findAllIn(a).matchData.foreach { m => x.append(m.group(1)) }
         | return x.mkString(",")
         | }
    matchlist: (a: String)String
    
    scala> val myudfmatchlist = udf( matchlist(_:String):String )
    myudfmatchlist: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
    
    scala> df.select(myudfmatchlist($"raw_type")).show(false)
    +-----------------------+
    |UDF(raw_type)          |
    +-----------------------+
    |TESTING                |
    |PURCHASE,BLACKLIST_ITEM|
    |UNKOWN                 |
    |BLACKLIST_ITEM         |
    |BLACKLIST_ITEM,UNKOWN  |
    +-----------------------+
    
    
    scala>
    
        2
  •  1
  •   RAGHHURAAMM    6 年前
    val p = "[A-Z_]+(?=:)".r
    df.rdd.map(x=>p.findAllIn(x.mkString).mkString(",")).toDF(df.columns:_*).show(false)
    

    scala> df.rdd.map(x=>p.findAllIn(x.mkString).mkString(",")).toDF(df.columns:_*).show(false)
    +-----------------------+
    |raw_type               |
    +-----------------------+
    |TESTING                |
    |PURCHASE,BLACKLIST_ITEM|
    |UNKOWN                 |
    |BLACKLIST_ITEM         |
    |BLACKLIST_ITEM,UNKOWN  |
    +-----------------------+