代码之家 › 专栏 › 技术社区 › user3407267

如何从scala列中提取字符串?

apache-spark scala regex

user3407267 · 技术社区 · 6 年前

我有一个datafram,其值类似于List[INTERSTED\u FIELD:details]。我只是想从中得到感兴趣的领域。如何删除不感兴趣的字段?

例子:

val df = Seq(
  "TESTING:Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low", 
  "PURCHASE:BLACKLIST_ITEM: Foo purchase count (12, 4) is too low ", 
   "UNKOWN:#!@", 
   "BLACKLIST_ITEM:item (mejwnw) is blacklisted",
   "BLACKLIST_ITEM:item (1) is blacklisted, UNKOWN:#!@" 
).toDF("raw_type")

df.show(false)

+-----------------------------------------------------------------+
|raw_type                                                         |
+-----------------------------------------------------------------+
|TESTING:Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|
|PURCHASE:BLACKLIST_ITEM: Foo purchase count (12, 4) is too low   |
|UNKOWN:#!@                                                       |
|BLACKLIST_ITEM:item (mejwnw) is blacklisted                      |
|BLACKLIST_ITEM:item (1) is blacklisted, UNKOWN:#!@               |
+-----------------------------------------------------------------+

我想得到:

+-----------------------------------------------------------------+
|raw_type                                                         |
+-----------------------------------------------------------------+
|TESTING                                                          | 
|PURCHASE,BLACKLIST_ITEM                                          |
|UNKOWN                                                           |
|BLACKLIST_ITEM                                                   |
|BLACKLIST_ITEM, UNKNOWN                                          |
+-----------------------------------------------------------------+

2 回复 | 直到 6 年前

stack0114106 6 年前

检查此自定义项解决方案

scala> val df = Seq(
     |   "TESTING:Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
     |   "PURCHASE:BLACKLIST_ITEM: Foo purchase count (12, 4) is too low ",
     |    "UNKOWN:#!@",
     |    "BLACKLIST_ITEM:item (mejwnw) is blacklisted",
     |    "BLACKLIST_ITEM:item (1) is blacklisted, UNKOWN:#!@"
     | ).toDF("raw_type")
df: org.apache.spark.sql.DataFrame = [raw_type: string]

scala> def matchlist(a:String):String=
     | {
     | import scala.collection.mutable.ArrayBuffer
     | val x = ArrayBuffer[String]()
     | val pt = "([A-Z_]+):".r
     | pt.findAllIn(a).matchData.foreach { m => x.append(m.group(1)) }
     | return x.mkString(",")
     | }
matchlist: (a: String)String

scala> val myudfmatchlist = udf( matchlist(_:String):String )
myudfmatchlist: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))

scala> df.select(myudfmatchlist($"raw_type")).show(false)
+-----------------------+
|UDF(raw_type)          |
+-----------------------+
|TESTING                |
|PURCHASE,BLACKLIST_ITEM|
|UNKOWN                 |
|BLACKLIST_ITEM         |
|BLACKLIST_ITEM,UNKOWN  |
+-----------------------+


scala>

RAGHHURAAMM 6 年前

val p = "[A-Z_]+(?=:)".r
df.rdd.map(x=>p.findAllIn(x.mkString).mkString(",")).toDF(df.columns:_*).show(false)

scala> df.rdd.map(x=>p.findAllIn(x.mkString).mkString(",")).toDF(df.columns:_*).show(false)
+-----------------------+
|raw_type               |
+-----------------------+
|TESTING                |
|PURCHASE,BLACKLIST_ITEM|
|UNKOWN                 |
|BLACKLIST_ITEM         |
|BLACKLIST_ITEM,UNKOWN  |
+-----------------------+