代码之家  ›  专栏  ›  技术社区  ›  user3407267

如何用定界符在火花中爆炸

  •  0
  • user3407267  · 技术社区  · 6 年前

    提供1项(foo bar),soaps true 3香皂、洗发水 4项(foo-bar,bar)可用 5项(foo bar,bar)可用,(肥皂,洗发水)正确 6空-假

    我想把它炸开

    id itemNames优惠券 1项(foo-bar)可用
    2项(条)可用错误 3肥皂假 3羞耻假 4项(foo-bar,bar)可用 5项(foo-bar,bar)可用 6(肥皂、洗发水)正确
    6空真

     df.withColumn("itemNames", explode(split($"itemNames", "[,]")))
    

    我得到:

    itemNames                                          coupons
    item (foo bar) is available                        true       
    soaps                                              true 
    item (bar) is available                            false
    soaps                                              false
    shampoo                                            false
    item (foo bar,                                     true
    bar) is available                                  true 
    (soap,                                             true    
    shampoo)                                           true
    

    有人能告诉我我做错了什么吗?我该怎么纠正?这里常见的一种模式是逗号出现在()中。

    2 回复  |  直到 6 年前
        1
  •  1
  •   pasha701    6 年前

    与自定义项和灵感 Regex to match only commas not in parentheses? :

    val df = List(
      ("item (foo bar) is available, soaps", true),
      ("item (bar) is available", false),
      ("soaps, shampoo", false),
      ("item (foo bar, bar) is available", true),
      ("item (foo bar, bar) is available, (soap, shampoo)", true)
    ).
      toDF("itemNames", "coupons")
    df.show(false)
    
    val regex = Pattern.compile(
      ",         # Match a comma\n" +
        "(?!       # only if it's not followed by...\n" +
        " [^(]*    #   any number of characters except opening parens\n" +
        " \\)      #   followed by a closing parens\n" +
        ")         # End of lookahead",
      Pattern.COMMENTS)
    
    val customSplit = (value: String) => regex.split(value)
    val customSplitUDF = udf(customSplit)
    val result = df.withColumn("itemNames", explode(customSplitUDF($"itemNames")))
    result.show(false)
    

    输出为:

    +--------------------------------+-------+
    |itemNames                       |coupons|
    +--------------------------------+-------+
    |item (foo bar) is available     |true   |
    | soaps                          |true   |
    |item (bar) is available         |false  |
    |soaps                           |false  |
    | shampoo                        |false  |
    |item (foo bar, bar) is available|true   |
    |item (foo bar, bar) is available|true   |
    | (soap, shampoo)                |true   |
    +--------------------------------+-------+
    

    如果需要“trim”,可以轻松地添加到“customSplit”。

        2
  •  1
  •   stack0114106    6 年前

    你的问题没有一个从后面分开的模式。下面是一个解决方法,适用于这种特殊情况。我用lookback操作除以“available”。在你的数据框里试试这个

    scala> "item (foo bar) is available, soaps".split("(?<=available),")
    res41: Array[String] = Array(item (foo bar) is available, " soaps")
    
    scala> "item (foo bar) is available, soaps".split("(?<=available),").length
    res42: Int = 2
    
    scala> "item (foo bar, bar) is available".split("(?<=available),")
    res44: Array[String] = Array(item (foo bar, bar) is available)
    
    scala> "item (foo bar, bar) is available".split("(?<=available),").length
    res45: Int = 1
    

    scala> "item (foo bar, bar) is empty, (soap, shampoo)".split("(?<=available|empty),").length
    res1: Int = 2
    
    scala>