代码之家  ›  专栏  ›  技术社区  ›  Anna

来自列值和regexp表达式的Pyspark字符串模式

  •  3
  • Anna  · 技术社区  · 6 年前

    嗨,我有两列数据框:

    +----------------------------------------+----------+
    |                  Text                  | Key_word |
    +----------------------------------------+----------+
    | First random text tree cheese cat      | tree     |
    | Second random text apple pie three     | text     |
    | Third random text burger food brain    | brain    |
    | Fourth random text nothing thing chips | random   |
    +----------------------------------------+----------+
    

    我想生成第三列,在文本的关键字之前显示一个单词。

    +----------------------------------------+----------+-------------------+--+
    |                  Text                  | Key_word | word_bef_key_word |  |
    +----------------------------------------+----------+-------------------+--+
    | First random text tree cheese cat      | tree     | text              |  |
    | Second random text apple pie three     | text     | random            |  |
    | Third random text burger food brain    | brain    | food              |  |
    | Fourth random text nothing thing chips | random   | Fourth            |  |
    +----------------------------------------+----------+-------------------+--+
    

    我试过了,但没用

    df2=df1.withColumn('word_bef_key_word',regexp_extract(df1.Text,('\\w+)'df1.key_word,1))
    

    下面是创建数据帧示例的代码

    df = sqlCtx.createDataFrame(
        [
            ('First random text tree cheese cat' , 'tree'),
            ('Second random text apple pie three', 'text'),
            ('Third random text burger food brain' , 'brain'),
            ('Fourth random text nothing thing chips', 'random')
        ],
        ('Text', 'Key_word') 
    )
    
    1 回复  |  直到 6 年前
        1
  •  5
  •   pault Tanjin    5 年前

    使现代化

    您还可以 do this without a udf 通过使用 pyspark.sql.functions.expr 通过 column values as a parameter pyspark.sql.functions.regexp_extract :

    from pyspark.sql.functions import expr
    
    df = df.withColumn(
        'word_bef_key_word', 
        expr(r"regexp_extract(Text, concat('\\w+(?= ', Key_word, ')'), 0)")
    )
    df.show(truncate=False)
    #+--------------------------------------+--------+-----------------+
    #|Text                                  |Key_word|word_bef_key_word|
    #+--------------------------------------+--------+-----------------+
    #|First random text tree cheese cat     |tree    |text             |
    #|Second random text apple pie three    |text    |random           |
    #|Third random text burger food brain   |brain   |food             |
    #|Fourth random text nothing thing chips|random  |Fourth           |
    #+--------------------------------------+--------+-----------------+
    

    原始答案

    一种方法是使用 自定义项 要执行正则表达式,请执行以下操作:

    import re
    from pyspark.sql.functions import udf
    
    def get_previous_word(text, key_word):
        matches = re.findall(r'\w+(?= {kw})'.format(kw=key_word), text)
        return matches[0] if matches else None
    
    get_previous_word_udf = udf(
        lambda text, key_word: get_previous_word(text, key_word),
        StringType()
    )
    
    df = df.withColumn('word_bef_key_word', get_previous_word_udf('Text', 'Key_word'))
    df.show(truncate=False)
    #+--------------------------------------+--------+-----------------+
    #|Text                                  |Key_word|word_bef_key_word|
    #+--------------------------------------+--------+-----------------+
    #|First random text tree cheese cat     |tree    |text             |
    #|Second random text apple pie three    |text    |random           |
    #|Third random text burger food brain   |brain   |food             |
    #|Fourth random text nothing thing chips|random  |Fourth           |
    #+--------------------------------------+--------+-----------------+
    

    正则表达式模式 '\w+(?= {kw})'.format(kw=key_word) 表示匹配后跟空格的单词和 key_word 。如果有多个匹配项,我们将返回第一个匹配项。如果没有匹配项,则函数返回 None