使现代化
您还可以
do this without a
udf
通过使用
pyspark.sql.functions.expr
通过
column values as a parameter
到
pyspark.sql.functions.regexp_extract
:
from pyspark.sql.functions import expr
df = df.withColumn(
'word_bef_key_word',
expr(r"regexp_extract(Text, concat('\\w+(?= ', Key_word, ')'), 0)")
)
df.show(truncate=False)
#+--------------------------------------+--------+-----------------+
#|Text |Key_word|word_bef_key_word|
#+--------------------------------------+--------+-----------------+
#|First random text tree cheese cat |tree |text |
#|Second random text apple pie three |text |random |
#|Third random text burger food brain |brain |food |
#|Fourth random text nothing thing chips|random |Fourth |
#+--------------------------------------+--------+-----------------+
原始答案
一种方法是使用
自定义项
要执行正则表达式,请执行以下操作:
import re
from pyspark.sql.functions import udf
def get_previous_word(text, key_word):
matches = re.findall(r'\w+(?= {kw})'.format(kw=key_word), text)
return matches[0] if matches else None
get_previous_word_udf = udf(
lambda text, key_word: get_previous_word(text, key_word),
StringType()
)
df = df.withColumn('word_bef_key_word', get_previous_word_udf('Text', 'Key_word'))
df.show(truncate=False)
#+--------------------------------------+--------+-----------------+
#|Text |Key_word|word_bef_key_word|
#+--------------------------------------+--------+-----------------+
#|First random text tree cheese cat |tree |text |
#|Second random text apple pie three |text |random |
#|Third random text burger food brain |brain |food |
#|Fourth random text nothing thing chips|random |Fourth |
#+--------------------------------------+--------+-----------------+
正则表达式模式
'\w+(?= {kw})'.format(kw=key_word)
表示匹配后跟空格的单词和
key_word
。如果有多个匹配项,我们将返回第一个匹配项。如果没有匹配项,则函数返回
None
。