在pyspark中,尝试以下操作:
df = df[df['original_problem'].rlike('|'.join(searchfor))]
或同等:
import pyspark.sql.functions as F
df.where(F.col('original_problem').rlike('|'.join(searchfor)))
或者,你可以选择
udf
:
import pyspark.sql.functions as F
searchfor = ['cat', 'dog', 'frog', 'fleece']
check_udf = F.udf(lambda x: x if x in searchfor else 'Not_present')
df = df.withColumn('check_presence', check_udf(F.col('original_problem')))
df = df.filter(df.check_presence != 'Not_present').drop('check_presence')
但数据帧方法是首选的,因为它们将更快。