我在pyspark中有一个数据表,其中包含两列,数据类型为“struc”。
请参见下面的示例数据框:
word_verb word_noun
{_1=cook, _2=VB} {_1=chicken, _2=NN}
{_1=pack, _2=VBN} {_1=lunch, _2=NN}
{_1=reconnected, _2=VBN} {_1=wifi, _2=NN}
我想将这两列连接在一起,以便对连接的动词和名词块进行频率计数。
我尝试了下面的代码:
df = df.withColumn('word_chunk_final', F.concat(F.col('word_verb'), F.col('word_noun')))
但我得到以下错误:
AnalysisException: u"cannot resolve 'concat(`word_verb`, `word_noun`)' due to data type mismatch: input to function concat should have been string, binary or array, but it's [struct<_1:string,_2:string>, struct<_1:string,_2:string>]
我想要的输出表如下。连接的新字段的数据类型为字符串:
word_verb word_noun word_chunk_final
{_1=cook, _2=VB} {_1=chicken, _2=NN} cook chicken
{_1=pack, _2=VBN} {_1=lunch, _2=NN} pack lunch
{_1=reconnected, _2=VBN} {_1=wifi, _2=NN} reconnected wifi