代码之家  ›  专栏  ›  技术社区  ›  NamrataK

.withColumn不提供dataframe的原始列,只提供新添加的列

  •  1
  • NamrataK  · 技术社区  · 6 年前
    val withOneDayts=userDataFrame.join(articleDataFrame,userDataFrame("cid")===articleDataFrame("id"),"left").drop(articleDataFrame("id")).drop(articleDataFrame("published_at"))
    val frame = withOneDayts.withColumn("views", getViews(withOneDayts("timestamp"), withOneDayts("oneDay_ts"))).groupBy("cid").agg(sum("views"))
    val with15MinViews = withOneDayts.withColumn("views", getViews(withOneDayts("timestamp"), withOneDayts("15_min_ts"))).groupBy("cid").agg(sum("views")).withColumnRenamed("sum(views)","views_in_15_min").drop("15_min_ts")
    val with30MinViews = with15MinViews.withColumn("views",getViews(with15MinViews("timestamp"),with15MinViews("30_min_ts"))).groupBy("cid").agg(sum("views")).withColumnRenamed("sum(views)","views_in_30_min").drop("30_min_ts")
    

    我试图在文章发表后的前15分钟和30分钟内获得观点。但对于30分钟的视图,它给出了错误“无法解析[cid,views\u in\u 15\u min]之间的“timestamp”

    1 回复  |  直到 6 年前
        1
  •  0
  •   koiralo    6 年前

    错误出现在使用数据帧的第四行 with15MinViews 其中不包含字段 timestamp

    val with30MinViews = with15MinViews.withColumn("views",
          getViews(with15MinViews("timestamp"),with15MinViews("30_min_ts"))
        )  
        .groupBy("cid").agg(sum("views"))
        .withColumnRenamed("sum(views)","views_in_30_min")
    .drop("30_min_ts")
    

    这个 有15条新闻 dataframe仅包含在上使用的列 groypBy aggregation .所以它只包含 cid ,则, views_in_15_min

    我希望这有帮助!