代码之家  ›  专栏  ›  技术社区  ›  vish

Spark MLib Word2Vec错误:词汇大小应大于0

  •  2
  • vish  · 技术社区  · 7 年前

    我正在尝试使用Spark的MLLib实现单词矢量化。我遵循给出的示例 here .

    我有很多句子,我想作为输入来训练模型。但我不确定这个模型是采用句子还是将所有单词作为字符串序列。

    我的输入如下:

    scala> v.take(5)
    res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, ...
    

    但是,当我尝试在此输入上训练word2vec模型时,它不起作用。

    scala> val word2vec = new Word2Vec()
    word2vec: org.apache.spark.mllib.feature.Word2Vec = org.apache.spark.mllib.feature.Word2Vec@51567040
    
    scala> val model = word2vec.fit(v)
    java.lang.IllegalArgumentException: requirement failed: The vocabulary size should be > 0. You may need to check the setting of minCount, which could be large enough to remove all your words in sentences.
    

    Word2Vec 不把句子作为输入?

    1 回复  |  直到 6 年前