代码之家  ›  专栏  ›  技术社区  ›  default_settings

模糊匹配器为best_match_score返回NaN

  •  0
  • default_settings  · 技术社区  · 3 年前

    我在表演时观察到奇怪的行为 fuzzy_left_join fuzzymatcher 图书馆。试图连接两个df,左一个有5217条记录,右一个有8734条记录,所有记录都有 best_match_score 是71条记录,这似乎真的很奇怪。为了获得更好的结果,我甚至删除了所有数字,只留下用于连接列的字母字符。在合并的表中,右侧表中的id列为 NaN 这也是一个奇怪的结果。

    左表-连接“amazon_s3_name”的列。第一项- limonig

    +------+---------+-------+-----------+------------------------------------+
    |  id  | product | price | category  |           amazon_s3_name           |
    +------+---------+-------+-----------+------------------------------------+
    |    1 | A       |  1.49 | fruits    | limonig                            |
    | 8964 | B       |  1.39 | beverages | studencajfuzelimonilimonetatrevaml |
    | 9659 | C       |  2.79 | beverages | studencajfuzelimonilimtreval       |
    +------+---------+-------+-----------+------------------------------------+
    

    右表-连接“amazon_s3_name”的列-最后一项- limoni

    +------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
    |  id  |                                                       picture                                                              |                    amazon_s3_name          |
    +------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
    |  191 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/AhmadCajLimonIDjindjifil20X2G.jpg                          | ahmadcajlimonidjindjifilxg                 |
    |  192 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/AhmadCajLimonIDjindjifil20X2G40g.jpg                       | ahmadcajlimonidjindjifilxgg                |
    |  204 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Ahmadcajlimonidjindjifil20x2g40g00051265.jpg               | ahmadcajlimonidjindjifilxgg                |
    | 1608 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Cajstudenfuzetealimonilimonovatreva15lpet.jpg              | cajstudenfuzetealimonilimonovatrevalpet    |
    | 4689 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Lesieursalatensosslimonimaslinovomaslo.jpg                 | lesieursalatensosslimonimaslinovomaslo     |
    | 4690 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Lesieursalatensosslimonimaslinovomaslo05l500ml01301150.jpg | lesieursalatensosslimonimaslinovomaslolml  |
    | 4723 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Limoni.jpg                                                 | limoni                                     |
    +------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
    

    合并表-正如我们在合并表中看到的那样 best_match_score NaN

    +----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
    | id | best_match_score | __id_left | __id_right | price | category | amazon_s3_name_left  | image_left | amazon_s3_name_left | image_right | amazon_s3_name_right |
    +----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
    |  0 | NaN              | 0_left    | None       |  1.49 | Fruits   | Limoni500g09700112   | NaN        | limonig             | NaN         | NaN                  |
    |  2 | NaN              | 2_left    | None       |  1.69 | Bio      | Morkovi1kgbr09700132 | NaN        | morkovikgbr         | NaN         | NaN                  |
    +----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
    
    0 回复  |  直到 3 年前
        1
  •  1
  •   RJ Adriaansen    3 年前

    你可以给 polyfuzz 试试看。使用示例的设置,例如使用 TF-IDF Bert ,然后运行:

    model = PolyFuzz(matchers).match(df1["amazon_s3_name"].tolist(), df2["amazon_s3_name"].to_list())
    df1['To'] = model.get_matches()['To']
    

    然后合并:

    df1.merge(df2, left_on='To', right_on='amazon_s3_name')