代码之家  ›  专栏  ›  技术社区  ›  ASH

我们能找到文本列之间的统计相关性吗?

  •  0
  • ASH  · 技术社区  · 5 年前

    我可以根据数据帧中的数字字段轻松设置相关矩阵。我想知道是否有办法在两个有文本的字段之间进行某种相关分析。假设我有两个类似的字段:

    Field1
    I wear a women's size 8 in every other shoes brand  
    Always been a lifelong fan of Birkenstock...        
    The wife loves them                 
    My daughter loves them.                 
    My daughter loves them! Very comfy          
    
    
    Field2
    i wear women's size 8 every shoes brand decided order size based everyone's review. the size 7-7.5/38 r fits perfectly.
    always lifelong fan birkenstock sandals suede straps...
    the wife loves
    She wears them all year round - with and without socks. 
    my daughter loves them! very comfy
    

    这些是相邻的;在这里只显示一个在另一个下面,因为我认为它更容易阅读。不管怎样,有没有办法在包含文本的字段之间进行某种相关分析。谢谢。

    0 回复  |  直到 5 年前
        1
  •  1
  •   shimo    5 年前

    你可以用 difflib.SequenceMatcher 找出两个字符串的相似性。

    import difflib
    
    Field1 = """I wear a women's size 8 in every other shoes brand  
    Always been a lifelong fan of Birkenstock...        
    The wife loves them                 
    My daughter loves them.                 
    My daughter loves them! Very comfy"""      
    
    
    Field2 = """i wear women's size 8 every shoes brand decided order size based everyone's review. the size 7-7.5/38 r fits perfectly.
    always lifelong fan birkenstock sandals suede straps...
    the wife loves
    She wears them all year round - with and without socks. 
    my daughter loves them! very comfy"""
    
    s = difflib.SequenceMatcher(None, Field1, Field2).ratio()
    
    print ("ratio:", s, "\n")
    
    # ratio: 0.312