代码之家  ›  专栏  ›  技术社区  ›  ohho

如何检测字符串内部的相同部分?

  •  0
  • ohho  · 技术社区  · 15 年前

    我试着打破 decoding algorithm wanted 把问题变成小问题。这是第一部分。

    问题:

    • 两个字符串:s1和s2
    • s1的一部分与s2的一部分相同
    • 空格是分隔符
    • 如何提取相同的零件?

    例1:

    s1 = "12 November 2010 - 1 visitor"
    s2 = "6 July 2010 - 100 visitors"
    
    the identical parts are "2010", "-", "1" and "visitor"
    

    例2:

    s1 = "Welcome, John!"
    s2 = "Welcome, Peter!"
    
    the identical parts are "Welcome," and "!"
    

    例3:(澄清“!”实例)

    s1 = "Welcome, Sam!"
    s2 = "Welcome, Tom!"
    
    the identical parts are "Welcome," and "m!"
    

    首选python和ruby。谢谢

    4 回复  |  直到 15 年前
        1
  •  2
  •   cryo    15 年前

    编辑: 更新此示例以使用所有示例,包括1:

    def scan(s1, s2):
        # Find the longest match where s1 starts with s2
        # Returns None if no matches
        l = len(s1)
        while 1:
            if not l:
                return None
            elif s1[:l] == s2[:l]:
                return s1[:l]
            else:
                l -= 1
    
    def contains(s1, s2):
        D = {} # Remove duplicates using a dict
        L1 = s1.split(' ')
        L2 = s2.split(' ')
    
        # Don't add results which have already 
        # been processed to satisfy example #1!
        DProcessed = {}
    
        for x in L1:
            yy = 0
            for y in L2:
                if yy in DProcessed:
                    yy += 1
                    continue
    
                # Scan from the start to the end of the words
                a = scan(x, y)
                if a: 
                    DProcessed[yy] = None
                    D[a] = None
                    break
    
                # Scan from the end to the start of the words
                a = scan(x[::-1], y[::-1])
                if a: 
                    DProcessed[yy] = None
                    D[a[::-1]] = None
                    break
                yy += 1
    
        return list(D.keys())
    
    print contains("12 November 2010 - 1 visitor",
                   "6 July 2010 - 100 visitors")
    print contains("Welcome, John!",
                   "Welcome, Peter!")
    print contains("Welcome, Sam!",
                   "Welcome, Tom!")
    

    哪些输出:

    ['1', 'visitor', '-', '2010']
    ['Welcome,', '!']
    ['Welcome,', 'm!']
    
        2
  •  3
  •   YOU    15 年前

    例如1

    >>> s1 = 'November 2010 - 1 visitor'
    >>> s2 = '6 July 2010 - 100 visitors'
    >>> 
    >>> [i for i in s1.split() if any(j for j in s2.split() if i in j)]
    ['2010', '-', '1', 'visitor']
    >>>
    

    对于两者

    >>> s1 = "Welcome, John!"
    >>> s2 = "Welcome, Peter!"
    >>> [i for i in s1.replace('!',' !').split() if any(j for j in s2.replace('!',' !').split() if i in j)]
    ['Welcome,', '!']
    >>>
    

    注释 :以上代码不起作用,例如3,它是op刚添加的

        3
  •  1
  •   Tim Pietzcker    15 年前
    s1 = "12 November 2010 - 1 visitor"
    s2 = "6 July 2010 - 100 visitors"
    l1 = s1.split()
    for item in l1:
       if item in s2:
          print item
    

    这在空白处分开。

    一种在单词边界上进行拆分的解决方案(以便捕获 ! 在示例2中)在python中不起作用,因为 re.split() 不会在零长度的比赛中分开。

    第三个例子,使单词的任何子串都成为潜在的匹配,因为有许多可能的变化(例如 1234 ,我得核对一下 一千二百三十四 , 123 , 234 , 12 , 23 , 34 , 1 , 2 , 3 4 ,并且每增加一个数字,置换的数量就成指数增长。

        4
  •  1
  •   clyfe    15 年前

    完整的Ruby解决方案:

    def start_similar(i, j)
        front = ''
        for ix in (0...([i.size, j.size].min))
          if i[ix] == j[ix] then
            front += i[ix].chr
          else
            break
          end
        end
        return front
    end
    
    def back_similar(i, j)
        back = ''
        for ix in (0...([i.size, j.size].min)).to_a.reverse
          dif = i.size < j.size ? j.size - i.size : i.size - j.size
          ci = i[ i.size < j.size ? ix : ix + dif ]
          cj = j[ i.size > j.size ? ix : ix + dif ]
          if ci == cj then
            back = ci.chr + back
          else
            break
          end
        end
        return back
    end
    
    def scan(x, y)
        a, b = x.split(' '), y.split(' ')
        result = []
        for i in a do
          for j in b do
            result << start_similar(i, j)
            result << back_similar(i, j)
          end
        end
        return result.uniq.select do |r| not r.empty? end
    end
    
    puts scan(
        "12 November 2010 - 1 visitor",
        "6 July 2010 - 100 visitors"
    ).inspect
    # ["1", "2010", "0", "-", "visitor"]
    
    puts scan(
        "Welcome, John!",
        "Welcome, Peter!"
    ).inspect
    # ["Welcome,", "!"]
    
    puts scan(
        "Welcome, Sam!",
        "Welcome, Tom!"
    ).inspect
    # ["Welcome,", "m!"]