代码之家  ›  专栏  ›  技术社区  ›  The Wanderer

对类似URL进行分组/查找常见URL模式(Python)

  •  2
  • The Wanderer  · 技术社区  · 8 年前

    我有大约10万个URL,每个都被标记为积极或消极。我想看看什么类型的URL对应于积极的?(同样适用于负片)

    我从分组子域开始,确定了最常见的正负子域。

    现在,对于正负比相等的子域,我想进一步剖析并寻找模式。示例模式:

    http://www.clarin.com/politica/ (pattern: domain/section)
    http://www.clarin.com/tema/manifestaciones.html (pattern: domain/tag/tag_name)
    http://www.clarin.com/buscador?q=protesta (pattern: domain/search?=search_term)
    

    这些链接不仅限于clarin.com。

    关于如何发现这种模式有什么建议吗?

    1 回复  |  直到 8 年前
        1
  •  0
  •   The Wanderer    8 年前

    解决了这个问题 finding largest common substring 问题

    解决方案包括从url的每个字符构建一个解析树。树中的每个节点存储正数、负数和总数。最后,修剪树以返回最常见的模式。

    代码:

    def find_patterns(incoming_urls):
        urls = {}
        # make the tree
        for url in incoming_urls:
            url, atype = line.strip().split("____")  # assuming incoming_urls is a list with each entry of type url__class
            if len(url) < 100:   # Take only the initial 100 characters to avoid building a sparse tree
                bound = len(url) + 1
            else:
                bound = 101
            for x in range(1, bound):
                if url[:x].lower() not in urls:
                    urls[url[:x].lower()] = {'positive': 0, 'negative': 0, 'total': 0}
                urls[url[:x].lower()][atype] += 1
                urls[url[:x].lower()]['total'] += 1
    
        new_urls = {}
        # prune the tree
        for url in urls:
            if urls[url]['total'] < 5:  # For something to be called as common pattern, there should be at least 5 occurrences of it.
                continue
            urls[url]['negative_percentage'] = (float(urls[url]['negative']) * 100) / urls[url]['total']
            if urls[url]['negative_percentage'] < 85.0: # Assuming I am interested in finding url patterns for negative class
                continue
            length = len(url)
            found = False
            # iterate to see if a len+1 url is present with same total count
            for second in urls:
                if len(second) <= length:
                    continue
                if url == second[:length] and urls[url]['total'] == urls[second]['total']:
                    found = True
                    break
            # discard urls with length less than 20
            if not found and len(url) > 20:
                new_urls[url] = urls[url]
    
        print "URL Pattern; Positive; Negative; Total; Negative (%)"
        for url in new_urls:
            print "%s; %d; %d; %d; %.2f" % (
                url, new_urls[url]['positive'], new_urls[url]['negative'], new_urls[url]['total'],
                new_urls[url]['negative_percentage'])