代码之家  ›  专栏  ›  技术社区  ›  sniperd Ali Ahmed

几个类似的正则表达式。更快的方法?

  •  4
  • sniperd Ali Ahmed  · 技术社区  · 6 年前

    我有一套相当简单的要求。我有一个对象列表(长度为200万),每个对象有两个需要重新执行的属性(其他属性不变)。

    零一二的值…需要将10改为其数值:12…十

    示例:

    ONE MAIN STREET -> 1 MAIN STREET
    BONE ROAD -> BONE ROAD
    BUILDING TWO, THREE MAIN ROAD -> BUILDING 2, 3 MAIN ROAD
    ELEVEN MAIN ST -> ELEVEN MAIN STREET
    ONE HUNDRED FUNTOWN -> 1 HUNDRED FUNTOWN
    

    很明显,有些数字是不变的,有些是奇怪的。 那是完全可以预料的

    我可以利用下面的内容来完成所有工作。我的问题是,有没有一个聪明的方法让这一切运行得更快?我想做一个 list 属于 dictionaries 其中键是单词数字,值是数字,但我认为这对性能没有帮助。或 re.compile 每个regex并将它们传递给这个函数?有什么聪明的主意能让这个跑得更快吗?

    def update_word_to_numeric(entrylist):
        updated_entrylist = []
        for theentry in entrylist:
            theentry.addr_ln_1 = re.sub(r"\bZERO\b", "0", theentry.addr_ln_1)
            theentry.addr_ln_1 = re.sub(r"\bONE\b", "1", theentry.addr_ln_1)
            theentry.addr_ln_1 = re.sub(r"\bTWO\b", "2", theentry.addr_ln_1)
            theentry.addr_ln_1 = re.sub(r"\bTHREE\b", "3", theentry.addr_ln_1)
            theentry.addr_ln_1 = re.sub(r"\bFOUR\b", "4", theentry.addr_ln_1)
            theentry.addr_ln_1 = re.sub(r"\bFIVE\b", "5", theentry.addr_ln_1)
            theentry.addr_ln_1 = re.sub(r"\bSIX\b", "6", theentry.addr_ln_1)
            theentry.addr_ln_1 = re.sub(r"\bSEVEN\b", "7", theentry.addr_ln_1)
            theentry.addr_ln_1 = re.sub(r"\bEIGHT\b", "8", theentry.addr_ln_1)
            theentry.addr_ln_1 = re.sub(r"\bNINE\b", "9", theentry.addr_ln_1)
            theentry.addr_ln_1 = re.sub(r"\bTEN\b", "10", theentry.addr_ln_1)
    
            theentry.addr_ln_2 = re.sub(r"\bZERO\b", "0", theentry.addr_ln_2)
            theentry.addr_ln_2 = re.sub(r"\bONE\b", "1", theentry.addr_ln_2)
            theentry.addr_ln_2 = re.sub(r"\bTWO\b", "2", theentry.addr_ln_2)
            theentry.addr_ln_2 = re.sub(r"\bTHREE\b", "3", theentry.addr_ln_2)
            theentry.addr_ln_2 = re.sub(r"\bFOUR\b", "4", theentry.addr_ln_2)
            theentry.addr_ln_2 = re.sub(r"\bFIVE\b", "5", theentry.addr_ln_2)
            theentry.addr_ln_2 = re.sub(r"\bSIX\b", "6", theentry.addr_ln_2)
            theentry.addr_ln_2 = re.sub(r"\bSEVEN\b", "7", theentry.addr_ln_2)
            theentry.addr_ln_2 = re.sub(r"\bEIGHT\b", "8", theentry.addr_ln_2)
            theentry.addr_ln_2 = re.sub(r"\bNINE\b", "9", theentry.addr_ln_2)
            theentry.addr_ln_2 = re.sub(r"\bTEN\b", "10", theentry.addr_ln_2)
            updated_entrylist.append(theentry)
        return updated_entrylist
    

    也许这只是一个很好的方法。“够好了”的评论对我也很好:)

    3 回复  |  直到 6 年前
        1
  •  5
  •   L3viathan gboffi    6 年前

    使用一个正则表达式而不是十个表达式要快得多(我注意到速度增加了3倍):

    def replace(match):
        return {
            "ZERO": "0",
            "ONE": "1",
            "TWO": "2",
            "THREE": "3",
            "FOUR": "4",
            "FIVE": "5",
            "SIX": "6",
            "SEVEN": "7",
            "EIGHT": "8",
            "NINE": "9",
            "TEN": "10",
        }[match.group(1)]
    
    pattern = re.compile(r"\b(ZERO|ONE|TWO|THREE|FOUR|FIVE|SIX|SEVEN|EIGHT|NINE|TEN)\b")
    
    def update_word_to_numeric(entrylist):
        updated_entrylist = []
        for theentry in entrylist:
            theentry.addr_ln_1 = pattern.sub(replace, theentry.addr_ln_1)
            theentry.addr_ln_2 = pattern.sub(replace, theentry.addr_ln_2)
            updated_entrylist.append(theentry)
        return updated_entrylist
    

    我正在使用鲜为人知的功能 re.sub 作为第二个参数的函数:它将获取匹配对象并返回替换字符串。这样我们就可以查找替换字符串。

    我也用过 re.compile 为了预编译regex,这也提高了时间,但没有大的变化那么多。

        2
  •  2
  •   l'L'l    6 年前

    下面是使用字典的方法:

    s = '''
    ONE MAIN STREET
    BONE ROAD
    BUILDING TWO, THREE MAIN ROAD
    ELEVEN MAIN ST
    ONE HUNDRED FUNTOWN
    '''
    
    d = {'ZERO':'0', 'ONE':'1', 'TWO':'2', 'THREE':'3', 'FOUR':'4', 
         'FIVE':'5', 'SIX':'6', 'SEVEN':'7', 'EIGHT':'8', 'NINE':'9', 
         'TEN':'10', 'ELEVEN':'11', 'TWELVE':'12'}
    
    p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
    r = p.sub(lambda x: d[x.group()], s)
    
    print(r)
    

    根据需要添加或删除字典中的条目。

        3
  •  1
  •   Druta Ruslan    6 年前
    numbers = ["\bZERO\b", "\bONE\b", "\bTWO\b", "\bTHREE\b", "\bFOUR\b", "\bFIVE\b", "\bSIX\b", "\bSEVEN\b", "\bEIGHT\b", "\bNINE\b", "\bTEN\b"]
    for theentry in entrylist:
        for i, number in enumerate(numbers):
            theentry.addr_ln_1 = re.sub(r"{}".format(number), "{}".format(i), theentry.addr_ln_1)
            theentry.addr_ln_2 = re.sub(r"{}".format(number), "{}".format(i), theentry.addr_ln_2)