代码之家  ›  专栏  ›  技术社区  ›  Oyibo

使用正则表达式和Polars高效解析公式

  •  3
  • Oyibo  · 技术社区  · 1 月前

    我试图解析一系列数学公式,需要在Python中使用Polars高效地提取变量名。 Polars中的Regex支持似乎有限,特别是在环视断言方面。 是否有一种简单有效的方法从公式中解析符号?

    以下是我的代码片段:

    import re
    import polars as pl
    
    # Define the regex pattern
    FORMULA_DECODER = r"\b[A-Za-z][A-Za-z_0-9_]*\b(?!\()"
    # \b          # Assert a word boundary to ensure matching at the beginning of a word
    # [A-Za-z]    # Match an uppercase or lowercase letter at the start
    # [A-Za-z0-9_]* # Match following zero or more occurrences of valid characters (letters, digits, or underscores)
    # \b          # Assert a word boundary to ensure matching at the end of a word
    # (?!\()      # Negative lookahead to ensure the match is not followed by an open parenthesis (indicating a function)
    
    # Sample formulas
    formulas = ["3*sin(x1+x2)+A_0",
                "ab*exp(2*x)"]
    
    # expected result
    pl.Series(formulas).map_elements(lambda formula: re.findall(FORMULA_DECODER, formula), return_dtype=pl.List(pl.String))
    # Series: '' [list[str]]
    # [
    #   ["x1", "x2", "A_0"]
    #   ["ab", "x"]
    # ]
    
    # Polars does not support this regex pattern
    pl.Series(formulas).str.extract_all(FORMULA_DECODER)
    # ComputeError: regex error: regex parse error:
    #     \b[A-Za-z][A-Za-z_0-9_]*\b(?!\()
    #                               ^^^
    # error: look-around, including look-ahead and look-behind, is not supported
    

    编辑 以下是一个小基准:

    import random
    import string
    import re
    import polars as pl
    
    def generate_symbol():
        """Generate random symbol of length 1-3."""
        characters = string.ascii_lowercase + string.ascii_uppercase
        return ''.join(random.sample(characters, random.randint(1, 3)))
    
    def generate_formula():
        """Generate random formula with 2-5 unique symbols."""
        op = ['+', '-', '*', '/']
        return ''.join([generate_symbol()+random.choice(op) for _ in range(random.randint(2, 6))])[:-1]
    
    
    def generate_formulas(num_formulas):
        """Generate random formulas."""
        return [generate_formula() for _ in range(num_formulas)]
    
    # Sample formulas
    # formulas = ["3*sin(x1+x2)+(A_0+B)",
    #             "ab*exp(2*x)"]
    
    def parse_baseline(formulas):
        """Baseline serves as performance reference. It will not detect function names."""
        FORMULA_DECODER_NO_LOOKAHEAD = r"\b[A-Za-z][A-Za-z_0-9_]*\b\(?"
        return pl.Series(formulas).str.extract_all(FORMULA_DECODER_NO_LOOKAHEAD)
    
    def parse_lookahead(formulas):
        FORMULA_DECODER = r"\b[A-Za-z][A-Za-z_0-9_]*\b(?!\()"
        return pl.Series(formulas).map_elements(lambda formula: re.findall(FORMULA_DECODER, formula), return_dtype=pl.List(pl.String))
    
    def parse_no_lookahead_and_filter(formulas):
        FORMULA_DECODER_NO_LOOKAHEAD = r"\b[A-Za-z][A-Za-z_0-9_]*\b\(?"
        return (
            pl.Series(formulas)
            .str.extract_all(FORMULA_DECODER_NO_LOOKAHEAD)
            # filter for matches not containing an open parenthesis
            .list.eval(pl.element().filter(~pl.element().str.contains("(", literal=True)))
        )
    
    formulas = generate_formulas(1000)
    %timeit parse_lookahead(formulas)
    %timeit parse_no_lookahead_and_filter(formulas)
    %timeit parse_baseline(formulas)
    # 10.7 ms ± 387 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    # 1.31 ms ± 76.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    # 708 μs ± 6.43 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    
    1 回复  |  直到 1 月前
        1
  •  2
  •   Hericks    1 月前

    正如评论中提到的,您可以删除负面前瞻,并在匹配中选择性地包含左括号。在后处理步骤中,您可以过滤掉任何包含左括号的匹配项(使用 pl.Series.list.eval ).

    这可能看起来如下。

    # avoid negative lookahead and optionally match open parenthesis
    FORMULA_DECODER_NO_LOOKAHEAD = r"\b[A-Za-z][A-Za-z_0-9_]*\b\(?"
    
    (
        pl.Series(formulas)
        .str.extract_all(FORMULA_DECODER_NO_LOOKAHEAD)
        # filter for matches not containing an open parenthesis
        .list.eval(pl.element().filter(~pl.element().str.contains("(", literal=True)))
    )
    
    shape: (2,)
    Series: '' [list[str]]
    [
        ["x1", "x2", "A_0"]
        ["ab", "x"]
    ]