代码之家  ›  专栏  ›  技术社区  ›  Berry Tsakala

python中的布尔文本搜索

  •  8
  • Berry Tsakala  · 技术社区  · 15 年前

    我正在寻找一个现有的模块,它使我能够编写基本的布尔查询来匹配和搜索文本,而无需编写自己的解析器等。

    例如,

    president AND (ronald OR (george NOT bush))
    

    会与真对抗 “罗纳德·拉根和布什总统”

    但错了 “乔治·布什是总统” “我不知道怎么拼写罗纳德·拉根”

    (到目前为止,我找到了Booleano,它似乎有点过火,但可以完成任务。然而,他们的团队处于非活动状态,我无法从文档中找出该怎么做。)

    谢谢

    编辑:

    3 回复  |  直到 15 年前
        1
  •  6
  •   Kerighan    5 年前

    对于那些可能会来到这个页面的人:我构建了一个包来实现这一点(仍处于测试阶段)。

    pip install eldar
    

    您的查询将被翻译成以下代码:

    from eldar import build_query
    
    eldar = build_query('"president" AND ("ronald" OR ("george" AND NOT "bush"))')
    
    print(eldar("President Bush"))
    # >>> False
    print(eldar("President George"))
    # >>> True
    

    您也可以在某些pandas dataframe上使用它,有关更多信息,请查看git页面: https://github.com/kerighan/eldar

        2
  •  2
  •   seanmac7577    15 年前

    如果能找到一个预先存在的库,正好可以解析您提供的示例表达式,那将是非常幸运的。我建议您的表达式格式更易于机器读取,同时保持其清晰性。Lisp S表达式(使用前缀表示法)简洁明了:

    为这种格式编写解析器比为您的格式编写解析器要容易得多。或者您可以切换到Lisp,它将以本机方式解析它。:)

    旁注:我想你不是有意让你的“NOT”运算符二进制,对吧?

        3
  •  1
  •   Justin Peel    15 年前

    你可能想看看 simpleBool.py this page 它使用pyparsing模块。否则,我写了一些简单的代码。

    这不是一个模块,但它可能会让你找到正确的方向。

    def found(s,searchstr):
        return s.find(searchstr)>-1
    
    def booltest1(s):
        tmp = found(s,'george') and not found(s,'bush')
        return found(s,'president') and (found(s,'ronald') or tmp)
    
    print booltest1('the president ronald reagan')
    print booltest1('george bush was a president')
    

    你可以测试其他的。我用tmp是因为排队太长了

        4
  •  1
  •   Jochen Ritzel    15 年前

    我用 sphinx for full text search boolean matchings ,但使用运算符,而不是文字。例如,您的查询将是 president (regan|(bush -george)) .

    Lucene 拥有 same feature .

        5
  •  -1
  •   PSK    4 年前

    我明白这不是对这个问题最恰当的回答。我发布这个只是因为我觉得这很有用,而且接受的解决方案在我的案例中太慢了(我有一个 DataFrame

    我创造了一个 pandas 命名 df

    import pandas as pd
    
    df = pd.DataFrame({'a':["the president's ronald ragen", 'the president ronald ragen and bush', 'abe'], 'b':['max bush was not a president','george bush was a president',"i don't know how to spell ronald ragen"]})
    

    运行查询 """"president"&("ronald"|("george"&~"bush"))""" ,这是我对问题中的查询的解释,关于

    class logical_search:
        def __init__(self, expression=""):
            self.expression = expression
            
        def evaluate(self):
            # This use of 'eval' seems safe because the user's input can only return a true or false vale for every cell in 'df' and hence eval cannot do anything but an operation on 'DataFrames'
            return eval(self.expression)
    
    def add_self(match):
        # Replace all special characters and spaces with an underscore to ensure appropriate attribute names in 'logica_search' class
        no_spcl_var_nm = re.sub(r'[^a-zA-Z]', '_', match.group(2))
        # Append an underscore and a unique number  to ensure that even if the query section is repeated we have a unique attribute name in 'logica_search' class for that query section
        fnl_var_nm = 'self.'+no_spcl_var_nm+'_'+str(add_self.counter)
        # Increment the counter
        add_self.counter += 1
        return fnl_var_nm
    
    # The query or expression is in triple quotes to allow for apostrophes
    query = """"president"&("ronald"|("george"&~"bush"))"""
    
    # Set the counter we will use to append every query section with to ensure uniqueness of attribute names in 'logica_search' class
    add_self.counter = 0
    # Get a version of the query that is appropriate to pass as an input to 'eval' in the 'evaluate' function in the 'logical_search' class
    self_query = re.sub(r'(")([^"]*)(")', add_self, query)
    
    # Instantiate the 'logical_search' class with 'self_query'
    ls = logical_search(self_query)
    
    # Get all the query sections from the expression
    var_nms = re.findall(r'"([^"]*)"', query)
    # For every query section
    for idx, var_nm in enumerate(var_nms):
        # Replacing all special characters with an '_' as in the add_self function
        no_spcl_var_nm = re.sub(r'[^a-zA-Z]', '_', var_nm)
        # Appending with an '_' and a unique number as in the add_self function
        fnl_var_nm = no_spcl_var_nm+'_'+str(idx)
        # Setting the attributes for the 'logical_search' class' object 'ls' where names are the same as those in 'self_query' and the values are DataFrames that contain True or False values based on whether that query section is available in the cell text or not.
        setattr(ls, fnl_var_nm, df.applymap(lambda x: var_nm in x))
    
    # Since the query expression and the attributes are set for 'ls' we can evaluate the query expression by calling the 'evaluate' function of 'logical_search' class.
    result = ls.evaluate()