代码之家 › 专栏 › 技术社区 › Berry Tsakala

python中的布尔文本搜索

boolean full-text-search python

Berry Tsakala · 技术社区 · 15 年前

我正在寻找一个现有的模块,它使我能够编写基本的布尔查询来匹配和搜索文本,而无需编写自己的解析器等。

例如,

president AND (ronald OR (george NOT bush))

会与真对抗 “罗纳德·拉根和布什总统”

但错了 “乔治·布什是总统” “我不知道怎么拼写罗纳德·拉根”

(到目前为止,我找到了Booleano,它似乎有点过火,但可以完成任务。然而,他们的团队处于非活动状态,我无法从文档中找出该怎么做。)

谢谢

编辑:

3 回复 | 直到 15 年前

Kerighan 5 年前

对于那些可能会来到这个页面的人:我构建了一个包来实现这一点(仍处于测试阶段)。

pip install eldar

您的查询将被翻译成以下代码:

from eldar import build_query

eldar = build_query('"president" AND ("ronald" OR ("george" AND NOT "bush"))')

print(eldar("President Bush"))
# >>> False
print(eldar("President George"))
#Â >>> True

您也可以在某些pandas dataframe上使用它,有关更多信息,请查看git页面: https://github.com/kerighan/eldar

seanmac7577 15 年前

如果能找到一个预先存在的库,正好可以解析您提供的示例表达式,那将是非常幸运的。我建议您的表达式格式更易于机器读取,同时保持其清晰性。Lisp S表达式(使用前缀表示法)简洁明了:

为这种格式编写解析器比为您的格式编写解析器要容易得多。或者您可以切换到Lisp,它将以本机方式解析它。:)

旁注:我想你不是有意让你的“NOT”运算符二进制,对吧?

Justin Peel 15 年前

你可能想看看 simpleBool.py this page 它使用pyparsing模块。否则,我写了一些简单的代码。

这不是一个模块,但它可能会让你找到正确的方向。

def found(s,searchstr):
    return s.find(searchstr)>-1

def booltest1(s):
    tmp = found(s,'george') and not found(s,'bush')
    return found(s,'president') and (found(s,'ronald') or tmp)

print booltest1('the president ronald reagan')
print booltest1('george bush was a president')

你可以测试其他的。我用tmp是因为排队太长了

Jochen Ritzel 15 年前

我用 sphinx for full text search boolean matchings ,但使用运算符,而不是文字。例如,您的查询将是 president (regan|(bush -george)) .

Lucene 拥有 same feature .

-1

PSK 4 年前

我明白这不是对这个问题最恰当的回答。我发布这个只是因为我觉得这很有用,而且接受的解决方案在我的案例中太慢了(我有一个 DataFrame

我创造了一个 pandas 命名 df

import pandas as pd

df = pd.DataFrame({'a':["the president's ronald ragen", 'the president ronald ragen and bush', 'abe'], 'b':['max bush was not a president','george bush was a president',"i don't know how to spell ronald ragen"]})

运行查询 """"president"&("ronald"|("george"&~"bush"))""" ,这是我对问题中的查询的解释,关于

class logical_search:
    def __init__(self, expression=""):
        self.expression = expression
        
    def evaluate(self):
        # This use of 'eval' seems safe because the user's input can only return a true or false vale for every cell in 'df' and hence eval cannot do anything but an operation on 'DataFrames'
        return eval(self.expression)

def add_self(match):
    # Replace all special characters and spaces with an underscore to ensure appropriate attribute names in 'logica_search' class
    no_spcl_var_nm = re.sub(r'[^a-zA-Z]', '_', match.group(2))
    # Append an underscore and a unique number  to ensure that even if the query section is repeated we have a unique attribute name in 'logica_search' class for that query section
    fnl_var_nm = 'self.'+no_spcl_var_nm+'_'+str(add_self.counter)
    # Increment the counter
    add_self.counter += 1
    return fnl_var_nm

# The query or expression is in triple quotes to allow for apostrophes
query = """"president"&("ronald"|("george"&~"bush"))"""

# Set the counter we will use to append every query section with to ensure uniqueness of attribute names in 'logica_search' class
add_self.counter = 0
# Get a version of the query that is appropriate to pass as an input to 'eval' in the 'evaluate' function in the 'logical_search' class
self_query = re.sub(r'(")([^"]*)(")', add_self, query)

# Instantiate the 'logical_search' class with 'self_query'
ls = logical_search(self_query)

# Get all the query sections from the expression
var_nms = re.findall(r'"([^"]*)"', query)
# For every query section
for idx, var_nm in enumerate(var_nms):
    # Replacing all special characters with an '_' as in the add_self function
    no_spcl_var_nm = re.sub(r'[^a-zA-Z]', '_', var_nm)
    # Appending with an '_' and a unique number as in the add_self function
    fnl_var_nm = no_spcl_var_nm+'_'+str(idx)
    # Setting the attributes for the 'logical_search' class' object 'ls' where names are the same as those in 'self_query' and the values are DataFrames that contain True or False values based on whether that query section is available in the cell text or not.
    setattr(ls, fnl_var_nm, df.applymap(lambda x: var_nm in x))

# Since the query expression and the attributes are set for 'ls' we can evaluate the query expression by calling the 'evaluate' function of 'logical_search' class.
result = ls.evaluate()