代码之家 › 专栏 › 技术社区 › Bibek Shakya Dev. Joel

如何在ApacheSolr中展开对象并应用于FieldType

lucene solr indexing java

Bibek Shakya Dev. Joel · 技术社区 · 6 年前

我正在尝试将Lucene标记器迁移到ApacheSolr中。我已经写了 TokenizerFactory 对于Lucene上的每个字段类型,如标题、正文等。在Lucene中,有一种方法可以添加 TokenStream 文档中的字段。在Solr中,为了与Lucene合作,我们必须定制标记器/过滤器。我在特定领域有问题,我已经研究了很多博客和书籍,这些都不能解决我的问题。在大多数博客和书籍中,他们使用的是string,int直接指向fieldtype。

我已经为ApacheSolr构建了定制的tokenfilterFactory,并将其放置在schema.xml中,如下所示

<fieldType name="text_reversed" class="solr.TextField">
<analyzer>
  <tokenizer class="solr.KeywordTokenizerFactory"/>
  <filter class="analyzer.TextWithMarkUpTokenizerFactory"/>
  <filter class="analyzer.ReverseFilterFactory" />
</analyzer>

当我试图索引solr上的文档时

 TextWithMarkUp textWithMarkUp = //get from method
 SolrInputDocument solrInputDocument = new SolrInputDocument();
 solrInputDocument.addField("id", new Random().nextDouble());
 solrInputDocument.addField("title", textWithMarkUp);

在Apache Solr管理面板上,结果如下

{
    "id":"0.4470506508669744",
    "title":"com.xyz.data:[text = Several disparities are highlighted in the new report:\n\n74 percent of white male students said they felt like they belonged at school., tokens.size = 24], tokens = [Several] [disparities] [are] [highlighted] [in] [the] [new] [report] [:] [74] [percent] [of] [white] [male] [students] [said] [they] [felt] [like] [they] [belonged] [at] [school] [.] ",
    "_version_":1607597126134530048
}

我无法在我的自定义tokenstream上获取textwithmarkup实例,这将阻止我像以前使用Lucene那样压平给定的对象。在Lucene中,我曾经在创建自定义标记流实例后设置textWithMarkup的实例。下面是我的textwithMarkup实例的JSON版本

{
"text": "The law, which was passed by the Louisiana Legislature and signed by Gov.",
"tokens": [
    {
        "category": "Determiner",
        "canonical": "The",
        "ids": null,
        "start": 0,
        "length": 3,
        "text": "The",
        "order": 0
    },
    //tokenized/stemmed/tagged all the words
],
"abbreviations": [],
"essentialTokenNumber": 12
}

下面的代码是我要做的

public class TextWithMarkUpTokenizer extends Tokenizer {
    private final PositionIncrementAttribute posIncAtt;
    protected int tokenIndex = -1; // index of the current token in the    collection of metaQTokens
    protected List<MetaQToken> metaQTokens;
    protected TokenStream tokenTokenizer;

    public TextWithMarkUpTokenizer() {
        MetaQTokenTokenizer metaQTokenizer = new MetaQTokenTokenizer();
        tokenTokenizer = metaQTokenizer;
        posIncAtt = addAttribute(PositionIncrementAttribute.class);
    }

    public void setTextWithMarkUp(TextWithMarkUp text) {
      this.markup = text == null ? null : text.getTokens();
    }

    @Override
    public final boolean incrementToken() throws IOException {
      //get instance of TextWithMarkUp here
    }

    private void setCurrentToken(Token token) {
        ((IMetaQTokenAware) tokenTokenizer).setToken(token);
    }
}

我已经跟踪了 TextWithMarkUpTokenizerFactory 类,但一旦我们在solr上的lib文件夹下加载jar,solr将完全控制工厂类。

那么,在索引期间有没有什么方法可以在solr上设置给定的实例呢?我研究过 Update Request Processors . 不管怎样,这可以解决我的问题吗?

1 回复 | 直到 6 年前

elyograg 6 年前

SOLR搜索结果与索引系统接收到的结果完全相同。这将是所有更新处理器处理后的原始输入。Solr默认使用的更新处理器链不会更改输入。

在模式中定义的分析链对搜索结果绝对没有影响-它们只影响在索引时间和查询时间生成的令牌。存储的数据不受分析的影响。

当您对定制对象执行“addfield”操作时,很可能会调用下面的solrj代码来确定发送给solr的内容。(val是输入对象):

writeVal(val.getClass().getName() + ':' + val.toString());

这将创建一个字符串,该字符串具有类的名称,后跟类的字符串表示形式。正如Matslindh在评论中所说,Solrj对您的自定义对象一无所知,因此数据不会作为您的自定义对象类型到达Solr。