我正在尝试将Lucene标记器迁移到ApacheSolr中。我已经写了
TokenizerFactory
对于Lucene上的每个字段类型,如标题、正文等。在Lucene中,有一种方法可以添加
TokenStream
文档中的字段。在Solr中,为了与Lucene合作,我们必须定制标记器/过滤器。我在特定领域有问题,我已经研究了很多博客和书籍,这些都不能解决我的问题。在大多数博客和书籍中,他们使用的是string,int直接指向fieldtype。
我已经为ApacheSolr构建了定制的tokenfilterFactory,并将其放置在schema.xml中,如下所示
<fieldType name="text_reversed" class="solr.TextField">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="analyzer.TextWithMarkUpTokenizerFactory"/>
<filter class="analyzer.ReverseFilterFactory" />
</analyzer>
当我试图索引solr上的文档时
TextWithMarkUp textWithMarkUp = //get from method
SolrInputDocument solrInputDocument = new SolrInputDocument();
solrInputDocument.addField("id", new Random().nextDouble());
solrInputDocument.addField("title", textWithMarkUp);
在Apache Solr管理面板上,结果如下
{
"id":"0.4470506508669744",
"title":"com.xyz.data:[text = Several disparities are highlighted in the new report:\n\n74 percent of white male students said they felt like they belonged at school., tokens.size = 24], tokens = [Several] [disparities] [are] [highlighted] [in] [the] [new] [report] [:] [74] [percent] [of] [white] [male] [students] [said] [they] [felt] [like] [they] [belonged] [at] [school] [.] ",
"_version_":1607597126134530048
}
我无法在我的自定义tokenstream上获取textwithmarkup实例,这将阻止我像以前使用Lucene那样压平给定的对象。在Lucene中,我曾经在创建自定义标记流实例后设置textWithMarkup的实例。下面是我的textwithMarkup实例的JSON版本
{
"text": "The law, which was passed by the Louisiana Legislature and signed by Gov.",
"tokens": [
{
"category": "Determiner",
"canonical": "The",
"ids": null,
"start": 0,
"length": 3,
"text": "The",
"order": 0
},
//tokenized/stemmed/tagged all the words
],
"abbreviations": [],
"essentialTokenNumber": 12
}
下面的代码是我要做的
public class TextWithMarkUpTokenizer extends Tokenizer {
private final PositionIncrementAttribute posIncAtt;
protected int tokenIndex = -1; // index of the current token in the collection of metaQTokens
protected List<MetaQToken> metaQTokens;
protected TokenStream tokenTokenizer;
public TextWithMarkUpTokenizer() {
MetaQTokenTokenizer metaQTokenizer = new MetaQTokenTokenizer();
tokenTokenizer = metaQTokenizer;
posIncAtt = addAttribute(PositionIncrementAttribute.class);
}
public void setTextWithMarkUp(TextWithMarkUp text) {
this.markup = text == null ? null : text.getTokens();
}
@Override
public final boolean incrementToken() throws IOException {
//get instance of TextWithMarkUp here
}
private void setCurrentToken(Token token) {
((IMetaQTokenAware) tokenTokenizer).setToken(token);
}
}
我已经跟踪了
TextWithMarkUpTokenizerFactory
类,但一旦我们在solr上的lib文件夹下加载jar,solr将完全控制工厂类。
那么,在索引期间有没有什么方法可以在solr上设置给定的实例呢?我研究过
Update Request Processors
. 不管怎样,这可以解决我的问题吗?