Lucene SpanNearQuery 部分匹配
题
给定文档 {'foo', 'bar', 'baz'},我想使用 SpanNearQuery 与标记 {'baz', 'extra'} 进行匹配
但这失败了。
我该如何解决这个问题?
样本测试(使用lucene 2.9.1)结果如下:
- 给定单场比赛 - 通过
- 给定的两场比赛 - 通过
- 给定三场比赛 - 通过
- givenSingleMatch_andExtraTerm - 失败
...
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.After;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;
import java.io.IOException;
public class SpanNearQueryTest {
private RAMDirectory directory = null;
private static final String BAZ = "baz";
private static final String BAR = "bar";
private static final String FOO = "foo";
private static final String TERM_FIELD = "text";
@Before
public void given() throws IOException {
directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(
directory,
new StandardAnalyzer(Version.LUCENE_29),
IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.add(new Field(TERM_FIELD, FOO, Field.Store.NO, Field.Index.ANALYZED));
doc.add(new Field(TERM_FIELD, BAR, Field.Store.NO, Field.Index.ANALYZED));
doc.add(new Field(TERM_FIELD, BAZ, Field.Store.NO, Field.Index.ANALYZED));
writer.addDocument(doc);
writer.commit();
writer.optimize();
writer.close();
}
@After
public void cleanup() {
directory.close();
}
@Test
public void givenSingleMatch() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, FOO))
}, Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
@Test
public void givenTwoMatches() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, FOO)),
new SpanTermQuery(new Term(TERM_FIELD, BAR))
}, Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
@Test
public void givenThreeMatches() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, FOO)),
new SpanTermQuery(new Term(TERM_FIELD, BAR)),
new SpanTermQuery(new Term(TERM_FIELD, BAZ))
}, Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
@Test
public void givenSingleMatch_andExtraTerm() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, BAZ)),
new SpanTermQuery(new Term(TERM_FIELD, "EXTRA"))
},
Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
}
解决方案
SpanNearQuery 可让您查找彼此相距一定距离内的术语。
示例(来自 http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/):
假设我们想在道格(Doug)的5个职位中找到卢肯(Lucene),而道格(Doug)跟随卢克(Lucene)(订单事项) - 您可以使用以下跨度:
new SpanNearQuery(new SpanQuery[] {
new SpanTermQuery(new Term(FIELD, "lucene")),
new SpanTermQuery(new Term(FIELD, "doug"))},
5,
true);
替代文本http://www.lucidimagination.com/blog/wp-content/uploads/2009/07/spanquery-dia1.png
在此示例文本中,Lucene在Doug的3中
但对于您的示例,我可以看到的唯一匹配是您的查询和目标文档都有“cd”(并且我假设所有这些术语都在单个字段中)。在这种情况下,您不需要使用任何特殊的查询类型。使用标准机制,您将获得一些非零权重,因为它们都在同一字段中包含相同的术语。
编辑3 - 针对最新评论,答案是不能使用 SpanNearQuery
执行其预期目的以外的任何操作,即查明文档中的多个术语是否出现在彼此一定数量的位置内。我无法告诉您的具体用例/预期结果是什么(请随意发布),但在最后一种情况下,如果您只想找出(“BAZ”,“EXTRA”)中的一个或多个是否在该文件,一个 BooleanQuery
会工作得很好。
编辑4 - 现在您已经发布了您的用例,我明白您想要做什么。您可以这样做:用一个 BooleanQuery
如上所述,将您想要的各个术语以及 SpanNearQuery
, ,并设置一个提升 SpanNearQuery
.
因此,文本形式的查询将如下所示:
BAZ OR EXTRA OR "BAZ EXTRA"~100^5
(例如 - 这将匹配包含“BAZ”或“EXTRA”的所有文档,但为术语“BAZ”和“EXTRA”彼此出现在 100 个位置以内的文档分配更高的分数;根据需要调整位置和提升。此示例来自 Solr 食谱,因此它可能无法在 Lucene 中解析,或者可能会给出不需要的结果。没关系,因为在下一节中我将向您展示如何使用 API 构建它)。
通过编程,您可以按如下方式构建它:
Query top = new BooleanQuery();
// Construct the terms since they will be used more than once
Term bazTerm = new Term("Field", "BAZ");
Term extraTerm = new Term("Field", "EXTRA");
// Add each term as "should" since we want a partial match
top.add(new TermQuery(bazTerm), BooleanClause.Occur.SHOULD);
top.add(new TermQuery(extraTerm), BooleanClause.Occur.SHOULD);
// Construct the SpanNearQuery, with slop 100 - a document will get a boost only
// if BAZ and EXTRA occur within 100 places of each other. The final parameter means
// that BAZ must occur before EXTRA.
SpanNearQuery spanQuery = new SpanNearQuery(
new SpanQuery[] { new SpanTermQuery(bazTerm),
new SpanTermQuery(extraTerm) },
100, true);
// Give it a boost of 5 since it is more important that the words are together
spanQuery.setBoost(5f);
// Add it as "should" since we want a match even when we don't have proximity
top.add(spanQuery, BooleanClause.Occur.SHOULD);
希望有帮助!将来,尝试从准确发布您所期望的结果开始——即使这对您来说是显而易见的,但对读者来说可能并不明显,并且明确可以避免多次来回。