Lucene 6.6: What is the current way to create a custom query using analyzers and filters?

Question

I'm using lucene 6.6.0 to develop a search service and i'm quite confused on how to create custom analyzers and queries.

I have written my index based on data from a rdbms and at first i was just using a standard analyzer. Unfortunately it does not seem to split text by special characters like "_","-" or numbers, it only tokenizes by whitespace. I have found the WordDelimiterGraphFilter, which seems to do what i want, but i do not understand to make it work. Right now i try to use it like this:

mCustomAnalyzer = new Analyzer()
        {
            @Override
            protected TokenStreamComponents createComponents(String fieldName) {
                Tokenizer source = new StandardTokenizer();

                TokenStream filter = new WordDelimiterGraphFilter(source, 8, null);
                return new TokenStreamComponents(source, filter);
            }
        };

QueryBuilder queryBuilder = new QueryBuilder(mCustomAnalyzer);
Query query = queryBuilder.createPhraseQuery(aField, aText, 15);

For indexing i am using the same Analyzer. However it does not work: If i search for "term1 term2" i expect to find things like "term1_term2" and also "term32423" or "term_232".

What am i missing here? I tried different integers as "configurationFlag" argument for the filter [1], but it doesn't seem to work...

[1] http://lucene.apache.org/core/6_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html

Could you edit your question and change 8 into the proper constants you used, it'd be clearer. I think your problem resides in positions, you should debug the tokenizer and output position increments while consuming tokens, I suspect that "term1_term2" is indexed as term1 and term2 at the same position (pos increment will be 0 between the two) this will cause the phrase query to fail. The javadoc mentions a combinations param to control this but it's unclear how to set it. — nomoa
– nomoa, Commented Jul 11, 2017 at 10:03

darcula · Accepted Answer · 2017-07-19 16:07:38Z

1

It is not clear what you are indexing and what are searching for. In your sample code, you are passing flag as CATENATE_NUMBERS(8) which doesn't really help with text, it will just catenate the numbers e.g.: 500-42 -> 50042. To break term1_term2 to term1, term2, term1term2, term1_term2 you need to use GENERATE_WORD_PARTS, CATENATE_WORDS, CATENATE_NUMBERS, CATENATE_ALL, PRESERVE_ORIGIN flags.

    private static class CustomAnalyzer extends Analyzer{
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        final int flags = GENERATE_WORD_PARTS|CATENATE_WORDS|CATENATE_NUMBERS|CATENATE_ALL|PRESERVE_ORIGINAL;
        Tokenizer tokenizer = new StandardTokenizer();
        return new TokenStreamComponents(tokenizer,new WordDelimiterGraphFilter(tokenizer, flags, null ));
    }
}

Sample code to test your example -

CustomAnalyzer customAnalyzer = new CustomAnalyzer();    
        Directory directory = FSDirectory.open(Paths.get("directoryPath"));
        IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(customAnalyzer));
        Document doc1 = new Document();
        doc1.add(new TextField("text", "WAS_sample_tc", Field.Store.YES));
        writer.addDocument(doc1);
        writer.close();

        QueryBuilder queryBuilder = new QueryBuilder(customAnalyzer);
        Query query = queryBuilder.createPhraseQuery("text", "sample", 15);

        IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(directory));

        TopDocs topDocs = searcher.search(query, 10);
        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
            Document doc = searcher.doc(scoreDoc.doc);
            System.out.println(doc.toString());
        }

edited Jul 19, 2017 at 16:07

answered Jul 14, 2017 at 14:32

darcula

945 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

flixe Over a year ago

Well, an example term i want to index would be: "WAS_sample_tc" and i would expect to find it if i search for "sample". Thanks for your example code, but i still don't get the use of flags: How do i know what integer values i have to assign to them?

darcula Over a year ago

For your example the above analyzer would work. Which flags to assign are application specific. If you index term doesn't have any numbers involved, you don't need number related flags.

darcula Over a year ago

Refer to [javadoc] - lucene.apache.org/core/6_6_0/analyzers-common/org/apache/lucene/…. It explains with example the purpose of each flag value.

flixe Over a year ago

Getting back to the topic: Thanks again, but my problem is that i'm not used to flags in general (not the most experienced programmer). If i want to use your code i have to declare the flags in my class like protected static final int GENERATE_WORD_PARTS = 1;. But what integer value do i have to set here to activate this property in the filter? Is it only 1 (true) and 0(false) or what? I tried some random values but the flag combination did not really give me the expected results...

darcula Over a year ago

No, you need not do that. These are lucene constants which are already defined in WordDelimiterGraphFilter. lucene.apache.org/core/6_6_0/analyzers-common/…

|

Collectives™ on Stack Overflow

Lucene 6.6: What is the current way to create a custom query using analyzers and filters?

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related