0

I'm using lucene 6.6.0 to develop a search service and i'm quite confused on how to create custom analyzers and queries.

I have written my index based on data from a rdbms and at first i was just using a standard analyzer. Unfortunately it does not seem to split text by special characters like "_","-" or numbers, it only tokenizes by whitespace. I have found the WordDelimiterGraphFilter, which seems to do what i want, but i do not understand to make it work. Right now i try to use it like this:

mCustomAnalyzer = new Analyzer()
        {
            @Override
            protected TokenStreamComponents createComponents(String fieldName) {
                Tokenizer source = new StandardTokenizer();

                TokenStream filter = new WordDelimiterGraphFilter(source, 8, null);
                return new TokenStreamComponents(source, filter);
            }
        };

QueryBuilder queryBuilder = new QueryBuilder(mCustomAnalyzer);
Query query = queryBuilder.createPhraseQuery(aField, aText, 15);

For indexing i am using the same Analyzer. However it does not work: If i search for "term1 term2" i expect to find things like "term1_term2" and also "term32423" or "term_232".

What am i missing here? I tried different integers as "configurationFlag" argument for the filter [1], but it doesn't seem to work...

[1] http://lucene.apache.org/core/6_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html

1
  • Could you edit your question and change 8 into the proper constants you used, it'd be clearer. I think your problem resides in positions, you should debug the tokenizer and output position increments while consuming tokens, I suspect that "term1_term2" is indexed as term1 and term2 at the same position (pos increment will be 0 between the two) this will cause the phrase query to fail. The javadoc mentions a combinations param to control this but it's unclear how to set it. Commented Jul 11, 2017 at 10:03

1 Answer 1

1

It is not clear what you are indexing and what are searching for. In your sample code, you are passing flag as CATENATE_NUMBERS(8) which doesn't really help with text, it will just catenate the numbers e.g.: 500-42 -> 50042. To break term1_term2 to term1, term2, term1term2, term1_term2 you need to use GENERATE_WORD_PARTS, CATENATE_WORDS, CATENATE_NUMBERS, CATENATE_ALL, PRESERVE_ORIGIN flags.

    private static class CustomAnalyzer extends Analyzer{
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        final int flags = GENERATE_WORD_PARTS|CATENATE_WORDS|CATENATE_NUMBERS|CATENATE_ALL|PRESERVE_ORIGINAL;
        Tokenizer tokenizer = new StandardTokenizer();
        return new TokenStreamComponents(tokenizer,new WordDelimiterGraphFilter(tokenizer, flags, null ));
    }
}

Sample code to test your example -

CustomAnalyzer customAnalyzer = new CustomAnalyzer();    
        Directory directory = FSDirectory.open(Paths.get("directoryPath"));
        IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(customAnalyzer));
        Document doc1 = new Document();
        doc1.add(new TextField("text", "WAS_sample_tc", Field.Store.YES));
        writer.addDocument(doc1);
        writer.close();

        QueryBuilder queryBuilder = new QueryBuilder(customAnalyzer);
        Query query = queryBuilder.createPhraseQuery("text", "sample", 15);

        IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(directory));

        TopDocs topDocs = searcher.search(query, 10);
        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
            Document doc = searcher.doc(scoreDoc.doc);
            System.out.println(doc.toString());
        }
Sign up to request clarification or add additional context in comments.

6 Comments

Well, an example term i want to index would be: "WAS_sample_tc" and i would expect to find it if i search for "sample". Thanks for your example code, but i still don't get the use of flags: How do i know what integer values i have to assign to them?
For your example the above analyzer would work. Which flags to assign are application specific. If you index term doesn't have any numbers involved, you don't need number related flags.
Refer to [javadoc] - lucene.apache.org/core/6_6_0/analyzers-common/org/apache/lucene/…. It explains with example the purpose of each flag value.
Getting back to the topic: Thanks again, but my problem is that i'm not used to flags in general (not the most experienced programmer). If i want to use your code i have to declare the flags in my class like protected static final int GENERATE_WORD_PARTS = 1;. But what integer value do i have to set here to activate this property in the filter? Is it only 1 (true) and 0(false) or what? I tried some random values but the flag combination did not really give me the expected results...
No, you need not do that. These are lucene constants which are already defined in WordDelimiterGraphFilter. lucene.apache.org/core/6_6_0/analyzers-common/…
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.