Extending Lucene Analyzer

Question

I have special analyzing needs in Lucene, but I want to keep using parts of the StandardAnalyzer mechanism.

In particular, I want the string

"-apple--carrot- tomato?"

to be tokenize into:

"-apple-" 2. "-carrot-" 3. "tomato"

(strings surrounded with -- are treated as a seperate token)

It seems that to achieve this, I have to customize the analyzer and the tokenizer. But do I have to rewrite it from scratch? for example I don't want to have to tell the tokenizer (or token filter) that it should ommit the question mark in "apple?".

Is there a way to just modify existing analyzer?

Mysterion · Accepted Answer · 2016-08-01 09:13:00Z

3

Basically, you couldn't extend StandardAnalyzer, since it's final class. But you could do the same trick, with your own tokenizer, and it's simple. Also you couldn't change existing one, since it's a bad idea.

I could imagine something like this:

public class CustomAnalyzer extends Analyzer {

    protected TokenStreamComponents createComponents(String s) {
        // provide your own tokenizer, that will split input string as you want it
        final Tokenizer standardTokenizer = new MyStandardTokenizer();

        TokenStream tok = new StandardFilter(standardTokenizer);
        // make everything lowercase, remove if not needed
        tok = new LowerCaseFilter(tok);
        //provide stopwords if you want them
        tok = new StopFilter(tok, stopwords);
        return new TokenStreamComponents(standardTokenizer, tok);
    }

    private class MyStandardTokenizer extends Tokenizer {

        public boolean incrementToken() throws IOException {
            //mimic the logic of standard analyzer and add your rules
            return false;
        }
    }
}

I put everything into one class, just to make it easier to post here. In general, you need your own logic in MyStandardTokenizer (e.g. you could copy code from StandardAnalyzer (it's final, so no extends again) and then in the incrementToken add needed stuff for your logic with dashes. Hope it will help you.

answered Aug 1, 2016 at 9:13

Mysterion

9,3303 gold badges33 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Person1 Over a year ago

Thank you for your detailed answer. One thing I didn't understand though - you said I should mimic the logic of StandardAnalyzer inside IncrementToken. Did you mean I was to copy the code from the sources and then add my logic? because That code is very complex and I doubt I'll be able to understand it enough to know where and how to add my logic, Or did you just mean to write my own code to achieve the same result?

Mysterion Over a year ago

both ways should be okay, the question is to implement it that way, so you will get expected behaviour

Collectives™ on Stack Overflow

Extending Lucene Analyzer

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related