0

I have special analyzing needs in Lucene, but I want to keep using parts of the StandardAnalyzer mechanism.

In particular, I want the string

"-apple--carrot- tomato?"

to be tokenize into:

  1. "-apple-" 2. "-carrot-" 3. "tomato"

(strings surrounded with -- are treated as a seperate token)

It seems that to achieve this, I have to customize the analyzer and the tokenizer. But do I have to rewrite it from scratch? for example I don't want to have to tell the tokenizer (or token filter) that it should ommit the question mark in "apple?".

Is there a way to just modify existing analyzer?

1 Answer 1

3

Basically, you couldn't extend StandardAnalyzer, since it's final class. But you could do the same trick, with your own tokenizer, and it's simple. Also you couldn't change existing one, since it's a bad idea.

I could imagine something like this:

public class CustomAnalyzer extends Analyzer {

    protected TokenStreamComponents createComponents(String s) {
        // provide your own tokenizer, that will split input string as you want it
        final Tokenizer standardTokenizer = new MyStandardTokenizer();

        TokenStream tok = new StandardFilter(standardTokenizer);
        // make everything lowercase, remove if not needed
        tok = new LowerCaseFilter(tok);
        //provide stopwords if you want them
        tok = new StopFilter(tok, stopwords);
        return new TokenStreamComponents(standardTokenizer, tok);
    }

    private class MyStandardTokenizer extends Tokenizer {

        public boolean incrementToken() throws IOException {
            //mimic the logic of standard analyzer and add your rules
            return false;
        }
    }
}

I put everything into one class, just to make it easier to post here. In general, you need your own logic in MyStandardTokenizer (e.g. you could copy code from StandardAnalyzer (it's final, so no extends again) and then in the incrementToken add needed stuff for your logic with dashes. Hope it will help you.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for your detailed answer. One thing I didn't understand though - you said I should mimic the logic of StandardAnalyzer inside IncrementToken. Did you mean I was to copy the code from the sources and then add my logic? because That code is very complex and I doubt I'll be able to understand it enough to know where and how to add my logic, Or did you just mean to write my own code to achieve the same result?
both ways should be okay, the question is to implement it that way, so you will get expected behaviour

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.