0

I have a an array of strings like this (from Twitter):

String str= "The Green New Deal is viable. It is the same vision that FDR had for his New Deal programs: nationwide mobilization http://94739 #thegreendeal #nationwide"

What I want is to 1) turn this string into an array and 2) remove stop words and include stemming 3) remove all characters except for '#' which indicates a term is a hashtag.

So I have tried to use this cool library https://github.com/uttesh/exude which does stemming and removes stop words, and lowercases and removes characters. The problem is this removes the hashtags. Code for this:

String tweetString = ExudeData.getInstance().filterStoppingsKeepDuplicates(str);

I have also tried this:

String[] wordArray = str.replaceAll("[^a-zA-Z ]", "").toLowerCase().split("\\s+");

But this also removes hashtags. Any workaround using either method to keep the hashtags? (I'd prefer to keep the exude library for this)

2
  • Extract the hashtags before processing. Append back in after processing if needed. Commented Mar 4, 2019 at 17:39
  • great idea, can you show me what this looks like please? Commented Mar 4, 2019 at 18:05

1 Answer 1

1

Using the regex method, you can try to add # in the list of characters that should not be removed like this :

        String[] wordArray = str.replaceAll("[^a-zA-Z #]", "").toLowerCase().split("\\s+");
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.