1

I have a parsing question. I have sentences that are stored as Strings. I want to grab each word in each sentence however I would like to filter which words I grab. For example say I have a sentence like the following:

Hell0 3v3ryb0dy @ stackoverflow $people \implies queen$ equals ~queen --> ~people. /#logic

I would do the following:

  1. grab 'H3ll0'
  2. grab 3v3ryb0dy
  3. throw away the @
  4. grab 'people' from '$people'
  5. grab 'implies' from '\implies'
  6. grab 'queen' from 'queen$'
  7. grab 'equals'
  8. grab 'queen' from '~queen'
  9. throw away -->
  10. grab 'people' from '~people'
  11. grab 'logic' from '/#logic'

Essentially I want only alphanumeric characters and whenever I have some other character such as a \ before or after a word I want to disregard this other character.

Currently I am doing:sentence.split(" ")

This gets the individual words from the sentence but it grabs '$people' and '~people' and treats them differently when I want them to be treated the same.

  1. How can I achieve this?
  2. Would a regex help me here?

2 Answers 2

4

Split the string with this regex \\W+, split at one or more non-word character(s).

String sentence = "Hell0 3v3ryb0dy @ stackoverflow $people \\implies queen$ equals ~queen --> ~people. /#logic";
String[] split = sentence.split("\\W+");
System.out.println(Arrays.asList(split));

Output

[Hell0, 3v3ryb0dy, stackoverflow, people, implies, queen, equals, queen, people, logic]

Sign up to request clarification or add additional context in comments.

Comments

1

I am using this regex.
[^A-Za-z0-9 ]+ (Edited) and the output I get is:
Hell0 3v3ryb0dy stackoverflow people implies queen equals queen people logic

Is this what you are expecting?

Snipped from myregextester

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("[^A-Za-z0-9 ]+",Pattern.CASE_INSENSITIVE);
  Matcher m = re.matcher(sourcestring);
  String result = m.replaceAll("");
  }
}

$sourcestring after replacement:
Hell0 3v3ryb0dy stackoverflow people implies queen equals queen people logic

11 Comments

I stil want to split my sentence by spaces. So say I do the following: wordsInSentence = sentence.split(" "); I would like this code to also filter out non-word characters with the functionality in my question
@CodeKingPlusPlus: Did you try my answer?
-1. With this I got this output: [, H, e, l, l, 0, , 3, v, 3, r, y, b, 0, d, y, , , , s, t, a, c, k, o, v, e, r, f, l, o, w, , , p, e, o, p, l, e, , , i, m, p, l, i, e, s, , q, u, e, e, n, , , e, q, u, a, l, s, , , q, u, e, e, n, , , , , p, e, o, p, l, e, , , , l, o, g, i, c].
@CodeKingPlusPlus Why don't you try the regex [^A-Za-z0-9]* on each split word?
You need to change the * to +: [^A-Za-z0-9 ]+. Your regex can match nothing, meaning it will match at every character boundary regardless of what follows it. If it happens to see any of the unwanted characters it will consume them, but something or nothing, it will always match. The regex in your comment ([^A-Za-z0-9]*) has the same problem. It won't throw an exception or return incorrect results, but it's doing lot of work it doesn't need to do.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.