Java String.split() with a regex

Question

I have a parsing question. I have sentences that are stored as Strings. I want to grab each word in each sentence however I would like to filter which words I grab. For example say I have a sentence like the following:

Hell0 3v3ryb0dy @ stackoverflow $people \implies queen$ equals ~queen --> ~people. /#logic

I would do the following:

grab 'H3ll0'
grab 3v3ryb0dy
throw away the @
grab 'people' from '$people'
grab 'implies' from '\implies'
grab 'queen' from 'queen$'
grab 'equals'
grab 'queen' from '~queen'
throw away -->
grab 'people' from '~people'
grab 'logic' from '/#logic'

Essentially I want only alphanumeric characters and whenever I have some other character such as a \ before or after a word I want to disregard this other character.

Currently I am doing:sentence.split(" ")

This gets the individual words from the sentence but it grabs '$people' and '~people' and treats them differently when I want them to be treated the same.

How can I achieve this?
Would a regex help me here?

Bhesh Gurung · Accepted Answer · 2012-11-05 17:21:56Z

4

Split the string with this regex \\W+, split at one or more non-word character(s).

String sentence = "Hell0 3v3ryb0dy @ stackoverflow $people \\implies queen$ equals ~queen --> ~people. /#logic";
String[] split = sentence.split("\\W+");
System.out.println(Arrays.asList(split));

Output

[Hell0, 3v3ryb0dy, stackoverflow, people, implies, queen, equals, queen, people, logic]

edited Nov 5, 2012 at 17:21

answered Nov 5, 2012 at 2:40

Bhesh Gurung

51.1k23 gold badges96 silver badges147 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Srinivas · Accepted Answer · 2012-11-11 08:09:58Z

1

I am using this regex.
[^A-Za-z0-9 ]+ (Edited) and the output I get is:
Hell0 3v3ryb0dy stackoverflow people implies queen equals queen people logic

Is this what you are expecting?

Snipped from myregextester

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("[^A-Za-z0-9 ]+",Pattern.CASE_INSENSITIVE);
  Matcher m = re.matcher(sourcestring);
  String result = m.replaceAll("");
  }
}

$sourcestring after replacement:
Hell0 3v3ryb0dy stackoverflow people implies queen equals queen people logic

edited Nov 11, 2012 at 8:09

answered Nov 5, 2012 at 2:33

Srinivas

1,7861 gold badge14 silver badges27 bronze badges

11 Comments

CodeKingPlusPlus Over a year ago

I stil want to split my sentence by spaces. So say I do the following: wordsInSentence = sentence.split(" "); I would like this code to also filter out non-word characters with the functionality in my question

Bhesh Gurung Over a year ago

@CodeKingPlusPlus: Did you try my answer?

Bhesh Gurung Over a year ago

-1. With this I got this output:

[, H, e, l, l, 0,  , 3, v, 3, r, y, b, 0, d, y,  , ,  , s, t, a, c, k, o, v, e, r, f, l, o, w,  , , p, e, o, p, l, e,  , , i, m, p, l, i, e, s,  , q, u, e, e, n, ,  , e, q, u, a, l, s,  , , q, u, e, e, n,  , ,  , , p, e, o, p, l, e, ,  , , l, o, g, i, c]

.

Srinivas Over a year ago

@CodeKingPlusPlus Why don't you try the regex [^A-Za-z0-9]* on each split word?

Alan Moore Over a year ago

You need to change the * to +: [^A-Za-z0-9 ]+. Your regex can match nothing, meaning it will match at every character boundary regardless of what follows it. If it happens to see any of the unwanted characters it will consume them, but something or nothing, it will always match. The regex in your comment ([^A-Za-z0-9]*) has the same problem. It won't throw an exception or return incorrect results, but it's doing lot of work it doesn't need to do.

|

Collectives™ on Stack Overflow

Java String.split() with a regex

2 Answers 2

Comments

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related