Finding all substrings in a string c# (Regex, Char Array?) [duplicate]

Question

I need to identify substrings found in a string such as:

"CityABCProcess Test" or "cityABCProcess Test"

to yield : [ "City/city", "ABC", "Process", "Test" ]

The first string in the substring can be lowercase or uppercase
Any substring with recurring uppercase letters will be a substring until a lowercase letter or space is found "ABCProcess -> ABC, ABC Process -> ABC"
If there is an uppercase letter followed by a lowercase letter the substring will be everything until the next uppercase letter.

Can this be handled by regex? Or should I convert my strings to a character array and manually check these cases using some indexing logic. Would a lambda solution work here? What is the best way to go about this?

This is going to be largely to your opinion, but IMO, when in doubt, don't use regex. It may be faster (and if speed is of a huge concern, then it might be worth considering) but maintaining it is a headache usually. — user2366842
– user2366842, Commented Jul 17, 2015 at 15:31
"\\p{Lu}+" would be starting point of your regex... But it likely will be easier to just write code by hand. (Note that string is already indexable sequence of characters)... stackoverflow.com/questions/18125738/… may be of help. — Alexei Levenkov
– Alexei Levenkov, Commented Jul 17, 2015 at 15:43
Implement a method that loops all characters in a for-loop and fills a StringBuilder. — Tim Schmelter
– Tim Schmelter, Commented Jul 17, 2015 at 15:44

Steven Doggart · Accepted Answer · 2015-07-17 15:54:40Z

3

Pay no attention to the naysayers! Even something like this really isn't that complicated with RegEx. I believe this pattern should do the trick:

[A-Z][a-z]+|[A-Z]+\b|[A-Z]+(?=[A-Z])|[a-z]+

See here for a working demonstration. It's just a bunch of OR's processed in order. Here's the breakdown:

[A-Z][a-z]+ - Any word that starts with an uppercase letter and then is followed by all lowercase letters
[A-Z]+\b - Any word that is in all uppercase (so as to include the last uppercase letter which would be excluded in the following option)
[A-Z]+(?=[A-Z]) - Any word that is in all uppercase, but not including the first uppercase letter of the next word
[a-z]+ - Any word that's all lowercase

For instance:

string input = "CityABCProcess TEST";
StringBuilder builder = new StringBuilder();
builder.Append("[A-Z][a-z]+");
builder.Append("|");
builder.Append("[A-Z]+$");
builder.Append("|");
builder.Append("[A-Z]+(?=[A-Z])");
builder.Append("|");
builder.Append("[a-z]+");
foreach (Match m in Regex.Matches(input, builder.ToString()))
    {
    Console.WriteLine(m.Value);
    }

edited Jul 17, 2015 at 15:54

answered Jul 17, 2015 at 15:46

Steven Doggart

43.8k8 gold badges71 silver badges109 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Equalsk Over a year ago

Can confirm, @StevenDoggart is some kind of wizard :-)

Pipeline Over a year ago

This is so close! Thanks for the reply. I am testing it using regexr.com I found it does not work in the case: "City ABC Process" it only gets "AB" instead of "ABC" also leaves out numbers "Process 1"

Phylogenesis Over a year ago

This example shows something similar.

Steven Doggart Over a year ago

Good point. Fixed my answer to use \b instead of $ to correct that.

Steven Doggart Over a year ago

You could use ([A-Z]|[1-9]) for uppercase and ([a-z]|[1-9]) for lowercase

|

Collectives™ on Stack Overflow

Finding all substrings in a string c# (Regex, Char Array?) [duplicate]

1 Answer 1

6 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Linked

Related