2

I need to identify substrings found in a string such as:

"CityABCProcess Test" or "cityABCProcess Test"

to yield : [ "City/city", "ABC", "Process", "Test" ]

  1. The first string in the substring can be lowercase or uppercase
  2. Any substring with recurring uppercase letters will be a substring until a lowercase letter or space is found "ABCProcess -> ABC, ABC Process -> ABC"
  3. If there is an uppercase letter followed by a lowercase letter the substring will be everything until the next uppercase letter.

Can this be handled by regex? Or should I convert my strings to a character array and manually check these cases using some indexing logic. Would a lambda solution work here? What is the best way to go about this?

6
  • 3
    This is going to be largely to your opinion, but IMO, when in doubt, don't use regex. It may be faster (and if speed is of a huge concern, then it might be worth considering) but maintaining it is a headache usually. Commented Jul 17, 2015 at 15:31
  • 2
    Now you have two problems. Commented Jul 17, 2015 at 15:36
  • "\\p{Lu}+" would be starting point of your regex... But it likely will be easier to just write code by hand. (Note that string is already indexable sequence of characters)... stackoverflow.com/questions/18125738/… may be of help. Commented Jul 17, 2015 at 15:43
  • Implement a method that loops all characters in a for-loop and fills a StringBuilder. Commented Jul 17, 2015 at 15:44
  • @user2366842: in most cases regex is the slowest option. Commented Jul 17, 2015 at 15:46

1 Answer 1

3

Pay no attention to the naysayers! Even something like this really isn't that complicated with RegEx. I believe this pattern should do the trick:

[A-Z][a-z]+|[A-Z]+\b|[A-Z]+(?=[A-Z])|[a-z]+

See here for a working demonstration. It's just a bunch of OR's processed in order. Here's the breakdown:

  • [A-Z][a-z]+ - Any word that starts with an uppercase letter and then is followed by all lowercase letters
  • [A-Z]+\b - Any word that is in all uppercase (so as to include the last uppercase letter which would be excluded in the following option)
  • [A-Z]+(?=[A-Z]) - Any word that is in all uppercase, but not including the first uppercase letter of the next word
  • [a-z]+ - Any word that's all lowercase

For instance:

string input = "CityABCProcess TEST";
StringBuilder builder = new StringBuilder();
builder.Append("[A-Z][a-z]+");
builder.Append("|");
builder.Append("[A-Z]+$");
builder.Append("|");
builder.Append("[A-Z]+(?=[A-Z])");
builder.Append("|");
builder.Append("[a-z]+");
foreach (Match m in Regex.Matches(input, builder.ToString()))
    {
    Console.WriteLine(m.Value);
    }
Sign up to request clarification or add additional context in comments.

6 Comments

Can confirm, @StevenDoggart is some kind of wizard :-)
This is so close! Thanks for the reply. I am testing it using regexr.com I found it does not work in the case: "City ABC Process" it only gets "AB" instead of "ABC" also leaves out numbers "Process 1"
This example shows something similar.
Good point. Fixed my answer to use \b instead of $ to correct that.
You could use ([A-Z]|[1-9]) for uppercase and ([a-z]|[1-9]) for lowercase
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.