Split a PascalCase string into separate words

Question

I am looking for a way to split PascalCase strings, e.g. "MyString", into separate words - "My", "String". Another user posed the question for bash, but I want to know how to do it with general regular expressions or at least in .NET.

Bonus if you can find a way to also split (and optionally capitalize) camelCase strings: e.g. "myString" becomes "my" and "String", with the option to capitalize/lowercase either or both of the strings.

possible duplicate of is there a elegant way to parse a word and add spaces before capital letters — Ken Bloom
– Ken Bloom, Commented Jul 9, 2010 at 20:20
This question is specific to .NET, but the regex answers could be applied elsewhere. — Pat
– Pat, Commented Jul 9, 2010 at 22:30
Check out the dupe question: the accepted answer has the regex to split AnXMLAndXSLT2.0Tool to [An][XML][And][XSLT][2.0][Tool]. It uses lookarounds that one can argue is quite readable. — polygenelubricants
– polygenelubricants, Commented Jul 10, 2010 at 3:44

Community · Accepted Answer · 2017-05-23 11:47:32Z

29

See this question: Is there a elegant way to parse a word and add spaces before capital letters? Its accepted answer covers what you want, including numbers and several uppercase letters in a row. While this sample has words starting in uppercase, it it equally valid when the first word is in lowercase.

string[] tests = {
   "AutomaticTrackingSystem",
   "XMLEditor",
   "AnXMLAndXSLT2.0Tool",
};


Regex r = new Regex(
    @"(?<=[A-Z])(?=[A-Z][a-z])|(?<=[^A-Z])(?=[A-Z])|(?<=[A-Za-z])(?=[^A-Za-z])"
  );

foreach (string s in tests)
  r.Replace(s, " ");

The above will output:

[Automatic][Tracking][System]
[XML][Editor]
[An][XML][And][XSLT][2.0][Tool]

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Jul 9, 2010 at 20:11

chilltemp

8,9689 gold badges44 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

chilltemp Over a year ago

@Steven Sudit: Yes. RegEx is one of the best tools for this type of problem. The other question is just got flushed out with a larger set of sample use cases.

Shimmy Weitzhandler Over a year ago

@chilltemp, do you know of a built-in function for it?

chilltemp Over a year ago

@Shimmy: No. I'd recommend that you use the information in the linked question to create a reusable library.

Shimmy Weitzhandler Over a year ago

I made my own function that doesn't use regex.

chilltemp Over a year ago

@Shimmy: Performance varies greatly depending upon many factors including the how complex the RegEx is and if it is compiled. Just like the performance of C# varies depending upon how you use it. That being said, I've always found RegEx in .NET to be fast enough for my needs (real-time transactional system with high throughput). The only ways to really compare is to look at the generated IL and/or do timed test runs.

|

Andy Rose · Accepted Answer · 2010-11-02 16:09:46Z

14

Just to provide an alternative to the RegEx and looping solutions all ready provided here is an answer using LINQ which also handles camel case and acronyms:

    string[] testCollection = new string[] { "AutomaticTrackingSystem", "XSLT", "aCamelCaseWord" };
    foreach (string test in testCollection)
    {
        // if it is not the first character and it is uppercase
        //  and the previous character is not uppercase then insert a space
        var result = test.SelectMany((c, i) => i != 0 && char.IsUpper(c) && !char.IsUpper(test[i - 1]) ? new char[] { ' ', c } : new char[] { c });
        Console.WriteLine(new String(result.ToArray()));
    }

The output from this is:

Automatic Tracking System  
XSLT  
a Camel Case Word

answered Nov 2, 2010 at 16:09

Andy Rose

17k7 gold badges45 silver badges49 bronze badges

2 Comments

kzu Over a year ago

This is my absolute favorite :)

dvlsg Over a year ago

Worth noting that this doesn't work for acronyms mixed with other words, if the expectation is to treat the acronym as its own word. For example, HTTPResponseException converts to HTTPResponse Exception.

Community · Accepted Answer · 2017-05-23 10:30:37Z

9

Answered in a different question:

void Main()
{
    "aCamelCaseWord".ToFriendlyCase().Dump();
}

public static class Extensions
{
    public static string ToFriendlyCase(this string PascalString)
    {
        return Regex.Replace(PascalString, "(?!^)([A-Z])", " $1");
    }
}

Outputs a Camel Case Word (.Dump() just prints to the console).

edited May 23, 2017 at 10:30

CommunityBot

11 silver badge

answered Jul 9, 2010 at 20:03

Pat

17k16 gold badges101 silver badges116 bronze badges

4 Comments

Arseni Mourzenko Over a year ago

What must happen for the strings like this: aCamelCaseXML? Reading the question, I would expect a Camel Case XML. Instead, it gives a Camel Case X M L.

Pat Over a year ago

@MainMa That's true. Following .NET naming standards, any acronyms three letters or longer (e.g. XML) would be in proper case (i.e. Xml), but two-letter acronyms (e.g. IP for IPAddress) would still cause a problem. It would be better to have the algorithm handle this case.

Shimmy Weitzhandler Over a year ago

Is there any out-the-box funtion that does this?

Custodio Over a year ago

I'd suggest:

new Regex(    @"  (?<=[A-Z])(?=[A-Z][a-z]) |  (?<=[^A-Z])(?=[A-Z]) | (?<=[A-Za-z])(?=[^A-Za-z])",    RegexOptions.IgnorePatternWhitespace

as stackoverflow.com/questions/3103730/… says

Dan Tao · Accepted Answer · 2010-07-09 20:12:23Z

5

How about:

static IEnumerable<string> SplitPascalCase(this string text)
{
    var sb = new StringBuilder();
    using (var reader = new StringReader(text))
    {
        while (reader.Peek() != -1)
        {
            char c = (char)reader.Read();
            if (char.IsUpper(c) && sb.Length > 0)
            {
                yield return sb.ToString();
                sb.Length = 0;
            }

            sb.Append(c);
        }
    }

    if (sb.Length > 0)
        yield return sb.ToString();
}

answered Jul 9, 2010 at 20:12

Dan Tao

129k57 gold badges309 silver badges451 bronze badges

3 Comments

Steven Sudit Over a year ago

This would be a "by hand" solution.

Dan Tao Over a year ago

@Steven Sudit: Yeah... was that forbidden or something?

Steven Sudit Over a year ago

No, no, not at all. There was just some confusion about what "by hand" meant, when I suggested that to Pat as an alternative to RegExp. In fact, I think that RegExp, for all its power, is overused. For many jobs, it's a bad fit, leading to cryptic code and poor performance.

Brent · Accepted Answer · 2014-08-04 08:50:17Z

with the aims of

a) Creating a function which optimised performance
b) Have my own take on CamelCase in which capitalised acronyms were not separated (I fully accept this is not the standard definition of camel or pascal case, but it is not an uncommon usage) : "TestTLAContainingCamelCase" becomes "Test TLA Containing Camel Case" (TLA = Three Letter Acronym)

I therefore created the following (non regex, verbose, but performance oriented) function

public static string ToSeparateWords(this string value)
{
    if (value==null){return null;}
    if(value.Length <=1){return value;}
    char[] inChars = value.ToCharArray();
    List<int> uCWithAnyLC = new List<int>();
    int i = 0;
    while (i < inChars.Length && char.IsUpper(inChars[i])) { ++i; }
    for (; i < inChars.Length; i++)
    {
        if (char.IsUpper(inChars[i]))
        {
            uCWithAnyLC.Add(i);
            if (++i < inChars.Length && char.IsUpper(inChars[i]))
            {
                while (++i < inChars.Length) 
                {
                    if (!char.IsUpper(inChars[i]))
                    {
                        uCWithAnyLC.Add(i - 1);
                        break;
                    }
                }
            }
        }
    }
    char[] outChars = new char[inChars.Length + uCWithAnyLC.Count];
    int lastIndex = 0;
    for (i=0;i<uCWithAnyLC.Count;i++)
    {
        int currentIndex = uCWithAnyLC[i];
        Array.Copy(inChars, lastIndex, outChars, lastIndex + i, currentIndex - lastIndex);
        outChars[currentIndex + i] = ' ';
        lastIndex = currentIndex;
    }
    int lastPos = lastIndex + uCWithAnyLC.Count;
    Array.Copy(inChars, lastIndex, outChars, lastPos, outChars.Length - lastPos);
    return new string(outChars);
}

What was most surprising was the performance tests. using 1 000 000 iterations per function

regex pattern used = "([a-z](?=[A-Z])|[A-Z](?=[A-Z][a-z]))"
test string = "TestTLAContainingCamelCase":
static regex:      13 302ms
Regex instance:    12 398ms
compiled regex:    12 663ms
brent(above):         345ms
AndyRose:           1 764ms
DanTao:               995ms

the Regex instance method was only slightly faster than the static method, even over a million iterations (and I can't see the benefit of using the RegexOptions.Compiled flag), and Dan Tao's very succinct code was almost as fast as my much less clear code!

Pat · Accepted Answer · 2010-07-09 20:12:44Z

1

var regex = new Regex("([A-Z]+[^A-Z]+)");
var matches = regex.Matches("aCamelCaseWord")
    .Cast<Match>()
    .Select(match => match.Value);
foreach (var element in matches)
{
    Console.WriteLine(element);
}

Prints

Camel
Case
Word

(As you can see, it doesn't handle camelCase - it dropped the leading "a".)

edited Jul 9, 2010 at 20:12

answered Jul 9, 2010 at 19:54

Pat

17k16 gold badges101 silver badges116 bronze badges

5 Comments

Steven Sudit Over a year ago

1) Compile the regexp for some speed. 2) It'll still be slower than doing it by hand.

Pat Over a year ago

@Steven I agree that it should be compiled for speed, but it's the functionality I'm going after for now. What do you mean it will be "slower than doing it by hand"? If I reflect over an object with a bunch of public properties and convert the names from PascalCase to separate words, it will be much faster (development and maintenance time) doing it programmatically than by hand.

Ron Warholic Over a year ago

I didn't see speed mentioned as a requirement. Also I think "doing it by hand" refers to writing your own string parsing code which may be faster but will be significantly more code and more testing.

Pat Over a year ago

@Ken This method doesn't handle camelCase, so the "a" was dropped (see edit to the answer).

Steven Sudit Over a year ago

@Pat: what Ron said is correct: "by hand" means writing your own code to loop over the string, character by character, building up each word into a StringBuilder and outputting as needed.

Sooraj kumar · Accepted Answer · 2020-10-02 07:49:22Z

1

string.Concat(str.Select(x => Char.IsUpper(x) ? " " + x : x.ToString())).TrimStart(' ').Dump();

This is far better approach then using Regex, Dump is just to print to console

answered Oct 2, 2020 at 7:49

Sooraj kumar

313 bronze badges

Comments

Ken Bloom · Accepted Answer · 2010-07-09 20:02:37Z

0

In Ruby:

"aCamelCaseWord".split /(?=[[:upper:]])/
=> ["a", "Camel", "Case", "Word"]

I'm using positive lookahead here, so that I can split the string right before each uppercase letter. This lets me save any initial lowercase part as well.

answered Jul 9, 2010 at 20:02

Ken Bloom

59.1k14 gold badges114 silver badges171 bronze badges

3 Comments

Pat Over a year ago

That's a positive lookahead, isn't it? I can't get an equivalent to work for .NET, even when I replace [[:upper:]] with [A-Z] (en.wikipedia.org/wiki/Regular_expression).

Alan Moore Over a year ago

.NET regex doesn't support the POSIX character class syntax. You could use \p{Lu} instead, but [A-Z] will probably suffice. Anyway, this approach is way too simplistic. Check out the other question, especially the split regex @poly came up with. It really is that complicated.

Alan Moore Over a year ago

@Pat: that Wikipedia article is not meant to be used as a reference; too general and too theoretical. This site is much more useful: regular-expressions.info

Aaron Butacov · Accepted Answer · 2010-07-09 20:14:50Z

0

Check that a non-word character comes at the beginning of your regex with \W and keep the individual strings together, then split the words.

Something like: \W([A-Z][A-Za-z]+)+

For: sdcsds sd aCamelCaseWord as dasd as aSscdcacdcdc PascelCase DfsadSsdd sd Outputs:

48: PascelCase
59: DfsadSsdd

edited Jul 9, 2010 at 20:14

answered Jul 9, 2010 at 20:00

Aaron Butacov

34.7k8 gold badges49 silver badges62 bronze badges

2 Comments

Pat Over a year ago

Hmmm. That doesn't work straight-up for .NET's regex, but maybe with a little documentation digging...

Alan Moore Over a year ago

You should use \b (word boundary) to match the beginning of the word, not \W.

JEM · Accepted Answer · 2016-02-23 20:06:21Z

0

    public static string PascalCaseToSentence(string input)
    {
        if (input == null) return "";

        string output = Regex.Replace(input, @"(?<=[A-Z])(?=[A-Z][a-z])|(?<=[^A-Z])(?=[A-Z])|(?<=[A-Za-z])(?=[^A-Za-z])", m => " " + m.Value);
        return output;
    }

Based on Shimmy's answer.

answered Feb 23, 2016 at 20:06

JEM

1511 silver badge8 bronze badges

Collectives™ on Stack Overflow

Split a PascalCase string into separate words

10 Answers 10

8 Comments

2 Comments

4 Comments

3 Comments

Comments

5 Comments

Comments

3 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

8 Comments

2 Comments

4 Comments

3 Comments

Comments

5 Comments

Comments

3 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related