0

I am trying to get all strings enclosed in <*> by using following Regex:

Regex regex = new Regex(@"\<(?<name>\S+)\>", RegexOptions.IgnoreCase);
string name = e.Match.Groups["name"].Value;

But in some cases where I have text like :

<Vendors><Vtitle/>  <VSurname/></Vendors> 

It's returning two strings instead of four, i.e. above Regex outputs

<Vendors><Vtitle/> //as one string and 
<VSurname/></Vendors> //as second string

Where as I am expecting four strings:

<Vendors>
<Vtitle/>
<VSurname/>
</Vendors> 

Could you please guide me what change I need to make to my Regex.

I tried adding '\b' to specify word boundry

new Regex(@"\b\<(?<name>\S+)\>\b", RegexOptions.IgnoreCase);

, but that didn't help.

6
  • 5
    Is there any good reason not to use an xml parser here? Commented Dec 14, 2009 at 16:57
  • 1
    Agreed with Marc; use an XML parser. Unless you want to build one. Commented Dec 14, 2009 at 16:58
  • Are you parsing an XML document or do you have angle bracket tags inside a mostly plain text document? XML parsers are particular about having well formatted XML documents. They wouldn't work for finding a few angle bracket tags sprinkled throughout a text document. Commented Dec 15, 2009 at 17:41
  • OK, I just saw OP's comment on Andrew's answer. These tags happen to look like XML, but this isn't about parsing XML. This is about finding angle bracket delimited text within a mostly plain text document. Commented Dec 15, 2009 at 17:52
  • Here is the best ever answer on your question. It have 2302 votes up. stackoverflow.com/questions/1732348/… Commented Dec 15, 2009 at 21:35

3 Answers 3

10

You'll get most of what what you want by using the regex /<([^>]*)>/. (No need to escape the angle brackets' as angle brackets aren't special characters in most regex engines, including the .NET engine.) The regex I provided will also capture trailing whitespace and any attributes on the tag--parsing those things reliably is way, way beyond the scope of a reasonable regex.

However, be aware that if you're trying to parse XML/HTML with a regex, that way lies madness

Sign up to request clarification or add additional context in comments.

1 Comment

By answering this question, however, the OP might use this regex (and more regexes) instead of the better methods. Then 2-3 years down the road someone's going to have to maintain it.
6

Regexes are the wrong tool for parsing XML. Try using the System.Xml.Linq (XElement) API.

1 Comment

See Dennis Palmer's comment on the original question. This isn't XML.
4

Your regex is using \S+ as the wildcard. In english, this is "a series of one or more characters, none of which is non-whitespace". In other words, when the regex <(?<name>\S+)> is applied to this string: '`, the regex will match the entire string. angle brackets are non-whitespace.

I think what you want is "a series of one or more characters, none of which is an angle bracket".

The regex for that is <(?<name>[^>]+)> .

Ahhh, regular expressions. The language designed to look like cartoon swearing.

1 Comment

+2 if I could for cartoon swearing.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.