Difficult (for me) string parsing in C# (regex?)

Question

I need help to parse some information from a mass of text, basically I am importing a PSD file and want to parse some data from it.

Amongst the text are strings such as this:

\r\nj78876 RANDOM TEXT STRINGS 75 £

Now what I want to do is grab all strings that fit this format (maybe the starting "\r\n" and ending "£" can be delimiters) and get the code at the start (j78876) and the price at the end (75). Note price may be more digits that 2.

I want to then grab the code such as j78876 and the price for each string like this which is found as they will occur many times (different codes and prices).

Can anyone suggest a way to do this?

I am not very proficient with Regex so guidance would be great.

thanks.

Note: Here is a snipped of the actual text (there is a lot more in the actual file).

Référence Ancienne référence 3Com/H3C Libellé Remarque Prix en €\r\nJ9449A HP V1810-8G Switch 139,00\r\nJ9450A HP V1810-24G Switch 359,00\r\nEdge Switches - Managed \r\nHP Layer 2 Switches - Managed Stackables and Chassis\r\nHP Switch 2510 Series\r\nRéférence Ancienne référence 3Com/H3C Libellé Remarque Prix en €\r\nJ9019B HP E2510-24 Switch 359,00\r \nJ9020A HP E2510-48 Switch 599,00\r\nJ9279A HP E2510-24G Switch 779,00\r\nJ9280A HP E2510-48G Switch 1 569,00\r\nHP Switch 2520 Series\r\nRéférence Ancienne référence 3Com/H3C Libellé Remarque Prix en €\r\nJ9137A HP E2520-8-PoE Switch 489,00\r\nJ9138A HP E2520-24-PoE Switch 779,00\r\nJ9298A HP E2520-8G-PoE Switch 749,00\r\nJ9299A HP E2520- 24G-PoE Switch 1 569,00\r\nHP Layer 2 and 3 Switches - Managed Stackables and Chassis\r \nThe RBP is a recommended price only. \r\nHP Switch 2600 Series\r\nRéférence Ancienne

Update I found this:

[\\r\\n](\w\d+\w).*?(\d+,\d\d)[\\r\\n]

Worked for me in regex browser testers but will not work in my C# code

Regex reg = new Regex(@"[\\r\\n](\w\d+\w).*?(\d+,\d\d)[\\r\\n]", RegexOptions.IgnoreCase);
Match matched = reg.Match(str);
if (matched.Success)
{
    string code = matched.Groups[1].Value;
    string currencyAmt = matched.Groups[2].Value;
}

Final Update: In the browser testers i had to double escape the \r\n - in my code it was not necessary. Then to loop the groups I used the looping answer.

foreach (Match match in Regex.Matches(content, @"[\r\n](?<code>\w\d+\w).*?(?<price>\d+,\d\d)[\r\n]", RegexOptions.IgnoreCase))
{
    string code = match.Groups["code"].Value;
    string currencyAmt = match.Groups["price"].Value;
}

It really depends on what characters "random text strings" can contain -- including whitespace information. — Jon
– Jon, Commented Mar 22, 2011 at 17:06
Hi Jon, yes the random text is all sorts of text - paragraphs with white space, carriage returns "\r\n"'s etc but do not contain the £ symbol - so I was thinking of looking for a "£" and back to the "\r\n" to act as sort of string token delimiters. — Simon
– Simon, Commented Mar 22, 2011 at 17:37
Your final update has a problem in the part .*?(?<price>\d+. The regex part .*? is aggressive and will match until the last digit before the decimal: in "...xyz 749,00\r\n" .*? will match "...xyz 74" and \d+,\d\d will match 9,00. — Eric H
– Eric H, Commented Mar 22, 2011 at 21:58

Eric H · Accepted Answer · 2011-03-22 21:34:27Z

3

Regex reg = new Regex(@"\r\n([a-z]\d+\w)\s.*\s(\d+\,?\d+?)\r\n", RegexOptions.IgnoreCase);
string productCode, productCost;
foreach (Match match in reg.Matches(str))
{
    productCode = match.Groups[1].Value;
    productCost = match.Groups[2].Value;
    //do something with values here
}

Edited because my original answer was wrong.
Based on your sample the above works.
Quick regex explanation of the first argument to new Regex(:

@ : makes my string constant and keeps me from having to add extra escapes everywhere.
\r\n : starts with.
([a-z]\d+\w)\s : matches your product code, I used the \s to frame it as it appears to be a consistent whitespace.
.* : matches your random string of production description.
\s(\d+\,?\d+?) : matches a whitespace followed by your second capture of currency of some sort.
\r\n : ends with.

If you provided a larger sample data set, I could fine tune the regex.

edited Mar 22, 2011 at 21:34

answered Mar 22, 2011 at 17:09

Eric H

1,7891 gold badge11 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

svoop Over a year ago

[\r\n] is probably not what Simon needs as it won't match "\r\n".

Simon Over a year ago

Gave it a try but no match was found. I have added a snippet of the actual text to my question above.

svoop · Accepted Answer · 2011-03-23 11:44:46Z

2

Alright, your question is a moving target. The actual text sample has (in contradiction to your question) no £ in it. Here's an adapted expression:

new Regex(@"\r\n(\w+?).*?\s+(\d+?,\d\d)")

In prose (this is a learing site after all): Match "\r\n" followed by any alphanumerics until you hit whitespace, then anything until you hit whitespace followed by a number with two digits behind the comma. The parts in italics are captured.

As I said, I don't do Obj-C and thus can't test it. See these C# docs (and other answers here) for how to use it.

edited Mar 23, 2011 at 11:44

answered Mar 22, 2011 at 17:06

svoop

3,4841 gold badge28 silver badges42 bronze badges

4 Comments

Simon Over a year ago

I tried this regex pattern in @felice Pollano code but still it does not find a match.

Felice Pollano Over a year ago

@Simon, ok lets integrate this in the question whit the code you used for test so @svoop or anyone else can help you better

Simon Over a year ago

@felice Pollano I updated the question as I realised there is a white space between the price and the £ symbol.

svoop Over a year ago

@Simon: I've adapted the expression to the recent question updates.

Mongus Pong · Accepted Answer · 2011-03-22 17:29:09Z

1

I would use named groups to identify the groups easier. The ?<code> part of the expression identifies the group.

You will want to use Matches, as you say there will be several occurrences of the pattern in your text. This will loop through them all..

foreach ( Match match in Regex.Matches(text, @"\r\n(?<code>\S+).*?(?<price>\d+)£") )
{
    string code = match.Groups["code"].Value;
    string currencyAmt = match.Groups["price"].Value;
    Console.WriteLine(code);
    Console.WriteLine(currencyAmt);
}

answered Mar 22, 2011 at 17:29

Mongus Pong

11.5k9 gold badges47 silver badges73 bronze badges

3 Comments

Simon Over a year ago

Gave it a try but no match was found. I have added a snippet of the actual text to my question above.

Mongus Pong Over a year ago

Does it work if you put the whitespace before the £ in the pattern?I am not at my computer to try it out at the minute...

Simon Over a year ago

Thanks for your input it got me in the right direction to loop through the matches.

Simon · Accepted Answer · 2011-03-22 20:30:19Z

0

Final result was this:

foreach (Match match in Regex.Matches(content, @"[\r\n](?<code>\w\d+\w).*?(?<price>\d+,\d\d)[\r\n]", RegexOptions.IgnoreCase))
{
    string code = match.Groups["code"].Value;
    string currencyAmt = match.Groups["price"].Value;
}

answered Mar 22, 2011 at 20:30

Simon

5776 silver badges15 bronze badges

Comments

Alan Moore · Accepted Answer · 2011-03-22 23:42:42Z

That sample data you added raises more questions than it answers. Are we supposed to treat those \r\n sequences as carriage-return+linefeed (CRLF), or as literal text? Also, it looks like space characters have been inserted at random positions--in some cases even between a \r and \n. Oh, and there are no pound symbols (£), only euro symbols (€), and they're never on the same line as a price, as you originally indicated.

If that sample really is representative of the your data, you should try to clean it up (or have the people who supplied to you clean it up) before you start searching it. I did just that so I could test my regex; if I've made any wrong assumptions, please let me know. And here it is:

  Regex rgx = new Regex(@"^(\w+).*?(\d+,\d\d)(?:[\r\n]+|\z)", RegexOptions.Multiline);

  string s = @"Référence Ancienne référence 3Com/H3C Libellé Remarque Prix en €
J9449A HP V1810-8G Switch 139,00
J9450A HP V1810-24G Switch 359,00
Edge Switches - Managed 
HP Layer 2 Switches - Managed Stackables and Chassis
HP Switch 2510 Series
Référence Ancienne référence 3Com/H3C Libellé Remarque Prix en €
J9019B HP E2510-24 Switch 359,00
J9020A HP E2510-48 Switch 599,00
J9279A HP E2510-24G Switch 779,00
J9280A HP E2510-48G Switch 1 569,00
HP Switch 2520 Series
Référence Ancienne référence 3Com/H3C Libellé Remarque Prix en €
J9137A HP E2520-8-PoE Switch 489,00
J9138A HP E2520-24-PoE Switch 779,00
J9298A HP E2520-8G-PoE Switch 749,00
J9299A HP E2520-24G-PoE Switch 1 569,00
HP Layer 2 and 3 Switches - Managed Stackables and Chassis
The RBP is a recommended price only. 
HP Switch 2600 Series
Référence Ancienne";

  foreach (Match m in rgx.Matches(s))
  {
    Console.WriteLine("code: {0}; price: {1}", 
        m.Groups[1].Value, m.Groups[2].Value);
  }

output:

code: J9449A; price: 139,00
code: J9450A; price: 359,00
code: J9019B; price: 359,00
code: J9020A; price: 599,00
code: J9279A; price: 779,00
code: J9280A; price: 569,00
code: J9137A; price: 489,00
code: J9138A; price: 779,00
code: J9298A; price: 749,00
code: J9299A; price: 569,00

The ^ in multiline mode is sufficient to anchor the match at the beginning of a line; you don't have to match the line separator (\r\n) itself. You should be able to use $ at the end the same way, but that won't work because .NET doesn't regard \r as a line separator character. Instead I did it longhand: (?:[\r\n]+|\z)

Collectives™ on Stack Overflow

Difficult (for me) string parsing in C# (regex?)

5 Answers 5

2 Comments

4 Comments

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

4 Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related