2

EDIT: To explain my motivation for this, I'm writing a command-line utility that takes a log file and a pattern (a non-regex string that indicates what each log entry looks like), converts the pattern into regex, and matches each line of the file with the regex, producing a collection of log events, which are then output in another format (e.g., JSON). I can't assume what the input pattern will be or what the file contains.


I'd like to parse a CSV list of key-value pairs. I need to capture the individual keys and values from the list. An example input string:

07/04/2012 <DEBUG> a=1, b=foo, c=bar : hello world!\n

I verified that the regex below correctly extracts the keys and values from input:

// regex
(([^,\s=]+)=([^,\s=]+)(?:,\s*(?:[^,\s=]+)=(?:[^,\s=]+))*?)

// input string
a=1, b=foo, c=bar

The result is:

// 1st call
group(1) == "a"
group(2) == "1"

// 2nd call
group(1) == "b"
group(2) == "foo"

// 3rd call
group(1) == "c"
group(2) == "bar"

But this regex (same as regex above with extra "stuff") does not work as expected:

// regex
\d{2}/\d{2}/\d{4} <DEBUG> (([^,\s=]+)=([^,\s=]+)(?:,\s*(?:[^,\s=]+)=(?:[^,\s=]+))*?) : .*

// input string
07/04/2012 <DEBUG> a=1, b=foo, c=bar : hello world! 

For some reason, the result is:

group(1) == "a=1, b=foo, c=bar"
group(2) == "a"
group(3) == "1"
// no more matches

What's the correct Java regex to extract the keys and values?

3 Answers 3

1

Regex:

\d{2}/\d{2}/\d{4}\s<DEBUG>\s([^=]+)=([^,\s]+)[,\s]([^=]+)=([^,\s]+)[,\s]([^=]+)=([^\s]+)\s:.*

Edit: If the count can be a arbitrary number, try the below one.

    Scanner s = new Scanner("07/04/2012 <DEBUG> a=1, b=foo, c=bar : d=erere  m=abcd hello world!");
    Pattern p = Pattern.compile("(?<=\\s|,)[^\\s=]+=[^,\\s]+");
    String out;
    while((out = s.findInLine(p))!=null) {
        System.out.println(Arrays.toString(out.split("=")));
    }

Output:

[a, 1]
[b, foo]
[c, bar]
[d, erere]
[m, abcd]
Sign up to request clarification or add additional context in comments.

2 Comments

Almost. It only works for CSV lists of exactly 3 elements, but the element count can actually be any reasonable positive number.
I really appreciate your effort. :) At first, I was thinking this programatic parsing solution (vs pure regex capture groups) was too restrictive in that it assumed no other regexes in the pattern (which is incorrect). However, all answers to this point lead me to think that the only viable solution (at least for now) is manual parsing (to get to the CSV list) followed by regex matching of the list by itself.
1

use "\\w+=\\w+" get result: ("a=1" "b=foo" "c=bar"), split with =.

Comments

1

The correct regex depends on what you are trying to achieve. In the latter case the result is correct with respect to the regex. That is because the phrase <DEBUG> is part of the regex and the trailing : .* is also part of it, therefore both will be matched and thus there will be only one suitable fragment of the string.

I would personally go for another solution - instead of using regexps directly I would use split. For example, if the part you are interested in is always between > and : and there are no such characters in that part, you can simply get along with substring, indexOf and split. The split you can do twice (one with , to get all key=value pairs, then = on each pair). But that is only my solution and it might not be an optimal one - I would be happy to see one.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.