0

I'm looking for a regex to split the following strings

red 12478
blue 25 12375
blue 25, 12364

This should give

Keywords red, ID 12478
Keywords blue 25, ID 12475
Keywords blue IDs 25, 12364

Each line has 2 parts, a set of keywords and a set of IDs. Keywords are separated by spaces and IDs are separated by commas.

I came up with the following regex: \s*((\S+\s+)+?)([\d\s,]+)

However, it fails for the second one. I've been trying to work with lookahead, but can't quite work it out

I am trying to split the string into its component parts (keywords and IDs)

The format of each line is one or more space separated keywords followed by one or more comma separated IDs. IDs are numeric only and keywords do not contain commas.

I'm using Java to do this.

5
  • 4
    Please precise the language you use. Commented Sep 17, 2013 at 16:25
  • Does the , after red has to be there, but not after blue? Commented Sep 17, 2013 at 16:32
  • What should be match to split a string? Commented Sep 17, 2013 at 16:34
  • Looks more like a match and replace than a split imo. Commented Sep 17, 2013 at 16:35
  • Updated. Apologies, the comma was missing from the output Commented Sep 18, 2013 at 8:14

3 Answers 3

2

I found a two-line solution using replaceAll and split:

pattern = "(\\S+(?<!,)\\s+(\\d+\\s+)*)";
String[] keywords = theString.replaceAll(pattern+".*","$1").split(" ");
String[] ids = theString.split(pattern)[1].split(",\\s?");

I assumed that the comma will always be immediately after the ID for each ID (this can be enforced by removing spaces adjacent to a comma), and that there is no trailing space.

I also assumed that the first keyword is a sequence of non-whitespace chars (without trailing comma) \\S+(?<!,)\\s+, and the rest of the keywords (if any) are digits (\\d+\\s+)*. I made this assumption based on your regex attempt.

The regex here is very simple, just take (greedily) any sequence of valid keywords that is followed by a space (or whitespaces). The longest will be the list of keywords, the rest will be the IDs.

Full Code:

public static void main(String[] args){
    String pattern = "(\\S+(?<!,)\\s+(\\d+\\s+)*)";
    Scanner sc = new Scanner(System.in);
    while(true){
        String theString = sc.nextLine();

        String[] keywords = theString.replaceAll(pattern+".*","$1").split(" ");
        String[] ids = theString.split(pattern)[1].split(",\\s?");

        System.out.println("Keywords:");
        for(String keyword: keywords){
            System.out.println("\t"+keyword);
        }
        System.out.println("IDs:");
        for(String id: ids){
            System.out.println("\t"+id);
        }
        System.out.println();
    }
}

Sample run:

red 124
Keywords:
    red
IDs:
    124

red 25 124
Keywords:
    red
    25
IDs:
    124

red 25, 124
Keywords:
    red
IDs:
    25
    124
Sign up to request clarification or add additional context in comments.

Comments

0

I came up with:

(red|blue)( \d+(?!$)(?:, \d+)*)?( \d+)?$

as illustrated in http://rubular.com/r/y52XVeHcxY which seems to pass your tests. It's a straightforward matter to insert your keywords between the match substrings.

Comments

0

Ok since the OP didn't specify a target language, I am willing to tilt at this windmill over lunch as a brain teaser and provide a C#/.Net Regex replace with match evaluator which gives the required output:

Keywords red, ID 12478
Keywords blue 25 ID 12375
Keywords blue IDs 25, 12364

Note there is no error checking and this is fine example of using a lamda expression for the match evaluator and returning a replace per rules does the job. Also of note due to the small sampling size of data it doesn't handle multiple Ids/keywords as the case may actually be.

string data = @"red 12478
blue 25 12375
blue 25, 12364";

var pattern = @"(?xmn)   # x=IgnorePatternWhiteSpace m=multiline n=explicit capture
^
(?<Keyword>[^\s]+)       # Match Keyword Color
[\s,]+
(
  (?<Numbers>\d+)       
  (?<HasComma>,)?       # If there is a comma that signifies IDs
  [,\s]*
)+                      # 1 or more values
$";


Console.WriteLine (Regex.Replace(data, pattern, (mtch) =>
{
    StringBuilder sb = new StringBuilder();

    sb.AppendFormat("Keywords {0}", mtch.Groups["Keyword"].Value);

    var values = mtch.Groups["Numbers"]
                     .Captures
                     .OfType<Capture>()
                     .Select (cp => cp.Value)
                     .ToList();

    if (mtch.Groups["HasComma"].Success)
    {
        sb.AppendFormat(" IDs {0}", string.Join(", ", values));
    }
    else
    {
        if (values.Count() > 1)
            sb.AppendFormat(" {0} ID {1}", values[0], values[1]  );
        else
            sb.AppendFormat(", ID {0}", values[0]);
    }

    return sb.ToString();
}));

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.