2

I have some text like this:

 //(10,0,'Computer_accessibility','',''),(13,0,'History_of_Afghanistan','',''),(14,0,'Geography_of_Afghanistan','','')

and I wrote a pattern:

public final static Pattern r_english = Pattern.compile("\\((.*?),(.*?),(.*?),(.*?),(.*?)\\)");

This works well in Java to extract m.group(1) (e.g. 13) and m.group(3) (e.g. History_of_Afghanistan) where m is a matcher. However, it breaks if the text is like this, since Washington,_D.C. (ie. m.group(3)) has a comma in it:

(8543,0,'Washington,_D.C.','',''),(8546,0,'Extermination_camp','','')

Can someone help me in with the regex to modify it and extract the Washington,_D.C. thingy? Thanks

3 Answers 3

3

Change your third capture group to capture everything until a closing ' is reached. That allows every character (including your comma) to be captured.

UPDATE: to allow escaped 's as well, the regex looks like this. Credits go to Pshemo, see the comments.

public final static Pattern r_english = Pattern.compile("\\((.*?),(.*?),('(?:[^']|\\')*'),(.*?),(.*?)\\)");
Sign up to request clarification or add additional context in comments.

10 Comments

Now I am waiting for info from OP that title may also have more than ' in it. But for now this looks OK.
So far it's not in the example input :D
BTW, you probably don't need that ? in '[^']*?'. Also (.*?) can be changed into ([^,]*).
@Knight In that case you need to create some format which will allow us to determine which ' is part of title, and which ' is considered as quote representing end of title. In other words you need to introduce some escaping mechanism of that non-special ' like preceding it with \ (but this would also mean that \ is special so just like in String literals you would need to also escape it if you would want to create such symbol), or like SQL does, escape each textual ' it with another ' like 'Marvin''s_Room'.
@Knight In that case instead of [^'] which accept only non-quote characters, try to write accept non-quotes OR escaped quote. So try to change ('[^']*') into something like ('(?:[^']|\\')*') <- you need to write \\ as \\\\ in String literal. If you are wondering what (?:...) is, it is called non-capturing group, it is a group which is not included in group tree, so you can't access it via group(index), I did it to not affect your current group indexes.
|
1

You should help to make your RegEx more specific to your case. For example:

((.*?),(.*?),('.*?'),('.*?'),('.*?'))

I used the parantehesis ', this solution is also agnostic to further parantehesis in Group 3-5.

Regards

Comments

1

You need to change your regular expression in order to fit all the matchings that you want to retrieve, E.g.:

/((.*?),(.*?),'(.*?)','(.*?)','(.*?)'\)/g

Working Example @ regex101

You need to translate/escape the above regular expression into a Java compatible one, E.g.:

public static String REGEX_PATTERN = "\\((.*?),(.*?),'(.*?)','(.*?)','(.*?)'\\)";

Then, iterate through all the matchings trying to mimic the //g modifier, E.g.:

while (matcher.find()) {
}

Java Working Example:

package SO40002225;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {

    public static String INPUT;
    public static String REGEX_PATTERN;

    static {
        INPUT = "(8543,0,'Washington,_D.C.','',''),(8546,0,'Extermination_camp','',''),(8543,0,'Washington,_D.C.','',''),(8546,0,'Extermination_camp','','')";
        REGEX_PATTERN = "\\((.*?),(.*?),'(.*?)','(.*?)','(.*?)'\\)";
    }


    public static void main(String[] args) {
        String text = INPUT;

        Pattern pattern = Pattern.compile(REGEX_PATTERN);
        Matcher matcher = pattern.matcher(text);

        while (matcher.find()) {
            String mg1 = matcher.group(1);
            String mg2 = matcher.group(2);
            String mg3 = matcher.group(3);
            String mg4 = matcher.group(4);
            String mg5 = matcher.group(5);

            System.out.println("Matching group #1: " + mg1);
            System.out.println("Matching group #2: " + mg2);
            System.out.println("Matching group #3: " + mg3);
            System.out.println("Matching group #4: " + mg4);
            System.out.println("Matching group #5: " + mg5);
        }

    }

}

Update #1

Removed the escape done for commas , with-in the regular expression, as pointed by Pshemo, the , is not a meta-character or it's not being used within a limit repetition quantifier: {min, max}

2 Comments

Sorry, but what is the point of escaping ,? It is not one of regex metacharacters (at least not here - only case where , doesn't represent literal is inside quantifiers like {min,max} or {,max} but even then there is also no point in escaping it).
@Pshemo, Thank you for pointing this out, I completely forgot about this.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.