1

My text will look like this

| birth_date          = {{birth date|1925|09|2|df=y}}
| birth_place         = [[Bristol]], [[England]], UK
| death_date          = {{death date and age|2000|11|16|1925|09|02|df=y}}
| death_place         = [[Eastbourne]], [[Sussex]], England, UK
| origin              = 
| instrument          = [[Piano]]
| genre               = 
| occupation          = [[Musician]]

I would like to get everything that is inside of [[ ]]. I tried to use replace all to replace everything that is not inside the [[ ]] and then use split by new line to get a list of text with [[ ]].

input = input.replaceAll("^[\\[\\[(.+)\\]\\]]", "");

Required output:

[[Bristol]]
[[England]]
[[Eastbourne]]
[[Sussex]]
[[Piano]]
[[Musician]]

But this is not giving the desired output. What am I missing here?. There are thousands of documents and is this the fastest way to get it? If no, do tell me the optimum way to get the desired output.

1
  • In addition to other problems, please note that (.+) is a "greedy" quantifier that will grab as many characters as it can between [[ and ]], meaning that for birth_place you'll get "Bristol]], [[England" as one of the matches. Adding ? after .+, as in falsetru's answer, prevents this. Commented Oct 4, 2013 at 16:56

3 Answers 3

6

You need to match it not replace

Matcher m=Pattern.compile("\\[\\[\\w+\\]\\]").matcher(input);
while(m.find())
{
    m.group();//result
}
Sign up to request clarification or add additional context in comments.

Comments

2

Use Matcher.find. For example:

import java.util.regex.*;

...

String text =
    "| birth_date          = {{birth date|1925|09|2|df=y}}\n" +
    "| birth_place         = [[Bristol]], [[England]], UK\n" +
    "| death_date          = {{death date and age|2000|11|16|1925|09|02|df=y}}\n" +
    "| death_place         = [[Eastbourne]], [[Sussex]], England, UK\n" +
    "| origin              = \n" +
    "| instrument          = [[Piano]]\n" +
    "| genre               = \n" +
    "| occupation          = [[Musician]]\n";
Pattern pattern = Pattern.compile("\\[\\[.+?\\]\\]");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
    System.out.println(matcher.group());
}

Comments

0

Just for fun, using replaceAll:

 String output = input.replaceAll("(?s)(\\]\\]|^).*?(\\[\\[|$)", "$1\n$2");

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.