Parsing text from a Java string

Question

I have a String in the following format: -----BEGIN MESSAGE-----, followed by a variable length encrypted session key, followed by a newline, followed by an encrypted message, followed by a newline, followed by a digital signature, followed by -----END MESSAGE-----.

-----BEGIN MESSAGE-----
SNyeWtz8QD8AKdioMG11wu7U6gG2wD9tekvVrx6VYW+6oJj4Wl8NE+7i5MHbu4Au
+vN1Z886lOWka7ekgPF8N7t9MpiFo2pBPHuFcOsaY5ETYuEyk5gaX7BYP7qT6wKG
BRILmX6DblWqGxG2tKs/AdcHDqQ5QBXrP03uhN68wgo=

U2FsdGVkX18gtpQSqyH4H5242SZzcZrb0oH7FWw7/MSCxo7h7BVaesZV2N38sr9y

kVr+wabiNn4RfAB4nNi9gAZHQLok4uxRMALGF2kZk2zpVNPQo6jcdz85fy68gylX
OCQIIdk8JPIwxzHfVvRZqNHDRADZRlNHUMYScjRPU+DB8avghYAVKMJhLgA/2Tdp
a59uBMBg/yB1yqA5FivxPzOhq92Y4nZuP1R9/yGE9O8K
-----END MESSAGE-----

What is the best way to parse out the three pieces of information (the session key, encrypted message, and digital signature)?

I tried using the Scanner class but I coudln't figure out what to use as the delimeter. I also tried using the Pattern class, but couldn't figure that method out either. Thank you!

I just did something similar. The question is, do you want those three pieces of data in one match? With 3 capture groups? Or in 3 seperate matches? — Suamere
– Suamere, Commented Apr 10, 2013 at 0:25
Suamere, I want 3 separate matches. Jaynathan, I tried using "\n" as a delimeter, but it wouldn't work because every line is followed by a newline. For example, the encrypted session key is 3 lines long, each line followed by a newline. I even tried using "\n\n" as a delimeter, but that failed as well. — Luke
– Luke, Commented Apr 10, 2013 at 0:30

Ted Hopp · Accepted Answer · 2013-04-10 00:36:12Z

1

You actually have newlines embedded in the various parts. What delimits them is the blank line—two newlines in a row. I assume you want each part with the line breaks removed. I'd suggest a brute force approach:

StringBuilder sb = new StringBuilder();
String[] parts = input.split("\\r?\\n\\r?\\n"); // should be 3 long
// strip out header and newlines from session key
String[] lines = parts[0].split("\\r?\\n");
for (int i = 1; i < lines.length; ++i) { // skip first line
    sb.append(lines[i]);
}
parts[0] = sb.toString();
// strip out header and newlines from message
sb.setLength(0);
lines = parts[1].split("\\r?\\n");
for (int i = 0; i < lines.length; ++i) {
    sb.append(lines[i]);
}
parts[1] = sb.toString();
// finally, deal with the signature
sb.setLength(0);
lines = parts[2].split("\\r?\\n");
for (int i = 0; i < lines.length - 1; ++i) {
    sb.append(lines[i]);
}
parts[2] = sb.toString();

Not elegant, but it makes clear what's happening.

An alternative approach would be to use a Scanner to read each line and decide what to do with it. Three lines—the header, the trailer, and a blank line—would have special treatment and affect the processing. Otherwise just append each line as you read it to a StringBuffer.

answered Apr 10, 2013 at 0:36

Ted Hopp

235k48 gold badges412 silver badges533 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Luke Over a year ago

I was hoping there would be a way of solving this with regex, but your method most definitely works. Thank you!

Ted Hopp Over a year ago

@Andy - There probably is a way using regex, but I was too lazy to work one out. :)

Suamere Over a year ago

A note for the OP and Ted. In Regex, the best way to get linebreaks is [\r\n]+ This means any number combination of either linefeed or carriage return in any order one or more times. It's much cleaner than \r?\n\r?\n, or the "classic" (\r*\n\r*|\n*\r\n*)+. But even cleaner is \s+, especially in this case where there are no spaces within each string, only spaces between them.

Ted Hopp Over a year ago

@Suamere - Good points. However, for my code, the first split must be only on two consecutive line breaks. The single line breaks must not match; however, [\r\n]+ (or \s+) will match a single line break. It won't help to make it [\r\n]{2} (or \s\s) because \r\n will match and you can't make it [\r\n]{4} (or \s{4}) because \r or \n may be missing from the line terminators (depending on the server). For the second splitting, \s+ is perfect. I used \r?\n out of inertia.

Suamere Over a year ago

All of your statements are correct. I only have my thought pattern because as far as I know, there are only single linebreaks for formatted viewing. When parsing an actual cert, if you don't count the Begin and End lines, any whitespace at all separates pieces of data. But I think your answer is more literal toward the example the OP gave, so I can't argue with you there.

|

Suamere · Accepted Answer · 2013-04-11 15:52:40Z

1

Right, Remove the Begin and End like Sergii said. Then do a Regex Split against "\s+" e.g. in .NET:

Regex.Split(Regex.Replace(strCert, "(?i)\s*-{5}(BEGIN|END)\sMESSAGE-{5}\s*", ""), "\s+")

That is, assuming the only reason your example has the single-linebreaks within the body of each data is for formatting, because as far as I know, those don't exist in the actual cert. The actual cert would look like:

-----BEGIN MESSAGE-----
SNyeWtz8QD8AKdioMG11wu7U6gG2wD9tekvVrx6VYW+6oJj4Wl8NE+7i5MHbu4Au+vN1Z886lOWka7ekgPF8N7t9MpiFo2pBPHuFcOsaY5ETYuEyk5gaX7BYP7qT6wKGBRILmX6DblWqGxG2tKs/AdcHDqQ5QBXrP03uhN68wgo=

U2FsdGVkX18gtpQSqyH4H5242SZzcZrb0oH7FWw7/MSCxo7h7BVaesZV2N38sr9y

kVr+wabiNn4RfAB4nNi9gAZHQLok4uxRMALGF2kZk2zpVNPQo6jcdz85fy68gylXOCQIIdk8JPIwxzHfVvRZqNHDRADZRlNHUMYScjRPU+DB8avghYAVKMJhLgA/2Tdpa59uBMBg/yB1yqA5FivxPzOhq92Y4nZuP1R9/yGE9O8K
-----END MESSAGE-----

Ya?

edited Apr 11, 2013 at 15:52

answered Apr 10, 2013 at 0:34

Suamere

6,3684 gold badges49 silver badges61 bronze badges

3 Comments

Ted Hopp Over a year ago

This will split at every line break, including those line breaks embedded in each part of the text. OP needs to separate first on blank lines (two consecutive line terminator sequences). For robustness, it needs to work for all varieties of line terminator sequences: \r\n (Windows, HTTP standard), \r (Mac), or \n (Unix).

Suamere Over a year ago

Not true. \s+ will gather one or more whitespace as one. Which means linebreaks and possible spaces on those "empty" lines. So if the Begin and End were manually removed, all that would be left is a collection of the three areas. \s+ will work for all varieties of terminator sequences. The only reason \s+ may not work is if one of those lines of information included whitespace, which according to the rules of certs, never would. But if there also would never be whitespace on the "empty" line, [\r\n]+ would be good too, but \s+ is still more "robust"

Ted Hopp Over a year ago

Good point that line breaks should be absent from the cert, message, and signature. I'm not so sure that this requirement applies to the message, however. Also good point about using \s+ to be robust against white space in the supposedly blank lines between the parts.

Sergii Zagriichuk · Accepted Answer · 2013-04-10 00:26:23Z

0

newline ?

And delete -----BEGIN MESSAGE----- from first value and -----END MESSAGE----- from last value.

answered Apr 10, 2013 at 0:26

Sergii Zagriichuk

5,3995 gold badges30 silver badges46 bronze badges

1 Comment

Suamere Over a year ago

I agree. If you're just dealing with one certificate and there are three lines. Just replace the begin/end message with nothing, then string or regex split by \s+ (I don't believe there are spaces within each string, just between the strings.

Aubin · Accepted Answer · 2013-04-10 00:34:34Z

Code:

public class MessageParser {

   public static void main( String[] args ) {
      String message =
         "-----BEGIN MESSAGE-----\n" +
         "SNyeWtz8QD8AKdioMG11wu7U6gG2wD9tekvVrx6VYW+6oJj4Wl8NE+7i5MHbu4Au\n" +
         "+vN1Z886lOWka7ekgPF8N7t9MoiFo2pBPHuFcOsaY5ETYuEyk5gaX7BYP7qT6wKG\n" +
         "BRILmX6DblWqGxG2tKs/AdcHDqQ5QBXrP03uhN68wgo=\n" +
         "\n" +
         "U2FsdGVkX18gtpQSqyH4H5242gZzcZrb0oH7FWw7/MSCxo7h7BVaesZV2N38sr9y\n" +
         "\n" +
         "kVr+wabiNn4RfAB4nNi9gAZHQLok4uxRMALGF2kZk2zpVNPQo6jcdz85fy68gylX\n" +
         "OCQIIdk8JPIwxzHfVvRZqNHDRFDZRlNHUMYScjRPU+DB8avghYAVKMJhLgA/2Tdp\n" +
         "a59uBMBg/yB1yqA5FivxPzOhq92Y4nZuP1R9/yGE9O8K\n" +
         "-----END MESSAGE-----\n";
      String[] lines = message.split( "\n" );
      int i = 1;
      String sessionKey = "";
      String line = lines[i];
      while( i < lines.length && line.length() > 0 ) {
         sessionKey += line;
         line = lines[++i];
      }
      String encryptedMessage = "";
      line = lines[++i];
      while( i < lines.length && line.length() > 0 ) {
         encryptedMessage += line;
         line = lines[++i];
      }
      String digitalSignature = "";
      line = lines[++i];
      while( i < lines.length && ! line.equals( "-----END MESSAGE-----" )) {
         digitalSignature += line;
         line = lines[++i];
      }
      System.out.println( "sessionKey      : " + sessionKey );
      System.out.println( "encryptedMessage: " + encryptedMessage );
      System.out.println( "digitalSignature: " + digitalSignature );
   }
}

Output:

sessionKey      : SNyeWtz8QD8AKdioMG11wu7U6gG2wD9tekvVrx6VYW+6oJj4Wl8NE+7i5MHbu4Au+vN1Z886lOWka7ekgPF8N7t9MoiFo2pBPHuFcOsaY5ETYuEyk5gaX7BYP7qT6wKGBRILmX6DblWqGxG2tKs/AdcHDqQ5QBXrP03uhN68wgo=
encryptedMessage: U2FsdGVkX18gtpQSqyH4H5242gZzcZrb0oH7FWw7/MSCxo7h7BVaesZV2N38sr9y
digitalSignature: kVr+wabiNn4RfAB4nNi9gAZHQLok4uxRMALGF2kZk2zpVNPQo6jcdz85fy68gylXOCQIIdk8JPIwxzHfVvRZqNHDRFDZRlNHUMYScjRPU+DB8avghYAVKMJhLgA/2Tdpa59uBMBg/yB1yqA5FivxPzOhq92Y4nZuP1R9/yGE9O8K

You forced \n's into the example in order to parse by them, though.

Nathaniel Waisbrot · Accepted Answer · 2013-04-10 00:39:07Z

0

String[] parts = string.split("\r?\n");
sessionKey = parts[1];
encryptedMessage = parts[3]; 
digitalSignature = parts[5];

The \r? allows Windows EOLs (\r\n) or Unix EOLs (\n).

edited Apr 10, 2013 at 0:39

Nathaniel Waisbrot

24.7k7 gold badges77 silver badges105 bronze badges

answered Apr 10, 2013 at 0:29

Ali

262 bronze badges

2 Comments

Suamere Over a year ago

Newlines can't always be relied upon to be \r\n. \s+ would be perfect with the assertion that there are no spaces within each part itself.

Ted Hopp Over a year ago

@Suamere - Each part has newlines in it. What distinguishes the parts is the blank lines.

Collectives™ on Stack Overflow

Parsing text from a Java string

5 Answers 5

6 Comments

3 Comments

1 Comment

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

6 Comments

3 Comments

1 Comment

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related