1

I am trying to parse this XML:

<?xml version="1.0" encoding="UTF-8"?>
<veranstaltungen>
  <veranstaltung id="201611211500#25045271">
    <titel>Mal- und Zeichen-Treff</titel>
    <start>2016-11-21 15:00:00</start>
    <veranstaltungsort id="20011507">
      <name>Freizeitclub - ganz unbehindert </name>
      <anschrift>Macht los e.V.
Lipezker Straße 48
03048 Cottbus
</anschrift>
      <telefon>xxxx xxxx </telefon>
      <fax>0355 xxxx</fax>
[...]
</veranstaltungen>

As you can see, some of the texts have whitespace or even linebreaks. I am having issues with the text from the node anschrift, because I need to find the right location data in a database. Problem is, the returned String is:

Macht los e.V.Lipezker Straße 4803048 Cottbus

instead of:

Macht los e.V. Lipezker Straße 48 03048 Cottbus

I know the correct way to parse it should be with normalie-space() but I cannot quite work out how to do it. I tried this:

// Does not work; afaik because xpath 1 normalizes just the first node
xPath.compile("normalize-space(veranstaltungen/veranstaltung[position()=1]/veranstaltungsort/anschrift/text()"));

// Does not work
xPath.compile("veranstaltungen/veranstaltung[position()=1]/veranstaltungsort[normalize-space(anschrift/text())]"));

I also tried the solution given here: xpath-normalize-space-to-return-a-sequence-of-normalized-strings

xPathExpression = xPath.compile("veranstaltungen/veranstaltung[position()=1]/veranstaltungsort");
NodeList result = (NodeList) xPathExpression.evaluate(doc, XPathConstants.NODESET);

String normalize = "normalize-space(.)";
xPathExpression = xPath.compile(normalize);

int length = result.getLength();
for (int i = 0; i < length; i++) {
    System.out.println(xPathExpression.evaluate(result.item(i), XPathConstants.STRING));
}

System.out prints:

Macht los e.V.Lipezker Straße 4803048 Cottbus

What am I doing wrong?

Update

I have a workaround already, but this can't be the solution. The following few lines show how I put the String together from the HTTPResponse:

try (BufferedReader reader = new BufferedReader(new InputStreamReader(response.getEntity().getContent(), Charset.forName(charset)))) {
  final StringBuilder stringBuilder = new StringBuilder();
  String              line;

  while ((line = reader.readLine()) != null) {
    // stringBuilder.append(line);
    // WORKAROUND: Add a space after each line
    stringBuilder.append(line).append(" ");
  }

  // Work with the red lines
}

I would rather have a solid solution.

1
  • normalize-space() strips leading and trailing whitespace and converts other sequences of whitespace characters (including newlines) into a single space character. As your result doesn't have a space between the lines of the text content of the anschrift element, something must eat your newlines before normalize-space() gets to do its job. Commented Nov 22, 2016 at 10:35

2 Answers 2

1

Originally, you seem to be using the following code for reading the XML:

try (BufferedReader reader = new BufferedReader(new InputStreamReader(response.getEntity().getContent(), Charset.forName(charset)))) {
  final StringBuilder stringBuilder = new StringBuilder();
  String              line;

  while ((line = reader.readLine()) != null) {
    stringBuilder.append(line);
  }

}

This is where your newlines get eaten: readline() does not return the trailing newline characters. If you then parse the contents of the stringBuilder object, you will get an incorrect DOM, where the text nodes do not contain the original newlines from the XML.

Sign up to request clarification or add additional context in comments.

3 Comments

Didn't know this. Thank you for the info. My solution is then to check if the line ends with an '>' and if not adding a "&#xA;".
Don't do this. You're modifying the input again. Why do you want to do line based reading? Why not parse the input stream as is?
I should get my head clear for a while. You are right. Will do this now.
0

Thanks to the help of Markus, I was able to solve the issue. The reason was the readLine() method of the BufferedReader discarding line breaks. The following codesnippet works for me (Maybe it can be improved):

public Document getDocument() throws IOException, ParserConfigurationException, SAXException {

  final HttpResponse response = getResponse(); // returns a HttpResonse
  final HttpEntity   entity   = response.getEntity();
  final Charset      charset  = ContentType.getOrDefault(entity).getCharset();  

  // Not 100% sure if I have to close the InputStreamReader. But I guess so.
  try (InputStreamReader isr = new InputStreamReader(entity.getContent(), charset == null ? Charset.forName("UTF-8") : charset)) {
    return documentBuilderFactory.newDocumentBuilder().parse(new InputSource(isr));
  }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.