10

I have an XML document which after sending it through my XSLT no longer has line breaks before the XML attributes. So for example

<myoutertag one="a"
            two="b"
            three="c">
    <myinnertag four="d"
                five="e"/>
</myoutertag>

would become

<myoutertag one="a" two="b" three="c">
    <myinnertag four="d" five="e"/>
</myoutertag>

This is of course perfectly valid XML but it's more difficult to read, especially if there are many long attribute values. From what I've read, XSLT is not able to preserve these line breaks as the XSLT processor is not passed such unimportant information.

So, what I'm looking for now is a command line based pretty printer (usable in Linux) which ideally would only change the document in that it adds line breaks between the attributes. Whether it adds one before the first attribute or not is pretty much irrelevant to me, just as long as it's more easily readable.

What I've tried unsuccessfully so far:

I'm using the input file

<?xml version="1.0" encoding="UTF-8"?>

<myoutertag one="a" two="b" three="c">
    <myinnertag four="d" five="e"/>
</myoutertag>

xmllint --format

I tried both xmllint --format test.xml and cat test.xml | xmllint --format - with the same result:

<?xml version="1.0" encoding="UTF-8"?>
<myoutertag one="a" two="b" three="c">
  <myinnertag four="d" five="e"/>
</myoutertag>

So, the changes are:

  • the line break after the xml declination is gone
  • the indentation of <myinnertag> was reduced from four spaces to two spaces

I want neither of those changes. This is using libxml version 20706.

xml_pp -s

I tried the styles none, nsgmls, nice, indented, record and record_c. The only one that comes close is nsgmls which will add line breaks, but the result looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<myoutertag
one="a"
two="b"
three="c"
><myinnertag
four="d"
five="e"
/></myoutertag>

So, no indentation and weird line breaking.

xmlstarlet

The output of xmlstarter fo test.xml is the same as with xmllint. I also tried finding something like xmlstarter -ed -P --insert "//@*" -t text -n "" -v "\\n" test.xml but that resulted in a glibc pointer error. Not surprising I guess, as I'm trying to add text in between attributes.

tidy

This is the closest I've gotten so far. Running the command tidy -quiet -xml -indent -wrap 1 test.xml gives me:

<?xml version="1.0"
encoding="UTF-8"?>
<myoutertag one="a"
two="b"
three="c">

  <myinnertag four="d"
  five="e"/>
</myoutertag>

So, if I could get it to indent some more before those attributes in new lines that would basically solve my problem (I think).

Any further suggestions?

11
  • Does the top answer to this StackOverflow question meet your requirements? stackoverflow.com/questions/16090869/… Commented Aug 22, 2014 at 12:40
  • @DWRoelands No, xmllint doesn't add those line breaks, I tried that before I asked here. Commented Aug 22, 2014 at 12:43
  • How about this? pauldeden.com/2009/01/pretty-printing-xml-in-ubuntu-on.html Commented Aug 22, 2014 at 12:44
  • @DWRoelands As far as I can see, xml_pp only offers predefined styles, none of which would change only this one thing. Or am I missing something? Commented Aug 22, 2014 at 12:49
  • I agree with the initial comment from @DWRoelands. I ran cat file.xml | xmllint --format - and obtained exactly what you want. Commented Aug 22, 2014 at 12:59

2 Answers 2

11

OK, I've found a solution. The tool I used is called HTML Tidy (well, actually I used jTidy, a port of HTML Tidy to Java which therefore is portable). The tool offers many options for configuration; the one I was looking for is called indent-attributes: true. In fact, my whole configuration file is:

add-xml-decl: true
drop-empty-paras: false
fix-backslash: false
fix-bad-comments: false
fix-uri: false
input-xml: true
join-styles: false
literal-attributes: true
lower-literals: false
output-xml: true
preserve-entities: true
quote-ampersand: false
quote-marks: false
quote-nbsp: false

indent: auto
indent-attributes: true
indent-spaces: 4
tab-size: 4
vertical-space: true
wrap: 150

char-encoding: utf8
input-encoding: utf8
newline: CRLF
output-encoding: utf8

quiet: true

The meanings of those options are explained in the Tidy manual (or the man page if you install it on a Linux system), I mostly cared about that middle block where I can set the indentation settings.

I can now call the tool using the command java -jar jtidy-r938.jar -config tidy.config test.xml and the output will be

<?xml
  version="1.0"
  encoding="UTF-8"?>
<myoutertag
 one="a"
 two="b"
 three="c">
    <myinnertag
     four="d"
     five="e" />
</myoutertag>

Now I'm happy. :-)

Sign up to request clarification or add additional context in comments.

1 Comment

I compared jTidy (latest git version) and tidy, and found that the later has more options by now. most importantly (for my needs), it has the --sort-attributes option.
0

Try xml_pp with style "cvs"

xml_pp has a style called "cvs". This is designed to be nice with line-based diffing.

("CVS" was a popular versioning system before git came along: https://en.wikipedia.org/wiki/Concurrent_Versions_System)

Input:

$ cat compact.xml
<myoutertag one="a" two="b" three="c">
    <myinnertag four="d" five="e"/>
</myoutertag>

Available styles:
There are 11 styles in my version of xml_pp:

$ which xml_pp | xargs head -n2
#!/usr/bin/perl -w
# $Id: /xmltwig/trunk/tools/xml_pp/xml_pp 32 2008-01-18T13:11:52.128782Z mrodrigu  $

$ xml_pp -s help
usage: /usr/local/bin/xml_pp [-v] [-i<extension>] [-s (none|nsgmls|nice|indented|indented_close_tag|indented_c|wrapped|record_c|record|cvs|indented_a)] [-p <tag(s)>] [-e <encoding>] [-l] [-f <file>] [<files>] at /usr/local/bin/xml_pp line 100.

$ xml_pp -s help 2>&1 | cut -d"(" -f2 | cut -d")" -f1 | tr "|" "\n" | nl
     1  none
     2  nsgmls
     3  nice
     4  indented
     5  indented_close_tag
     6  indented_c
     7  wrapped
     8  record_c
     9  record
    10  cvs
    11  indented_a

Style "cvs" applied:

$ cat compact.xml | xml_pp -s cvs
<myoutertag
    one="a"
    three="c"
    two="b">
  <myinnertag
      five="e"
      four="d"
  />
</myoutertag>

From the XML::Twig documentation: (dead link replaced).
FYI: The page misspells indented_a as idented_a with a missing "n".

There are several rules inside that Oracle document. I guess only rules 5 and 7 are relevant for automatic formatting:

Writing Version Controllable XML
Rule 1: Avoid making implicit changes
Rule 2: Avoid marking files "dirty" when they have not changed
Rule 3: Don't arbitrarily reorder elements inside an XML file
Rule 4: Don't reformat XML files if the user can manually edit them
Rule 5: Separate XML elements and attributes onto separate lines
Rule 6: Avoid magic values
Rule 7: Sort elements to avoid conflicts

Installing XML::Twig was hard

BTW: I installed XML::Twig on Windows 10 inside Mobaxterm. And this was very painful. I had to manually keep going back and forth between mobaxterm and google and install pre-requisites of pre-requisites of pre-requisites. Try, fail, google, install something. Try again, fail again, install some other lib, repeat. Not fun.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.