Strip HTML tags from libreoffice command line conversion

Question

I'm trying to convert a HTML file, on my linux server, to a TXT file. The thing is the conversion working fine but it keeps the HTML tags in it. Any command to strip all HTML tags in the conversion ?

libreoffice4.2 --headless --convert-to txt 2000.html 2000.txt

Opening it in a GUI Libreoffice is already stripping HTML when saving from HTML to TXT so there must be something to accomplish this in command line too.

I think I've found something using sed command with a regex formula to strip the content of the HTML file instead to using Libre Office. Will tell if it works. — Warface
– Warface, Commented Jul 20, 2014 at 3:17

François Bruneau · Accepted Answer · 2014-07-18 16:03:51Z

1

You need to tell LibreOffice which filter it has to use in order to perform the conversion (see http://ask.libreoffice.org/en/question/2641/convert-to-command-line-parameter/)

libreoffice4.2 --headless --convert-to txt:text 2000.html

answered Jul 18, 2014 at 16:03

François Bruneau

1695 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Warface Over a year ago

Hum... That's the same command line that I do only that you didn't add the output filename. The HTML tags still be in the file after conversion so sorry, no good.

François Bruneau Over a year ago

The other difference is the addition of ":text" after "txt". This tells LibreOffice to use the "text" filter, which actually takes care of removing the HTML tags. As for the output filename, it is ignored by LibreOffice. It basically takes the name of the source file, replacing the extension according to the target filetype.

z-- · Accepted Answer · 2014-07-19 19:59:48Z

0

An alternative might be to use pandoc

answered Jul 19, 2014 at 19:59

z--

2,24620 silver badges34 bronze badges

Collectives™ on Stack Overflow

Strip HTML tags from libreoffice command line conversion

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related