0

I'm trying to convert a HTML file, on my linux server, to a TXT file. The thing is the conversion working fine but it keeps the HTML tags in it. Any command to strip all HTML tags in the conversion ?

libreoffice4.2 --headless --convert-to txt 2000.html 2000.txt

Opening it in a GUI Libreoffice is already stripping HTML when saving from HTML to TXT so there must be something to accomplish this in command line too.

1
  • I think I've found something using sed command with a regex formula to strip the content of the HTML file instead to using Libre Office. Will tell if it works. Commented Jul 20, 2014 at 3:17

2 Answers 2

1

You need to tell LibreOffice which filter it has to use in order to perform the conversion (see http://ask.libreoffice.org/en/question/2641/convert-to-command-line-parameter/)

libreoffice4.2 --headless --convert-to txt:text 2000.html
Sign up to request clarification or add additional context in comments.

2 Comments

Hum... That's the same command line that I do only that you didn't add the output filename. The HTML tags still be in the file after conversion so sorry, no good.
The other difference is the addition of ":text" after "txt". This tells LibreOffice to use the "text" filter, which actually takes care of removing the HTML tags. As for the output filename, it is ignored by LibreOffice. It basically takes the name of the source file, replacing the extension according to the target filetype.
0

An alternative might be to use pandoc

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.