1

I'm attempting to display an HTML preview of a .msg file using Java for my backend. However, when I use Apache POI to read the .msg file, I only receive the content in either plain text or RTF format. The RTF format does contain the HTML content, but I'm struggling to parse it into HTML for the email preview on my website. I attempted to parse the RTF content using Apache Tika, but it was unsuccessful. Currently, I am using the method detailed in this stackoveflow thread, but it's not parsing all the HTML tags. Any suggestions on how to resolve this issue?

Here is the code I am using:

    public static String rtfToHtml(String rtfText)
    {
        StringBuilder sb = new StringBuilder();

        if (rtfText != null)
        {
            String[] lignes = rtfText.split("[\\r\\n]+");
            for (String ligne : lignes)
            {
                String tempLine = ligne.replaceAll("\\{\\\\\\*\\\\[m]?htmltag[\\d]*([^}]*)\\}", "$1")
                    .replaceAll("\\\\htmlrtf0([^\\\\]*)\\\\htmlrtf", "$1")
                    .replaceAll("\\\\htmlrtf \\{(.*)\\}\\\\htmlrtf0", "$1")
                    .replaceAll("\\\\htmlrtf (.*)\\\\htmlrtf0", "")
                    .replaceAll("\\\\htmlrtf[0]?", "")
                    .replaceAll("\\\\field\\{\\\\\\*\\\\fldinst\\{[^}]*\\}\\}", "")
                    .replaceAll("\\{\\\\fldrslt\\\\cf1\\\\ul([^}]*)\\}", "$1")
                    .replaceAll("\\\\htmlbase", "")
                    .replaceAll("\\\\*\\\bkmkstart BM\\_", "")
                    .replaceAll("\\\\par", "\n")
                    .replaceAll("\\\\tab", "\t")
                    .replaceAll("\\\\line", "\n")
                    .replaceAll("\\\\page", "\n\n")
                    .replaceAll("\\\\sect", "\n\n")
                    .replaceAll("\\\\emdash", "ߞ")
                    .replaceAll("\\\\endash", "ߝ")
                    .replaceAll("\\\\emspace", "ߓ")
                    .replaceAll("\\\\enspace", "ߒ")
                    .replaceAll("\\\\qmspace", "ߕ")
                    .replaceAll("\\\\bullet", "ߦ")
                    .replaceAll("\\\\lquote", "ߢ")
                    .replaceAll("\\\\rquote", "ߣ")
                    .replaceAll("\\\\ldblquote", "&#201C;")
                    .replaceAll("\\\\rdblquote", "&#201D;")
                    .replaceAll("\\\\row", "\n")
                    .replaceAll("\\\\cell", "|")
                    .replaceAll("\\\\nestcell", "|")
                    .replaceAll("([^\\\\])\\{", "$1")
                    .replaceAll("([^\\\\])}", "$1")
                    .replaceAll("[\\\\](\\{)", "$1")
                    .replaceAll("[\\\\](})", "$1")
                    .replaceAll("\\\\u([0-9]{2,5})", "&#$1;")
                    .replaceAll("\\\\'([0-9A-Fa-f]{2})", "&#x$1;")
                    .replaceAll("\"cid:(.*)@.*\"", "\"$1\"")
                    .replaceAll(" {2,}", " ")
                    .replaceAll("\\\\htmlrtf[1]?(.*)\\\\htmlrtf0", "")
                    .replaceAll("\\\\htmlrtf[01]?", "");

                if (!tempLine.replaceAll("\\s+", "").isEmpty())
                {
                    sb.append(tempLine).append("\r\n");
                }
            }

            rtfText = sb.toString();

            int index = rtfText.indexOf("<html");
            if (index != -1)
            {
                return rtfText.substring(index);
            }
        }

        return null;
    }
InputStream inputStream = new FileInputStream(file);
MAPIMessage msgMessage = new MAPIMessage(inputStream);
return rtfToHtml(msgMessage.getRtfBody());

I tried using Apache Tika but it was not able to parse the content and was returning empty HTML

    public String convertUsingApacheTika(String rtfString) throws TikaException, IOException, SAXException
    {
        Tika tika = new Tika();
        InputStream stream = TikaInputStream.get(rtfString.getBytes(StandardCharsets.UTF_8));
        BodyContentHandler handler = new BodyContentHandler(-1);
        Metadata metadata = new Metadata();

        AutoDetectParser parser = new AutoDetectParser();
        parser.parse(stream, handler, metadata);

        String htmlString = handler.toString();
        return htmlString;
    }
4
  • Tika should work, Have you tried with different RTF documents? Commented Jan 16, 2024 at 13:17
  • *.msg files are not the best choice for email display. If you can get the emails as *.eml (perhaps by pulling them directly from the mail box server), you can use Jakarta Mail to parse them, and to extract the HTML portion (as HTML). This HTML is (usually) complete and can be displayed as is. Commented Jan 16, 2024 at 14:03
  • @Sam yes I tried with different RTF documents (received from .msg). Also, I need to parse MSG attachments and can't pull it from mail box server. Commented Jan 16, 2024 at 14:09
  • Parsing RTF-wrapped HTML is a lot more involved than that. Parsing code runs a couple thousand lines in Redemption (I am its author) - you can try to use it (it requires Outlook/MAPI system to be installed): RDOSession.MessageFromMsgFile / RDOMail.HTMLBody would give you the HTML. Or you can save it in MHTML (if you want to handle embedded HTML images) or EML formats using RDOMail.SaveAs. Commented Jan 16, 2024 at 16:17

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.