2

I need to decode HTML into plain text. I know that there are a lot of questions like this but I noticed one problem with those solutions and don't know how to solve it.

For example we have this piece of HTML: <h1><strong>Some text</strong></h1><p><br></p><p>Some more text</p>

Tried regex solutions, HttpUtility.HtmlDecode method. And all of them give this output: Some textSome more text. Words get connected where they should be separate. Is there a way to decode string without merging words?

8
  • You can take a substring to take all strings after ">" and all strings before "<" Commented Feb 8, 2019 at 13:02
  • What would you want to use to separate the two phrases? What would determine when one phrase ends and the next begins? Commented Feb 8, 2019 at 13:02
  • html-agility-pack.net will allow you to parse HTML pretty successfully and gain access to all parts of the HTML (including tags and inner text). Commented Feb 8, 2019 at 13:03
  • Space between words would work for me. Just want to make sure words don't get blended. Commented Feb 8, 2019 at 13:03
  • 1
    RegEx is not a good answer for this. Sure you might find you can get it to work 99% of the time, but HTML is not XML. It's too irregular for regular expressions. Commented Feb 8, 2019 at 13:05

4 Answers 4

4

It's not clear what separator you wan between things that were not separated in the first place. So I used NewLine \n.
Where(x=>!string.IsNullOrWhiteSpace(x) will remove the empty element that will result in a lot of \n\n in more complex html doc

var input = "<h1><strong>Some text</strong></h1><p><br></p><p>Some more text</p>";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(input);

var result = string.Join(
                "\n", 
                htmlDocument
                    .DocumentNode
                    .ChildNodes
                    .Select(x=> x.InnerText)
                    .Where(x=>!string.IsNullOrWhiteSpace(x))
              );

Result:

"Some text\nSome more text"

Sign up to request clarification or add additional context in comments.

Comments

2

easy way to do it is to use HTML Agility pack:

HtmlDocument htmlDocument= new HtmlDocument();
htmlDocument.Load(htmlString);
string res=htmlDocument.DocumentNode.SelectSingleNode("YOUR XPATH TO THE INTRESTING ELEMENT").InnerText

2 Comments

This is giving the same result Some textSome more text while expected result is Some text Some more text
@Sparrow so you should 1. choose the html element that contains them both. or 2. choose each one of them and concat the string. but that's not the elegant way to do it.
0

You can use something as follows. In this sample i have used new line to separate inner text, hope you can adapt this to suite your scenario.

public static string GetPlainTextFromHTML(string inputText)
    {
        // Extracted plain text
        var plainText = string.Empty;

        if(string.IsNullOrWhiteSpace(inputText))
        {
            return plainText;
        }

        var htmlNote = new HtmlDocument();
        htmlNote.LoadHtml(inputText);

        var nodes = htmlNote.DocumentNode.ChildNodes;
        if(nodes == null)
        {
            return plainText;
        }

        StringBuilder innerString = new StringBuilder();

        // Replace <p> with new lines
        foreach (HtmlNode node in nodes) 
        {
            innerString.Append(node.InnerText);
            innerString.Append("\\n");
        }

        plainText = innerString.ToString();
        return plainText;
    }

Comments

-1

You can use a regex : <(div|/div|br|p|/p)[^>]{0,}>

1 Comment

Hi, you might want to read: stackoverflow.com/a/1732454/1336590

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.