How to decode HTML into string?

Question

I need to decode HTML into plain text. I know that there are a lot of questions like this but I noticed one problem with those solutions and don't know how to solve it.

For example we have this piece of HTML: <h1><strong>Some text</strong></h1><p><br></p><p>Some more text</p>

Tried regex solutions, HttpUtility.HtmlDecode method. And all of them give this output: Some textSome more text. Words get connected where they should be separate. Is there a way to decode string without merging words?

You can take a substring to take all strings after ">" and all strings before "<" — Adas
– Adas, Commented Feb 8, 2019 at 13:02
What would you want to use to separate the two phrases? What would determine when one phrase ends and the next begins? — Andy G
– Andy G, Commented Feb 8, 2019 at 13:02
html-agility-pack.net will allow you to parse HTML pretty successfully and gain access to all parts of the HTML (including tags and inner text). — Neil
– Neil, Commented Feb 8, 2019 at 13:03
Space between words would work for me. Just want to make sure words don't get blended. — PovilasZ
– PovilasZ, Commented Feb 8, 2019 at 13:03
RegEx is not a good answer for this. Sure you might find you can get it to work 99% of the time, but HTML is not XML. It's too irregular for regular expressions. — Neil
– Neil, Commented Feb 8, 2019 at 13:05

Drag and Drop · Accepted Answer · 2019-02-08 13:51:14Z

4

It's not clear what separator you wan between things that were not separated in the first place. So I used NewLine \n.
Where(x=>!string.IsNullOrWhiteSpace(x) will remove the empty element that will result in a lot of \n\n in more complex html doc

var input = "<h1><strong>Some text</strong></h1><p><br></p><p>Some more text</p>";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(input);

var result = string.Join(
                "\n", 
                htmlDocument
                    .DocumentNode
                    .ChildNodes
                    .Select(x=> x.InnerText)
                    .Where(x=>!string.IsNullOrWhiteSpace(x))
              );

Result:

"Some text\nSome more text"

answered Feb 8, 2019 at 13:51

Drag and Drop

2,7463 gold badges28 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Or Yaacov · Accepted Answer · 2019-02-08 13:04:34Z

2

easy way to do it is to use HTML Agility pack:

HtmlDocument htmlDocument= new HtmlDocument();
htmlDocument.Load(htmlString);
string res=htmlDocument.DocumentNode.SelectSingleNode("YOUR XPATH TO THE INTRESTING ELEMENT").InnerText

answered Feb 8, 2019 at 13:04

Or Yaacov

3,9205 gold badges29 silver badges55 bronze badges

2 Comments

PovilasZ Over a year ago

This is giving the same result Some textSome more text while expected result is Some text Some more text

Or Yaacov Over a year ago

@Sparrow so you should 1. choose the html element that contains them both. or 2. choose each one of them and concat the string. but that's not the elegant way to do it.

Hasitha · Accepted Answer · 2019-02-08 13:24:53Z

You can use something as follows. In this sample i have used new line to separate inner text, hope you can adapt this to suite your scenario.

public static string GetPlainTextFromHTML(string inputText)
    {
        // Extracted plain text
        var plainText = string.Empty;

        if(string.IsNullOrWhiteSpace(inputText))
        {
            return plainText;
        }

        var htmlNote = new HtmlDocument();
        htmlNote.LoadHtml(inputText);

        var nodes = htmlNote.DocumentNode.ChildNodes;
        if(nodes == null)
        {
            return plainText;
        }

        StringBuilder innerString = new StringBuilder();

        // Replace <p> with new lines
        foreach (HtmlNode node in nodes) 
        {
            innerString.Append(node.InnerText);
            innerString.Append("\\n");
        }

        plainText = innerString.ToString();
        return plainText;
    }

Magnus Dot · Accepted Answer · 2019-02-08 13:03:34Z

-1

You can use a regex : <(div|/div|br|p|/p)[^>]{0,}>

answered Feb 8, 2019 at 13:03

Magnus Dot

1

1 Comment

Corak Over a year ago

Hi, you might want to read: stackoverflow.com/a/1732454/1336590

Collectives™ on Stack Overflow

How to decode HTML into string?

4 Answers 4

Comments

2 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related