0

I made a console c# application which is supposed to display the html source of a page.

Instead, the console app is showing HtmlAgilityPack.HtmlDocument.

Can anyone explain to me why that is?

class Program
{
    public HtmlDocument read()
    {
        HtmlWeb htmlWeb = new HtmlWeb();
        try
        {
            HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.yahoo.com");
            return document;
        }
        catch (Exception e)
        {
            Console.WriteLine("Error : " + e.ToString());
            return null;     
        }
    }     

    static void Main(string[] args)
    {
        Program dis = new Program();
        string text = Convert.ToString(dis.read());
        Console.WriteLine(text);
        Console.ReadLine();        
    }
}
4
  • The output is "HtmlAgilityPack.HtmlDocument" Commented Jul 3, 2013 at 15:28
  • 2
    I don't know the model of HtmlDocument; but clearly its ToString() is not implemented to return the html. You will need to inspect the properties and use one of them which should contain the source. Commented Jul 3, 2013 at 15:30
  • 1
    posisble duplicate stackoverflow.com/questions/5599012/… Commented Jul 3, 2013 at 15:33
  • how do I then convert document to string? Commented Jul 3, 2013 at 15:33

2 Answers 2

3

replace

 return document;

with:

 return document.DocumentNode.InnerHtml;

or if you wanna to extract text only (without HTML tags):

 return document.DocumentNode.InnerText;

the whole code would be:

class Program
{
    public string read()
    {
        HtmlWeb htmlWeb = new HtmlWeb();
        try
        {
            HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.yahoo.com");
            return document.DocumentNode.InnerHtml;
        }
        catch (Exception e)
        {
            Console.WriteLine("Error : " + e.ToString());
            return null;     
        }
    }     

    static void Main(string[] args)
    {
        Program dis = new Program();
        string text = dis.read();
        Console.WriteLine(text);
        Console.ReadLine();        
    }
}
Sign up to request clarification or add additional context in comments.

Comments

2

The default implementation of .ToString() is just to output the name of the class, which is what you're seeing. So HtmlDocument from the HtmlAgilityPack obviously doesn't provide a derived implementation.

From glancing at the code over on CodePlex, it looks like you need to use the Save function to save the output to an XmlWriter and then use that to get the string. I don't see another way to get at the whole contents of the page directly from that object (though admittedly I just scanned it).

Edit: Amine Hajyoussef pointed you in the right direction with document.DocumentNode.Innerhtml, though note that you'll need to change the return type of the function as well.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.