1

I'm trying to parse a html file so I can extract the data from a table.
So I did some google magic and ended up here where a simular question was asked.

In that question they suggested to use HTMLP for the parsing of the html. so I downloaded the units and tried it out.

It works but i'm missing something, I think. I have absolutely no clue on how to get the actual text from the element.

I looked through the source but I cant find anything on this. So I was hoping someone here knows the answer.

Thanks in advance.

Edit

As requested: The data im trying to get is found here
I want to get this data and turn each row into an object which will be stored so I can compare different practices, qualifications and races.

2
  • can you post at least the actual html you are trying to parse and what result you want to achieve? Commented Aug 22, 2014 at 14:34
  • @whosrdaddy I added more info on that. Commented Aug 22, 2014 at 14:43

1 Answer 1

4

The problem with your code, which could you please reinstate in this q, is with the line:

for i:=0 to doc.body.all.length-1 do

When this executes, an Invalid Variant Operation occurs. Here's the code I used to investigate this:

procedure GetTable2(FSource : TStrings);
var
  Doc : IHtmlDocument2;
  Body : IHtmlElement;
  All : IHtmlElementCollection;
begin
  Doc := coHTMLDocument.Create as IHTMLDocument2;
  Doc.Write(PSafeArray(FSource.Text));
  Doc.Close;
  Assert(Doc <> Nil);
  Body := Doc.body;
  Assert(Body <> Nil);
  All := Body.All as IHtmlElementCollection;
  Assert(All <> Nil);
  Assert(All.Length <> 0);
end;

This gets passed a TStringlist which has been loaded with a locally-saved copy of your racing results page.

You've been using "late binding", i.e. variants, to interact with the MS Dom Parser. That's fine, if a little slower than using early binding like the code I've just quoted, but it can hide or obscure some kinds of error.

My code splits accessing of the parsed HTML up into several stages and uses the Assert()s to check that the DOM objects actually exist. They all pass the Assert tests, but the last Assert, that the length of the All collection is not zero, fails.

You might like to run my code above and inspect the OuterHtml property of the Body object. It is just '' plus a few embedded CRLFs. (The original version of this answer stopped here).

Update: A bit more digging revealed the cause of your problem. To see it, save your problem web page locally, then create a new VCL project, add to its form a TWebBrowser, two TMemos and to TButtons, then paste the following code into it (obviously, you'll need to adjust the Form.Create to load your local copy of the page):

procedure GetTable(All : IHtmlElementCollection; Output : TStrings);
var
  el:OleVariant;
  i,tdc,mc:integer;
  tst,v:string;
begin
  v:='';
  mc:=4;
  tdc:=0;
  for i:=0 to all.length -1 do
  begin
    el:= All.item(i, '');
    if el.tagname='TD' then
    begin
      inc(tdc);
      if tdc>mc then
      begin
        Output.Add(v);
        v:='';
        tdc:=1;
      end;
      if v='' then v:=el.InnerText
      else v:=v+'^'+el.InnerText;
    end;
  end;
end;

procedure ProcessDoc(Doc : IHtmlDocument2; Output : TStrings);
var
  Body : IHtmlElement;
  All : IHtmlElementCollection;
  V : OleVariant;
begin
  Assert(Doc <> Nil);
  Body := Doc.Body;
  Assert(Body <> Nil);
  All := Body.All as IHtmlElementCollection;
  Assert(All <> Nil);
  Assert(All.Length <> 0);
  GetTable(All, Output);
end;


procedure TForm1.FormCreate(Sender: TObject);
begin
  Memo1.Lines.LoadFromFile('D:\aaad7\html\race.htm');
end;

procedure TForm1.Button1Click(Sender: TObject);
var
  V : OleVariant;
begin
  WebBrowser1.Navigate('about:blank');  //  This line is so that the WebBrowser
    // has a Doc object
  Doc := WebBrowser1.Document as IHTMLDocument2;
  V := VarArrayCreate([0, 0], varVariant);
  V[0] := Memo1.Lines.Text;
  try
    Doc.Write(PSafeArray(TVarData(V).VArray));
  finally
    Doc.Close;
  end;  
end;

procedure TForm1.Button2Click(Sender: TObject);
begin
  ProcessDoc(Doc, Memo2.Lines);
end;

When you click Button1, you'll soon see the cause of the problem (assuming like me you're using IE11, but you may get them with earlier versions), namely a cascade of seven Javascript error pop-ups. If you click Yes through them, you'll see that the second memo receives the output of a slightly adapted version of your code.

So, I think the problem with your code was that because you were creating an IHTMLDocument object with no GUI, there was no way for the script errors to manifest. I think the problem is hidden with your non-gui Doc object because IIRC, the MS specification for COM objects requires that exceptions never propagate across the boundary between the COM host and its client, so you never get to find out about the errors. The obvious work-around is to load the page into a TWebBrowser and use the Doc object from that.

Update #2: Something I hadn't realised when I first wrote this answer is that you can tell your IHtmlDocument not to try to pop-up JavaScript errors, so that it will load instead of refusing to. All you need to do is to put

Doc.DesignMode := 'On';

before you try to load anything into it, e.g. by calling its .Write method. Fwiw, you can do a similar thing when using a TWebBrowser's Silent property to True.

Btw, if you're trying to parse your table to get at the data you might want to have a look at this earlier answer of mine:

Delphi: Some tip to parse this html table?

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot for finding the problem and also for explaining it properly :) Much appreciated.
Not sure is SO will notify you automatically, but just in case not, this is just to let you know that I've added to my answer a mention of how to get an IHtmlDocument2 object to load a webpage despite JavaScript errors.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.