Efficient Delphi method to parse HTML code for data

Question

I sometimes need to parse data like these:

<tr>
  <td data-th="Name">
    John Smith
  </td>
  <td data-th="Phone">
    1234567
  </td>
  <td data-th="Postal">
    16803
  </td>
  <td data-th="Office Number">
    12345678
  </td>
  <td data-th="Remarks">
    Hello
  </td>
</tr>
<tr>
  <td data-th="Name">
    Mary Smith
  </td>
  <td data-th="Phone">
    1234589
  </td>
  <td data-th="Postal">
    16801
  </td>
  <td data-th="Office Number">
    2385234
  </td>
  <td data-th="Remarks">
    Hi There
  </td>
</tr>

I would do something like loading this to a TStringList:

for i := 0 to oStringList.Count-1 do
begin
  if oStringList[i].Trim = '<tr>' then
  begin
    // start of record
  end else if oStringList[i].Trim = '</tr>' then
  begin 
    // end of record
  end else
  begin
    // part of record data
  end;
end;

Is there a better way to do this, either via some very efficient code, or is there already some really good Delphi components (preferably free/opensource) that can accomplish this? I saw a thread (dated 3+ years ago) in stackoverflow that mentioned a component, just wondering if something better has popped up.

Thanks.

Update: trying the htmlp component --> how do I configure the code to parse above data... the sketchy example did not help. i want to loop through each TR/TR and get the

  var HtmlParser: THtmlParser;
  var  HtmlDoc: TDocument;
  var  x: Integer;
  var  body, el: TElement;
  var  node: TNode;
  begin
    HtmlParser := THtmlParser.Create;
    try
      HtmlDoc := HtmlParser.parseString(memo1.Text);
      try
        body := GetDocBody(HtmlDoc);
        if Assigned(body) then
        for x := 0 to body.childNodes.length - 1 do
        begin
          node := body.childNodes.item(x);
          if (node is TElement) then
          begin
            el := node as TElement;
            if (el.tagName = 'td') then //and (el.GetAttribute('data-th') = 'Name') then
            begin
              // iterate el.childNodes here...
              //ShowMessage(IntToStr(el.childNodes.length));
              memo1.Lines.Add(IntToStr(el.childNodes.length));
            end else
            begin

            end;
          end else
          begin
            memo1.Lines.Add('node is not element');
          end;
        end;
      finally
        HtmlDoc.Free;
      end;
    finally
      HtmlParser.Free
    end;
  end;

"Is there a better way to do this[?]" Yes, use an HTML parser. We cannot recommend one, though, since software recommendation is explicitly disallowed by Stack Overflow rules. — Andreas Rejbrand
– Andreas Rejbrand, Commented Sep 29, 2021 at 13:18
Then may I ask if htmlp is a very good choice, since you cannot recommend? — Peter Jones
– Peter Jones, Commented Sep 29, 2021 at 15:04
Do not expect HTML to come in multiple lines - stuffing everything together without one single linebreak is legal. A (HTML) parser will most likely choke on syntax/logic errors (such as your 2 </td> in a row) that could be easy for you as a human to adapt to, but on the other hand it also most likely supports entities (<) and whatnot that must be expected with HTML. — AmigoJack
– AmigoJack, Commented Sep 29, 2021 at 15:11
That's true. I have corrected the double </td>. may i ask if anyone has any experience with the htmlp parser to know how to parse in this case? — Peter Jones
– Peter Jones, Commented Sep 29, 2021 at 15:25

Kim Madsen · Accepted Answer · 2021-09-29 19:36:55Z

1

When it is well formed HTML like that, where start entries also have end entries (<TR>...</TR>) then it is basically XML. So you can use an XML reader to parse the document.

Using kbmMW's XML parser like this:

const
  HTML =
'  <tr>'+
'  <td data-th="Name">'+
'    John Smith'+
'  </td>'+
'  <td data-th="Phone">'+
'    1234567'+
'  </td>'+
'  <td data-th="Postal">'+
'    16803'+
'  </td>'+
'  <td data-th="Office Number">'+
'    12345678'+
'  </td>'+
'  <td data-th="Remarks">'+
'    Hello'+
'  </td>'+
'</tr>'+
'<tr>'+
'  <td data-th="Name">'+
'    Mary Smith'+
'  </td>'+
'  <td data-th="Phone">'+
'    1234589'+
'  </td>'+
'  <td data-th="Postal">'+
'    16801'+
'  </td>'+
'  <td data-th="Office Number">'+
'    2385234'+
'  </td>'+
'  <td data-th="Remarks">'+
'    Hi There'+
'  </td>'+
'</tr>';
    
procedure TForm1.Button1Click(Sender: TObject);
var
   xml:TkbmMWDOMXML;
   i,j:integer;
   nTR,nTD:TkbmMWDOMXMLNodeList;
   n1,n2:TkbmMWDOMXMLNode;
begin
     Memo1.Clear;
     xml:=TkbmMWDOMXML.Create(HTML);
     try
        nTR:=xml.Root.ChildrenByName['tr'];
        try
           for i:=0 to nTR.Count-1 do
           begin
                n1:=nTR.Nodes[i];
                nTD:=n1.ChildrenByName['td'];
                try
                   for j:=0 to nTD.Count-1 do
                   begin
                        n2:=nTD.Nodes[j];
                        Memo1.Lines.Add(n2.AttribByName['data-th']+'='+n2.Data);
                   end;
                finally
                   nTD.Free;
                end;
           end;
        finally
           nTR.Free;
        end;
     finally
        xml.Free;
     end;
end;

Results in this:

Name=John Smith
Phone=1234567
Postal=16803
Office Number=12345678
Remarks=Hello
Name=Mary Smith
Phone=1234589
Postal=16801
Office Number=2385234
Remarks=Hi There

The kbmMW XML parser is included, along with much much more, in the free Community Edition that can be downloaded from https://portal.components4developers.com after registering.

answered Sep 29, 2021 at 19:36

Kim Madsen

2522 silver badges2 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

David Heffernan Over a year ago

Isn't html that is also xml known as xhtml

AmigoJack Over a year ago

@DavidHeffernan Definitly yes. HTML is SGML, so only the "ML" part is the most common denominator. HTML5 should be impossible to parse as XML because missing closing tags are legal now, so this answer implies HTML2 thru 4 or XHTML1.

David Heffernan Over a year ago

@peter There are many XML parsers available, and many of them have more permissive licences than Kim's parser that he is promoting here. Indeed the standard Delphi libraries have a fully functional XML parser that should be more than enough to meet your needs.

Peter Jones Over a year ago

@KimMadsen my apologies the other day when I saw your answer, the site was going under maintenance so I was not able to accept the answer as correct.

Peter Jones Over a year ago

@AmigoJack thanks for your reply

|

Collectives™ on Stack Overflow

Efficient Delphi method to parse HTML code for data

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related