3

I sometimes need to parse data like these:

<tr>
  <td data-th="Name">
    John Smith
  </td>
  <td data-th="Phone">
    1234567
  </td>
  <td data-th="Postal">
    16803
  </td>
  <td data-th="Office Number">
    12345678
  </td>
  <td data-th="Remarks">
    Hello
  </td>
</tr>
<tr>
  <td data-th="Name">
    Mary Smith
  </td>
  <td data-th="Phone">
    1234589
  </td>
  <td data-th="Postal">
    16801
  </td>
  <td data-th="Office Number">
    2385234
  </td>
  <td data-th="Remarks">
    Hi There
  </td>
</tr>

I would do something like loading this to a TStringList:

for i := 0 to oStringList.Count-1 do
begin
  if oStringList[i].Trim = '<tr>' then
  begin
    // start of record
  end else if oStringList[i].Trim = '</tr>' then
  begin 
    // end of record
  end else
  begin
    // part of record data
  end;
end;

Is there a better way to do this, either via some very efficient code, or is there already some really good Delphi components (preferably free/opensource) that can accomplish this? I saw a thread (dated 3+ years ago) in stackoverflow that mentioned a component, just wondering if something better has popped up.

Thanks.

Update: trying the htmlp component --> how do I configure the code to parse above data... the sketchy example did not help. i want to loop through each TR/TR and get the

  var HtmlParser: THtmlParser;
  var  HtmlDoc: TDocument;
  var  x: Integer;
  var  body, el: TElement;
  var  node: TNode;
  begin
    HtmlParser := THtmlParser.Create;
    try
      HtmlDoc := HtmlParser.parseString(memo1.Text);
      try
        body := GetDocBody(HtmlDoc);
        if Assigned(body) then
        for x := 0 to body.childNodes.length - 1 do
        begin
          node := body.childNodes.item(x);
          if (node is TElement) then
          begin
            el := node as TElement;
            if (el.tagName = 'td') then //and (el.GetAttribute('data-th') = 'Name') then
            begin
              // iterate el.childNodes here...
              //ShowMessage(IntToStr(el.childNodes.length));
              memo1.Lines.Add(IntToStr(el.childNodes.length));
            end else
            begin

            end;
          end else
          begin
            memo1.Lines.Add('node is not element');
          end;
        end;
      finally
        HtmlDoc.Free;
      end;
    finally
      HtmlParser.Free
    end;
  end;
7
  • "Is there a better way to do this[?]" Yes, use an HTML parser. We cannot recommend one, though, since software recommendation is explicitly disallowed by Stack Overflow rules. Commented Sep 29, 2021 at 13:18
  • Then may I ask if htmlp is a very good choice, since you cannot recommend? Commented Sep 29, 2021 at 15:04
  • Do not expect HTML to come in multiple lines - stuffing everything together without one single linebreak is legal. A (HTML) parser will most likely choke on syntax/logic errors (such as your 2 </td> in a row) that could be easy for you as a human to adapt to, but on the other hand it also most likely supports entities (&lt;) and whatnot that must be expected with HTML. Commented Sep 29, 2021 at 15:11
  • That's true. I have corrected the double </td>. may i ask if anyone has any experience with the htmlp parser to know how to parse in this case? Commented Sep 29, 2021 at 15:25
  • 1
    Why do you need to parse HTML in the first place? Commented Sep 29, 2021 at 18:26

1 Answer 1

1

When it is well formed HTML like that, where start entries also have end entries (<TR>...</TR>) then it is basically XML. So you can use an XML reader to parse the document.

Using kbmMW's XML parser like this:

const
  HTML =
'  <tr>'+
'  <td data-th="Name">'+
'    John Smith'+
'  </td>'+
'  <td data-th="Phone">'+
'    1234567'+
'  </td>'+
'  <td data-th="Postal">'+
'    16803'+
'  </td>'+
'  <td data-th="Office Number">'+
'    12345678'+
'  </td>'+
'  <td data-th="Remarks">'+
'    Hello'+
'  </td>'+
'</tr>'+
'<tr>'+
'  <td data-th="Name">'+
'    Mary Smith'+
'  </td>'+
'  <td data-th="Phone">'+
'    1234589'+
'  </td>'+
'  <td data-th="Postal">'+
'    16801'+
'  </td>'+
'  <td data-th="Office Number">'+
'    2385234'+
'  </td>'+
'  <td data-th="Remarks">'+
'    Hi There'+
'  </td>'+
'</tr>';
    
procedure TForm1.Button1Click(Sender: TObject);
var
   xml:TkbmMWDOMXML;
   i,j:integer;
   nTR,nTD:TkbmMWDOMXMLNodeList;
   n1,n2:TkbmMWDOMXMLNode;
begin
     Memo1.Clear;
     xml:=TkbmMWDOMXML.Create(HTML);
     try
        nTR:=xml.Root.ChildrenByName['tr'];
        try
           for i:=0 to nTR.Count-1 do
           begin
                n1:=nTR.Nodes[i];
                nTD:=n1.ChildrenByName['td'];
                try
                   for j:=0 to nTD.Count-1 do
                   begin
                        n2:=nTD.Nodes[j];
                        Memo1.Lines.Add(n2.AttribByName['data-th']+'='+n2.Data);
                   end;
                finally
                   nTD.Free;
                end;
           end;
        finally
           nTR.Free;
        end;
     finally
        xml.Free;
     end;
end;

Results in this:

Name=John Smith
Phone=1234567
Postal=16803
Office Number=12345678
Remarks=Hello
Name=Mary Smith
Phone=1234589
Postal=16801
Office Number=2385234
Remarks=Hi There

The kbmMW XML parser is included, along with much much more, in the free Community Edition that can be downloaded from https://portal.components4developers.com after registering.

Sign up to request clarification or add additional context in comments.

7 Comments

Isn't html that is also xml known as xhtml
@DavidHeffernan Definitly yes. HTML is SGML, so only the "ML" part is the most common denominator. HTML5 should be impossible to parse as XML because missing closing tags are legal now, so this answer implies HTML2 thru 4 or XHTML1.
@peter There are many XML parsers available, and many of them have more permissive licences than Kim's parser that he is promoting here. Indeed the standard Delphi libraries have a fully functional XML parser that should be more than enough to meet your needs.
@KimMadsen my apologies the other day when I saw your answer, the site was going under maintenance so I was not able to accept the answer as correct.
@AmigoJack thanks for your reply
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.