Extracting information from HTML file using perl script

Question

This is my code to extract certain data under the header Item Drop%. I want to extract the 90.5% under that header. But i can only extract the whole column not just that value. any idea ?

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TableExtract;
use LWP::Simple;

my $file = 'data.html';
unless ( -e $file ) {
    my $rc = getstore(
        'proj/Desktop/folder1/data.html',
        $file);
    die "Failed to download document\n" unless $rc == 200;
}



my $te = HTML::TableExtract->new( headers => qw(Item Drop%)]);

$te->parse_file($file);

my ($table) = $te->tables;

foreach my $ts (ts->tables) {
    print "Table (", join(',', $ts->coords), ");\n";
    foreach my $row ($ts->rows) {
        print join(',', @$row), "\n";
    }
}

My data.html is:


 ..
 ..
 ..
<table align = "center" class="" style= .......>
<tr>
<th rowspan="2">EM</th>
<th colspan="2"><a href= "proj/Desktop/folder1/data.html" class = ..../th>
<td> 90.5%</td>
</tr>
..
..
..
..
<tr>
<th rowspan="2">EM</th>
<th colspan="2"><a href= "proj/Desktop/folder1/data.html" class = ..../th>
<td> 40%</td>
</tr>

</table>

The data snippet you provided contains no table tag and thus contains no tables. Please make a minimal but reproducible example, i.e. skip everything not actually needed to show your problem (like use of LWP) and provide a complete data file which should work in your opinion but does not with your code. — Steffen Ullrich
– Steffen Ullrich, Commented Jul 14, 2021 at 4:20
Presumably there is a table there but you've only shown a fragment which isn't nearly enough for us to help you out. Next, if $te is an "undefined value" as that error indicates then things are really not right since the good module you use couldn't even parse a table. Then, the method rows goes on tables, so $table->rows. See documentation; also, there are posts here with complete examples for processing tables (for example, this one, with links in it for two more complete examples) — zdim
– zdim, Commented Jul 14, 2021 at 4:58
Thanks for the comments , i think i solved it somehow. will update again soon ! — Vyshunavi
– Vyshunavi, Commented Jul 14, 2021 at 5:02
@Vyshunavi OK, good :) I just updated my comment with a link to a complete example, have a look — zdim
– zdim, Commented Jul 14, 2021 at 5:03
@zdim will do thanks ! i've updated my question. please take a look ~~ — Vyshunavi
– Vyshunavi, Commented Jul 14, 2021 at 5:56

zdim · Accepted Answer · 2021-07-15 18:27:15Z

3

Here are basics, with the given table fragment completed to a meaningful table.

use warnings;
use strict;
use feature 'say';

use HTML::TableExtract;
use Scalar::Util qw(looks_like_number);

my $filename = shift // die "Usage: $0 file\n";

my $te = HTML::TableExtract->new;
$te->parse_file($filename);    

my ($tbl) = $te->tables;  # one table in the sample file

my (@values1, @values2);  # for 90.5% and such, processing options

foreach my $row ($tbl->rows) { 
    # Tables often come with empty fields; keep them, for counting and such
    my @fields = map { defined($_) ? $_ : '--' } @$row;
    printf "%8s ", $_  for @fields;
    say ''; 

    # Criteria for how to identify the number aren't explained,
    # but may it be the fourth column in a row starting with 'EM'?
    if ($fields[0] =~ /^\s*EM\s*$/) {
        push @values1, $fields[3] =~ s/^\s*|\s*$//gr;  # see note in text
    }

    # Or is it simply the number ending with % sign?
    foreach my $fld (@fields) { 
        if ($fld =~ /\s*(.+)\s*%/ and looks_like_number($1)) { 
            push @values2, $1;
        }
    }
}
say "@values1";
say "@values2";

NOTE: The /r modifier in the regex was added in v5.14. If your Perl is older see footnote^†

Much of processing is shown for a demo. One needn't print those values (once you figure out where the items of interest are), and I wouldn't replace undef fields with --, done for clearer printout, but rather with '' (empty string). Also, we would need one criterion, not two.

Note that in the first case we keep the % sign, and use regex to clean up the spaces; in the second case the percentile sign is left out (and spaces end up cleaned out by regex naturally, in matching). These can both be adjusted to what you actually need of course.

Since neither the actual table nor data in it, nor the exact criteria, are given I can only offer hints and examples of code. With more detail this can be made more specific.

Note, when it comes to nailing down the last details, often by regex, things tend to become picky and sensitive to details; so careful.

The html file, completed off of the fragment in the question, used above:

<html>    
<style> th, td { padding: 10px } </style>  <!-- to better see it -->

<table align="center" rules="all">    
<tr>
    <th rowspan="2">EM</th>
    <th colspan="2"><a href="http://www.google.com">ggl</a></th>
    <td> 90.5%</td>
</tr>
<tr>
    <td>data</td> <td>more</td> <td>etc</td>
</tr>    

<tr>
<th rowspan="2">EM</th>
<th colspan="2"><a href="http://www.google.com">ggl</a></th>
<td> 0.0%</td>
</tr>
<tr> <td>data</td> <td>more</td> <td>etc</td> </tr>    
</table>    

</html>

^† The /r modifier on the substitution regex makes it return the changed string (and it leaves the original as it is). This is precisely what one wants in many situations, the one in the code here being a good example (we merely want to add the changed string to an array).

However, in Perls older than version 5.14, when this feature was introduced, one has to do otherwise: either create a new variable that has the change, if you need to keep the original unchanged, or change that original and then use it.

Since we are pruning extra spaces here it may well be best to simply change $fields[3] and then add it to the array. So instead of

if ($fields[0] =~ /^\s*EM\s*$/) {
    push @values1, $fields[3] =~ s/^\s*|\s*$//gr;
}

do

if ($fields[0] =~ /^\s*EM\s*$/) {
    $fields[3] =~ s/^\s*|\s*$//g;   # strip leading/trailing spaces
    push @values1, $fields[3];
}

edited Jul 15, 2021 at 18:27

answered Jul 15, 2021 at 6:12

zdim

67.2k5 gold badges59 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Vyshunavi Over a year ago

ok let me try ur code and get back to u soon !

Vyshunavi Over a year ago

hi, it says that bareword found where the operation expected at if ($fields[0] =~ /^\s*EM\s*$/) { push @values1, $fields[3] =~ s/^\s*|\s*$//r;

zdim Over a year ago

@Vyshunavi Not for me, it doesn't; I tested this (just did it again) and it works as expected. Are you using your real tables with this code? I don't know your real data so this is written to process a file I made up from what you gave in the question (shown at the end). If your actual HTML has a different structure then you'd need to adjust the code accordingly. I can't do that since I don't know your table.

zdim Over a year ago

@Vyshunavi On the other hand ... what version of Perl do you have? That /r modifier in the regex was added in version 5.14. If yours is earlier it won't work so let me know to add a different way in that case.

Vyshunavi Over a year ago

mine is v5.8.9, does that mean i cant use it? is there anyway i can sent u my html file ~

|

Collectives™ on Stack Overflow

Extracting information from HTML file using perl script

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related