HTML parser using perl

Question

I'm trying to parse the html file using perl script. I'm trying to grep all the text with html tag p. If I view the source code the data is written in this format.

<p> Metrics are all virtualization specific and are prioritized and grouped as follows: </p>

Here is the following code.

use HTML::TagParser();

use URI::Fetch;

//my @list = $html->getElementsByTagName( "p" );

    foreach my $elem ( @list ) {
        my $tagname = $elem->tagName;
        my $attr = $elem->attributes;
        my $text = $elem->innerText;

        push (@array,"$text");

        foreach $_  (@array) {
           # print "$_\n"; 
           print $html_fh "$_\n";   
          chomp ($_);        
           push (@array1, "$_");
         }
       } 
    }

$end = $#array1+1;

print "Elements in the array: $end\n";

close $html_fh;

The problem that I'm facing is that the output which is generated is 4.60 Mb and lot of the array elements are just repetition sentences. How can I avoid such repetition? Is there any other efficient way to grep the lines which I'm interested. Can anybody help me out with this issue?

TLP · Accepted Answer · 2012-12-09 08:44:47Z

The reason you are seeing duplicated lines is that you are printing your entire array once for every element in it.

foreach my $elem ( @list ) {
    my $tagname = $elem->tagName;
    my $attr = $elem->attributes;
    my $text = $elem->innerText;

    push (@array,"$text");      # this array is printed below

    foreach $_  (@array) {      # This is inside the other loop
       # print "$_\n"; 
       print $html_fh "$_\n";   # here comes the print
      chomp ($_);        
       push (@array1, "$_");
     }
   }

So for example, if you have an array "foo", "bar", "baz", it would print:

foo   # first iteration
foo   # second
bar
foo   # third
bar
baz

So, to fix your duplication errors, move the second loop outside the first one.

Some other notes:

You should always use these two pragmas:

use strict;
use warnings;

They will provide more help than any other single thing that you can do. The short learning curve associated with fixing the errors that appear more than make up for the massively reduced time spent debugging.

//my @list = $html->getElementsByTagName( "p" );

Comments in perl start with #. Not sure if this is a typo, because you use this array below.

foreach my $elem ( @list ) {

You don't need to actually store the tags into an array unless you need an array. This is an intermediate variable only in this case. You can simply do the following (note that for and foreach are exactly the same):

for my $elem ($html->getElementsByTagName("p")) {

These variables are also intermediate, and two of them unused.

my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (@array,"$text");

Also note that you never have to quote a variable this way. You can simply do this:

push @array, $elem->innerText;

foreach $_  (@array) {

The $_ variable is used by default, no need to specify it explicitly.

print $html_fh "$_\n";   
chomp ($_);        
push (@array1, "$_");

I'm not sure why you are chomping the variable after you print it, but before you store it in this other array, but it doesn't seem to make sense to me. Also, this other array will contain the exact same elements as the other array, only duplicated.

$end = $#array1+1;

This is another intermediate variable, and also it can be simplified. The $# sigil will give you the index of the last element, but the array itself in scalar context will give you the size of it:

$end = @array1;   # size = last index + 1

But you can do this in one go:

print "Elements in the array: " . @array1 . "\n";

Note that using the concatenation operator . here enforces scalar context on the array. If you had used the comma operator , it would have list context, and the array would have been expanded into a list of its elements. This is a typical way to manipulate by context.

close $html_fh;

Explicitly closing a file handle is not required as it will automatically closed when the script ends.

@user128956 If you feel this answered your question, you can accept the answer by clicking the check mark.

creaktive · Accepted Answer · 2012-12-09 14:11:45Z

3

If you use Web::Scraper instead, your code gets even simpler and clearer (as long as you are able to construct CSS selectors or XPath queries):

#!/usr/bin/env perl
use strict;
use warnings qw(all);

use URI;
use Web::Scraper;

my $result = scraper {
    process 'p',
        'paragraph[]' => 'text';
}->scrape(URI->new('http://www.perl.org/'));

for my $test (@{$result->{paragraph}}) {
    print "$test\n";
}

print "Elements in the array: " . (scalar @{$result->{paragraph}});

answered Dec 9, 2012 at 14:11

creaktive

5,2302 gold badges21 silver badges32 bronze badges

Comments

Joel Berger · Accepted Answer · 2012-12-09 16:12:35Z

2

Here is another way to get all the content from between <p> tags, this time using Mojo::DOM part of the Mojolicious project.

#!/usr/bin/env perl

use strict;
use warnings;
use v5.10; # say

use Mojo::DOM;

my $html = <<'END';
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<div>Should not find this</div>
<p>Paragraph 3</p>
END

my $dom = Mojo::DOM->new($html);
my @paragraphs = $dom->find('p')->pluck('text')->each;

say for @paragraphs;

edited Dec 9, 2012 at 16:12

answered Dec 9, 2012 at 16:01

Joel Berger

20.3k5 gold badges52 silver badges106 bronze badges

Collectives™ on Stack Overflow

HTML parser using perl

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related