2

I'm trying to parse the html file using perl script. I'm trying to grep all the text with html tag p. If I view the source code the data is written in this format.

<p> Metrics are all virtualization specific and are prioritized and grouped as follows: </p>

Here is the following code.

use HTML::TagParser();

use URI::Fetch;

//my @list = $html->getElementsByTagName( "p" );

    foreach my $elem ( @list ) {
        my $tagname = $elem->tagName;
        my $attr = $elem->attributes;
        my $text = $elem->innerText;

        push (@array,"$text");

        foreach $_  (@array) {
           # print "$_\n"; 
           print $html_fh "$_\n";   
          chomp ($_);        
           push (@array1, "$_");
         }
       } 
    }

$end = $#array1+1;

print "Elements in the array: $end\n";

close $html_fh;

The problem that I'm facing is that the output which is generated is 4.60 Mb and lot of the array elements are just repetition sentences. How can I avoid such repetition? Is there any other efficient way to grep the lines which I'm interested. Can anybody help me out with this issue?

0

3 Answers 3

3

The reason you are seeing duplicated lines is that you are printing your entire array once for every element in it.

foreach my $elem ( @list ) {
    my $tagname = $elem->tagName;
    my $attr = $elem->attributes;
    my $text = $elem->innerText;

    push (@array,"$text");      # this array is printed below

    foreach $_  (@array) {      # This is inside the other loop
       # print "$_\n"; 
       print $html_fh "$_\n";   # here comes the print
      chomp ($_);        
       push (@array1, "$_");
     }
   } 

So for example, if you have an array "foo", "bar", "baz", it would print:

foo   # first iteration
foo   # second
bar
foo   # third
bar
baz

So, to fix your duplication errors, move the second loop outside the first one.

Some other notes:

You should always use these two pragmas:

use strict;
use warnings;

They will provide more help than any other single thing that you can do. The short learning curve associated with fixing the errors that appear more than make up for the massively reduced time spent debugging.

//my @list = $html->getElementsByTagName( "p" );

Comments in perl start with #. Not sure if this is a typo, because you use this array below.

foreach my $elem ( @list ) {

You don't need to actually store the tags into an array unless you need an array. This is an intermediate variable only in this case. You can simply do the following (note that for and foreach are exactly the same):

for my $elem ($html->getElementsByTagName("p")) {

These variables are also intermediate, and two of them unused.

my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (@array,"$text");

Also note that you never have to quote a variable this way. You can simply do this:

push @array, $elem->innerText;

foreach $_  (@array) {

The $_ variable is used by default, no need to specify it explicitly.

print $html_fh "$_\n";   
chomp ($_);        
push (@array1, "$_");

I'm not sure why you are chomping the variable after you print it, but before you store it in this other array, but it doesn't seem to make sense to me. Also, this other array will contain the exact same elements as the other array, only duplicated.

$end = $#array1+1;

This is another intermediate variable, and also it can be simplified. The $# sigil will give you the index of the last element, but the array itself in scalar context will give you the size of it:

$end = @array1;   # size = last index + 1

But you can do this in one go:

print "Elements in the array: " . @array1 . "\n";

Note that using the concatenation operator . here enforces scalar context on the array. If you had used the comma operator , it would have list context, and the array would have been expanded into a list of its elements. This is a typical way to manipulate by context.

close $html_fh;

Explicitly closing a file handle is not required as it will automatically closed when the script ends.

Sign up to request clarification or add additional context in comments.

1 Comment

@user128956 If you feel this answered your question, you can accept the answer by clicking the check mark.
3

If you use Web::Scraper instead, your code gets even simpler and clearer (as long as you are able to construct CSS selectors or XPath queries):

#!/usr/bin/env perl
use strict;
use warnings qw(all);

use URI;
use Web::Scraper;

my $result = scraper {
    process 'p',
        'paragraph[]' => 'text';
}->scrape(URI->new('http://www.perl.org/'));

for my $test (@{$result->{paragraph}}) {
    print "$test\n";
}

print "Elements in the array: " . (scalar @{$result->{paragraph}});

Comments

2

Here is another way to get all the content from between <p> tags, this time using Mojo::DOM part of the Mojolicious project.

#!/usr/bin/env perl

use strict;
use warnings;
use v5.10; # say

use Mojo::DOM;

my $html = <<'END';
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<div>Should not find this</div>
<p>Paragraph 3</p>
END

my $dom = Mojo::DOM->new($html);
my @paragraphs = $dom->find('p')->pluck('text')->each;

say for @paragraphs;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.