1

I'm trying to save the whole web page on my system as a .html file and then parse that file, to find some tags and use them.

I'm able to save/parse http://<url>, but not able to save/parse https://<url>. I'm using Perl.

I'm using the following code to save HTTP and it works fine but doesn't work for HTTPS:

use strict; 
use warnings; 
use LWP::Simple qw($ua get);
use LWP::UserAgent;
use LWP::Protocol::https;
use HTTP::Cookies;

sub main
{
  my $ua = LWP::UserAgent->new();

  my $cookies = HTTP::Cookies->new(
    file => "cookies.txt",
    autosave => 1,
    );
 
  $ua->cookie_jar($cookies);
 
  $ua->agent("Google Chrome/30");
 

#$ua->ssl_opts( SSL_ca_file => 'cert.pfx' );

  $ua->proxy('http','http://proxy.com');
  my $response = $ua->get('http://google.com');

#$ua->credentials($response, "", "usrname", "password");
 
  unless($response->is_success) {
    print "Error: " . $response->status_line;
    }
 
         
    # Let's save the output.
  my $save = "save.html";
 
  unless(open SAVE, '>' . $save) {
    die "nCannot create save file '$save'n";
  }
 
    # Without this line, we may get a
    # 'wide characters in print' warning.
  binmode(SAVE, ":utf8");
 
  print SAVE $response->decoded_content;
 
  close SAVE;
 
  print "Saved ",
      length($response->decoded_content),
      " bytes of data to '$save'.";
}

main();

Is it possible to parse an HTTPS page?

1
  • any errors running this one-liner? perl -MLWP::UserAgent -e '$ua=LWP::UserAgent->new;print $ua->get("https://github.com")->decoded_content();' Commented Oct 18, 2013 at 6:26

2 Answers 2

5

Always worth checking the documentation for the modules that you're using...

You're using modules from libwww-perl. That includes a cookbook. And in that cookbook, there is a section about HTTPS, which says:

URLs with https scheme are accessed in exactly the same way as with http scheme, provided that an SSL interface module for LWP has been properly installed (see the README.SSL file found in the libwww-perl distribution for more details). If no SSL interface is installed for LWP to use, then you will get "501 Protocol scheme 'https' is not supported" errors when accessing such URLs.

The README.SSL file says this:

As of libwww-perl v6.02 you need to install the LWP::Protocol::https module from its own separate distribution to enable support for https://... URLs for LWP::UserAgent.

So you just need to install LWP::Protocol::https.

Sign up to request clarification or add additional context in comments.

Comments

0

You need to have https://metacpan.org/module/Crypt::SSLeay for https links

It provides SSL support for LWP.

Bit me in the ass with a project of my own.

1 Comment

Actually in newer versions of libwww-perl, you need to make sure that LWP::Protocol::https is installed ( installing it would force an SSL module to also be installed )

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.