1

I'm on Ubuntu 14.04 using python 2.7 scraping with rotating proxies... After a few minutes of scraping the error:

raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",))


            if keyword1 in text and keyword2 in text and keyword3 in text:
                print("LINK SCRAPED")
                print(text, "link scraped")
                found = True 
                break 

except requests.exceptions.ConnectionError as err:
    print("Encountered ConnectionError, retrying: {}".format(err))

If this is not the correct way to implement try I assume only the request goes into the try clause and everything else is after except ?

1
  • I will remove beautifulsoup tag. Commented Jan 6, 2017 at 2:46

1 Answer 1

2

Instead of restarting the script, you can handle the error using a try / except statement.

For example:

try:
    # line of code that is failing
except requests.exceptions.ConnectionError as err:
    print("Encountered ConnectionError, retrying: {}".format(err))

Then just retry the original call.

UPDATE: Based on your updated code sample, here's what I'd do:

from bs4 import BeautifulSoup
import requests
import smtplib
import urllib2
from random import randint
import time
from lxml import etree
from time import sleep
import random


proxies = {'https': '100.00.00.000:00000'}
hdr1 = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding': 'none',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive',
}

hdrs = [hdr1] #, hdr2, hdr3, hdr4, hdr5, hdr6, hdr7]
ua = random.choice(hdrs)
head = {
    'Connection': 'close',
    'User-Agent': ua,
}

#####   REQUEST  1  ####
done = False
while not done:
    try:
        a = requests.get('https://store.fabspy.com/sitemap.xml', proxies=proxies, headers=head)
        done = True
    except requests.exceptions.ConnectionError as err:
        print('Encountered ConnectionError, retrying: {}'.format(err))
        time.sleep(1)

scrape = BeautifulSoup(a.text, 'lxml')
links = scrape.find_all('loc')
for link in links:
    if 'products' in link.text:
        sitemap = str(link.text)
        break

keyword1 = 'not'
keyword2 = 'on'
keyword3 = 'site'

#########    REQUEST 2 #########
done = False
while not done:
    try:
        r = requests.get(sitemap, proxies=proxies, headers=head)
        done = True
    except requests.exceptions.ConnectionError as err:
        print('Encountered ConnectionError, retrying: {}'.format(err))
        sleep(randint(4,6))

soup = BeautifulSoup(r.text, 'lxml')
links = soup.find_all('loc')
for link in links:
    text = link.text
    if keyword1 in text and keyword2 in text and keyword3 in text:
        print(text, 'link scraped')
        break
Sign up to request clarification or add additional context in comments.

5 Comments

I have attempted to apply this to slimmer version of the script I am running, I have edited it on above, can you verify?
should the try statement contain the entire request loop ? Or only the initial request and the rest of the loop after the except
@ColeWorld I just updated my answer to include a re-written code sample for ya.
Thanks, so far this seems to solve the error issue but I think it conflicts with the keyword search.. If you pass any string to the keywords where 1 keyword matches with some link on the site it provides that link..
Sorry, not sure what you mean by keyword search. I was only looking at handling the error correctly, I'm not terribly familiar with the other logic in your program.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.