0

A script is getting links from a csv file and scrapes some info from webpages. Some links don't work and the script fumbles. I've included a try/except, but this messes up my output, since I need the exact amount of output rows as in the original file.

for row in reader:
    try:
        url = row[4]
        req=urllib2.Request(url)
        tree = lxml.html.fromstring(urllib2.urlopen(req).read())
    except:
        continue

Is there a way to delete the row from a csv file where there's a faulty link? Something like:

for row in reader:
    try:
        url = row[4]
        req=urllib2.Request(url)
        tree = lxml.html.fromstring(urllib2.urlopen(req).read())
    except:
        continue
        DELETE_THE_ROW
1
  • Why do you "need the exact amount of output rows as in the original file" Commented Oct 3, 2014 at 15:42

2 Answers 2

1

The best possible approach would be to create a new csv file and keep on writing only those rows whose links are valid.

f = open('another_csv.csv','w+')
for row in reader:
    try:
       url = row[4]
       req=urllib2.Request(url)
       tree = lxml.html.fromstring(urllib2.urlopen(req).read())
       print >>f,','.join(row)
    except:
       #can log the faulty links in another file
       continue
f.close()

You can rename the new csv to the original one, or keep both.

Sign up to request clarification or add additional context in comments.

3 Comments

That works, but with some complications. Since there are commas in the original file (like in article headlines), the new file with ',' delimiter is super messed up. Is there a way to circumvent this problem?
Here you go : print>>f, '"' + '","'.join(row) + '"'
Or you could directly user csv.writer as mentioned in @Yann. It'll quote only those fields which have comma in them. Using quotations for all fields also increases the file size.
0

If all goes well, why don't you write the good rows to another file?

writer = csv.writer(out_file_handle)
for row in reader:
    try:
        url = row[4]
        req=urllib2.Request(url)
        tree = lxml.html.fromstring(urllib2.urlopen(req).read())
    except:
        continue
    else:
       writer.writerow(row)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.