Parsing and Iterating over URL List in Python

Question

website_list = [
    'https://www.zillow.com/62347390?location=Chicago%2N%23253',
    'https://www.zillow.com/82983250?location=Boston%3B%53324',
    'https://www.zillow.com/12917837?location=Miami%7K%26345',
]

How does one create a python function (e.g. city_finder()) such that we get the following output when given website_list as input?

>>> city_finder(website_list)
['Chicago', 'Boston', 'Miami']

You could use a simple regular expression like location=([^%]+) and grab the first group, see regex101.com/r/aSJxn7/1 — Jan
– Jan, Commented Feb 18, 2018 at 6:48

ndmeiri · Accepted Answer · 2018-02-18 07:18:41Z

3

The previous answers assume that the format of the URLs will not change. Using regular expressions does not account for unexpected URL forms.

To handle changes in the URL format, use the urllib.parse module, whose documentation is here.

Namely, use the urlparse() function, which can parse a URL into its components. The component you want is the "query component," which is exposed by urlparse() as a dictionary. The value associated with the location key will be a list containing, for example, 'Chicago%2N%23253'. Finally, extract the substring before the first %.

Here's a code snippet:

from urllib.parse import urlparse, parse_qs

def city_finder(links)
    cities = []
    for url in links:
        query = parse_qs(urlparse(url).query)
        cities.append(query['location'][0].split('%')[0])
    return cities

edited Feb 18, 2018 at 7:18

answered Feb 18, 2018 at 7:13

ndmeiri

5,03912 gold badges39 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ZaxR · Accepted Answer · 2018-02-18 06:52:11Z

0

You can use str.find() to find the index location of "location=" and of the "%" following the name of the city. Use a list compehension to loop through the url list:

def city_finder(website_list)
    return [site[site.find("location=")+9:site.find("%")] for site in website_list]

answered Feb 18, 2018 at 6:52

ZaxR

5,1954 gold badges29 silver badges46 bronze badges

Comments

Austin · Accepted Answer · 2018-02-18 07:08:25Z

0

Use re module to find word following location= from each item in website_list. Use append to add retrieved location to city list and return it.

import re
website_list = ['https://www.zillow.com/62347390?location=Chicago%2N%23253', 'https://www.zillow.com/82983250?location=Boston%3B%53324', 'https://www.zillow.com/12917837?location=Miami%7K%26345']
regexp = re.compile("location=(.*)%")
city = []
def city_finder(website_list):
    for lists in website_list:
        city.append((regexp.search(lists).group(1).split('%')[0]))
    return(city)
print city_finder(website_list)

Outputs:

['Chicago', 'Boston', 'Miami']

edited Feb 18, 2018 at 7:08

answered Feb 18, 2018 at 6:54

Austin

26.1k4 gold badges28 silver badges52 bronze badges

Comments

Jan · Accepted Answer · 2018-02-18 08:55:08Z

0

As per my comment, you could use

import re

website_list = [
    'https://www.zillow.com/62347390?location=Chicago%2N%23253',
    'https://www.zillow.com/82983250?location=Boston%3B%53324',
    'https://www.zillow.com/12917837?location=Miami%7K%26345',
]

def city_finder(lst=None):
    rx = re.compile(r'location=([^%]+)')
    return [city.group(1) 
            for item in lst 
            for city in [rx.search(item)]
            if city]

print(city_finder(website_list))

Which yields

['Chicago', 'Boston', 'Miami']

answered Feb 18, 2018 at 8:55

Jan

43.3k11 gold badges57 silver badges87 bronze badges

Collectives™ on Stack Overflow

Parsing and Iterating over URL List in Python

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related