I'm working on a web scraping project in Python to collect data from a real estate website. I'm running into an issue with the addresses, as they are not always consistent.
I've already handled simple issues like pipes (|) and newlines. The main problem is that some addresses have a repeated street name, separated by a comma.
For example, I'm getting addresses like this:
'747 Geary Street, 747 Geary St, Oakland, CA 94609'
The goal is to get a single, clean address without the repetition, like:
'747 Geary Street, Oakland, CA 94609'
I've tried a few things, but I'm having trouble handling both types of addresses in a single line.
This is a training project and the goal is to not use any tools such as ai, but the write code.
Here is an example:
# Here is an example of the addresses I am trying to clean.
addresses_to_clean = [
'The Gantry | 1340 3rd St, San Francisco, CA',
'845 Sutter, 845 Sutter St APT 509, San Francisco, CA',
'1350 Washington Street | 1350 Washington St, San Francisco, CA',
'Parkmerced 3711 19th Ave, San Francisco, CA',
'747 Geary Street, 747 Geary St, Oakland, CA 94609'
]
#Here is the code i am using:
`cleaned_addresses = [address.strip().replace("|", "") for address in addresses_to_clean]`
# of course this does not solve the problem of repeated parts, which I am struggling with.
# This is what I want the list to look like after it's cleaned:
desired_output = [
'The Gantry, 1340 3rd St, San Francisco, CA',
'845 Sutter St APT 509, San Francisco, CA',
'1350 Washington St, San Francisco, CA',
'Parkmerced 3711 19th Ave, San Francisco, CA',
'747 Geary Street, Oakland, CA 94609'
]
# How can I write the code to get from my 'addresses_to_clean' list
# to the 'desired_output' list?
I am trying to use a single list comprehension with a .split() and .replace() to clean the addresses. I was expecting to get a single, clean address string for each property. However, my code either removed too much information (like the city and state) or didn't correctly handle all the different formatting issues