3

I'm working on a web scraping project in Python to collect data from a real estate website. I'm running into an issue with the addresses, as they are not always consistent.

I've already handled simple issues like pipes (|) and newlines. The main problem is that some addresses have a repeated street name, separated by a comma.

For example, I'm getting addresses like this:

'747 Geary Street, 747 Geary St, Oakland, CA 94609'

The goal is to get a single, clean address without the repetition, like:

'747 Geary Street, Oakland, CA 94609'

I've tried a few things, but I'm having trouble handling both types of addresses in a single line.

This is a training project and the goal is to not use any tools such as ai, but the write code.

Here is an example:

# Here is an example of the addresses I am trying to clean.
addresses_to_clean = [
    'The Gantry | 1340 3rd St, San Francisco, CA',
    '845 Sutter, 845 Sutter St APT 509, San Francisco, CA',
    '1350 Washington Street | 1350 Washington St, San Francisco, CA',
    'Parkmerced 3711 19th Ave, San Francisco, CA',
    '747 Geary Street, 747 Geary St, Oakland, CA 94609'
]

#Here is the code i am using:
`cleaned_addresses = [address.strip().replace("|", "") for address in addresses_to_clean]`
# of course this does not solve the problem of repeated parts, which I am struggling with.

# This is what I want the list to look like after it's cleaned:
desired_output = [
    'The Gantry, 1340 3rd St, San Francisco, CA',
    '845 Sutter St APT 509, San Francisco, CA',
    '1350 Washington St, San Francisco, CA',
    'Parkmerced 3711 19th Ave, San Francisco, CA',
    '747 Geary Street, Oakland, CA 94609'
]

# How can I write the code to get from my 'addresses_to_clean' list
# to the 'desired_output' list?

I am trying to use a single list comprehension with a .split() and .replace() to clean the addresses. I was expecting to get a single, clean address string for each property. However, my code either removed too much information (like the city and state) or didn't correctly handle all the different formatting issues

5
  • 2
    "the goal is to not use any tools such as ai" - Why? If the problem is that input data is structurally inconsistent but can be intuitively interpreted with reasonable accuracy, that's exactly what AI tools are for. Such a tool could be trained on a reasonable set of expected data, repeating that process until the results achieve the desired level of accuracy. If the goal is to replicate that functionality manually without using those tools then "How do I do that" may be a pretty broad question. Commented Sep 11 at 12:15
  • 2
    This seems like the right tool for AI, it would do a pretty good job. The alternative could be to have a database of known cities, building names, streets, state codes, postcodes etc and try to extract them out of the text. Commented Sep 11 at 12:20
  • 1
    Agreed on AI, especially GPT models. It would deal much better with yet unseen outliers. Especially as it is very hard to create a "complete" test set. I nevertheless proposed a manual way in the answer to point the way. Commented Sep 11 at 12:23
  • I appreciate that ai would be a good tool, but the idea of this project is to develop my skills with writing code without any ai. I want to make sure I have a good foundation of knowledge to resolve problems before using other tools. Commented Sep 11 at 12:50
  • Your desired output is confusing. How do you decide which of the repeated addresses to use? Commented Sep 11 at 13:51

1 Answer 1

1

You won't be able to solve this with using split() and replace() only. The following code works on your examples and uses a three step approach:

  1. Convert pipes to ,, expandable to your needs by adding characters.
  2. Normalizing street names to a common abbreviation, i.e. Street becomes St.
  3. Finding and neglecting any parts that are already contained in other parts of the address.

Feel free to adapt the steps to your needs. As I am pretty sure your test set is not complete in terms of potential input, you surely have to treat these cases. But this should get you started.

import re

def clean_address(address):
    # Normalize common street suffixes
    suffix_map = {
        r'\bStreet\b': 'St',
        r'\bAvenue\b': 'Ave',
        r'\bRoad\b': 'Rd',
        r'\bBoulevard\b': 'Blvd',
        r'\bDrive\b': 'Dr',
        r'\bLane\b': 'Ln',
        r'\bCourt\b': 'Ct',
    }
    normalized = address.replace('|', ',')
    for pattern, abbr in suffix_map.items():
        normalized = re.sub(pattern, abbr, normalized, flags=re.IGNORECASE)
    parts = [part.strip() for part in normalized.split(',')]

    # Check if one part is completely contained in another and remove the smaller or first equal one
    cleaned = []
    for i, part in enumerate(parts):
        if not any(i < j and part in other for j, other in enumerate(parts)):
            cleaned.append(part)

    return ', '.join(cleaned)

Output:

Original: The Gantry | 1340 3rd St, San Francisco, CA
Cleaned:  The Gantry, 1340 3rd St, San Francisco, CA

Original: 845 Sutter, 845 Sutter St APT 509, San Francisco, CA
Cleaned:  845 Sutter St APT 509, San Francisco, CA

Original: 1350 Washington Street | 1350 Washington St, San Francisco, CA
Cleaned:  1350 Washington St, San Francisco, CA

Original: Parkmerced 3711 19th Ave, San Francisco, CA
Cleaned:  Parkmerced 3711 19th Ave, San Francisco, CA

Original: 747 Geary Street, 747 Geary St, Oakland, CA 94609
Cleaned:  747 Geary St, Oakland, CA 94609
Sign up to request clarification or add additional context in comments.

2 Comments

thank you so much for this solution. It is excellent and has really helped me resolve this issue and also learn a lot.
This does not produce the desired output

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.