How to clean inconsistent address strings in Python?

Question

I'm working on a web scraping project in Python to collect data from a real estate website. I'm running into an issue with the addresses, as they are not always consistent.

I've already handled simple issues like pipes (|) and newlines. The main problem is that some addresses have a repeated street name, separated by a comma.

For example, I'm getting addresses like this:

'747 Geary Street, 747 Geary St, Oakland, CA 94609'

The goal is to get a single, clean address without the repetition, like:

'747 Geary Street, Oakland, CA 94609'

I've tried a few things, but I'm having trouble handling both types of addresses in a single line.

This is a training project and the goal is to not use any tools such as ai, but the write code.

Here is an example:

# Here is an example of the addresses I am trying to clean.
addresses_to_clean = [
    'The Gantry | 1340 3rd St, San Francisco, CA',
    '845 Sutter, 845 Sutter St APT 509, San Francisco, CA',
    '1350 Washington Street | 1350 Washington St, San Francisco, CA',
    'Parkmerced 3711 19th Ave, San Francisco, CA',
    '747 Geary Street, 747 Geary St, Oakland, CA 94609'
]

#Here is the code i am using:
`cleaned_addresses = [address.strip().replace("|", "") for address in addresses_to_clean]`
# of course this does not solve the problem of repeated parts, which I am struggling with.

# This is what I want the list to look like after it's cleaned:
desired_output = [
    'The Gantry, 1340 3rd St, San Francisco, CA',
    '845 Sutter St APT 509, San Francisco, CA',
    '1350 Washington St, San Francisco, CA',
    'Parkmerced 3711 19th Ave, San Francisco, CA',
    '747 Geary Street, Oakland, CA 94609'
]

# How can I write the code to get from my 'addresses_to_clean' list
# to the 'desired_output' list?

I am trying to use a single list comprehension with a .split() and .replace() to clean the addresses. I was expecting to get a single, clean address string for each property. However, my code either removed too much information (like the city and state) or didn't correctly handle all the different formatting issues

"the goal is to not use any tools such as ai" - Why? If the problem is that input data is structurally inconsistent but can be intuitively interpreted with reasonable accuracy, that's exactly what AI tools are for. Such a tool could be trained on a reasonable set of expected data, repeating that process until the results achieve the desired level of accuracy. If the goal is to replicate that functionality manually without using those tools then "How do I do that" may be a pretty broad question. — David
– David, Commented Sep 11 at 12:15
This seems like the right tool for AI, it would do a pretty good job. The alternative could be to have a database of known cities, building names, streets, state codes, postcodes etc and try to extract them out of the text. — Tom McLean
– Tom McLean, Commented Sep 11 at 12:20
Agreed on AI, especially GPT models. It would deal much better with yet unseen outliers. Especially as it is very hard to create a "complete" test set. I nevertheless proposed a manual way in the answer to point the way. — André
– André, Commented Sep 11 at 12:23
I appreciate that ai would be a good tool, but the idea of this project is to develop my skills with writing code without any ai. I want to make sure I have a good foundation of knowledge to resolve problems before using other tools. — Adamzam15
– Adamzam15, Commented Sep 11 at 12:50
Your desired output is confusing. How do you decide which of the repeated addresses to use? — jackal
– jackal, Commented Sep 11 at 13:51

André · Accepted Answer · 2025-09-11 12:21:51Z

1

You won't be able to solve this with using split() and replace() only. The following code works on your examples and uses a three step approach:

Convert pipes to ,, expandable to your needs by adding characters.
Normalizing street names to a common abbreviation, i.e. Street becomes St.
Finding and neglecting any parts that are already contained in other parts of the address.

Feel free to adapt the steps to your needs. As I am pretty sure your test set is not complete in terms of potential input, you surely have to treat these cases. But this should get you started.

import re

def clean_address(address):
    # Normalize common street suffixes
    suffix_map = {
        r'\bStreet\b': 'St',
        r'\bAvenue\b': 'Ave',
        r'\bRoad\b': 'Rd',
        r'\bBoulevard\b': 'Blvd',
        r'\bDrive\b': 'Dr',
        r'\bLane\b': 'Ln',
        r'\bCourt\b': 'Ct',
    }
    normalized = address.replace('|', ',')
    for pattern, abbr in suffix_map.items():
        normalized = re.sub(pattern, abbr, normalized, flags=re.IGNORECASE)
    parts = [part.strip() for part in normalized.split(',')]

    # Check if one part is completely contained in another and remove the smaller or first equal one
    cleaned = []
    for i, part in enumerate(parts):
        if not any(i < j and part in other for j, other in enumerate(parts)):
            cleaned.append(part)

    return ', '.join(cleaned)

Output:

Original: The Gantry | 1340 3rd St, San Francisco, CA
Cleaned:  The Gantry, 1340 3rd St, San Francisco, CA

Original: 845 Sutter, 845 Sutter St APT 509, San Francisco, CA
Cleaned:  845 Sutter St APT 509, San Francisco, CA

Original: 1350 Washington Street | 1350 Washington St, San Francisco, CA
Cleaned:  1350 Washington St, San Francisco, CA

Original: Parkmerced 3711 19th Ave, San Francisco, CA
Cleaned:  Parkmerced 3711 19th Ave, San Francisco, CA

Original: 747 Geary Street, 747 Geary St, Oakland, CA 94609
Cleaned:  747 Geary St, Oakland, CA 94609

answered Sep 11 at 12:21

André

1,97714 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Adamzam15 Sep 11 at 12:51

thank you so much for this solution. It is excellent and has really helped me resolve this issue and also learn a lot.

jackal Sep 11 at 13:47

This does not produce the desired output

Collectives™ on Stack Overflow

How to clean inconsistent address strings in Python?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related