1

Hello I have a string which contains a mail address. For example ( [email protected] ) And I have a list which contains only domains ('bar.com','stackoverflow.com') etc.

I want to search the list if it contains my string's domain. Right now I am using a code like this

if tokens[1].partition("@")[2] in domainlist:

tokens[1] contains the mail address and domainlist contains the domains. But as you can see the result of tokens[1].partition("@")[2] will return foo.bar.com but my list has the domain bar.com. How can I make this if statement return true? And it should be very fast because hundreds of mail addresses will come in every second

1

5 Answers 5

4

It should work like this:

if any(tokens[1].endswith(domain) for domain in domainlist): 
Sign up to request clarification or add additional context in comments.

3 Comments

too simple, the domain 'foo.com' would match also 'somefoo.com'
Ooops, sorry... it has one problem... It returns true for '[email protected]' if 'bar.com' is in the domainlist. Probably not what OP wants... :-(
See stackoverflow.com/questions/5908190/… in case len(domainlist)>1000, for a way to avoid doing 1000x the work. Otherwise this is a valid answer.
2

If speed is really an issue for you, you can look into methods like Aho-Corasick. There are plenty of implementations available, like esmre/esm http://code.google.com/p/esmre/

As pointed out by @Riccardo Galli, simple string matching will produce some false positives, so you can try with esmre first, adding according regexes into index, something like index.enter("(^|\.){0}$".format(domain))

Comments

1

Opposite to other answers, here 'foo.com' would not match also '@y.afoo.com'

def mailInDomains(mail,domains):

    for domain in domainList:
        dLen = len(domain)
        if mail[-dLen:]==domain and mail[-dLen-1] in ('.','@'):
            return True

    return False

Comments

1

First, make domainlist a set. It will be faster to check whether there is something contained in it.

Second, add all 'superdomains' into this set, such as 'bar.com' for 'foo.bar.com'.

domainlist = ['foo.bar.com', 'bar2.com', 'foo3.bar3.foobar.com']
domainset = set()
for domain in domainlist:
    parts = domain.split('.')
    domainset.update('.'.join(parts[i:]) for i in xrange(len(parts)-1))

#domainset is now:
set(['bar.com',
     'bar2.com',
     'bar3.foobar.com',
     'foo.bar.com',
     'foo3.bar3.foobar.com',
     'foobar.com'])

And now you can test

if tokens[1].partition("@")[2] in domainset:

Comments

1

Hundreds of mail addresses should not be an issue. The following is a one-liner:

any(domain.endswith(d) for d in MY_DOMAINS)

Here, you can do user,sep,domain = address.rpartition('@'). Otherwise, your current method will fail for email addresses such as "B@tm4n"@something.com, which are valid according to https://www.rfc-editor.org/rfc/rfc5322

If performance becomes a factor, you can use a Trie (a kind of data structure). If performance is still a factor, you can use other tricks.

The above goes through each element in the domains you're checking, so if you have 1000 domains in your list, you need to do 1000 lookups for each email address. If this is an issue, you can do this to achieve O(1) per lookup (you also probably want to make sure you're not checking more than 5 suffixes, to protect yourself from maliciously crafted email addresses).

MY_DOMAINS = set(MY_DOMAINS)

def suffixes(domain):
    """
        suffixes('foo.bar.com') -yields-> ['foo.bar.com', 'bar.com', 'com']
    """
    while True:
        yield domain
        parts = domain.split('.',1)
        if len(parts>1)
            domain = parts[1]
        else:
            break
def isInList(address):
    user,sep,domain = address.rpartition('@')
    return any(suffix in MY_DOMAINS for suffix in suffixes(domain))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.