Extract domain name from URL in Python

Question

I am tring to extract the domain names out of a list of URLs. Just like in https://stackoverflow.com/questions/18331948/extract-domain-name-from-the-url
My problem is that the URLs can be about everything, few examples:
m.google.com => google
m.docs.google.com => google
www.someisotericdomain.innersite.mall.co.uk => mall
www.ouruniversity.department.mit.ac.us => mit
www.somestrangeurl.shops.relevantdomain.net => relevantdomain
www.example.info => example
And so on..
The diversity of the domains doesn't allow me to use a regex as shown in how to get domain name from URL (because my script will be running on enormous amount of urls from real network traffic, the regex will have to be enormous in order to catch all kinds of domains as mentioned).
Unfortunately my web research the didn't provide any efficient solution.
Does anyone have an idea of how to do this ?
Any help will be appreciated !
Thank you

Gather a list of top-level domains, split your url by dots, right-strip your url from TLD, extract name. — Pearley
– Pearley, Commented May 17, 2017 at 10:10
Yes, I can use external libs. It is not a duplication (I even attached a link to this thread), I couldn't find a satisfying answer there. — kobibo
– kobibo, Commented May 17, 2017 at 10:20

robertspierre · Accepted Answer · 2024-12-17 11:17:27Z

47

Use tldextract, which is more efficient version of urlparse.

tldextract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and subdomains of a URL.

>>> import tldextract
>>> ext = tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
>>> ext.domain
'cnn'

edited Dec 17, 2024 at 11:17

robertspierre

5,3793 gold badges43 silver badges65 bronze badges

answered May 17, 2017 at 10:40

akash karothiya

5,9601 gold badge21 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

alphazwest Over a year ago

Note: the tldextract library makes an http request upon initial install and creates a cache of the latest tld data. This can raise a permission error for some remote deployments. See here: github.com/john-kurkowski/tldextract#note-about-caching

manjy Dec 11, 2024 at 11:44

@alphazwest I wish I had read your comment before pushing my changes to prod

Mariano Anaya · Accepted Answer · 2017-05-17 10:12:36Z

6

It seems you can use urlparse https://docs.python.org/3/library/urllib.parse.html for that url, and then extract the netloc.

And from the netloc you could easily extract the domain name by using split

answered May 17, 2017 at 10:12

Mariano Anaya

1,30610 silver badges11 bronze badges

3 Comments

kobibo Over a year ago

Thank you for your response, unfortunately, using urlparse on url like m.city.domain.com returned me ParseResult(scheme='', netloc='', path='m.city.domain.com', params='', query='', fragment=''), while the expected output was domain

user9608133 Over a year ago

Use a valid URL (//m.city.domain.com/), not a something like (m.city.domain.com). Nobody can guess what did you pass when you removed backslashes.

jfs Over a year ago

@kobibo urlparse follows rfc 1808 syntax which requires // before net_loc docs.python.org/3/library/urllib.parse.html

Kapil_Khatik · Accepted Answer · 2023-01-24 08:23:32Z

3

For extracting domain from url

from urllib.parse import urlparse

url = "https://stackoverflow.com/questions/44021846/extract-domain-name-from-url-in-python"
domain = urlparse(url).netloc
"stackoverflow.com"

For check domain is exist in url

if urlparse(url).netloc in ["domain1", "domain2", "domain3"]:
           do something

edited Jan 24, 2023 at 8:23

answered Jan 24, 2023 at 7:45

Kapil_Khatik

441 silver badge5 bronze badges

1 Comment

jfs Over a year ago

pu.netloc may include port. You might want pu.hostname instead (to get the domain without port).

Sharif Orzikulov · Accepted Answer · 2020-05-20 12:03:27Z

1

Simple solution via regex

import re

def domain_name(url):
    return url.split("www.")[-1].split("//")[-1].split(".")[0]

answered May 20, 2020 at 12:03

Sharif Orzikulov

611 silver badge3 bronze badges

2 Comments

Steve Gon Over a year ago

Gets the first part of the domain, not the actual domain. Only works for things like www.google.com

Pedro Lobito Over a year ago

Unreliable solution, avoid.

oddRaven · Accepted Answer · 2017-05-17 10:34:05Z

0

With regex, you could use something like this:

(?<=\.)([^.]+)(?:\.(?:co\.uk|ac\.us|[^.]+(?:$|\n)))

https://regex101.com/r/WQXFy6/5

Notice, you'll have to watch out for special cases such as co.uk.

answered May 17, 2017 at 10:34

oddRaven

6721 gold badge7 silver badges21 bronze badges

Comments

Denis · Accepted Answer · 2022-05-31 16:08:21Z

0

Check the replace and split methods.

PS: ONLY WORKS FOR SIMPLE LINKS LIKE https://youtube.com (output=youtube) AND (www.user.ru.com) (output=user)

def domain_name(url):

return url.replace("www.","http://").split("//")[1].split(".")[0]

answered May 31, 2022 at 16:08

Denis

213 bronze badges

Comments

Jup · Accepted Answer · 2022-12-25 09:12:49Z

0

import re
def getDomain(url:str) -> str:
    '''
        Return the domain from any url
    '''
    # copy the original url text
    clean_url = url

    # take out protocol
    reg = re.findall(':[0-9]+',url)
    if len(reg) > 0:
        url = url.replace(reg[0],'')
    
    # take out paths routes
    if '/' in url:
        url = url.split('/')

    # select only the domain
    if 'http' in clean_url:
        url = url[2]

    # preparing for next operation
    url = ''.join(url)

    # select only domain
    url = '.'.join(url.split('.')[-2:])

    return url

edited Dec 25, 2022 at 9:12

answered Dec 25, 2022 at 9:10

Jup

212 bronze badges

1 Comment

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

gies0r · Accepted Answer · 2023-01-21 18:15:25Z

from urllib.parse import urlparse
import validators

    hostnames = []
    counter = 0
    errors = 0
    for row_orig in rows:
        try:
            row = row_orig.rstrip().lstrip().split(' ')[1].rstrip()
            if len(row) < 5:
                print(f"Empty row {row_orig}")
                errors += 1
                continue
            if row.startswith('http'):
                domain = urlparse(row).netloc # works for https and http
            else:
                domain = row

            if ':' in domain:
                domain = domain.split(':')[0] # split at port after clearing http/https protocol 

            # Finally validate it
            if validators.domain(domain):
                pass
            elif validators.ipv4(domain):
                pass
            else:
                print(f"Invalid domain/IP {domain}. RAW: {row}")
                errors +=1
                continue

            hostnames.append(domain)
            if counter % 10000 == 1:
                print(f"Added {counter}. Errors {errors}")
            counter+=1
        except:
            print("Error in extraction")
            errors += 1

FastFingertips · Accepted Answer · 2024-02-10 13:40:37Z

tests = {
  "m.google.com": 'google',
  "m.docs.google.com": 'google',
  "www.someisotericdomain.innersite.mall.co.uk": 'mall',
  "www.ouruniversity.department.mit.ac.us": 'mit',
  "www.somestrangeurl.shops.relevantdomain.net": 'relevantdomain',
  "www.example.info": 'example',
  "github.com": 'github',
}

def get_domain(url, loop=0, data={}):

  dot_count = url.count('.')

  if not dot_count:
    raise Exception("Invalid URL")

  # basic
  if not loop:
    if dot_count < 3:
      data = {
        'main':  url.split('.')[0 if dot_count == 1 else 1]
        }

  # advanced
  if not data and '.' in url:
      if dot_count > 1:
        loop += 1
        start = url.find('.')+1
        end = url.rfind('.') if dot_count != 2 else None
        return get_domain(url[start:end], loop, data)
      else:
        data ={
          'main': url.split('.')[-1]
          }

  return data

for u, v in tests.items():
  print(get_domain(u))

Collectives™ on Stack Overflow

Extract domain name from URL in Python

9 Answers 9

2 Comments

3 Comments

1 Comment

2 Comments

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

2 Comments

3 Comments

1 Comment

2 Comments

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related