Extract domain name from URL using python's re regex

Question

I want to input a URL and extract the domain name which is the string that comes after http:// or https:// and contains strings, numbers, dots, underscores, or dashes.

I wrote the regex and used the python's re module as follows:

import re
m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something')
m.group(1)
print(m)

My understanding is that m.group(1) will extract the part between () in the re.search.

The output that I expect is: google.co.uk But I am getting this:

<_sre.SRE_Match object; span=(0, 35), match='https://google.co.uk?link=something'>

Can you point to me how to use re to achieve my requirement?

Jan · Accepted Answer · 2019-04-26 06:35:25Z

10

You need to write

print(m.group(1))

Even better yet - have a condition before:

m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something')
if m:
    print(m.group(1))

answered Apr 26, 2019 at 6:35

Jan

43.3k11 gold badges57 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sam · Accepted Answer · 2023-04-13 15:55:10Z

3

The easiest way to do it is by the package urllib

from urllib.parse import urlsplit
s = "https://google.co.uk?link=something"
urlsplit(s).netloc

output of this is

'google.co.uk'

answered Apr 13, 2023 at 15:55

Sam

4977 silver badges16 bronze badges

Comments

loophole_sameer · Accepted Answer · 2020-07-01 07:27:36Z

Jan has already provided solution for this. But just to note, we can implement the same without using re. All it needs is !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ for validation purposes. The same can be obtained from string package.

def domain_finder(link):
    import string
    dot_splitter = link.split('.')

    seperator_first = 0
    if '//' in dot_splitter[0]:
        seperator_first = (dot_splitter[0].find('//') + 2)

    seperator_end = ''
    for i in dot_splitter[2]:
        if i in string.punctuation:
            seperator_end = i
            break

    if seperator_end:
        end_ = dot_splitter[2].split(seperator_end)[0]
    else:
        end_ = dot_splitter[2]

    domain = [dot_splitter[0][seperator_first:], dot_splitter[1], end_]
    domain = '.'.join(domain)

    return domain

link = 'https://google.co.uk?link=something'
domain = domain_finder(link=link)
print(domain) # prints ==> 'google.co.uk'

This is just another way of solving the same without re.

Dhamodharan · Accepted Answer · 2019-04-26 09:41:03Z

0

There is an library called tldextract which is very reliable in this case.

Here is how it will work

import tldextract

def extractDomain(url):
    if "http" in str(url) or "www" in str(url):
        parsed = tldextract.extract(url)
        parsed = ".".join([i for i in parsed if i])
        return parsed
    else: return "NA"

op = open("out.txt",'w')
# with open("test.txt") as ptr:
#   for lines in ptr.read().split("\n"):
#       op.write(str(extractDomain(lines)) + "\n")

print(extractDomain("https://test.pythonhosted.org/Flask-Mail/"))

output as follows,

test.pythonhosted.org

edited Apr 26, 2019 at 9:41

answered Apr 26, 2019 at 8:36

Dhamodharan

3091 silver badge12 bronze badges

2 Comments

user9371654 Over a year ago

But I need the subdomains BTW. I think the first is more reliable. This library depends on hard coded lists. So it depends how updated the list is.

Dhamodharan Over a year ago

Yes both providing good results.. In my use case i have to fetch domain name alone, there it helps me lot. also did some test of 10K different urls both working without any issues

Collectives™ on Stack Overflow

Extract domain name from URL using python's re regex

4 Answers 4

Comments

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related