5

I want to input a URL and extract the domain name which is the string that comes after http:// or https:// and contains strings, numbers, dots, underscores, or dashes.

I wrote the regex and used the python's re module as follows:

import re
m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something')
m.group(1)
print(m)

My understanding is that m.group(1) will extract the part between () in the re.search.

The output that I expect is: google.co.uk But I am getting this:

<_sre.SRE_Match object; span=(0, 35), match='https://google.co.uk?link=something'>

Can you point to me how to use re to achieve my requirement?

0

4 Answers 4

10

You need to write

print(m.group(1))

Even better yet - have a condition before:

m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something')
if m:
    print(m.group(1))
Sign up to request clarification or add additional context in comments.

Comments

3

The easiest way to do it is by the package urllib

from urllib.parse import urlsplit
s = "https://google.co.uk?link=something"
urlsplit(s).netloc

output of this is

'google.co.uk'

Comments

1

Jan has already provided solution for this. But just to note, we can implement the same without using re. All it needs is !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ for validation purposes. The same can be obtained from string package.

def domain_finder(link):
    import string
    dot_splitter = link.split('.')

    seperator_first = 0
    if '//' in dot_splitter[0]:
        seperator_first = (dot_splitter[0].find('//') + 2)

    seperator_end = ''
    for i in dot_splitter[2]:
        if i in string.punctuation:
            seperator_end = i
            break

    if seperator_end:
        end_ = dot_splitter[2].split(seperator_end)[0]
    else:
        end_ = dot_splitter[2]

    domain = [dot_splitter[0][seperator_first:], dot_splitter[1], end_]
    domain = '.'.join(domain)

    return domain

link = 'https://google.co.uk?link=something'
domain = domain_finder(link=link)
print(domain) # prints ==> 'google.co.uk'

This is just another way of solving the same without re.

Comments

0

There is an library called tldextract which is very reliable in this case.

Here is how it will work

import tldextract

def extractDomain(url):
    if "http" in str(url) or "www" in str(url):
        parsed = tldextract.extract(url)
        parsed = ".".join([i for i in parsed if i])
        return parsed
    else: return "NA"

op = open("out.txt",'w')
# with open("test.txt") as ptr:
#   for lines in ptr.read().split("\n"):
#       op.write(str(extractDomain(lines)) + "\n")

print(extractDomain("https://test.pythonhosted.org/Flask-Mail/"))

output as follows,

test.pythonhosted.org

2 Comments

But I need the subdomains BTW. I think the first is more reliable. This library depends on hard coded lists. So it depends how updated the list is.
Yes both providing good results.. In my use case i have to fetch domain name alone, there it helps me lot. also did some test of 10K different urls both working without any issues

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.