3

I have a list of urls like this:

http://noto.zrobimystrone.pl/pucenter/images/NGdocs/
http://visionwebmkt.com/unsubscribe.php?M=879552&C=b744d324e38f5f3b0bcf549f1d57a3ab&L=20&N=497
http://www.meguiatramandai.com.br/unsubscribe.php?M=722&C=8410431be55bf12faac13d18982d71cd&L=1&N=3
http://www.contatoruy.in/link.php?M=86457&N=4&L=1&F=H
http://www.maxxivrimoveis.com.br/
http://www.meguiatramandai.com.br/unsubscribe.php?M=722&C=8410431be55bf12faac13d18982d71cd&L=1&N=2
http://arm.smilecire.com/ch+urch38146263923bpa.stor/imp-roved258021029his+health212149011
http://hurl.zonalrems.com/ge.tyo-ur584372780599hea+lth247408058un/der+control21211901
http://harp.doomyjupe.com/see.this-better/life+58291551346csexdrive663295668+better/how.981692016
http://beefy.toneyvaws.com/no+tice/how/35306640b+see/app=5429204last/attempt=457943182
http://kirk.yournjuju.com/shop/sam.sclub-win=ter/58387369768esame+673844946.bett.er-loo.k981686408
http://idly.theirpoem.com/veri-fy/notice-7853508818b2glob/al=who.43639603inc.lusion-610549278
http://wva188.suleacatan.com/credit-score/review/-551694841511001sfdghsfdgsdfg63887839
http://cop.forterins.com/app.lyto=face962540097dtolo+oko.ung268570307yo.un-ger8752507
http://vni116.gaelsyaray.com/qertqetert//-dghjghjghd5531864856415612229498430
http://ticket.prategama.com/shop/sam.sclub-win=ter/752490935same+226373195.bett.er-loo.k212801
http://cbu125.quetxviii.com/cvbnvbn7551116db537203--swrtytry664896546
http://c5a.dicadodia.com.br/pass4sp09/NetAffProTeste-1.html
http://snub.woadsbevy.com/ama/zing-753773417oppe-tun/ity+217801.is-here/now=236922473
http://mkt.livrariacultura.com.br/pub/cc?_ri_=X0Gzc2X%3DWQpglLjHJlYQGgzfB7tPi0PuyyJ71ES

I wanna extract only the parents domain names, for example:

http://noto.zrobimystrone.pl/pucenter/images/NGdocs/
http://visionwebmkt.com/unsubscribe.php?M=879552&C=b744d324e38f5f3b0bcf549f1d57a3ab&L=20&N=497
http://www.meguiatramandai.com.br/unsubscribe.php?M=722&C=8410431be55bf12faac13d18

Into

zrobimystrone.pl
visionwebmkt.com
meguiatramandai.com.br

I have tried

awk '{gsub("http://|/.*","")}1' list.txt

and got the following results:

noto.zrobimystrone.pl
visionwebmkt.com
www.meguiatramandai.com.br
www.contatoruy.in
www.maxxivrimoveis.com.br
www.meguiatramandai.com.br
arm.smilecire.com
hurl.zonalrems.com
harp.doomyjupe.com
beefy.toneyvaws.com

but dont know how to get only the parent name from noto.zrobimystrone.pl for instance.

5 Answers 5

6

Using awk

awk -F \/ '{l=split($3,a,"."); print (a[l-1]=="com"?a[l-2] OFS:X) a[l-1] OFS a[l]}' OFS="." file|sort -u

contatoruy.in
dicadodia.com.br
doomyjupe.com
forterins.com
gaelsyaray.com
livrariacultura.com.br
maxxivrimoveis.com.br
meguiatramandai.com.br
prategama.com
quetxviii.com
smilecire.com
suleacatan.com
theirpoem.com
toneyvaws.com
visionwebmkt.com
woadsbevy.com
yournjuju.com
zonalrems.com
zrobimystrone.pl
Sign up to request clarification or add additional context in comments.

1 Comment

Well what else can i say but thank you? It's perfectly working. cheers brah!!
1

You can use this awk:

awk -F'.' '{gsub("http://|/.*","")} NF>2{$1="";$0=substr($0, 2)}1' OFS='.' list.txt
zrobimystrone.pl
visionwebmkt.com
meguiatramandai.com.br
contatoruy.in
maxxivrimoveis.com.br
meguiatramandai.com.br
smilecire.com
zonalrems.com
doomyjupe.com
toneyvaws.com
yournjuju.com
theirpoem.com
suleacatan.com
forterins.com
gaelsyaray.com
prategama.com
quetxviii.com
dicadodia.com.br
woadsbevy.com
livrariacultura.com.br

3 Comments

Cool, but i found a problem with this approach: for the domain meudis.com.br from http://meudis.com.br/media/wb.php?p=u8/u4/rs/eot/s5/rs it shows only "com.br"
If you want be most accurate then I guess whois database needs to be looked into otherwise cases like will arise on solution like this.
awk -F \/ '{l=split($3,a,"."); print (a[l-1]=="com"?a[l-2] OFS:X) a[l-1] OFS a[l]}' OFS="." a.txt a.txt is " tiktok.com ads.faceBoook.com sub.ads.faCebook.com api.tiktok.com Google.com aws.amazon.com " output tiktok.com faceBoook.com faCebook.com . . .
1

I guess it depends on what you mean by parent. If by "parent", you mean the top of the zone apex in DNS (e.g., zrobimystrone.pl ), then the right way to do this is to look that up in DNS. There's a trick with DNS where you get back the parent zone SOA record if you ask for the SOA for any name.. So, try this:

for i in $(awk '{gsub("http://|/.*","")}1' list.txt); do dig soa $i | grep -v ^\; | grep SOA | awk '{print $1}'; done

This will give you a much more accurate list, but it runs way slower and is sub-optimal. The other answers don't take into account all the possible variations of TLD names used within TLDs, e.g., www.somecompany.org.uk, so it all depends on how accurate you need this to be.

Comments

1

A "simple" bash solution. Tested in bash shell on Solaris 11.2 x86.

#!/bin/bash
while IFS=/ read HTTP NULL FQDN PAGE
do
    PARENT=${FQDN#*.}
    if [[ $PARENT != *"."* ]]
        then echo $FQDN
        else echo $PARENT
    fi
done < fileOfURLs.txt

Without the string contains pattern test, too much of the domain could be stripped away. The if paragraph can be reduced,so the whole script now looks like this:

#!/bin/bash
while IFS=/ read HTTP NULL FQDN PAGE
do
    PARENT=${FQDN#*.}
    [[ $PARENT != *"."* ]] && echo $FQDN || echo $PARENT
done < fileOfURLs.txt

The bash variable substitution is taking the contents of the variable FQDN and stripping from the left any character up to and including the first dot.

The test condition is asking if the contents of the PARENT variable does not contain a dot. If it does not hold a dot somewhere in the value, the test evaluates to true and will display the original FQDN contents. If the test evaluates to false, (there is still a dot in the value) the contents of PARENT are displayed.

Comments

0

An easy solution to get parent domain name

echo http://www.humkinar.pk | awk -F '/' '{print $3}'
www.humkinar.pk

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.