5

I am having list of urls like ,

l=['bit.ly/1bdDlXc','bit.ly/1bdDlXc',.......,'bit.ly/1bdDlXc']

I just want to see the full url from the short one for every element in that list.

Here is my approach,

import urllib2

for i in l:
    print urllib2.urlopen(i).url

But when list contains thousands of url , the program takes long time.

My question : Is there is any way to reduce execution time or any other approach I have to follow ?

6
  • 5
    Might be worth looking at dev.bitly.com (specifically dev.bitly.com/links.html#v3_expand which allows 15 URLs to be expanded at a time). No doubt there's some Python bitly wrappers on pypi or code.google - but I'll leave you to search for those. Commented Aug 11, 2014 at 14:24
  • Do all of the URLs have a hostname of bit.ly? Commented Aug 11, 2014 at 14:31
  • @Robᵩ s No , all urls not associated with bit.ly Commented Aug 11, 2014 at 14:32
  • @JonClements But all urls are not associated with bitly. Commented Aug 11, 2014 at 14:34
  • Well, use the bitly api for the ones that are... if there's other common shortners, they'll probably have APIs that can be used as well... otherwise, you're stuck with your current approach of seeing where you end up after redirection. You may wish to consider multi-threading/processing to make multiple requests at the same time. Commented Aug 11, 2014 at 14:37

4 Answers 4

15

First method

As suggested, one way to accomplish the task would be to use the official api to bitly, which has, however, limitations (e.g., no more than 15 shortUrl's per request).

Second method

As an alternative, one could just avoid getting the contents, e.g. by using the HEAD HTTP method instead of GET. Here is just a sample code, which makes use of the excellent requests package:

import requests

l=['bit.ly/1bdDlXc','bit.ly/1bdDlXc',.......,'bit.ly/1bdDlXc']

for i in l:
    print requests.head("http://"+i).headers['location']
Sign up to request clarification or add additional context in comments.

2 Comments

As a bonus, requests.head() doesn't follow the redirection, so it saves at least one HTTP transaction.
Actually it doesn't return the link, see my answer
2
from requests import get

def get_real_url_from_shortlink(url):
    resp = requests.get(url)
    return resp.url

Comments

1

I'd try twisted's asynchronous web client. Be careful with this, though, it doesn't rate-limit at all.

#!/usr/bin/python2.7

from twisted.internet import reactor
from twisted.internet.defer import Deferred, DeferredList, DeferredLock
from twisted.internet.defer import inlineCallbacks
from twisted.web.client import Agent, HTTPConnectionPool
from twisted.web.http_headers import Headers
from pprint import pprint
from collections import defaultdict
from urlparse import urlparse
from random import randrange
import fileinput

pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = 16
agent = Agent(reactor, pool)
locks = defaultdict(DeferredLock)
locations = {}

def getLock(url, simultaneous = 1):
    return locks[urlparse(url).netloc, randrange(simultaneous)]

@inlineCallbacks
def getMapping(url):
    # Limit ourselves to 4 simultaneous connections per host
    # Tweak this as desired, but make sure that it no larger than
    # pool.maxPersistentPerHost
    lock = getLock(url,4)
    yield lock.acquire()
    try:
        resp = yield agent.request('HEAD', url)
        locations[url] = resp.headers.getRawHeaders('location',[None])[0]
    except Exception as e:
        locations[url] = str(e)
    finally:
        lock.release()


dl = DeferredList(getMapping(url.strip()) for url in fileinput.input())
dl.addCallback(lambda _: reactor.stop())

reactor.run()
pprint(locations)

Comments

1

You can use 'pyurlextract' library to extract all links from shortened link.

pip install pyurlextract

you will get all details from here - https://pypi.org/project/pyurlextract/

from pyurlextract import extract_shorturl

short_url = "https://url.com/3Bg19uM"  # The actual short URL
full_link, all_links = extract_shorturl(short_url)

if full_link is None:
    print("Failed to expand the URL")
    print("Details:", all_links)
else:
    print("Original URL:", short_url)
    print("Full Link:", full_link)
    print("All Possible Redirections:", all_links)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.