How can I split a URL string up into separate parts in Python?

Question

I decided that I'll learn Python tonight :) I know C pretty well (wrote an OS in it), so I'm not a noob in programming, so everything in Python seems pretty easy, but I don't know how to solve this problem: let's say I have this address:

http://example.com/random/folder/path.html

Now how can I create two strings from this, one containing the "base" name of the server, so in this example it would be

http://example.com/

and another containing the thing without the last filename, so in this example it would be

http://example.com/random/folder/

Also I of course know the possibility to just find the third and last slash respectively, but is there a better way?

Also it would be cool to have the trailing slash in both cases, but I don't care since it can be added easily. So is there a good, fast, effective solution for this? Or is there only "my" solution, finding the slashes?

Come back tomorrow and let us know how that's going for you. I suspect you'll just be writing C code in Python rather than real Python code :-). — paxdiablo
– paxdiablo, Commented Jan 16, 2009 at 7:55
You can find a Python regex for a partial split (i.e. URL, scheme, domain, TLD, port and query path) here: stackoverflow.com/questions/9760588/… — Paolo Rovelli
– Paolo Rovelli, Commented Aug 11, 2015 at 21:19

Peter Mortensen · Accepted Answer · 2022-12-01 16:11:13Z

60

The urlparse module in Python 2.x (or urllib.parse in Python 3.x) would be the way to do it.

>>> from urllib.parse import urlparse
>>> url = 'http://example.com/random/folder/path.html'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'example.com'
>>> parse_object.path
'/random/folder/path.html'
>>> parse_object.scheme
'http'
>>>

If you wanted to do more work on the path of the file under the URL, you can use the posixpath module:

>>> from posixpath import basename, dirname
>>> basename(parse_object.path)
'path.html'
>>> dirname(parse_object.path)
'/random/folder'

After that, you can use posixpath.join to glue the parts together.

Note: Windows users will choke on the path separator in os.path. The posixpath module documentation has a special reference to URL manipulation, so all's good.

edited Dec 1, 2022 at 16:11

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Jan 16, 2009 at 8:14

sykloid

102k12 gold badges67 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

bobince Over a year ago

+1 on urlparse, but don't use os.path to manipulate the .path part. os.path's handling necessarily differs from OS to OS, whereas URIs always use '/' as the path part separator.

nosklo Over a year ago

yeah, remove the os.path part. Maybe use the posixpath module instead. Then you'll have my vote.

sykloid Over a year ago

argh, missed that one completely. It's been ages since I used windows :|. Fixed.

patrick Over a year ago

Just for ease of reference, here is the procedure for Py 2 : import urlparse; parse_object = urlparse.urlparse(url)

J. Gwinner Over a year ago

"windows users will choke on ..." I like to think of it as Linux users will choke on the path specifier that was around before it was :)

Peter Mortensen · Accepted Answer · 2022-11-28 02:24:33Z

12

If this is the extent of your URL parsing, Python's inbuilt rpartition will do the job:

>>> URL = "http://example.com/random/folder/path.html"
>>> Segments = URL.rpartition('/')
>>> Segments[0]
'http://example.com/random/folder'
>>> Segments[2]
'path.html'

From Pydoc, str.rpartition:

Splits the string at the last occurrence of sep, and returns a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing two empty strings, followed by the string itself

What this means is that rpartition does the searching for you, and splits the string at the last (right most) occurrence of the character you specify (in this case / ). It returns a tuple containing:

(everything to the left of char , the character itself , everything to the right of char)

edited Nov 28, 2022 at 2:24

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Jan 16, 2009 at 8:11

Mike Hamer

1,16511 silver badges23 bronze badges

3 Comments

NeilG Over a year ago

I know everyone says (with good reasons) to use a library like urllib for things like this but I have an emotional preference for this approach in cases like this. As far as I can see this will be fully reliable. There's no fancy parsing required for this limited case. As long as the incoming URL is well formed it should be guaranteed that the path starts at the third slash character. I am drawn to this because it looks simpler and faster without the library import, but I'm ready to be corrected if I'm wrong.

Peter Mortensen Over a year ago

The Pydoc link is broken (404).

Peter Mortensen Over a year ago

What is 'sep'? A simple (fixed) string? A regular expression?

Stephen Rauch · Accepted Answer · 2018-06-24 17:24:29Z

10

I have no experience with Python, but I found the urlparse module, which should do the job.

edited Jun 24, 2018 at 17:24

Stephen Rauch♦

50.1k32 gold badges118 silver badges143 bronze badges

answered Jan 16, 2009 at 7:49

Sebastian Dietz

5,7061 gold badge33 silver badges40 bronze badges

Comments

Peter Mortensen · Accepted Answer · 2022-12-01 16:01:01Z

In Python, a lot of operations are done using lists. The urlparse module mentioned by Sebasian Dietz may well solve your specific problem, but if you're generally interested in Pythonic ways to find slashes in strings, for example, try something like this:

url = 'http://example.com/random/folder/path.html'

# Create a list of each bit between slashes
slashparts = url.split('/')

# Now join back the first three sections 'http:', '' and 'example.com'
basename = '/'.join(slashparts[:3]) + '/'

# All except the last one
dirname = '/'.join(slashparts[:-1]) + '/'

print 'slashparts = %s' % slashparts
print 'basename = %s' % basename
print 'dirname = %s' % dirname

The output of this program is this:

slashparts = ['http:', '', 'example.com', 'random', 'folder', 'path.html']
basename = http://example.com/
dirname = http://example.com/random/folder/

The interesting bits are split, join, the slice notation array[A:B] (including negatives for offsets-from-the-end) and, as a bonus, the % operator on strings to give printf-style formatting.

Peter Mortensen · Accepted Answer · 2022-12-01 16:07:41Z

2

It seems like the posixpath module mentioned in sykora's answer is not available in my Python setup (Python 2.7.3).

As per this article, it seems that the "proper" way to do this would be using...

urlparse.urlparse and urlparse.urlunparse can be used to detach and reattach the base of the URL
The functions of os.path can be used to manipulate the path
urllib.url2pathname and urllib.pathname2url (to make path name manipulation portable, so it can work on Windows and the like)

So for example (not including reattaching the base URL)...

>>> import urlparse, urllib, os.path
>>> os.path.dirname(urllib.url2pathname(urlparse.urlparse("http://example.com/random/folder/path.html").path))
'/random/folder'

edited Dec 1, 2022 at 16:07

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Feb 6, 2013 at 5:35

Abbafei

3,1463 gold badges29 silver badges25 bronze badges

1 Comment

Peter Mortensen Over a year ago

The doughellmann.com link is broken: "404. Page Not Found"

Peter Mortensen · Accepted Answer · 2022-12-01 16:16:45Z

1

You can use Python's library furl:

f = furl.furl("http://example.com/random/folder/path.html")
print(str(f.path))  # '/random/folder/path.html'
print(str(f.path).split("/")) # ['', 'random', 'folder', 'path.html']

To access word after first "/", use:

str(f.path).split("/") # 'random'

edited Dec 1, 2022 at 16:16

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Dec 2, 2016 at 15:58

Mayank Jaiswal

13.2k7 gold badges42 silver badges41 bronze badges

1 Comment

Peter Mortensen Over a year ago

Re "Python's library": But not part of the batteries(?)

Collectives™ on Stack Overflow

How can I split a URL string up into separate parts in Python?

6 Answers 6

5 Comments

3 Comments

Comments

Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

5 Comments

3 Comments

Comments

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related