4

I have a Python script that crawls various webistes and downloads files form them. My problem is, that some of the websites seem to be using PHP, at least that's my theory since the URLs look like this: https://www.portablefreeware.com/download.php?dd=1159

The problem is that I can't get any file names or endings from a link like this and therefore can't save the file. Currently I'm only saving the URLs.

Is there any way to get to the actual file name behind the link?

This is my stripped down download code:

r = requests.get(url, allow_redirects=True)

file = open("name.something", 'wb')
file.write(r.content)
file.close()

Disclaimer: I've never done any work with PHP so please forgive any incorrect terminolgy or understanding I have of that. I'm happy to learn more though

2
  • You can try check the headers for example: r.headers Commented Jun 23, 2020 at 12:00
  • "Like stated in the question, I don't want to save the HTML, I want the actual file" I can't understand this. What does "the actual file behind the link" mean? If I go to a URL like the one in question, why should it correspond to "a file"? Why does the HTML content of the page, saved to disk, not qualify as "a file"? Commented Sep 5, 2022 at 14:45

5 Answers 5

3
+25

To get the filename from the URL after redirects, check the Content Disposition header in the response. Here's a code that extracts the filename and extension from the response headers.

import requests from urllib.parse import unquote
    
def get_filename_from_cd(cd):
    """
    Get filename from content disposition header
    """
    if not cd:
        return None
    fname = None
    for param in cd.split(';'):
        if 'filename' in param.lower():
            fname = param.split('=')[1].strip()
            break
    return unquote(fname)
    
url = 'https://www.portablefreeware.com/download.php?dd=1159'
r = requests.get(url,allow_redirects=True)
    
# Get the file name from the headers
filename = get_filename_from_cd(r.headers.get('content-disposition'))
    
if filename:
    with open(filename, 'wb') as file:
    file.write(r.content)
else:
    print("Couldn't find the file name in the response headers.")

If that doesn't work.

import requests
import mimetypes

response = requests.get('https://www.portablefreeware.com/download.php?dd=1159')
content=response.content
content_type = response.headers['Content-Type']
ext= mimetypes.guess_extension(content_type)


print(content)# [ZipBinary]
print(ext)# .zip
print(content_type)#application/zip, application/octet-stream

with open("newFile."+ext, 'wb') as f:
  f.write(content)
  f.close()
Sign up to request clarification or add additional context in comments.

Comments

1

With your use of the allow_redirects=True option, requests.get would automatically follow the URL in the Location header of the response to make another request, losing the headers of the first response as a result, which is why you can't find the file name information anywhere.

You should instead use the allow_redirects=False option so that you can the Location header, which contains the actual download URL:

import requests

url = 'https://www.portablefreeware.com/download.php?dd=1159'
r = requests.get(url, allow_redirects=False)
print(r.headers['Location'])

This outputs:

https://www.diskinternals.com/download/Linux_Reader.exe

Demo: https://replit.com/@blhsing/TrivialLightheartedLists

You can then make another request to the download URL, and use os.path.basename to obtain the name of the file to which the content will be written:

import os

url = r.headers['Location']
with open(os.path.basename(url), 'w') as file:
    r = requests.get(url)
    file.write(r.content)

Comments

0

You're using requests for downloading. This doesn't work with downloads of this kind.

Try urllib instead:

import urllib.request

urllib.request.urlretrieve(url, filepath)

Comments

0

You can download the file with file name get from response header.

Here's my code for a download with a progress bar and a chunk size buffer:

  1. To display a progress bar, use tqdm. pip install tqdm
  2. In this, chunk write is used to save memory during downloading.
import os

import requests
import tqdm
url = "https://www.portablefreeware.com/download.php?dd=1159"
response_header = requests.head(url)
file_path = response_header.headers["Location"]
file_name = os.path.basename(file_path)
with open(file_name, "wb") as file:
    response = requests.get(url, stream=True)
    total_length = int(response.headers.get("content-length"))
    for chunk in tqdm.tqdm(response.iter_content(chunk_size=1024), total=total_length / 1024, unit="KB"):
        if chunk:
            file.write(chunk)
            file.flush()

Progress output:

6%|▌         | 2848/46100.1640625 [00:04<01:11, 606.90KB/s]

Comments

0

redirectable can be bounced via DNS distributed Network any where. So the example answers above show https://www but in my case they will be resolved to Europe so my fastest local source is coming in as

https://eu.diskinternals.com/download/Linux_Reader.exe

by far the simplest is to raw curl first if its good no need to inspect or scrape

without bothering to resolve anything,
curl -o 1159.tmp https://www.portablefreeware.com/download.php?dd=1159

however I know in this case that not the expected result, so next level is

curl -I https://www.portablefreeware.com/download.php?dd=1159 |find "Location"

and that gives the result as shown by others
https://www.diskinternals.com/download/Linux_Reader.exe but that's not the fuller picture since if we back feed that

curl.exe -K location.txt

we get

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://eu.diskinternals.com/download/Linux_Reader.exe">here</a>.</p>
</body></html>

hence the nested redirects to

https://eu.diskinternals.com/download/Linux_Reader.exe

all of that can be command line scripted to run in loops in a line or two but I don't use Python so you will need to write perhaps a dozen lines to do similar

C:\Users\WDAGUtilityAccount\Desktop>curl -O https://eu.diskinternals.com/download/Linux_Reader.exe
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 44.9M  100 44.9M    0     0  3057k      0  0:00:15  0:00:15 --:--:-- 3640k

C:\Users\WDAGUtilityAccount\Desktop>dir /b lin*.*
Linux_Reader.exe

and from the help file yesterdays extra update (Sunday, ‎September ‎4, ‎2022) Link

curl -O https://eu.diskinternals.com/download/Uneraser_Setup.exe

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.