How to download files from website using PHP with Python

Question

I have a Python script that crawls various webistes and downloads files form them. My problem is, that some of the websites seem to be using PHP, at least that's my theory since the URLs look like this: https://www.portablefreeware.com/download.php?dd=1159

The problem is that I can't get any file names or endings from a link like this and therefore can't save the file. Currently I'm only saving the URLs.

Is there any way to get to the actual file name behind the link?

This is my stripped down download code:

r = requests.get(url, allow_redirects=True)

file = open("name.something", 'wb')
file.write(r.content)
file.close()

Disclaimer: I've never done any work with PHP so please forgive any incorrect terminolgy or understanding I have of that. I'm happy to learn more though

"Like stated in the question, I don't want to save the HTML, I want the actual file" I can't understand this. What does "the actual file behind the link" mean? If I go to a URL like the one in question, why should it correspond to "a file"? Why does the HTML content of the page, saved to disk, not qualify as "a file"? — Karl Knechtel
– Karl Knechtel, Commented Sep 5, 2022 at 14:45

dazzafact · Accepted Answer · 2023-04-13 01:13:31Z

To get the filename from the URL after redirects, check the Content Disposition header in the response. Here's a code that extracts the filename and extension from the response headers.

import requests from urllib.parse import unquote
    
def get_filename_from_cd(cd):
    """
    Get filename from content disposition header
    """
    if not cd:
        return None
    fname = None
    for param in cd.split(';'):
        if 'filename' in param.lower():
            fname = param.split('=')[1].strip()
            break
    return unquote(fname)
    
url = 'https://www.portablefreeware.com/download.php?dd=1159'
r = requests.get(url,allow_redirects=True)
    
# Get the file name from the headers
filename = get_filename_from_cd(r.headers.get('content-disposition'))
    
if filename:
    with open(filename, 'wb') as file:
    file.write(r.content)
else:
    print("Couldn't find the file name in the response headers.")

If that doesn't work.

import requests
import mimetypes

response = requests.get('https://www.portablefreeware.com/download.php?dd=1159')
content=response.content
content_type = response.headers['Content-Type']
ext= mimetypes.guess_extension(content_type)


print(content)# [ZipBinary]
print(ext)# .zip
print(content_type)#application/zip, application/octet-stream

with open("newFile."+ext, 'wb') as f:
  f.write(content)
  f.close()

blhsing · Accepted Answer · 2022-09-05 05:39:46Z

With your use of the allow_redirects=True option, requests.get would automatically follow the URL in the Location header of the response to make another request, losing the headers of the first response as a result, which is why you can't find the file name information anywhere.

You should instead use the allow_redirects=False option so that you can the Location header, which contains the actual download URL:

import requests

url = 'https://www.portablefreeware.com/download.php?dd=1159'
r = requests.get(url, allow_redirects=False)
print(r.headers['Location'])

This outputs:

https://www.diskinternals.com/download/Linux_Reader.exe

Demo: https://replit.com/@blhsing/TrivialLightheartedLists

You can then make another request to the download URL, and use os.path.basename to obtain the name of the file to which the content will be written:

import os

url = r.headers['Location']
with open(os.path.basename(url), 'w') as file:
    r = requests.get(url)
    file.write(r.content)

Deepthought · Accepted Answer · 2020-06-23 11:12:54Z

0

You're using requests for downloading. This doesn't work with downloads of this kind.

Try urllib instead:

import urllib.request

urllib.request.urlretrieve(url, filepath)

answered Jun 23, 2020 at 11:12

Deepthought

1189 bronze badges

Comments

Nguyễn Anh Bình · Accepted Answer · 2022-09-04 06:17:46Z

You can download the file with file name get from response header.

Here's my code for a download with a progress bar and a chunk size buffer:

To display a progress bar, use tqdm. pip install tqdm
In this, chunk write is used to save memory during downloading.

import os

import requests
import tqdm
url = "https://www.portablefreeware.com/download.php?dd=1159"
response_header = requests.head(url)
file_path = response_header.headers["Location"]
file_name = os.path.basename(file_path)
with open(file_name, "wb") as file:
    response = requests.get(url, stream=True)
    total_length = int(response.headers.get("content-length"))
    for chunk in tqdm.tqdm(response.iter_content(chunk_size=1024), total=total_length / 1024, unit="KB"):
        if chunk:
            file.write(chunk)
            file.flush()

Progress output:

6%|▌         | 2848/46100.1640625 [00:04<01:11, 606.90KB/s]

K J · Accepted Answer · 2022-09-04 23:38:22Z

redirectable can be bounced via DNS distributed Network any where. So the example answers above show https://www but in my case they will be resolved to Europe so my fastest local source is coming in as

https://eu.diskinternals.com/download/Linux_Reader.exe

by far the simplest is to raw curl first if its good no need to inspect or scrape

without bothering to resolve anything,
curl -o 1159.tmp https://www.portablefreeware.com/download.php?dd=1159

however I know in this case that not the expected result, so next level is

curl -I https://www.portablefreeware.com/download.php?dd=1159 |find "Location"

and that gives the result as shown by others
https://www.diskinternals.com/download/Linux_Reader.exe but that's not the fuller picture since if we back feed that

curl.exe -K location.txt

we get

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://eu.diskinternals.com/download/Linux_Reader.exe">here</a>.</p>
</body></html>

hence the nested redirects to

https://eu.diskinternals.com/download/Linux_Reader.exe

all of that can be command line scripted to run in loops in a line or two but I don't use Python so you will need to write perhaps a dozen lines to do similar

C:\Users\WDAGUtilityAccount\Desktop>curl -O https://eu.diskinternals.com/download/Linux_Reader.exe
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 44.9M  100 44.9M    0     0  3057k      0  0:00:15  0:00:15 --:--:-- 3640k

C:\Users\WDAGUtilityAccount\Desktop>dir /b lin*.*
Linux_Reader.exe

and from the help file yesterdays extra update (Sunday, ‎September ‎4, ‎2022) Link

curl -O https://eu.diskinternals.com/download/Uneraser_Setup.exe

Collectives™ on Stack Overflow

How to download files from website using PHP with Python

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related