0

I'm writing a small java-program for downloading blacklists from the Internet.
The URLs can be of two types:
1) direct link, e.g: http://www.shallalist.de/Downloads/shallalist.tar.gz
Absolutely no problem here, we can use some library, such as: apache.commons.io.FilenameUtils; or simply look for the last occurrence of "/" and "."
2) "frienly url", which is something like: http://urlblacklist.com/cgi-bin/commercialdownload.pl?type=download&file=bigblacklist
Here no explicit filename and extension is present, but if I use my browser or Internet Download Manager (IDM), filename+extension would be: "bigblacklist.tar.gz"
How to solve this problem in java and get filenames and extensions from "friendly" URLs?

P.S: I know about Content-Disposition and Content-Type fields, but the Response Header for the urlblacklist link is:

Transfer-Encoding : [chunked]
Keep-Alive : [timeout=5, max=100]
null : [HTTP/1.1 200 OK]
Server : [Apache/2.4.10 (Debian)]
Connection : [Keep-Alive]
Date : [Sat, 05 Sep 2015 23:51:35 GMT]
Content-Type : [ application/octet-stream]

As we see, there's nothing connected with .gzip (.gz). How to deal with it using java?
And how do web browsers and download managers recognize the correct name and extension?

===============UPDATE=====================
Thanks to @eugenioy, the problem was solved. The real trouble was in IP-blocking for my multiple downloading attempts, that's why I decided to use proxies. Now it looks like (for the both types of URL) :

Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxyIP, port));
HttpURLConnection httpConn = (HttpURLConnection) new URL(downloadFrom).openConnection(proxy);
String disposition = httpConn.getHeaderField("Content-Disposition");
if (disposition != null) {
// extracts file name from header field
    int index = disposition.indexOf("filename");
    if  (index > 0) {
        fullFileName = disposition.substring(disposition.lastIndexOf("=") + 1, disposition.length() );
    }
} else {
// extracts file name from URL
    fullFileName = downloadFrom.substring(downloadFrom.lastIndexOf("/") + 1, downloadFrom.length());
            }

Now fullFileName contains the name of the file to download + its extension.

1 Answer 1

1

Take a look at the output from curl:

curl -s -D - 'http://urlblacklist.com/cgi-bin/commercialdownload.pl?type=download&file=bigblacklist' -o /dev/null

You will see this response:

HTTP/1.1 200 OK
Date: Sun, 06 Sep 2015 00:55:51 GMT
Server: Apache/2.4.10 (Debian)
Content-disposition: attachement; filename=bigblacklist.tar.gz
Content-length: 22840787
Content-Type: application/octet-stream

I gues that's how the browsers get the filename and extension:

Content-disposition: attachement; filename=bigblacklist.tar.gz

Or to do it from Java:

    URL obj = new URL("http://urlblacklist.com/cgi-bin/commercialdownload.pl?type=download&file=bigblacklist");
    URLConnection conn = obj.openConnection();
    String disposition = conn.getHeaderField("Content-disposition");
    System.out.println(disposition);

NOTE: The servers seems to block your IP after trying several times, so make sure to try this from a "clean" IP if you already tried many times today.

Sign up to request clarification or add additional context in comments.

1 Comment

thank you for you reply! The actual problem was in IP-blocking. That's why I decided to use proxies and now it works for me!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.