1

I'm using django validators and python-magic to check the mime type of uploaded documents and accept only pdf, zip and rar files.

Accepted mime-types are: 'application/pdf’, 'application/zip’, 'multipart/x-zip’, 'application/x-zip-compressed’, 'application/x-compressed', 'application/rar’, 'application/x-rar’ 'application/x-rar-compressed’, 'compressed/rar',

The problem is that sometimes pdf files seem to have 'application/octet-stream' as mime-type. 'application/octet-stream' means generic binary file, so I can't simply add that mime type to the list of accepted files, because in that case also other files such es excel files would be accepted, and I don't want that to happen.

How can I do in this case?

Thanks in advance.

3 Answers 3

3

As a follow up to Liyosi answer, I also used python-magic. There seems to be a bug with libmagic where it still incorrectly identifies some files as application/octet-stream Described better on the code

    def _handle509Bug(self, e):
        # libmagic 5.09 has a bug where it might fail to identify the
        # mimetype of a file and returns null from magic_file (and
        # likely _buffer), but also does not return an error message.
        if e.message is None and (self.flags & MAGIC_MIME):
            return "application/octet-stream"
        else:
            raise e

To get around this issue, I had to instantiate a magic object and make use of uncompressed and mime attributes. To complete Liyosi example:

import magic
from django.core.exceptions import ValidationError

def validate_file_type(upload):
    allowed_filetypes = [
        'application/pdf', 'image/jpeg', 'image/jpg', 'image/png',
        'application/msword']
    validator = magic.Magic(uncompress=True, mime=True)
    file_type = validator.from_buffer(upload.read(), mime=True)
    if file_type not in allowed_filetypes:
        raise ValidationError('Unsupported file')
Sign up to request clarification or add additional context in comments.

Comments

2

The most fool proof way of telling is by snooping into the file contents by reading its metadata in the file header.

In most files, this file header is usually stored at the beginning of the file, though in some, it may be located in other locations.

python-magic helps you to do this, but the trick is to always reset the pointer at the beginning of the file, before trying to guess its mime type, else you will sometimes be getting appliation/octet-stream mime type if the reader's pointer has advanced past the file header location to other locations that just contains arbitrary stream of bytes.

For example, if you have a django validator function that tries to validate uploaded files for mime types:

import magic
from django.core.exceptions import ValidationError

def validate_file_type(upload):
    allowed_filetypes = [
        'application/pdf', 'image/jpeg', 'image/jpg', 'image/png',
        'application/msword']
    upload.seek(0)
    file_type = magic.from_buffer(upload.read(1024), mime=True)
    if file_type not in allowed_filetypes:
        raise ValidationError(
            'Unsupported file')

2 Comments

this still gives appliation/octet-stream with certain files
Thank you..using upload.seek(0) did the trick..many thanks
0

You should not rely on the MIME type provided, but rather the MIME type discovered from the first few bytes of the file itself.

This will help eliminate the generic MIME type issue.

The problem with this approach is that it will usually rely on some third party tool (for example the file command commonly found on Linux systems is great; use it with -b --mime - and pass in the first few bytes of your file to have it give you the mime type).

The other option you have is to accept the file, and try to validate it by opening it with a library.

So if pypdf cannot open the file, and the built-in zip module cannot open the file, and rarfile cannot open the file - its most likely something that you don't want to accept.

2 Comments

I don't rely on the MIME type provided. I already read the first few bytes to discover it, with python-magic: mime = magic.from_buffer(value.read(1024), mime=True) but also this method sometimes gives "application/octet-stream"
Go with the second option.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.