18

I need identify which file is binary and which is a text in a directory.

I tried use mimetypes but it isnt a good idea in my case because it cant identify all files mimes, and I have strangers ones here... I just need know, binary or text. Simple ? But I couldn´t find a solution...

Thanks

5
  • 2
    What is a text file for you? Does UTF-16-BE encoded Unicode count, for example? Commented Sep 18, 2009 at 20:04
  • 3
    You need to define precisely what is meant by 'binary' and 'text' before anyone can help you. Commented Sep 18, 2009 at 20:07
  • Text file is any file that is readable by humans. Say, any file that you can read by a "cat" (linux) or "type" (windows) command. Commented Sep 19, 2009 at 14:07
  • This similar question has a few good answers, stackoverflow.com/questions/898669/… file(1) is pretty reliable, so you could go with the pure-python solution that is based on file(1) behaviour; or you could trust the mimetypes module. Commented Mar 14, 2013 at 2:52
  • Use this library: pypi.python.org/pypi/binaryornot It is very simple and based on code found in this stackoverflow question. Commented Nov 7, 2014 at 9:10

4 Answers 4

11

Thanks everybody, I found a solution that suited my problem. I found this code at http://code.activestate.com/recipes/173220/ and I changed just a little piece to suit me.

It works fine.

from __future__ import division
import string 

def istext(filename):
    s=open(filename).read(512)
    text_characters = "".join(map(chr, range(32, 127)) + list("\n\r\t\b"))
    _null_trans = string.maketrans("", "")
    if not s:
        # Empty files are considered text
        return True
    if "\0" in s:
        # Files with null bytes are likely binary
        return False
    # Get the non-text characters (maps a character to itself then
    # use the 'remove' option to get rid of the text characters.)
    t = s.translate(_null_trans, text_characters)
    # If more than 30% non-text characters, then
    # this is considered a binary file
    if float(len(t))/float(len(s)) > 0.30:
        return False
    return True
Sign up to request clarification or add additional context in comments.

7 Comments

A little correction for your code : if float(len(t))/float(len(s)) > 0.30: return 0 Otherwise, python will use the integer division, and the comparison will only be true when len(t) == len(s)
Thomas, please apply that "float" correction to the answer! Activestate should fix their recipe, too! ;) but I can't be bothered signing up to bump the comments there.
@cedriv-julien, @sam-watkins, I think it's fine without the use of float, because of the from __future__ import division line, isn't it?
TypeError: unsupported operand type(s) for +: 'map' and 'list'
This code is not valid for python 3
|
8

It's inherently not simple. There's no way of knowing for sure, although you can take a reasonably good guess in most cases.

Things you might like to do:

  • Look for known magic numbers in binary signatures
  • Look for the Unicode byte-order-mark at the start of the file
  • If the file is regularly 00 xx 00 xx 00 xx (for arbitrary xx) or vice versa, that's quite possibly UTF-16
  • Otherwise, look for 0s in the file; a file with a 0 in is unlikely to be a single-byte-encoding text file.

But it's all heuristic - it's quite possible to have a file which is a valid text file and a valid image file, for example. It would probably be nonsense as a text file, but legitimate in some encoding or other...

Comments

8

It might be possible to use libmagic to guess the MIME type of the file using python-magic. If you get back something in the "text/*" namespace, it is likely a text file, while anything else is likely a binary file.

Comments

5

If your script is running on *nix, you could use something like this:

import subprocess
import re

def is_text(fn):
    msg = subprocess.Popen(["file", fn], stdout=subprocess.PIPE).communicate()[0]
    return re.search('text', msg) != None

4 Comments

No need for re if just finding substring.
Doesn't work if text is part of a binary's file filepath.
I suggest Popen(["file", "--mime", fn]. ...). Otherwise the word "text" might not appear. On my Linux, the answer for something that looks like a Fortran program is "FORTAN program". If you add the mime switch you get "text/x-fortran; charset=us-ascii".
If you're using Python 3 the msg will be bytes rather than a string, so you'd have to use return re.search("text", msg.decode()) != None or return "text" in msg.decode() instead.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.