1

Let's say you download some .zip/tar.gz or source-code of a project from somewhere. The result you will get are a random bunch of files, some having code and some having images. Is there a way to get some understanding how much percentage images, media files and how much is a text file ? If there is a tool which does that, please share. If not, how would one go in crafting a script or something which does that ?

Update - giving more information due to comments received.

Just to state an example, this is what I'm talking about -

┌─[shirish@debian] - [~/games/I-Nex] - [10054]
└─[$] [$] ll -h

total 236K
drwxr-xr-x 3 shirish shirish 4.0K 2016-11-13 21:25 debian
drwxr-xr-x 3 shirish shirish 4.0K 2016-11-13 19:16 I-Nex
drwxr-xr-x 2 shirish shirish 4.0K 2016-11-13 19:16 JSON
drwxr-xr-x 3 shirish shirish 4.0K 2016-11-13 02:12 dists
-rw-r--r-- 1 shirish shirish 7.8K 2016-11-13 02:12 i2c_smbus.rules
-rw-r--r-- 1 shirish shirish 1.4K 2016-11-13 02:12 i-nex.mk
drwxr-xr-x 2 shirish shirish 4.0K 2016-11-13 02:12 manpages
drwxr-xr-x 2 shirish shirish 4.0K 2016-11-13 02:12 pixmaps
-rw-r--r-- 1 shirish shirish   97 2016-11-13 02:12 release.conf
-rw-r--r-- 1 shirish shirish  280 2016-11-13 02:12 requirements.md
-rwxr-xr-x 1 shirish shirish 1.4K 2016-11-13 02:12 changelog.awk
-rwxr-xr-x 1 shirish shirish 2.5K 2016-11-13 02:12 Makefile
-rw-r--r-- 1 shirish shirish 6.6K 2016-11-13 02:12 README.md
-rw-r--r-- 1 shirish shirish 176K 2016-11-13 02:12 Changelog.md

Now while this example is simple as only pixmaps directory has the pictures/images, it doesn't tell how much space is being consumed by text files and text directories and how much by pixmaps.

11
  • 1
    The simplest solution is probably to look at the file extensions, even if it isn't a perfect solution. Can you show us what you've done so far and where you got stuck? Commented Jan 27, 2017 at 18:57
  • 1
    percentage by bytes or by number of files? what's your criteria for image vs media vs text? (is ascii art text or image? is an animated GIF an image or media?) Commented Jan 27, 2017 at 19:03
  • @JuliePelletier while what you are saying is true, I'm trying to get sense of how the directory is structured, based on space occupied. I'll update the question so it makes more sense. Commented Jan 27, 2017 at 19:03
  • @JeffSchaller percentage by bytes, hmm... guess I need to provide more details. Commented Jan 27, 2017 at 19:04
  • 4
    You could use file to determine the basic file type and extract the size from the file listing ls and sum it up based on the file's extension. Now show us what you did so we can help you. This site is not a free script writing service and you need to be specific on where you get stuck. Commented Jan 27, 2017 at 19:26

2 Answers 2

3
#!/bin/bash

find "$1" ! -type d |
while read fpath; do
    fname="${fpath##*/}"
    suffix="${fname##*.}"

    if [[ "$suffix" == "$fname" ]]; then
        suffix="(none)"
    fi

    size="$( stat --format '%s' "$fpath" )"

    printf '%s\t%d\n' "$suffix" "$size"
done |
awk '{ sz[$1] += $2 }
     END { for (s in sz) { printf("%s:\t%d\n", s, sz[s]) } }'

Given a directory on the command line, the above bash script will use stat1 to get the size of each individual file in the directory, and below, in bytes. The while-loop also chops off the suffix for each file and outputs it together with the size of the file (in bytes).

The awk script2 at the end will summarize and print the information.

Example, running over a directory of one of my work projects:

$ bash ./script.sh /home/kk/Work/Development/project/src/
c:      4559172
am:     369
h:      151369
o:      4613432
in:     42216
out:    3282712
(none): 2908962
Po:     18414
txt:    7129

The output may then be further filtered and formatted if need be.

Modifying this to do percentages of total size, or to use file to get the filetype rather than relying on the filename suffix, or to output the sizes in another unit than bytes, is left to the reader as on exercise.

1 The stat call here is tailored for GNU stat from the GNU coreutils package. The stat on OpenBSD is totally different.

2 The awk script is assumed to be run by an awk implementation that knows about associative arrays, such as GNU awk or mawk.

4
  • stat- part of coreutils ? Could you edit it so the readout is in more human readable form ? Commented Jan 27, 2017 at 20:32
  • I do like it, it at least gives some more idea/indication, I could use graphviz to generate a circle graph or something after this. Commented Jan 27, 2017 at 20:38
  • @shirish Yes, stat appears to be part of GNU coreutils. Thanks, I'll update the text. Commented Jan 27, 2017 at 20:46
  • @shirish Now also with a tab between the columns. Commented Jan 27, 2017 at 20:51
0

If it's in a compressed archive, like a .zip file or .tgz file, you can compare the compressed size to the uncompressed size. Binary files within the archive will tend to compress a lot less, especially image and media files (they're already compressed). Text files compress a whole lot more (like more than %90).

I'm too hungry to do the math right now, but if your archive is "a lot" smaller than the directory it unpacks into, you've got an archive with "a lot" of text files. If you've got an archive that's "pretty close" to the size of the directory it unpacks into, you've got an archive that's "pretty close" to all binary files.

Hope that helps

1
  • That I know as well, I am looking for more implementation-wise details. What you have shared is more of rule of thumb sort of thing. Commented Jan 27, 2017 at 20:33

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.