1

I needed to compress/archive a folder, so I ran the following command:

gzip -v --rsyncable --fast -r myFolder/ -c > myFolderArchive.gz

...foolishly thinking this was going to do just what I thought it would: an archive of myFolderArchive and its files recursively. It even had a nice output:

./myFolder/file1 ... 80%
./myFolder/file2 ... 20%
...

Opening the archive later however, I only see a single file in it. A quick search led me to understand my mistake: GZip (or I guess, myself) has taken every file, compressed it, and concatenated them one by one into a single file, essentially. Losing all file/directory structure.

In the meantime, I've rm -r'd the original folder. All I have now is myFolderArchive.gz.

Would anyone see a way to take that archive, and potentially reconstruct the original set of files, from the myFolderArchive.gz file's content, now that it's all mixed into a single GZipped file?

I do still have access to the original disk (for a limited time) and could potentially attempt to recover at least the original directory structure (filesystem is ext4). Technically, the content/data itself is in myFolderArchive.gz, it would "just" need to be sliced right...

1 Answer 1

2

You can try your luck with binwalk. It will show offsets and filenames.

$ binwalk testarchive.gz
-----------------------------------------------------------------------------------------------
DECIMAL                            HEXADECIMAL                        DESCRIPTION
-----------------------------------------------------------------------------------------------
0                                  0x0                                gzip compressed data,
                                                                      original file name:
                                                                      "yes.txt", operating
                                                                      system: Unix, timestamp:
                                                                      2024-12-22 09:56:00,
                                                                      total size: 4618 bytes
4618                               0x120A                             gzip compressed data,
                                                                      original file name:
                                                                      "no.txt", operating
                                                                      system: Unix, timestamp:
                                                                      2024-12-22 09:56:06,
                                                                      total size: 36503 bytes
41121                              0xA0A1                             gzip compressed data,
                                                                      original file name:
                                                                      "zero.txt", operating
                                                                      system: Unix, timestamp:
                                                                      2024-12-22 09:56:12,
                                                                      total size: 36506 bytes
-----------------------------------------------------------------------------------------------

binwalk can also extract directly, but unfortunately, it does not retain the filenames it's showing there.

I'm not sure if gzip itself, or any other utility could do that more directly. So for now this still involves some scripting.

An example of how to extract the files from this output would be as follows:

binwalk big_blog.gz
[...]
37126805        0x2368295       gzip compressed data, was "some_file_1", from Unix, last modified: Tue Jul 29 11:09:01 2014, max speed
37128788        0x2368A54       gzip compressed data, was "some_file_2", from Unix, last modified: Thu Jul  3 14:02:42 2014, max speed
[...]
echo "37128788 - 37126805" | bc                                                                                                                                                                        
1983  # size of some_file_1 in bytes
dd if=big_blob.gz bs=1 skip=37126805 count=1983 iflag=skip_bytes,count_bytes of=some_file_1.gz
gzip -d some_file_1.gz # successfully unzipped the file

Note that binwalk's "decimal" numbers (indices?) start from 0, not 1. So when it says the file starts at say index "335", if you're using dd skip to get to that index, you'll need to skip 335 bytes, as index 335 is the 336th byte.

Using skip=37126804 in that example above to get to the 37126805th byte would then cause GZIP to complain the file isn't in Gunzip format.

4
  • Also it seems that this mode of compression is particularly poor for some reason. Compressed individually with gzip --fast, each file (1 MiB of yes-pattern, no-pattern, zero-pattern) is ~4600 bytes long, but in this example it somehow turned into ~36500 bytes starting from the 2nd file. The resulting gz file can be compressed by quite a lot. (Won't help you either way, just a bit of a mystery on the side.) Commented Dec 22, 2024 at 10:52
  • 1
    This seems like an amazing start, thanks a lot! I edited your answer to include actual example steps to get a file extracted - wanted to post it as a comment but it was above the comment's max length. Commented Dec 23, 2024 at 4:13
  • one limitation of this method is that binwalk does not see folders or any original directory structure (since those weren't turned into .gz files by GZip and thus not concatenated into the big resulting blob). Do you have some idea as to how we may be able to leverage ext4's journaling to potentially get the original directory structure? Commented Dec 23, 2024 at 4:14
  • 1
    I'm not sure. You can try debugfs (logdump, undel, ...?) or other undelete tools. For more generic / agnostic / stupid approach, if your filenames are unique enough, you can do something like strings /dev/filesystem | grep -C 100 -F filename and if you got a bunch of filenames next to each other, those might belong in the same directory... if you had a locate db running, that might also have a list of path/filenames stored somewhere, but I guess its oldfashioned Commented Dec 23, 2024 at 9:24

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.