6

I am trying to unzip a set of files (3 files). These files contain a lot of random files, duplicate files etc. To do this, I am first unziping the list of the names of files into a text file, then clean up the files, and then unzipping the files.

I was originally extracting files by extracting all the necessary names of the files into a text file all at onces, and then running them through unzip.

So for example I have 3 zip files (file1, file2 and file3) all from the same source. First I extracted the list of the contents and cleaned them up into unzip.txt. Then I ran the following script on the file unzip.txt

zipdir="/Volumes/filedir"
i=0
while IFS= read -r line; do
    i=$((i+1))
    for f in $zipdir/*.zip; do
        # echo $f
        unzip -qq $f $line
    done
    echo "$i"
done < unzip.txt

Not all of the files corresponding to each of the lines (in unzip.txt) are in 3 zip files. So I received a lot of errors, and I suspect because I am unnecessarily running the lines which are not in file1, its wasting a lot of unzipping time. I am concerned about this, as I have a much larger set of files I have to run this on.

So I came up with a better way of handling the unipping using chatgpt, but I am not sure what errors I made:

function unzip1() {
    f=$1
    echo $f
    while IFS= read -r line; do
        printf "%s\0" "$line" 
    done < unzip1.txt | xargs -0 -n 1000 unzip $f
}

To explain this, I am now extracting only the files corresponding to file1, and then cleaning up the files and then unzipping the files in one single pass. Lets assume that the extraction of the lines into unzip.txt is ok, because i can see that the total number of lines came out to be the same.

The lines have a lot of random characters, so I have to pass it through "printf "%s\0" "$line"", to first put a null delimiter at the end of each line, then I pass it through xargs. The reason I have to use xargs, is because the number of lines in the unzip.txt is very large, and I cant just do a "cat unzip.txt".

At the end of this, I was able to get a unzipped archive. However, when I ran the test on the first 3 files using method 1 and method 2, method 2 gave me half of the size of the archive compared to method 1.

Am I doing anything incorrectly?


The error in the size of the files is happening because of something to do with xargs. When I use xargs -0, it gives me 30% less number of files, when I use xargs -0 -n 1000, it gives me 20% less number of files, and when I use xargs -0 -n 10, it gives me 5% less number of files. So I am not sure what to do with xargs, and switched to the following method of running one line at a time.

function unzip1() {
    f=$1
    echo $f
    while IFS= read -r line; do
        unzip -qq $f $line 
    done < unzip1.txt
}

There must be a bug in xargs, and how it works with unzip. Is this something that should be reported to xargs, or am I imagining this as a bug?

4
  • 1
    I have used unzip *.zip for years. Commented yesterday
  • @Wastrel you might have run unzip *.zip on Microsoft operating systems (where globbing is done (each in their own way) by applications). On other OSes, that wouldn't make sense. unzip having a MS-DOS-like API, unzip '*.zip' would work though in Unix shell (at unzipping all the .zip files in the current directory, not unzipping the file literally called *.zip only). Commented 23 hours ago
  • @StéphaneChazelas I haven't used any Microsoft in years. The unzip *.zip command works fine on my Ubuntu. The command unrar x *.rar does not work, though. The "rar' files have to be "unrared" by name. And just to be complete, I am not sure what tar would do. Commented 5 hours ago
  • 1
    @Wastrel, unzip *.zip in a directory that has more than one zip file (and whose name don't start with -) would be expanded to something like unzip file1.zip file2.zip file3.zip which is a request to extract the file2.zip and file3.zip members of the file1.zip zip file. Commented 5 hours ago

3 Answers 3

5

It looks like you're using macos whose tar (the one from libarchive like on modern BSDs or Microsoft Windows, or the bsdtar you can install on GNU/Linux distributions) can extract zip files, and take the list of members to extract newline-delimited or NUL-delimited from a file (with the -T¹ option aka --files-from (like in GNU tar) aka -I²).

So:

for f ($zipdir/*.zip(N)) tar -xf $f -T unzip.txt

Beware, that like for unzip, lines of unzip.txt are treated as patterns, and it doesn't look like there's a way to disable it other than escaping the wildcard operators. There's also a special handling of a line containing just -C³.

That will complain if some members cannot be matched.

It can also generate a tar archive on the fly from a list of zip archives, with the tar -cf - @file1.zip @file2.zip... syntax, so you could also do:

(){tar -cf - @$^@ | tar -xf - -T unzip.txt} $zipdir/*.zip

Where the first tar outputs a tar concatenation of all the zip archives and the second extracts the requested members.

Then an error will only be reported if members listed in unzip.txt are found in none of the archives.

If you also have GNU tar (which doesn't doesn't do pattern-matching unless explicitly requested with --wildcards) installed, for instance as gtar or /opt/gnu/bin/tar, you can replace the second tar with gtar --verbatim-files-from so the file passed to -T be treated as a list of literal file paths. That works as that's tar format (as opposed to zip) that is fed to it which GNU tar can handle.

Above, the list of archives is passed as arguments to an anonymous function, so we can use @$^@ to prepend @ to each of them. You could also prepend that @ as part of the glob expansion using the e glob qualifier ($zipdir/*.zip(e:REPLY[1,0]=@:)) or with the histsubstpattern option enabled with the :s/pattern/replacement/ modifier ($zipdir/*.zip(:s/#/@)).

Note that your:

while IFS= read -r line; do
  printf "%s\0" "$line" 
done

Can be written tr '\n' '\0'⁴.

You can also get the contents of unzip.txt as an array of its non-empty lines using the f (to split on linefeed) parameter expansion flag:

files=( ${(f)"$(<unzip.txt)"} )

And use zsh's own zargs instead of xargs to split that list to avoid the E2BIG execve() error in case of very large lists:

autoload -U zargs
zargs -r --eof= -- $files '' unzip $f

As to why you get a different number of extracted files with xargs and different numbers passed to -n, that may be caused by unzip prompting the user and reading the response from stdin, which in this case is the pipe your loop is feeding, which means unzip may read parts of or the whole of the file list as soon as it issues a prompt.

With GNU xargs, you'd use:

xargs -rd '\n' -a unzip.txt unzip '/Volumes/filedir/*.zip'

(note the quotes around the globs, as unzip being more like a MS-DOS application actually does globbing by itself and in the process is able to extract members from more than one archive at a time).

Then xargs' and unzip's stdin remain untouched meaning that if you run that from a terminal, you'll be able to answer the prompts.

BSD xargs have neither -d nor -a, but has -o which tells it to reopen stdin on /dev/tty (you'd only use it when run from a terminal), so you can do:

<unzip.txt tr '\n' '\0' | xargs -or0 unzip '/Volumes/filedir/*.zip'

As noted by Matija Nalis, if there are are lines of unzip.txt that start with -d or -x, they'll be interpreted by unzip as destination directory or file pattern to exclude for the rest of the entries passed to that unzip invocation and even if -- is used.

Prefixing all bytes except the line delimiters with \ in unzip.txt would work around that and also avoid problems with entries that contain wildcard characters:

<unzip.txt LC_ALL=C sed 's/./\\&/g' |
  tr '\n' '\0' |
  xargs -or0 unzip '/Volumes/filedir/*.zip'

¹ from John Gilmore's public domain tar (upon which GNU tar is based) since at least as far back as 1986.

² From SunOS 4.0 from 1988. I wouldn't be surprised if John Gilmore was also involved in that.

³ A "feature" apparently inherited or inspired by GNU tar (added there in 1991). It's actually worse in GNU tar, where there's also whitespace trimming by default and where all strings starting with - in the file are treated as tar options, though only the position sensitive options are allowed. In GNU tar, you'd add the --verbatim-files-from (added in 1.29 from 2016) to make sure the file is interpreted as just a list of file paths.

⁴ well, strictly speaking, the while loop would discard the part after the last newline character if any, while tr wouldn't.

2
  • out of curiosity, how did you determine the OP was using MacOS? Commented 4 hours ago
  • @MatijaNalis I can't be 100% sure but /Volumes and zsh are two potential telltales. Their posting history here is an even stronger one. Commented 4 hours ago
4

Firstly, double-quote your variables. e.g. use "$f" rather than just unquoted $f.

After that, the next thing that comes to mind is that if the unzip1.txt input file is newline-separated, then there's no point converting it to NUL separated for xargs. You should use -d '\n' instead of -0, then you can just cat or redirect unzip1.txt into xargs and avoid the excruciatingly slow shell while-read loop. e.g. xargs -d '\n' -n 100 unzip "$f" < unzip1.txt

But why use -n 100 to limit it to 100 files at a time? Is there any good reason for xargs to run unzip once per 100 files rather than the default of as many as will fit into a command line?

BTW, you may want to use unzip's -n option to always skip extracting files that already exist (e.g. were already extracted in a previous run of unzip). Or, more dangerously, -o to always overwrite existing files. See man unzip.

Also, $1 / $f is obviously the zipfile to be unzipped - personally, I'd use $zipfile or $zf rather than just $f, but that's a trivial detail. More importantly, is it being passed to the function with a full pathname or just the basename? And what is the $zipdir variable meant to be used for? Did you mean to cd to that directory before unzipping, or pass it to unzip as "$zipdir/$f"? Or is it meant to be the directory to extract into (using unzip's -d exdir option)?


Finally, ChatGPT is an LLM. That means it's a stochastic bullshit generator. It's not "intelligent", it's not "smart", and it doesn't understand anything, it just spews out text that is likely to follow from the prompt and previously-seen input.

I wouldn't trust any code "written" (i.e. plagiarised from the original author(s) without attribution and without regard to copyright) by it or any other LLM for any reason. I'd trust it even less than I would trust code posted by some random stranger on the net - at least the human stranger has a chance of understanding what they're doing, while an LLM has no chance of understanding anything.

I still wouldn't trust code from either unless I took the time to read and understand what that code was doing and how it was doing it.

The use of unquoted variables is an obvious example of LLM failure - it has just copied the error made by thousands or millions of novice shell programmers. There are far more examples of variables not being quoted, so it's far more likely to perpetuate that bad programming practice.

6
  • 2
    Note that -d is specific to the GNU implementation of xargs. zsh and /Volumes suggest the OP might be on macos. Unquoted variables is less of a problem in zsh (usage of echo instead of print -r - or echo -E - is more of a problem). The only potentially unwanted side-effect of leaving parameter expansions unquoted there is empty-removal. Commented yesterday
  • 1
    1. for versions of xargs that don't support -d, using tr '\n' '\0' unzip1.txt | xargs -0 ... would be better than a shell while-read loop. 2. IMO, avoiding the removal of empty args is a good thing as that can hide bugs in the rest of the script....it lets you know that a variable isn't being set correctly, which could lead to catastrophic results. Better for the script to print an error and die than to run a command with incorrect, possibly dangerous, args (or lack of args). Commented yesterday
  • As for zsh's print... i wasn't aware that zsh had that as a built-in. but now that I do know, I consider it to be a problem in itself because it conflicts with existing uses of print for sending stuff to a printer that have been around for decades (eg the mailcap package has a print command for doing exactly that). Admittedly, there aren't many other keyword names they could have used, but zprint or zecho would be better than just print. Commented yesterday
  • print has been a built-in of ksh since the early 80s. It's also in bash as a loadable builtin. Commented yesterday
  • print is also among the list for which POSIX warns the look-up will be unspecified (typically in the case of the ones like echoti/typeset/print because they're used as builtin in some POSIX sh implementations and have been for decades). Commented yesterday
0

TL;DR: add -- to unzip.


There must be a bug in xargs, and how it works with unzip. Is this something that should be reported to xargs, or am I imagining this as a bug?

Statistically, when one has a problem with a tool that is used many times a day by many millions of people for decades, I'd say it is quite unlikely that one has found a genuine bug of such level.

As for the problem, I'd say it is (as it most often is) an user error. Try changing -n 1000 in your experiments to -n 1 and suddenly it will work as well as it did before.

The reason is likely in the filenames. When one bad filename causes unzip to abort, it only loses one file. When there were 10 others files in unzip, when unzip aborts, it loses all 10 files. Thus the difference.

I think you'll find out even your original script loses files, but less often (i.e. ls | wc -l will differ from wc -l unzip.txt


So, changing xargs -0 -n 1000 unzip $f to xargs -0 -n 1000 unzip -- $f (or even better just xargs -0 unzip -- $f, to use maximum available and safe size) should help in this particular case.

-- in many (but not all) commands forces the said command to stop interpreting any following parameters as possible options, and only consider them as filenames. It happens to work in my zip 3.0-13 from Debian Bookworm:

Consider the following example:

% mkdir /tmp/test
% cd /tmp/test
% touch ./a ./b ./c ./-d
% zip all.zip *

zip warning: all.zip not found or empty
zip error: Nothing to do! (all.zip)

% zip all.zip -- *
  adding: -d (stored 0%)
  adding: a (stored 0%)
  adding: b (stored 0%)
  adding: c (stored 0%)

Same problem would occur with unzip, of course.

If -- didn't work, you'd have to pre-process the filenames in unzip.txt so they are in format compatible with unzip(1). Prepending ./ might work (like I did with touch(1) in example above) but there might be other workarounds needed depending on filename breakage (e.g. not all filesystems support full UTF-8 complement of characters in filename, but that is another deeper problem for another question)

I'd also note somewhat relatedly that if that zip archive contains all kind of strange stuff, adding -j to that unzip might be prudent too.

And always verify if number of files extracted is the same as number or files you were expecting to get.

2
  • 2
    Looks like -- doesn't work for unzip and -d/-x. unzip file.zip -- whatever complains -- can't be found in the archive, unzip -- file.zip -d/e/f complains with cannot create extraction directory: /e/f. unzip file.zip ./-d doesn't extract a member whose path is -d. You'd need something like unzip file.zip '[-]d' Commented 3 hours ago
  • Or unzip file.zip '\-d'. I've updated my answer with that. Commented 3 hours ago

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.