Revisions to Splitting a small file into 512 byte segments changes it, but splitting it in 1k segments doesn't

Turn "determined" into "deterministic" which I think is more thorough

Source Link

edited Nov 5, 2021 at 20:00

dhag

16.3k
4
57
66

The order find processes the files is not determineddeterministic. It may be just the same order as the underlying system call gives, which probably depends on the underlying filesystem structure and can be essentially random. Some implementations might process the list somehow, but don't expect it to be sorted.

Let's try with a smaller file. cat frag* reproduces the right file, since shell globs do sort the filenames:

$ split -b512 orig.bin frag
$ cat frag* > new.bin
$ sha256sum orig.bin new.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  orig.bin
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  new.bin

But find doesn't, so we get a different file:

$ find . -name 'frag*' -exec cat {} + > second.bin
$ sha256sum second.bin 
821325739ca65d1cb568ecf3a16bd2e01ac4eef1419b4d714834fab07d2f135c  second.bin

Just running find to print the names reveals this nicely:

$ find . -name 'frag*' |head -5
./fragzbgv
./fragzbmg
./fragvt
./fragyd
./fragzayc

That was on Linux and ext4. I think it uses some sort of hashing and trees to store the filenames, thus producing a random-looking order. On tmpfs, I got the list in reverse creation order, which isn't that random but still would mess up this case.

Explicitly sorting the list of filenames should help:

$ find . -name 'frag*' -print0 | sort -z | xargs -0 cat > third.bin
$ sha256sum third.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  third.bin

The fact that it worked for you with 1k blocks, is probably an accident...

As an aside, I'm not sure why you're using dd bs=512 for the input there instead of just cat or giving split the filename directly. The thing dd does is reads and writes with a particular block size, but the pipe between dd and sort doesn't conserve block sizes, it's just a stream of bytes, and the filesystem really shouldn't care what blocksize you use to read the file, be it 512 (2^8) or 521 (a prime) bytes.

The order find processes the files is not determined. It may be just the same order as the underlying system call gives, which probably depends on the underlying filesystem structure and can be essentially random. Some implementations might process the list somehow, but don't expect it to be sorted.

Let's try with a smaller file. cat frag* reproduces the right file, since shell globs do sort the filenames:

$ split -b512 orig.bin frag
$ cat frag* > new.bin
$ sha256sum orig.bin new.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  orig.bin
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  new.bin

But find doesn't, so we get a different file:

$ find . -name 'frag*' -exec cat {} + > second.bin
$ sha256sum second.bin 
821325739ca65d1cb568ecf3a16bd2e01ac4eef1419b4d714834fab07d2f135c  second.bin

Just running find to print the names reveals this nicely:

$ find . -name 'frag*' |head -5
./fragzbgv
./fragzbmg
./fragvt
./fragyd
./fragzayc

That was on Linux and ext4. I think it uses some sort of hashing and trees to store the filenames, thus producing a random-looking order. On tmpfs, I got the list in reverse creation order, which isn't that random but still would mess up this case.

Explicitly sorting the list of filenames should help:

$ find . -name 'frag*' -print0 | sort -z | xargs -0 cat > third.bin
$ sha256sum third.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  third.bin

The fact that it worked for you with 1k blocks, is probably an accident...

As an aside, I'm not sure why you're using dd bs=512 for the input there instead of just cat or giving split the filename directly. The thing dd does is reads and writes with a particular block size, but the pipe between dd and sort doesn't conserve block sizes, it's just a stream of bytes, and the filesystem really shouldn't care what blocksize you use to read the file, be it 512 (2^8) or 521 (a prime) bytes.

The order find processes the files is not deterministic. It may be just the same order as the underlying system call gives, which probably depends on the underlying filesystem structure and can be essentially random. Some implementations might process the list somehow, but don't expect it to be sorted.

Let's try with a smaller file. cat frag* reproduces the right file, since shell globs do sort the filenames:

$ split -b512 orig.bin frag
$ cat frag* > new.bin
$ sha256sum orig.bin new.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  orig.bin
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  new.bin

But find doesn't, so we get a different file:

$ find . -name 'frag*' -exec cat {} + > second.bin
$ sha256sum second.bin 
821325739ca65d1cb568ecf3a16bd2e01ac4eef1419b4d714834fab07d2f135c  second.bin

Just running find to print the names reveals this nicely:

$ find . -name 'frag*' |head -5
./fragzbgv
./fragzbmg
./fragvt
./fragyd
./fragzayc

That was on Linux and ext4. I think it uses some sort of hashing and trees to store the filenames, thus producing a random-looking order. On tmpfs, I got the list in reverse creation order, which isn't that random but still would mess up this case.

Explicitly sorting the list of filenames should help:

$ find . -name 'frag*' -print0 | sort -z | xargs -0 cat > third.bin
$ sha256sum third.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  third.bin

The fact that it worked for you with 1k blocks, is probably an accident...

As an aside, I'm not sure why you're using dd bs=512 for the input there instead of just cat or giving split the filename directly. The thing dd does is reads and writes with a particular block size, but the pipe between dd and sort doesn't conserve block sizes, it's just a stream of bytes, and the filesystem really shouldn't care what blocksize you use to read the file, be it 512 (2^8) or 521 (a prime) bytes.

added 200 characters in body

Source Link

edited Nov 5, 2021 at 12:44

ilkkachu

148k
16
268
441

The order find processes the files is essentially randomnot determined. It'sIt may be just the same order as the underlying system call gives, which probably depends on the underlying filesystem structure and can be essentially random. Some implementations might process the list somehow, but don't expect it to be sorted.

Let's try with a smaller file. cat frag* reproduces the right file, since shell globs do sort the filenames:

$ split -b512 orig.bin frag
$ cat frag* > new.bin
$ sha256sum orig.bin new.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  orig.bin
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  new.bin

But find doesn't, so we get a different file:

$ find . -name 'frag*' -exec cat {} + > second.bin
$ sha256sum second.bin 
821325739ca65d1cb568ecf3a16bd2e01ac4eef1419b4d714834fab07d2f135c  second.bin

Just running find to print the names reveals this nicely:

$ find . -name 'frag*' |head -5
./fragzbgv
./fragzbmg
./fragvt
./fragyd
./fragzayc

That was on Linux and ext4. I think it uses some sort of hashing and trees to store the filenames, thus producing a random-looking order. On tmpfs, I got the list in reverse creation order, which isn't that random but still would mess up this case.

SortingExplicitly sorting the list of filenames should help:

$ find . -name 'frag*' -print0 | sort -z | xargs -0 cat > third.bin
$ sha256sum third.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  third.bin

The fact that it worked for you with 1k blocks, is probably an accident...

AlsoAs an aside, I'm not sure why you're using dd bs=512 for the input there instead of just cat or giving split the filename directly. The thing dd does is reads and writes with a particular block size, but the pipe between dd and sort doesn't conserve block sizes, it's just a stream of bytes, and the filesystem really shouldn't care what blocksize you use to read the file, be it 512 (2^8) or 521 (a prime) bytes.

The order find processes the files is essentially random. It's the same order as the underlying system call gives, which probably depends on the underlying filesystem structure.

Let's try with a smaller file. cat frag* reproduces the right file, since shell globs sort the filenames:

$ split -b512 orig.bin frag
$ cat frag* > new.bin
$ sha256sum orig.bin new.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  orig.bin
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  new.bin

But find doesn't, so we get a different file:

$ find . -name 'frag*' -exec cat {} + > second.bin
$ sha256sum second.bin 
821325739ca65d1cb568ecf3a16bd2e01ac4eef1419b4d714834fab07d2f135c  second.bin

Just running find to print the names reveals this nicely:

$ find . -name 'frag*' |head -5
./fragzbgv
./fragzbmg
./fragvt
./fragyd
./fragzayc

That was on ext4. I think it uses some sort of hashing and trees to store the filenames. On tmpfs, I got the list in reverse creation order, which isn't that random but still would mess up this case.

Sorting the list of filenames should help:

$ find . -name 'frag*' -print0 | sort -z | xargs -0 cat > third.bin
$ sha256sum third.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  third.bin

The fact that it worked for you with 1k blocks, is probably an accident...

Also I'm not sure why you're using dd bs=512 for the input there instead of just cat or giving split the filename directly. The thing dd does is reads and writes with a particular block size, but the pipe between dd and sort doesn't conserve block sizes, it's just a stream of bytes, and the filesystem really shouldn't care what blocksize you use to read the file, be it 512 (2^8) or 521 (a prime) bytes.

The order find processes the files is not determined. It may be just the same order as the underlying system call gives, which probably depends on the underlying filesystem structure and can be essentially random. Some implementations might process the list somehow, but don't expect it to be sorted.

Let's try with a smaller file. cat frag* reproduces the right file, since shell globs do sort the filenames:

$ split -b512 orig.bin frag
$ cat frag* > new.bin
$ sha256sum orig.bin new.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  orig.bin
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  new.bin

But find doesn't, so we get a different file:

$ find . -name 'frag*' -exec cat {} + > second.bin
$ sha256sum second.bin 
821325739ca65d1cb568ecf3a16bd2e01ac4eef1419b4d714834fab07d2f135c  second.bin

Just running find to print the names reveals this nicely:

$ find . -name 'frag*' |head -5
./fragzbgv
./fragzbmg
./fragvt
./fragyd
./fragzayc

That was on Linux and ext4. I think it uses some sort of hashing and trees to store the filenames, thus producing a random-looking order. On tmpfs, I got the list in reverse creation order, which isn't that random but still would mess up this case.

Explicitly sorting the list of filenames should help:

$ find . -name 'frag*' -print0 | sort -z | xargs -0 cat > third.bin
$ sha256sum third.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  third.bin

The fact that it worked for you with 1k blocks, is probably an accident...

As an aside, I'm not sure why you're using dd bs=512 for the input there instead of just cat or giving split the filename directly. The thing dd does is reads and writes with a particular block size, but the pipe between dd and sort doesn't conserve block sizes, it's just a stream of bytes, and the filesystem really shouldn't care what blocksize you use to read the file, be it 512 (2^8) or 521 (a prime) bytes.

added 377 characters in body

Source Link

edited Nov 4, 2021 at 19:59

ilkkachu

148k
16
268
441

The output fromorder find comes in anprocesses the files is essentially random order. It's the same order as the underlying system call gives, which probably depends on the underlying filesystem structure.

Let's try with a smaller file. cat frag* reproduces the right file, since shell globs sort the filenames:

$ split -b512 orig.bin frag
$ cat frag* > new.bin
$ sha256sum orig.bin new.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  orig.bin
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  new.bin

But find doesn't, so we get a different file:

$ find . -name 'frag*' -exec cat {} + > second.bin
$ sha256sum second.bin 
821325739ca65d1cb568ecf3a16bd2e01ac4eef1419b4d714834fab07d2f135c  second.bin

Just running find to print the names reveals this nicely:

$ find . -name 'frag*' |head -5
./fragzbgv
./fragzbmg
./fragvt
./fragyd
./fragzayc

That was on ext4. I think it uses some sort of hashing and trees to store the filenames. On tmpfs, I got the list in reverse creation order, which isn't that random but still would mess up this case.

Sorting the list of filenames should help:

$ find . -name 'frag*' -print0 | sort -z | xargs -0 cat > third.bin
$ sha256sum third.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  third.bin

The fact that it worked for you with 1k blocks, is probably an accident...

Also I'm not sure why you're using dd bs=512 for the input there instead of just cat or giving split the filename directly. The thing dd does is reads and writes with a particular block size, but the pipe between dd and sort doesn't conserve block sizes, it's just a stream of bytes, and the filesystem really shouldn't care what blocksize you use to read the file, be it 512 (2^8) or 521 (a prime) bytes.

The output from find comes in an essentially random order. It's the same order as the underlying system call gives, which probably depends on the underlying filesystem structure.

Let's try with a smaller file. cat frag* reproduces the right file, since shell globs sort the filenames:

$ split -b512 orig.bin frag
$ cat frag* > new.bin
$ sha256sum orig.bin new.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  orig.bin
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  new.bin

But find doesn't, so we get a different file:

$ find . -name 'frag*' -exec cat {} + > second.bin
$ sha256sum second.bin 
821325739ca65d1cb568ecf3a16bd2e01ac4eef1419b4d714834fab07d2f135c  second.bin

Just running find to print the names reveals this nicely:

$ find . -name 'frag*' |head -5
./fragzbgv
./fragzbmg
./fragvt
./fragyd
./fragzayc

Sorting the list of filenames should help:

$ find . -name 'frag*' -print0 | sort -z | xargs -0 cat > third.bin
$ sha256sum third.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  third.bin

The fact that it worked for you with 1k blocks, is probably an accident...

Also I'm not sure why you're using dd bs=512 for the input there instead of just cat or giving split the filename directly. The thing dd does is reads and writes with a particular block size, but the pipe between dd and sort doesn't conserve block sizes, it's just a stream of bytes, and the filesystem really shouldn't care what blocksize you use to read the file, be it 512 (2^8) or 521 (a prime) bytes.

The order find processes the files is essentially random. It's the same order as the underlying system call gives, which probably depends on the underlying filesystem structure.

Let's try with a smaller file. cat frag* reproduces the right file, since shell globs sort the filenames:

$ split -b512 orig.bin frag
$ cat frag* > new.bin
$ sha256sum orig.bin new.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  orig.bin
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  new.bin

But find doesn't, so we get a different file:

$ find . -name 'frag*' -exec cat {} + > second.bin
$ sha256sum second.bin 
821325739ca65d1cb568ecf3a16bd2e01ac4eef1419b4d714834fab07d2f135c  second.bin

Just running find to print the names reveals this nicely:

$ find . -name 'frag*' |head -5
./fragzbgv
./fragzbmg
./fragvt
./fragyd
./fragzayc

That was on ext4. I think it uses some sort of hashing and trees to store the filenames. On tmpfs, I got the list in reverse creation order, which isn't that random but still would mess up this case.

Sorting the list of filenames should help:

$ find . -name 'frag*' -print0 | sort -z | xargs -0 cat > third.bin
$ sha256sum third.bin 
8d12b42623eeefee872f123bd0dc85d535b00df4d42e865f993c40f7bfc92b1e  third.bin

The fact that it worked for you with 1k blocks, is probably an accident...

Also I'm not sure why you're using dd bs=512 for the input there instead of just cat or giving split the filename directly. The thing dd does is reads and writes with a particular block size, but the pipe between dd and sort doesn't conserve block sizes, it's just a stream of bytes, and the filesystem really shouldn't care what blocksize you use to read the file, be it 512 (2^8) or 521 (a prime) bytes.

added 377 characters in body

Source Link

edited Nov 4, 2021 at 19:50

ilkkachu

148k
16
268
441

Loading

Source Link

answered Nov 4, 2021 at 19:43

ilkkachu

148k
16
268
441

Loading

Stack Exchange Network

Return to Answer