Sort is not sorting?

Question

I seem to be having the same issue as described in The "join" utility reports: file is not sorted, but in fact it is sorted however I have piped BOTH files through sort before attempting to join. I have also tried sort -d and sort -g.

This is running on Amazon Linux 2, using sort from coreutils-8.22-24

The following illustrates the issue:

root@host:/home/user# cat /tmp/db_schema_size | sort
directory       0.000106811523
directory_1      1.059814453265
directory_123    0.564987182688
directory_123123 0.564987182688
directory_1234   0.564987182688
directory_12345  0.564987182688
directory_1234567        0.564987182688
directory_82473  0.934677124123
directory_82475  0.751586914161
directory_82477  0.881881713968
directory_82479  0.751571655373
directory_82481  0.750396728614
directory_82483  0.589370727610

root@host:/home/user# cat /tmp/db_dir_sizes | sort
directory       132
directory_1      1115936
directory_123123 613244
directory_12345  613248
directory_1234567        613248
directory_1234   613244
directory_123    613244
directory_82473  1015140
directory_82475  818764
directory_82477  958628
directory_82479  818756
directory_82481  817500
directory_82483  638820

Both files are the same structure - no lead/trailing whitespace, a single tab char between the values.

Both files are processed by sort but produce output in a different order.

I do see that on Ubuntu 22.04 LTS, the output is consistent (and of the first form above).

What am I missing here?

update

For clarification...

On AWS Linux 2 with LANG=en_US.UTF-8, I get output as above - i.e. output differs

On AWS Linux 2 with LANG=C.UTF-8 output is the same

On Ubuntu with both LANG=C.UTF-8 and LANG=en_US.UTF-8, output is the same

erk. While I still do not understand why there is a difference, setting LANG=C.UTF-8 (instead of LANG=en_US.UTF-8) gives consistent ordering across the 2 files. — symcbean
– symcbean, Commented Jun 19, 2024 at 13:45
For each of these files please show the corresponding alternative. Also provide the output of locale from both systems. — Chris Davies
– Chris Davies, Commented Jun 19, 2024 at 13:58
I already did that? (LC_ values are inherited from $LANG when not explicitly set - but that DOES NOT EXPLAIN why sort gives different results WITHOUT changing the locale vars) — symcbean
– symcbean, Commented Jun 19, 2024 at 14:12
Are the two files being shown from the same host? If so, on this host what is the locale? — Chris Davies
– Chris Davies, Commented Jun 19, 2024 at 14:16
Output quoted in the question is on the same AWS host with LANG=en_US.UTF-8. Applying the LANG=en_US.UTF-8 on my Ubuntu box still gives consistent results across both files with data order as per the first example. Same with LANG=C.UTF-8. i.e. issue ONLY manifiests with LANG=en_US.UTF-8 on AWS Linux 2 host. — symcbean
– symcbean, Commented Jun 19, 2024 at 14:20

user763861 · Accepted Answer · 2024-08-31 01:56:31Z

The short answer is that en_US.UTF-8 has non-intuitive sorting behavior.

A longer answer is provided by https://stackoverflow.com/questions/51930948/understanding-gnu-sorting-with-en-us-utf-8. Since the en_US.UTF-8 ignores whitespace, that means the sort is ignoring your tabs. Which is disastrous, given your goal of joining on the directory names.

Here's what happens when the tabs are removed using sed $'s/\t//g':

$ cat db_schema_size | sed $'s/\t//g' | LC_COLLATE=en_US.UTF-8 sort | cat -T
directory0.000106811523
directory_11.059814453265
directory_1230.564987182688
directory_1231230.564987182688
directory_12340.564987182688
directory_123450.564987182688
directory_12345670.564987182688
directory_824730.934677124123
directory_824750.751586914161
directory_824770.881881713968
directory_824790.751571655373
directory_824810.750396728614
directory_824830.589370727610

$ cat db_dir_sizes | sed $'s/\t//g' | LC_COLLATE=en_US.UTF-8 sort | cat -T
directory_11115936
directory_123123613244
directory_12345613248
directory_1234567613248
directory_1234613244
directory_123613244
directory132
directory_824731015140
directory_82475818764
directory_82477958628
directory_82479818756
directory_82481817500
directory_82483638820

If you're wondering why different hosts would behave differently -- Ubuntu vs AWS Linux 2? You might be seeing a difference in how en_US.UTF-8 is implemented on those two platforms. My examples were run on Ubuntu 22.04.4 LTS.

It's conceivable that a shell might have sort aliased to LC_COLLATE=C sort to avoid the troubles with en_US.UTF-8. The way to find out is type sort.

On GNU systems at least, en_US.UTF-8 doesn't ignore whitespace. whitespace have an undefined primary (and secondary and ternary for that matters) weight like many other accessory characters including _, ., also used in those inputs, but if two strings compare the same after primary, secondary, ternary weights have been processed, they still have a relative order with tab <spc < . < _. The issue here is that you need to sort on the join field, not the full line. — Stéphane Chazelas
– Stéphane Chazelas, Commented Aug 31, 2024 at 4:17

Stéphane Chazelas · Accepted Answer · 2024-08-31 04:32:27Z

join expects its input to be sorted (lexically, using the same collation order as the one it will use to compare fields, so same locale at least in the LC_CTYPE and LC_COLLATE categories) on their respective join field, not on the whole line.

For sort, by default, fields are delimited by the transition from a non-blank to a blank, that's the same for join except that leading blanks are ignored, like when sort is called with -b, though with both sort and join, one can specify a single-character¹ field separator with the -t option.

As the POSIX specification of the join utility puts it:

The files file1 and file2 shall be ordered in the collating sequence of sort -b on the fields on which they shall be joined, by default the first in each line. All selected output shall be written in the same collating sequence.

-t char
Use character char as a separator, for both input and output. Every appearance of char in a line shall be significant. When this option is specified, the collating sequence shall be the same as sort without the -b option.

So, if joining file1 and file2 (where fields are blank-separated) on the first field, you need:

join <(sort -bk1,1 file1) <(sort -bk2,2 file2)

(here assuming a shell with support for ksh-style process substitution such as ksh, zsh or bash)

And if the fields are TAB-separated:

join -t $'\t' <(sort -t $'\t' -k1,1 file1) <(sort -t $'\t' -k1,1 file2)

Now, en_US.UTF-8 is probably not the best choice of locale as it will give non-deterministic outcome if the input contains sequences of bytes that can't be decoded in UTF-8; and that decoding and the complex en_US collation order is costly to process, and at least on GNU systems, those human collection orders have characters that sort the same so can give non-deterministic outcome even on valid UTF-8 encoded text.

If you don't care about the actual order the the join keys in the files are sorted in as long as the files are joined, using the C/POSIX locale would be much more efficient and reliable as it's a single-byte locale with no decoding taking place, and the comparison function is just a byte-to-byte comparison.

LC_ALL=C join -t $'\t' <(LC_ALL=C sort -t $'\t' -k1,1 file1) \
                       <(LC_ALL=C sort -t $'\t' -k1,1 file2)

Now beware that tab and newline are as valid a character as any in a file or directory name and neither sort nor join have any provision to escape the field separator or record delimiter.

The GNU implementations of sort and join have a -z option to process NUL-delimited records instead of newline-delimited ones which helps in that that character cannot occur in a file name but that won't help for Tab.

So to process arbitrary file paths, your options are either to encode those TAB/NL one way or another like as \t, \n (and \\ for backslash) or with URI encoding (%08, %0A and %25 for %).

Or use the more advanced forms of TSV as recognised by mlr for instance where fields with newlines and/or tabs are quoted like in CSVs. But you can't process those with sort/join, though you could use mlr instead which would also have the benefit of allowing you to use headers and not require the inputs to be sorted (though its join supports a -s for sorted input which helps for sorted files that can't fit in memory).

~$ sed -n l file1
dir\tvalue$
directory\t0.000106811523$
directory_1\t1.059814453265$
directory_123\t0.564987182688$
directory_123123\t0.564987182688$
"directory$
with newline"\t0.123$
"directory with\ttab"\t0.456$
~$ mlr --tsv join -j dir --lp file1. --rp file2. -f file1 file1
dir     file1.value     file2.value
directory       0.000106811523  0.000106811523
directory_1     1.059814453265  1.059814453265
directory_123   0.564987182688  0.564987182688
directory_123123        0.564987182688  0.564987182688
"directory
with newline"   0.123   0.123
"directory with tab"    0.456   0.456

^{¹ Beware that with many sort implementations including current versions of GNU tar, that character has to be single-byte, which is the case of TAB.}

Stack Exchange Network

Sort is not sorting?

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Sort is not sorting?

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions