Is jq internal sort slower than GNU sort?

Ask Question

Asked 4 years, 5 months ago

Modified 4 years, 4 months ago

Viewed 648 times

While filtering through this json file I did a benchmark and found out utilizing jq's internal sort and unique method is actually 25% slower than sort --unique!

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`jq "[.[].category] \\| sort \\| unique" channels.json`	172.0 ± 2.6	167.8	176.8	1.25 ± 0.06
`jq "[.[].category \\| select((. != null) and (. != \"XXX\"))] \\| sort \\| unique" channels.json`	151.9 ± 4.1	146.5	163.9	1.11 ± 0.06
`jq ".[].category" channels.json \\| sort -u`	137.2 ± 6.6	131.8	156.6	1.00

Summary
  'jq ".[].category" channels.json | sort -u' ran
    1.11 ± 0.06 times faster than 'jq "[.[].category | select((. != null) and (. != \"XXX\"))] | sort | unique" channels.json'
    1.25 ± 0.06 times faster than 'jq "[.[].category] | sort | unique" channels.json'

test command:

hyperfine --warmup 3 \
    'jq "[.[].category] | sort | unique" channels.json'  \
    'jq "[.[].category | select((. != null) and (. != \"XXX\"))] | sort | unique" channels.json' \
    'jq ".[].category" channels.json | sort -u'

If we only test sort (without uniqueness), again jq is 9% slower than sort:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`jq "[.[].category] \\| sort" channels.json`	133.9 ± 1.6	131.1	138.2	1.09 ± 0.02
`jq ".[].category" channels.json \\| sort`	123.0 ± 1.3	120.5	125.7	1.00

Summary
  'jq ".[].category" channels.json | sort' ran
    1.09 ± 0.02 times faster than 'jq "[.[].category] | sort" channels.json'

versions:

jq-1.5-1-a5b5cbe
sort (GNU coreutils) 8.28

I expected using jq's internal functions would result in a faster processing than piping into an external app which itself should be spawned. Am I using jq poorly?

update Just repeated this experiment on host with FLASH storage, Arm CPU and these versions:

jq-1.6
sort (GNU coreutils) 8.32

result:

Benchmark #1: jq "[.[].category] | sort" channels.json
  Time (mean ± σ):     587.8 ms ±   3.9 ms    [User: 539.5 ms, System: 44.2 ms]
  Range (min … max):   582.8 ms … 594.2 ms    10 runs
 
Benchmark #2: jq ".[].category" channels.json | sort
  Time (mean ± σ):     606.0 ms ±   8.6 ms    [User: 569.5 ms, System: 49.0 ms]
  Range (min … max):   589.6 ms … 616.2 ms    10 runs
 
Summary
  'jq "[.[].category] | sort" channels.json' ran
    1.03 ± 0.02 times faster than 'jq ".[].category" channels.json | sort'

Now jq sort runs 3% faster than GNU sort :D

edited Jun 26, 2021 at 20:28

asked Jun 26, 2021 at 8:39

Zeta.Investigator

1,2021 gold badge18 silver badges29 bronze badges

2

jq uses your C library's qsort() implementation. All your tests are running in sub-second time, indicating that any memory admin overhead is likely significantly affecting the results, so we can't say more than possibly "okay". Test again on data that takes around 2 to 5 seconds to sort. Spawning a single process to run sort is quick (it's not as if you're running a shell loop, spawning sort in every iteration or something).

Kusalananda
– Kusalananda ♦

2021-06-26 08:55:54 +00:00
Commented Jun 26, 2021 at 8:55
1

jq's sort operation is not specialised to strings, so it is also fundamentally doing more work with every comparison.

Michael Homer
– Michael Homer

2021-06-26 09:28:47 +00:00
Commented Jun 26, 2021 at 9:28
I re-did your benchmarks with much more data, and I can only say that I confirmed your numbers. The difference in timing is probably due to using different sorting algorithms. Running the sort utility with --debug on my OpenBSD system indicates that it's using Radix Sort.

Kusalananda
– Kusalananda ♦

2021-06-26 09:29:08 +00:00
Commented Jun 26, 2021 at 9:29
2

Just to say that the OpenBSD sort allows for specifying --qsort. Running sort --qsort | uniq (-u can't be used with --qsort), I get exactly the same identical timings as with sort | unique in jq.

Kusalananda
– Kusalananda ♦

2021-06-26 10:07:17 +00:00
Commented Jun 26, 2021 at 10:07
1

@Zeta.Investigator You can't really draw any conclusions from benchmarks that run in sub-second time. Use data that requires at least a few seconds (ideally 10 seconds or more) to sort. If you don't have that amount of data, then I'd say that this whole question is a bit irrelevant.

Kusalananda
– Kusalananda ♦

2024-11-20 06:00:17 +00:00
Commented Nov 20, 2024 at 6:00

| Show 4 more comments

0 You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Is jq internal sort slower than GNU sort?

0

You must log in to answer this question.

Hot Network Questions

Is jq internal sort slower than GNU sort?

0

You must log in to answer this question.

Related

Hot Network Questions