I'm afraid your assumption is wrong. The version with a critical section does produce a correct answer - well at least not a deterministic answer.
For simplicity take the case where get_key always returns 0. The serial version would copy the array, the parallel one would perform a arbitrary reshuffle. There is an ordering dependency between all iterations in which get_key returns the same value.
Generally speaking. Simple critical sections can often be replaced by a reduction which allows a independent execution while incurring some merge overhead after the parallel part. Atomics can also be an option for simple operations, but they also suffer from a general performance penalty and often additional negative cache issues. Technically your incorrect critical section code would be equivalent to this slightly more efficient atomic code:
int index;
#pragma omp atomic capture
index = count[key]++;
#pragma omp atomic write
toArray[index] = fromArray[i];
I wonder if there is a way to parallelize it with good performance?
Any question about performance needs more specific information. What are the involved types, data sizes, parallelism level, ...? There is no general answer to "this is the best way for performance".