Just for the record, vectorize_width counts in elements not bytes, so 8 is 8x uint32_t = one 256-bit YMM vector. Interleave=4 is just an unroll count of that many logical vectors.
TL:DR: bumping up the vectorize_width() beyond the HW / asm vector register width it's willing to use is effectively just a way to make it unroll more the way it already unrolls. At least for simple cases; I'd be worried about it making inefficient asm if it had to widen or narrow elements, like if you were using a uint8_t[] array with a uint32_t[] array.
Out-of-order exec can already interleave independent work across loop iterations, and clang already likes to unroll tiny loops by 2 or 4 vectors, depending how tiny they are, and sometimes even 8 with some -mtune settings. (Clang's unrolling also interleaves, doing 4 loads then 4 vpaddd ymm, ymm, [mem] then 4 stores, rather than 4x load/add/store. Which might matter on an in-order CPU like a low-power ARM Cortex-A53 efficiency core.)
Bumping up to vectorize_width(64) so one logical "vector" takes 8x 32-byte (8-element) vector registers, I think it's seeing that the loop is already big enough with one "64-element vector" per iteration (8 instructions each to load/load+add/store) and deciding not to unroll to a multiple of that amount of work1. Thus interleave=1 for a total unroll factor in the asm of 8, exactly the same as vectorize_width(8) and interleave=8 if there's a way to ask for that.
When asking for "vectors" wider than the target HW supports, the chunks of that vector are also an unroll with independent work, producing about the same asm as a higher unroll count would, at least for this very simple problem where input and output element widths are the same so it doesn't need to invent any shuffles.
I guess this could be useful as a way to get it to unroll a loop more than it would with the current -mtune= options implied by -march=core-avx2 or better2 -march=haswell (the first Intel "Core" CPU with AVX2). But normally clang's default amount of unrolling is generous enough.
It might be more relevant in a reduction (like a sum of an array or a dot product), where there is a data dependency across loop iterations. In that case, unrolling with more vector registers really does interleave more chains of work in ways out-of-order exec can't do for you: Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)
Clang does already unroll with multiple accumulators for associative math (integers, or FP with -ffast-math or #pragma omp simd reduction (+:my_sum)), but a hot loop might benefit from more unrolling than it does by default; without profile-guided optimization, it doesn't want to spend too much code size on loops that might not be hot or might typically be run with fairly small n.
If you compile with -march=x86-64-v4 (which includes AVX-512), even asking for 16-element vectors doesn't get it to use 64-byte ZMM vectors, unfortunately3. For that you want -mprefer-vector-width=512. Or -march=znver4 which implies -mtune=znver4 - Zen 4 has no downside for using 512-bit vectors (because they're actually double-pumped 256-bit ALUs), unlike Intel, so compilers will freely use them when tuning for it.
#pragma clang loop vectorize_width(64) can reduce the vector width used in the asm from the -mtune default, down to scalar if you use 1, or down to XMM if you use 4 for 4-byte elements. (Or 16 for 1-byte elements.) With a width of 2, it uses vmovq 64-bit loads/stores on XMM registers, fortunately not MMX!
vectorize_width(1) could perhaps be useful to stop a compiler from vectorizing a cleanup loop after a manually-vectorized loop (with intrinsics), if it can't already see the iteration count is 0..3 or something. But it might still want to make unrolled scalar so that might not help. As always, check the asm. (And often there are ways of making the cleanup loop trip-count more obviously a small number, like deriving it from n & 3 instead of just resuming iterating with the i from the manually-vectorized loop like for ( ; i < n ; i++ );)
Footnote 1: unroll choices with AVX-512 for 256 or 512-bit regs
With -march=znver4 (or -march=icelake-client -mprefer-vector-width=512) so it will use 64-byte ZMM registers (16-element for uint32_t), vectorize_width(64) does get it to unroll by a total of 16 ZMM vectors. That's 4x ZMM for each of the "64 element vectors" we asked for, and it's choosing to unroll by 4 because it thinks the loop is still small?
Godbolt with Clang 17 for for znver4 or -march=x86-64-v4 -mprefer-vector-width=512 -
vectorized loop (vectorization width: 64, interleaved count: 4)
AVX-512 makes 32 vector regs available, but I don't think it was worried about using up all 16 YMM vectors; with just -march=x86-64-v4 or other option that allows AVX-512 but prefers 256-bit vector-width, we get "vectorization width: 64", "interleaved count: 1", i.e. unroll by 8x YMM. This is still more unrolling than its default 4 vectors (of YMM or ZMM width depending on tuning).
Footnote 2: -march= strings: core-avx2 is an obsolete way to specify Haswell, Skylake, etc.
Those old arch strings like core-whatever are pretty clunky and unclear since Intel made many generations of CPU with the same "core" naming; avoid them. Use a newer clang that understands -march=x86-64-v3 if you want a CPU-tuning-neutral AVX2+FMA+BMI2 microarchitecture feature level, or use -march=skylake, -march=znver3, or -march=icelake-client -mno-avx512f or whatever to optimize for a specific CPU as well as enabling everything it has. Or -march=x86-64-v3 -mtune=skylake. For Skylake-family, see also How can I mitigate the impact of the Intel jcc erratum on gcc? which isn't enabled by default as part of -mtune=skylake)
AFAIK, there's no clear definition of what -mtune is implied by -march=core-avx2, like is that supposed to be all Haswell-and-later CPUs with "core" in their name, or is it specifically Haswell? If LLVM's optimizer does know a difference between Haswell and Skylake or Ice Lake (e.g. like that popcnt's false output-dependency is fixed in Ice Lake, and same for lzcnt/tzcnt in Skylake), then you'd rather specify a specific CPU.
GCC at least doesn't have tuning settings for Generic-CPUs-with-AVX2. -march=x86-64-v3 leaves -mtune=generic, which fortunately has stopped catering to first-gen Sandybridge so it doesn't split 32-byte vector load/store that it can't prove must be aligned. (Since that was worse for later CPUs, especially if your data was aligned all or most of the time but you hadn't jumped through hoops to promise that to the compiler.) It would be good if compilers did have tune options that could leave out workarounds for CPUs that don't have the features to run the asm we're generating, instead of only a specific CPU or pure generic.
(-mtune=generic is always a moving target that changes with compiler version as old CPUs become sufficiently obsolete that we stop working around their performance potholes, especially for things that aren't total showstoppers. And as new CPUs are released with their own quirks.)
Footnote 3: Interaction with AVX-512 256 vs. 512-bit vector-width tuning choices
It might be nice if there was a per-loop way to override that, for a program that has phases of sustained heavy-duty work on mostly-aligned data where 512-bit vectors are worth paying the penalty in turbo clock speed (especially on older Intel CPUs but negligible on Sapphire Rapids) and port 1's vector ALUs being shut down on Intel.
There might be a way to influence auto-vectorization if per-function tune options are a thing, but #pragma clang loop vectorize_width(16) isn't it. Compiling separate files without -flto can work, but then you don't get -flto