I'm writing code which needs to find the advances for each grapheme in the original text (e.g., for cursor positioning and selection), and I would like to use the Unicode grapheme boundaries rather than HarfBuzz's shaping boundaries. It seems cluster IDs are necessary for this.
I assume that I need to manually specify cluster IDs in the hb_buffer before calling hb_shape(), and I have choice about how to do this. (What happens if I don't write these fields?) I think it would make the most sense if I gave all code points the same cluster ID, so that they are guaranteed to have the same cluster ID in the shaped output. If the cluster ID is just the grapheme's index in the (conceptual) list of graphemes, then I can easily go backwards from a glyph to a string position, as I'm keeping an array of all the graphemes' utf8 offsets.
My problem is that the documentation is not clear on whether it's acceptable usage to (contiguously) duplicate cluster IDs in the unicode input. I can't find any example that does this (or really any example at all that shows setting the cluster IDs).
Is this permitted? Is this the best way to accomplish this?