If a sequence of code points forms a Unicode character, does every non-empty prefix of that sequence also form a valid character?

Question

The problem I have is that given a sequence of bytes, I want to determine its longest prefix which forms a valid Unicode character (extended grapheme cluster) assuming UTF8 encoding.

I am using Swift, so I would like to use Swift's built-in function(s) to do so. But these functions only decode a complete sequence of bytes. So I was thinking to convert prefixes of the byte sequence via Swift and take the last prefix that didn't fail and consists of 1 character only. Obviously, this might lead to trying out the entire sequence of bytes, which I want to avoid. A solution would be to stop trying out prefixes after 4 prefixes in a row failed. If the property asked in my question holds, this would then guarantee that all longer prefixes must also fail.

I find the Unicode Text Segmentation Standard unreadable, otherwise I would try to directly implement boundary detection of extended grapheme clusters...

"implement boundary detection of extended grapheme clusters." haha you don't want to do that. That way lies madness. Honestly using init?(validatingUTF8:) to check incrementally longer prefixes isn't a bad idea. While extended grapheme clusters can get pathologically big (try copying this answer into a unicode inspector lol), the vast majority of them are tiny. — Alexander
– Alexander, Commented Apr 19, 2020 at 22:14
I definitely don't want to do that :-) But that's why I am asking the question! I like my code to have predictable performance even in an error case. — Steven Obua
– Steven Obua, Commented Apr 19, 2020 at 22:17
It looks like the perfect solution would a function like findLongestValidCharacter(in input: [UnicodeScalar]) -> Character. Even if such a function existed, Pathologically long characters would still be an issue. The "bad" cases would be less bad given that the function could be O(input.count) rather than the O(input.count^2) of the brute force approach, but that can still get bad. — Alexander
– Alexander, Commented Apr 19, 2020 at 22:21
You could try spelunking into the standard library implementation, and see how it does its cluster breaking: github.com/apple/swift/blob/… — Alexander
– Alexander, Commented Apr 19, 2020 at 22:23
I am not so much worried about actual valid characters that have very long code point sequences. If that happens, oh well. But in my scenario it can very well be that the byte sequence is not a UTF-8 string at all. If that byte sequence is 80GB long, I don't want to check it all to find out that it isn't valid. — Steven Obua
– Steven Obua, Commented Apr 19, 2020 at 22:25

Steven Obua · Accepted Answer · 2020-04-20 08:21:51Z

After taking a long hard look at the specification for computing the boundaries for extended grapheme clusters (EGCs) at https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules, it is obvious that the rules for EGCs all have the shape of describing when it is allowed to append a code point to an existing EGC to form a longer EGC. From that fact alone my two questions follow: 1) Yes, every non-empty prefix of code points which form an EGC is also an EGC. 2) No, by adding a code point to a valid Unicode string you will not decrease its length in terms of number of EGCs it consists of.

So, given this, the following Swift code will extract the longest Unicode character from the start of a byte sequence (or return nil if there is no valid Unicode character there):

    func lex<S : Sequence>(_ input : S) -> (length : Int, out: Character)? where S.Element == UInt8 {
        // This code works under three assumptions, all of which are true:
        // 1) If a sequence of codepoints does not form a valid character, then appending codepoints to it does not yield a valid character
        // 2) Appending codepoints to a sequence of codepoints does not decrease its length in terms of extended grapheme clusters
        // 3) a codepoint takes up at most 4 bytes in an UTF8 encoding
        var chars : [UInt8] = []
        var result : String = ""
        var resultLength = 0
        func value() -> (length : Int, out : Character)? {
            guard let character = result.first else { return nil }
            return (length: resultLength, out: character)
        }
        var length = 0
        var iterator = input.makeIterator()
        while length - resultLength <= 4 {
            guard let char = iterator.next() else { return value() }
            chars.append(char)
            length += 1
            guard let s = String(bytes: chars, encoding: .utf8) else { continue }
            guard s.count == 1 else { return value() }
            result = s
            resultLength = length
        }
        return value()
    }

Collectives™ on Stack Overflow

If a sequence of code points forms a Unicode character, does every non-empty prefix of that sequence also form a valid character?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related