glib iconv() - force conversion to single bytes

Question

I have my own personal movie database system, within which context I NEVER want to see "extended" characters (with accents, umlauts, etc.) in any text fields.

MS Co-pilot tells me that i could use something based on...

iconv_t cd = iconv_open("ASCII//TRANSLIT", "ISO-8859-1");
...
size_t result = iconv(cd, &inptr, &inbytesleft, &outptr, &outbytesleft);

...to reliably convert anything I get back from API calls to https://www.omdbapi.com and themoviedb.org into "nearest equivalent" ASCII characters, but it also tells me there's NO STANDARD WAY of forcing "single byte in = single byte out". So if the input happens to contain the SINGLE BYTE 'ß' (Eszett or sharp S) then iconv() may convert it to TWO BYTES ("ss").

I find this hard to believe. So before I go to the trouble of writing my own logic to convert my text byte-by-byte (replacing any multi-byte outputs with or some other 'special' char), I thought I'd ask here.

Is there a standard way to reduce every "extended" character (single byte with 128-bit set) to "nearest equivalent" ASCII char (i.e. - WITHOUT the high bit set)?

In my context, fixed text length is more important than "accuracy", so just "s" would be better than "ss" for 'ß'.

Are you coding in C? C++? What would you want ß to convert to? — Dan Getz
– Dan Getz, Commented Jul 19 at 22:49
In my context, fixed text length is more important than "accuracy" (I'm English, so I don't much care about non-English orthography). But just s is fine for ß if that's the best I can do. — FumbleFingers
– FumbleFingers, Commented Jul 20 at 19:46
man 3 iconv_open confirms what Copilot told you. Do you really need to keep the byte count? Or were you just hoping to avoid writing code to handle E2BIG? To keep the byte count, writing your own function that simply looks up the equivalent in a 256-byte string might be simpler than trying to bend someone else's function to your will. — Dan Getz
– Dan Getz, Commented Jul 21 at 0:45
Also might want to double check if your input is really ISO-8859-1, as often text can be in the similar ISO-8859-15 or CP-1252 encodings instead. — Dan Getz
– Dan Getz, Commented Jul 21 at 0:54
Doesn't the API use UTF-8, implied by output in XML or JSON format? — Ian Abbott
– Ian Abbott, Commented Jul 22 at 16:19

MSalters · Accepted Answer · 2025-07-23 14:54:43Z

0

No. The paradox you'd run into is that the conversion you imagine makes zero sense for the people (mostly non-english europeans) who use ISO 8859-1. "Nearest ASCII character" doesn't make much sense to a German, for instance. The nearest to ü is ue. (But in Dutch, it's just u - this is locale-specific and 8859-1 is not tied to one locale)

answered Jul 23 at 14:54

MSalters

182k11 gold badges171 silver badges376 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

FumbleFingers Jul 23 at 15:54

Well, I did say it's for my own personal movie database system, in which context I don't care in the least what sense it makes to German or Dutch people. But equally, as an Anglophone, I don't approve of using accents and other non-Ascii characters in any English text or names. So I'm heartened to see that IMDB is coming round to my position here, with things like Cafe Society Original title: Café Society (2016).

MSalters Jul 23 at 19:03

That's a slightly naïve approach though 😳. The usual approach here is unicode decomposition followed by dropping the now isolated accents. Doesn't work for ß though.

Collectives™ on Stack Overflow

glib iconv() - force conversion to single bytes

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related