0

I have my own personal movie database system, within which context I NEVER want to see "extended" characters (with accents, umlauts, etc.) in any text fields.

MS Co-pilot tells me that i could use something based on...

iconv_t cd = iconv_open("ASCII//TRANSLIT", "ISO-8859-1");
...
size_t result = iconv(cd, &inptr, &inbytesleft, &outptr, &outbytesleft);

...to reliably convert anything I get back from API calls to https://www.omdbapi.com and themoviedb.org into "nearest equivalent" ASCII characters, but it also tells me there's NO STANDARD WAY of forcing "single byte in = single byte out". So if the input happens to contain the SINGLE BYTE 'ß' (Eszett or sharp S) then iconv() may convert it to TWO BYTES ("ss").

I find this hard to believe. So before I go to the trouble of writing my own logic to convert my text byte-by-byte (replacing any multi-byte outputs with or some other 'special' char), I thought I'd ask here.

Is there a standard way to reduce every "extended" character (single byte with 128-bit set) to "nearest equivalent" ASCII char (i.e. - WITHOUT the high bit set)?

In my context, fixed text length is more important than "accuracy", so just "s" would be better than "ss" for 'ß'.

6
  • Are you coding in C? C++? What would you want ß to convert to? Commented Jul 19 at 22:49
  • In my context, fixed text length is more important than "accuracy" (I'm English, so I don't much care about non-English orthography). But just s is fine for ß if that's the best I can do. Commented Jul 20 at 19:46
  • 1
    man 3 iconv_open confirms what Copilot told you. Do you really need to keep the byte count? Or were you just hoping to avoid writing code to handle E2BIG? To keep the byte count, writing your own function that simply looks up the equivalent in a 256-byte string might be simpler than trying to bend someone else's function to your will. Commented Jul 21 at 0:45
  • 2
    Also might want to double check if your input is really ISO-8859-1, as often text can be in the similar ISO-8859-15 or CP-1252 encodings instead. Commented Jul 21 at 0:54
  • 2
    Doesn't the API use UTF-8, implied by output in XML or JSON format? Commented Jul 22 at 16:19

1 Answer 1

0

No. The paradox you'd run into is that the conversion you imagine makes zero sense for the people (mostly non-english europeans) who use ISO 8859-1. "Nearest ASCII character" doesn't make much sense to a German, for instance. The nearest to ü is ue. (But in Dutch, it's just u - this is locale-specific and 8859-1 is not tied to one locale)

Sign up to request clarification or add additional context in comments.

2 Comments

Well, I did say it's for my own personal movie database system, in which context I don't care in the least what sense it makes to German or Dutch people. But equally, as an Anglophone, I don't approve of using accents and other non-Ascii characters in any English text or names. So I'm heartened to see that IMDB is coming round to my position here, with things like Cafe Society Original title: Café Society (2016).
That's a slightly naïve approach though 😳. The usual approach here is unicode decomposition followed by dropping the now isolated accents. Doesn't work for ß though.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.