How to convert to float a string array with multiple locale?

Question

In Python, I have an array with float numbers expressed as strings with multiple locales.

E.g.:

str_array = ['1,234.56', '7.890,12', '123 456,789']

I would like to convert all of them to float, and append to a new array.

float_array = [1234.56, 7890.12, 123456.789]

I was wondering whether there's a concise and elegant way to have the job done in Python. I tried something with the locale module:

import locale
locale.setlocale(locale.LC_NUMERIC, locale='en_EN')
num_string = '1,234.56'
float_num = locale.atof(num_string)
# It prints 1234.56
print(float_num)

But I'm searching for a smart way all of the different locales that can appear inside my input array (potentially, any locale on earth). Can anyone help me?

Thank you!

You need to now about each datum. For example, what would "1,234" mean? Unless you know the locale for that datum the result will be ambiguous because it might be equivalent to 1234.0 or 1.234 (where dot is the decimal separator) — jackal
– jackal, Commented May 28, 2024 at 16:56

Aicody · Accepted Answer · 2024-05-28 17:20:27Z

If there was no option, you could just do it manually:

import re, string


def _sanatize(arr):
    p = r"[^\s\r\n.,]*\d+\s*$"

    float_array = []
    for s in str_array:
        matches = re.finditer(p, s)
        for i, match in enumerate(matches, start=1):
            st = match.start()
            ed = match.end()
            gr = match.group()
            float_num = ''
            for ch in range(0, int(st)):
                if s[ch] in string.digits:
                    float_num += s[ch]
            if float_num:
                float_num += f'.{gr}'
            else:
                float_num = gr
            float_array += [float(float_num)]
    return float_array


str_array = ['1, 234.56', '7.890, 12', '123 456, 789', '123435', '123, 134', '1, 231', '1  , 231  ']

print(_sanatize(str_array))

Prints

[1234.56, 7890.12, 123456.789, 123435.0, 123.134, 1.231, 1.231]

`r"[^\s\r\n.,]\d+\s$"`

Here we look at the end of the number and save the float part in gr.
We then concat it to all digits before the float, and remove all spaces and commas or dots.
Note that if there is no float, the num is an integer == gr.

jsbueno · Accepted Answer · 2024-05-28 17:26:35Z

locale is certainly not a one-stop solution for what you need: it is a process-wide state, so once you set a locale before a call, all functions (in all threads) will work with that locale.

Still, it would be possible to dynamically switch the locale before consuming each element in the list - some "flickering" thing - the problem is: there is no "guess the locale for this number" call.

And that is actually a problem - so in the end it will be easier to parse the numbers manually (constructing a locale-dictionary from installed locales, or just hard-coding then directly) - and using some in code heuristics to try and guess each one. And even there, without some extra information about each number, they are ambiguous. For example 1.234 can mean "one and two hundred third for thousandths" in one locale or "one thousand, two hundred third four" in another. Just the presence of another dot or comma elsewhere in the number could disambiguate it.

That said, I'd write code to scan each number from right to left - if there are two non-digit symbols in the number, assume the rightmost is the decimal separator, and then just ignore the other. If there is one symbol, assume it is the decimal separator, unless it is "." or "," with exactly three places after it, in that case assume it is a thousands separator. And then, attach some "probability" for the ambiguous cases, keeping a record with the possible values, instead of blindingly assume "1234" when one means "1.234"(US-locale).

If you have any other information about each number (that could be even the scale they should be constrained too, for example), that might help. Otherwise you will end-up with some ambiguous numbers,no matter what.

Andj · Accepted Answer · 2024-06-05 08:00:27Z

It is fairly easy to convert individual formatted numbers to an int or float. Likewise it is possible to convert a series of formatted numbers using the same locale or number formatting.

It becomes challenging when different conventions are being used in the list.

The following code will work if the formatted numbers are string formatted floats. It will fail if some of the numbers are integers.

It is important to note that locale data varies from libc implementation to libc implementation. For instance many macOS locales do not have a thousands separator defined, and will format numbers without a thousands separator, unless you process the number as currency. I've tried to accommodate macOS formatted numbers in the code.

The following code uses a list comprehension to pass each formatted number and it's separators to a function to convert it to a float:

import unicodedata as ud
import regex

def convert_digits(text, sep = (",", ".")):
    nd = regex.compile(r'^[+(-]?\p{Nd}[,.\u066B\u066C\u0020\u2009\u202F\p{Nd}]*[+)-]?$')
    tsep, dsep = sep
    if nd.match(text):
        if tsep:
            text = text.replace(tsep, "")
        text = ''.join([str(ud.decimal(c, c)) for c in text])
        if text[-1] in ["-", "+"]:
            text = text[-1] + text[:-1]
        if text[0] == "(" and text[-1] == ")":
            text = "-" + text[1:-1]
        return float(text.replace(dsep, ".")) if dsep != "." else float(text)
    return None

def get_separators(n):
    t = tuple(dict.fromkeys(regex.sub(r'\d+', '', n)))
    if t[0] in ["-", "+", "("]:
        t = t[1:]
    if t[-1]  in ["-", "+", ")"]:
        t = t[:-1]
    if len(t) == 1:
        t = ("", t[0])
    return t

numbers = ['1,234.56', '7.890,12', '123 456,789', '1234,56', '1234.56']
result = [convert_digits(n, sep=get_separators(n)) for n in numbers]
print(result)
# [1234.56, 7890.12, 123456.789, 1234.56, 1234.56]

But ideally it is better to track the locale of each segment of data, and process each accordingly.

One benefit of the above code is that it will work with other decimal number systems:

numbers2 = ['-1.234,56', '123 456,789', '๓๔.๕๕', '٣٫١٤١٥٩٢٦٥٣٥٨']
result2 = [convert_digits(n, sep=get_separators(n)) for n in numbers2]
print(result2)
# [-1234.56, 123456.789, 34.55, 3.14159265358]

It should also be able to handle signed numbers:

numbers3 = ['(1.234,56)', '-1.234,56', '+1.234,56', '1.234,56-']
result3 = [convert_digits(n, sep=get_separators(n)) for n in numbers3]
print(result3)
# [-1234.56, -1234.56, 1234.56, -1234.56]

Collectives™ on Stack Overflow

How to convert to float a string array with multiple locale?

3 Answers 3

Prints

`r"[^\s\r\n.,]\d+\s$"`

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Prints

r"[^\s\r\n.,]*\d+\s*$"

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related

`r"[^\s\r\n.,]\d+\s$"`