1

I need a function that takes a (non-binary) string as input and returns a numpy array.

Numpy provides the function numpy.fromstring, and this works in all cases (with proper arguments):

>>> np.fromstring('1 2 3.1415', dtype=float, sep=' ')
array([ 1.    ,  2.    ,  3.1415])

my problem is that it works in too many cases. For example, in the following case it silently fails

>>> np.fromstring('not a string', dtype=float, sep=' ')
array([], dtype=float64)

Is there a way to safely convert non-binary strings to numpy arrays that properly throws an error if the input cannot be converted to numbers?

3 Answers 3

2

You can directly work with the string and convert it back to numpy array using np.array and split, like this:

>>> np.array('1 2 3.1415'.split(' '), dtype=float)
array([ 1.    ,  2.    ,  3.1415])
>>> np.array('not a string'.split(' '), dtype=float)
ValueError: could not convert string to float: not

When using fromstring, if your input string does not contain only real valued data, you should expect an empty array.

>>> np.fromstring('not a string', dtype=float, sep=' ')
array([], dtype=float64)
>>> np.fromstring('not a string 5', dtype=float, sep=' ')
array([], dtype=float64)
>>> np.fromstring('8 5', dtype=float, sep=' ')
array([ 8.,  5.])

EDIT: You can implement your own .fromstring by verifying your input_string format. If it does have the pattern that you are looking for (in your case all floats), then convert it to numpy.array. In case of failure, you either want to explicitly through an exception error, or return an empty list.

In [1]: import re
In [2]: import numpy as np    
In [3]: def my_fromstring(input_string):
...:     input_string = input_string.strip()
...:     input_string = re.sub(' +', ' ', input_string)
...:     float_pattern = '\d+\.d+|\d+'
...:     verify_fn = lambda s: map(lambda x: re.match(float_pattern, x),           
...:                                    s.split(' '))
...:     pattern_match_fn = lambda x: any(map(lambda x: True if x == None          
...:                                    else False, x))
...:     res = verify_fn(input_string)
...:     match = pattern_match_fn(res)
...:     if not match:
...:         return np.array(map(float, input_string.split(' ')))
...:     else:
...:         raise ValueError('Incorrect input format')
...:     

You can now use your custom function to check:

In [4]: my_fromstring(' 7 5      8  3  ')
Out[4]: array([ 7.,  5.,  8.,  3.])

In [5]: my_fromstring('not a string')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-67-88cd38f7ad26> in <module>()
----> 1 my_fromstring('not a string')

<ipython-input-65-e355cf28acb0> in my_fromstring(input_string)
     10         return np.array(map(float, input_string.split(' ')))
     11     else:
---> 12         raise ValueError('Incorrect input format')
     13 

ValueError: Incorrect input format
Sign up to request clarification or add additional context in comments.

5 Comments

Try np.fromstring(' not a string', dtype=float, sep=' '), this returns array([-1.]).
If you know your input data format, you can apply strip function and you still get the desired output.
np.array(s.split(), dtype=float) will throw an error if it can't convert one of the 'words' to float.
It depends on what you want to do with your data, we are assuming an arbitrary input. If you want to explicitly have float numbers, then you have to do some checking, and you would end-up with the behavior of the np.fromstring function.
If I add a dtype=float it seems simply calling np.array is the best solution so far. Can you update with that?
1

You can write a regular expression since it's not a very complicated language; the json spec shows the diagram for a floating point number. To allow arbitrary newlines and spaces between these would look like:

[\s\n]*(?:-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][-+]?\d+)?[\s\n]*)*

Breaking that down we have:

[\s\n]*                                                        leading ws (whitespace)
       (?:                                           [\s\n]+)* repeat with trailing ws
          -?(?:0|[1-9]\d*)                                     an integer, no leading 0s
                          (?:\.\d+)?                           opt. decimal part
                                    (?:[eE][-+]?\d+)           opt. base-10 exponent

Use by enclosing with ^ for start-of-string and $ for end-of-string, so e.g.

re.match(r'^[\s\n]*(?:-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][-+]?\d+)?[\s\n]*)*$', 
         '1 2 3.12345')
# returns a Match object

re.match(r'^[\s\n]*(?:-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][-+]?\d+)?[\s\n]*)*$', 
         '1, 2, 3.12345')
# returns None because we did not allow commas in the regex.

Of course to allow optional commas, right after the optional exponent include ,?, the optional comma; if square braces or semicolons are needed those are also not too hard to add. Also consider changing the * in the "repeat with trailing ws" part to a + to force the array to be nonempty.

Comments

1

Why not check if the array is empty after the operation and throw an error if that is the case?

def extract(s):
    a = np.fromstring(s.strip(), dtype=float, sep=' ')
    if a.size == 0 or a.size == 1 and len(str(a[0])) != len(s.strip()):
      raise Exception('No numbers found')
    return a

8 Comments

This fails, try e.g. np.fromstring(' not a string', dtype=float, sep=' ')
If whitespace is the issue, we can strip the string before parsing. See changes.
Good update, now at least I cannot get it to fail, but do we know there are no other failure cases?
That would depend on the kind of strings you intend to use the function with.
Hmm, your example still gives np.fromstring('5 not a number', dtype=float, sep=' '), array([ 5.]), which (at least to me) is not the expected answer.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.