how to convert string array of mixed data types

Question

Let's say I have read and loaded a file into a 2D matrix of mixed data as strings(an example has been provided below)

# an example row of the matrix
['529997' '46623448' '2122110124' '2310' '2054' '2' '66' '' '2010/11/03-12:42:08' '26' 'CLEARING' '781' '30' '3' '0' '0' '1']

I want to convert this chunk of data into their data types to be able to do statistical analysis on it with numpy and scipy.

The datatype for all of the columns is integer except the 8th index this is DateTime and the 10th index is pure string.

Question:

What is the easiest way to this conversation?

EDIT

Performance is very important than readability, I have to convert 4.5m rows of data and then process them!

Have you tried anything so far? And what about the empty string? — Kasravnd
– Kasravnd, Commented Sep 9, 2016 at 9:37
@Kasramvd those are N/A integers, 0 or -1 will be replaced with the empty values. as I said The datatype for all of the columns is integer except the 8th index this is DateTime and the 10th index is pure string. — dariush
– dariush, Commented Sep 9, 2016 at 9:39
You have an empty sting in 7th index, (although your items are not separate with comma!) so you want to change that empty string to what? — Kasravnd
– Kasravnd, Commented Sep 9, 2016 at 9:41
@Kasramvd the shown row is the result of print mtx[0]! it's already been loaded, no need to have a , separator! so you want to change that empty string to what?: to 0. — dariush
– dariush, Commented Sep 9, 2016 at 9:54
How did you load this? Have you tried np.genfromtxt with dtype=None? What kind of processing will you do next? — hpaulj
– hpaulj, Commented Sep 9, 2016 at 12:14

Kasravnd · Accepted Answer · 2016-09-09 09:58:12Z

2

Here is a one linear with list comprehension:

In [24]: from datetime import datetime
In [25]: func = lambda x: datetime.strptime(x, "%Y/%m/%d-%H:%M:%S")
In [26]: [{8:func, 10:str}.get(ind)(item) if ind in {8, 10} else int(item or '0') for ind, item in enumerate(lst)]
Out[26]: 
[529997,
 46623448,
 2122110124,
 2310,
 2054,
 2,
 66,
 0,
 datetime.datetime(2010, 11, 3, 12, 42, 8),
 26,
 'CLEARING',
 781,
 30,
 3,
 0,
 0,
 1]

answered Sep 9, 2016 at 9:58

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

dariush Over a year ago

I know I can apply this with a for to convert my 1m-row matrix. just wondering, is there any method to expand this solution to convert the entire matrix? nice solution btw. +1

Kasravnd Over a year ago

@Dariush I don't recommend this approach for every problem, If you don't care about the performance, note that readability is more important and you can simply implement it with a regular loop. Also another problem with this solution is that you can't handle the unexpected exceptions that might happen during the execution, for example TypeErros or etc.

dariush Over a year ago

performance is very important than readability, I have to convert 4.5m data! and the process them! so knowing this fact do you recommend this solution?

Georg Schölly Over a year ago

@Dariush: Converting probably takes a fraction of the time it takes to read the data from disk, so I wouldn't bother too much unless you see at the end that it is a problem.

Kasravnd Over a year ago

@Dariush So why don't you use Numby or pandas for handling your data, and converting them during the load time. Which is extremely faster than python.

|

Georg Schölly · Accepted Answer · 2016-09-09 12:12:06Z

1

I like clear code like this:

from datetime import datetime

input_row = ['529997', '46623448', '2122110124', '2310', '2054',
             '2', '66', '', '2010/11/03-12:42:08', '26',
             'CLEARING', '781', '30', '3', '0', '0', '1']

_date = lambda x: datetime.strptime(x, "%Y/%m/%d-%H:%M:%S")
# only necessary because '' should be treated as 0
_int  = lambda x: int('0' + x)

# specify the type parsers for each column
parsers = 8 * [_int] + [_date, _int, str] + 6 * [_int]

output_row = [parse(input) for parse, input in zip(parsers, input_row)]

Depending on your needs, use an iterator instead of a list. This could greatly reduce the amount of memory you need.

edited Sep 9, 2016 at 12:12

answered Sep 9, 2016 at 10:14

Georg Schölly

127k54 gold badges225 silver badges277 bronze badges

1 Comment

dariush Over a year ago

I am going to accept your answer for the sake of its readability, although the @Kasramvd's answer is valid too.

dariush · Accepted Answer · 2016-09-09 15:38:32Z

I have developed the following function to convert the 4.5m rows of the matrix, the invalid data type exception is also taken into consideration too. Although it can be improved with parallelizing the process, but it did the job OK for me, for what it worth, I am going to post it here.

def cnvt_data(mat):
    from datetime import datetime

    _date = lambda x: datetime.strptime(x, "%Y/%m/%d-%H:%M:%S")
    # only necessary because '' should be treated as 0
    _int  = lambda x: int('0' + x)

    # specify the type parsers for each column
    parsers = 8 * [_int] + [_date, _int, str] + 6 * [_int]

    def try_parse(parse, value, _def):
        try:
            return parse(value), True
        except ValueError:
            return _def, False

    matrix = [];

    for idx in range(len(mat)):
        try:
            row = mat[idx]
            matrix.append(np.asarray([parse(input) for parse, input in zip(parsers, row)]))
        except ValueError:
            l = [];
            matrix.append([])
            for _idx, args in enumerate(zip(parsers, row)):
                val, pres = try_parse(args[0], args[1], 0)
                matrix[-1].append(val)
                if(not pres): l.append(_idx);
            print "\r[Error] value error @row %d @indices(%s): replaced with 0" %(idx, ', '.join(str(x) for x in l))

        print "\r[.] %d%% converted" %(idx * 100/len(mat)),

    print "\r[+] 100% converted."

    return matrix

hpaulj · Accepted Answer · 2016-09-09 18:05:08Z

Usually when people ask about reading csv files we ask for a sample of the file. I've attempted to reconstruct your line from the string list:

In [590]: txt
Out[590]: b'529997, 46623448, 2122110124, 2310, 2054, 2, 66, , 2010/11/03-12:42:08, 26, CLEARING, 781, 30, 3, 0, 0, 1'

(b for bytestring in Py3, which is how genfromtxt expects its input)

genfromtxt expects a filename, open file, or anything that feeds it lines. So a list of lines works fine:

With dtype=None it deduces column types.

In [591]: data=np.genfromtxt([txt], dtype=None, delimiter=',', autostrip=True)
In [592]: data
Out[592]: 
array((529997, 46623448, 2122110124, 2310, 2054, 2, 66, False, b'2010/11/03-12:42:08', 26, b'CLEARING', 781, 30, 3, 0, 0, 1), 
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<i4'), ('f6', '<i4'), ('f7', '?'), ('f8', 'S19'), ('f9', '<i4'), ('f10', 'S8'), ('f11', '<i4'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4'), ('f15', '<i4'), ('f16', '<i4')])

The result is a bunch of int fields, 2 string fields. The blank is interpreted as boolean.

If I spell out the columns types I get a slightly different array

In [593]: dt=[int,int,int,int,int,int,int,float,'U20',int, 'U10',int,int,int,int,int,int]
In [594]: data=np.genfromtxt([txt], dtype=dt, delimiter=',', autostrip=True)
In [595]: data
Out[595]: 
array((529997, 46623448, 2122110124, 2310, 2054, 2, 66, nan, '2010/11/03-12:42:08', 26, 'CLEARING', 781, 30, 3, 0, 0, 1), 
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<i4'), ('f6', '<i4'), ('f7', '<f8'), ('f8', '<U20'), ('f9', '<i4'), ('f10', '<U10'), ('f11', '<i4'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4'), ('f15', '<i4'), ('f16', '<i4')])

I specified float for the blank column, which it then interprets as nan. Handling of blacks can be refined.

I changed the string files to unicode (the default py3 string).

I should be able to specify a datetime conversion, for example to np.datetime64.

With just one line, data is a single element array, 0d, with a compound dtype.

Fields are accessed by name

In [598]: data['f8']
Out[598]: 
array('2010/11/03-12:42:08', 
      dtype='<U20')
In [599]: data['f2']
Out[599]: array(2122110124)

Speed wise this probably is the same as your custom reader. genfromtxt reads the file line by line, and parses it. It collects the parsed lines in a list, and creates an array once at the end (I don't recall if parsed lines are lists or dtype arrays - I suspect lists, but would have to study the code).

To handle the date, I have to use 'datetime64[s]', and some how change the date to read "2010-11-03T12:42:08", probably in a converter.

===================

I can make a converter based on your datetime parsing:

In [649]: from datetime import datetime
In [650]: dateconvert=lambda x: datetime.strptime(x.decode(),"%Y/%m/%d-%H:%M:%S")
In [651]: data=np.genfromtxt([txt], dtype=dt, delimiter=',',  autostrip=True, converters={8:dateconvert})
In [652]: data
Out[652]: 
array((529997, 46623448, 2122110124, 2310, 2054, 2, 66, nan, datetime.datetime(2010, 11, 3, 12, 42, 8), 26, 'CLEARING', 781, 30, 3, 0, 0, 1), 
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<i4'), ('f6', '<i4'), ('f7', '<f8'), ('f8', '<M8[s]'), ('f9', '<i4'), ('f10', '<U10'), ('f11', '<i4'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4'), ('f15', '<i4'), ('f16', '<i4')])

Collectives™ on Stack Overflow

how to convert string array of mixed data types

Question:

EDIT

4 Answers 4

6 Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Question:

EDIT

4 Answers 4

6 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related