0

Let's say I have read and loaded a file into a 2D matrix of mixed data as strings(an example has been provided below)

# an example row of the matrix
['529997' '46623448' '2122110124' '2310' '2054' '2' '66' '' '2010/11/03-12:42:08' '26' 'CLEARING' '781' '30' '3' '0' '0' '1']

I want to convert this chunk of data into their data types to be able to do statistical analysis on it with numpy and scipy.

The datatype for all of the columns is integer except the 8th index this is DateTime and the 10th index is pure string.

Question:

What is the easiest way to this conversation?


EDIT

Performance is very important than readability, I have to convert 4.5m rows of data and then process them!

7
  • Have you tried anything so far? And what about the empty string? Commented Sep 9, 2016 at 9:37
  • @Kasramvd those are N/A integers, 0 or -1 will be replaced with the empty values. as I said The datatype for all of the columns is integer except the 8th index this is DateTime and the 10th index is pure string. Commented Sep 9, 2016 at 9:39
  • You have an empty sting in 7th index, (although your items are not separate with comma!) so you want to change that empty string to what? Commented Sep 9, 2016 at 9:41
  • @Kasramvd the shown row is the result of print mtx[0]! it's already been loaded, no need to have a , separator! so you want to change that empty string to what?: to 0. Commented Sep 9, 2016 at 9:54
  • How did you load this? Have you tried np.genfromtxt with dtype=None? What kind of processing will you do next? Commented Sep 9, 2016 at 12:14

4 Answers 4

2

Here is a one linear with list comprehension:

In [24]: from datetime import datetime
In [25]: func = lambda x: datetime.strptime(x, "%Y/%m/%d-%H:%M:%S")
In [26]: [{8:func, 10:str}.get(ind)(item) if ind in {8, 10} else int(item or '0') for ind, item in enumerate(lst)]
Out[26]: 
[529997,
 46623448,
 2122110124,
 2310,
 2054,
 2,
 66,
 0,
 datetime.datetime(2010, 11, 3, 12, 42, 8),
 26,
 'CLEARING',
 781,
 30,
 3,
 0,
 0,
 1]
Sign up to request clarification or add additional context in comments.

6 Comments

I know I can apply this with a for to convert my 1m-row matrix. just wondering, is there any method to expand this solution to convert the entire matrix? nice solution btw. +1
@Dariush I don't recommend this approach for every problem, If you don't care about the performance, note that readability is more important and you can simply implement it with a regular loop. Also another problem with this solution is that you can't handle the unexpected exceptions that might happen during the execution, for example TypeErros or etc.
performance is very important than readability, I have to convert 4.5m data! and the process them! so knowing this fact do you recommend this solution?
@Dariush: Converting probably takes a fraction of the time it takes to read the data from disk, so I wouldn't bother too much unless you see at the end that it is a problem.
@Dariush So why don't you use Numby or pandas for handling your data, and converting them during the load time. Which is extremely faster than python.
|
1

I like clear code like this:

from datetime import datetime

input_row = ['529997', '46623448', '2122110124', '2310', '2054',
             '2', '66', '', '2010/11/03-12:42:08', '26',
             'CLEARING', '781', '30', '3', '0', '0', '1']

_date = lambda x: datetime.strptime(x, "%Y/%m/%d-%H:%M:%S")
# only necessary because '' should be treated as 0
_int  = lambda x: int('0' + x)

# specify the type parsers for each column
parsers = 8 * [_int] + [_date, _int, str] + 6 * [_int]

output_row = [parse(input) for parse, input in zip(parsers, input_row)]

Depending on your needs, use an iterator instead of a list. This could greatly reduce the amount of memory you need.

1 Comment

I am going to accept your answer for the sake of its readability, although the @Kasramvd's answer is valid too.
1

I have developed the following function to convert the 4.5m rows of the matrix, the invalid data type exception is also taken into consideration too. Although it can be improved with parallelizing the process, but it did the job OK for me, for what it worth, I am going to post it here.

def cnvt_data(mat):
    from datetime import datetime

    _date = lambda x: datetime.strptime(x, "%Y/%m/%d-%H:%M:%S")
    # only necessary because '' should be treated as 0
    _int  = lambda x: int('0' + x)

    # specify the type parsers for each column
    parsers = 8 * [_int] + [_date, _int, str] + 6 * [_int]

    def try_parse(parse, value, _def):
        try:
            return parse(value), True
        except ValueError:
            return _def, False

    matrix = [];

    for idx in range(len(mat)):
        try:
            row = mat[idx]
            matrix.append(np.asarray([parse(input) for parse, input in zip(parsers, row)]))
        except ValueError:
            l = [];
            matrix.append([])
            for _idx, args in enumerate(zip(parsers, row)):
                val, pres = try_parse(args[0], args[1], 0)
                matrix[-1].append(val)
                if(not pres): l.append(_idx);
            print "\r[Error] value error @row %d @indices(%s): replaced with 0" %(idx, ', '.join(str(x) for x in l))

        print "\r[.] %d%% converted" %(idx * 100/len(mat)),

    print "\r[+] 100% converted."

    return matrix

Comments

1

Usually when people ask about reading csv files we ask for a sample of the file. I've attempted to reconstruct your line from the string list:

In [590]: txt
Out[590]: b'529997, 46623448, 2122110124, 2310, 2054, 2, 66, , 2010/11/03-12:42:08, 26, CLEARING, 781, 30, 3, 0, 0, 1'

(b for bytestring in Py3, which is how genfromtxt expects its input)

genfromtxt expects a filename, open file, or anything that feeds it lines. So a list of lines works fine:

With dtype=None it deduces column types.

In [591]: data=np.genfromtxt([txt], dtype=None, delimiter=',', autostrip=True)
In [592]: data
Out[592]: 
array((529997, 46623448, 2122110124, 2310, 2054, 2, 66, False, b'2010/11/03-12:42:08', 26, b'CLEARING', 781, 30, 3, 0, 0, 1), 
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<i4'), ('f6', '<i4'), ('f7', '?'), ('f8', 'S19'), ('f9', '<i4'), ('f10', 'S8'), ('f11', '<i4'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4'), ('f15', '<i4'), ('f16', '<i4')])

The result is a bunch of int fields, 2 string fields. The blank is interpreted as boolean.

If I spell out the columns types I get a slightly different array

In [593]: dt=[int,int,int,int,int,int,int,float,'U20',int, 'U10',int,int,int,int,int,int]
In [594]: data=np.genfromtxt([txt], dtype=dt, delimiter=',', autostrip=True)
In [595]: data
Out[595]: 
array((529997, 46623448, 2122110124, 2310, 2054, 2, 66, nan, '2010/11/03-12:42:08', 26, 'CLEARING', 781, 30, 3, 0, 0, 1), 
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<i4'), ('f6', '<i4'), ('f7', '<f8'), ('f8', '<U20'), ('f9', '<i4'), ('f10', '<U10'), ('f11', '<i4'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4'), ('f15', '<i4'), ('f16', '<i4')])

I specified float for the blank column, which it then interprets as nan. Handling of blacks can be refined.

I changed the string files to unicode (the default py3 string).

I should be able to specify a datetime conversion, for example to np.datetime64.

With just one line, data is a single element array, 0d, with a compound dtype.

Fields are accessed by name

In [598]: data['f8']
Out[598]: 
array('2010/11/03-12:42:08', 
      dtype='<U20')
In [599]: data['f2']
Out[599]: array(2122110124)

Speed wise this probably is the same as your custom reader. genfromtxt reads the file line by line, and parses it. It collects the parsed lines in a list, and creates an array once at the end (I don't recall if parsed lines are lists or dtype arrays - I suspect lists, but would have to study the code).

To handle the date, I have to use 'datetime64[s]', and some how change the date to read "2010-11-03T12:42:08", probably in a converter.

===================

I can make a converter based on your datetime parsing:

In [649]: from datetime import datetime
In [650]: dateconvert=lambda x: datetime.strptime(x.decode(),"%Y/%m/%d-%H:%M:%S")
In [651]: data=np.genfromtxt([txt], dtype=dt, delimiter=',',  autostrip=True, converters={8:dateconvert})
In [652]: data
Out[652]: 
array((529997, 46623448, 2122110124, 2310, 2054, 2, 66, nan, datetime.datetime(2010, 11, 3, 12, 42, 8), 26, 'CLEARING', 781, 30, 3, 0, 0, 1), 
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<i4'), ('f6', '<i4'), ('f7', '<f8'), ('f8', '<M8[s]'), ('f9', '<i4'), ('f10', '<U10'), ('f11', '<i4'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4'), ('f15', '<i4'), ('f16', '<i4')])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.