2

Here is my test_data.csv:

A,1,2,3,4,5
B,6,7,8,9,10
C,11,12,13,14,15
A,16,17,18,19,20

And I am reading it to a numpy array using the code below:

def readCSVToNumpyArray(dataset):
    with open(dataset) as f:
        values = [i for i in csv.reader(f)]

    data = numpy.array(values)

    return data

In the main code, I have:

    numpyArray = readCSVToNumpyArray('test_data.csv')
    print(numpyArray)

which gives me the output:

(array([['A', '1', '2', '3', '4', '5'],
       ['B', '6', '7', '8', '9', '10'],
       ['C', '11', '12', '13', '14', '15'],
       ['A', '16', '17', '18', '19', '20']], 
      dtype='|S2'))

But all the numbers in the array is treated as string, is there a good way to make them stored as float without going through each element and assign the type?

Thanks!

6
  • numpy.ndarrays are homogeneous. That's part of why they have improved performance. Maybe you could have two separate arrays, one for numbers and one for strings? Or a list of strings and array of numbers? Otherwise, you need to look into numpy records or some other datastructure. Have you considered pandas dataframes? Commented Mar 17, 2016 at 16:44
  • Take a look at pandas, it is really good at loading csv. You may convert pandas table (DataFrame actually) to numpy array easily just by asarray(table). Commented Mar 17, 2016 at 16:54
  • If you don't want to involve an extra package (pandas), note that np.fromfile or np.genfromtxt are also good utils for reading text file, in your case you have to define a data type and pass it to these functions. Go and see their docstring and also take a look at np.dtype. Commented Mar 17, 2016 at 17:00
  • What kind of array do you want? In particular how should the string and integer values be combined? All elements of an array must be of the same type (though the dtype may be complex, see structured arrays). Commented Mar 17, 2016 at 17:30
  • @hpaulj : I need them to be String and Number for different columns. So I can use DictVectorizer to transform the categorical variables later. Commented Mar 17, 2016 at 17:32

3 Answers 3

2

Since the first character at each line is a string you'll have to use a more flexible type in numpy called "object". Try with this function and see if this is what you are looking for:

    def readCSVToNumpyArray(dataset):
        values = [[]]
        with open(dataset) as f:
            counter = 0
            for i in csv.reader(f):
                for j in i:
                    try:
                        values[counter].append(float(j))
                    except ValueError:
                        values[counter].append(j)
                counter = counter + 1
                values.append([])

        data = numpy.array(values[:-1],dtype='object')

        return data

    numpyArray = readCSVToNumpyArray('test_data.csv')
    print(numpyArray)

The results are:

    [['A' 1.0 2.0 3.0 4.0 5.0]
     ['B' 6.0 7.0 8.0 9.0 10.0]
     ['C' 11.0 12.0 13.0 14.0 15.0]
     ['A' 16.0 17.0 18.0 19.0 20.0]]
Sign up to request clarification or add additional context in comments.

Comments

1

I'd read it in using Pandas which lets you set dtype per column very easily.

import numpy as np 
import pandas as pd 

pdDF = pd.read_csv(
    'test_data.csv', 
    header=None, 
    names=list('abcdef'), 
    dtype=dict(zip(list('abcdef'),[str]+[float]*5)))

now each column will have the appropriate dtype.

pdDF.b
Out[24]: 
0     1
1     6
2    11
3    16
Name: b, dtype: float64

If you still want it in numpy arrays, you can just take values.

npArr = pdDF.values

npArr
Out[27]: 
array([['A', 1.0, 2.0, 3.0, 4.0, 5.0],
       ['B', 6.0, 7.0, 8.0, 9.0, 10.0],
       ['C', 11.0, 12.0, 13.0, 14.0, 15.0],
       ['A', 16.0, 17.0, 18.0, 19.0, 20.0]], dtype=object)

It's still going to be objects for the 'row' arrays, because you can't make 'A' into a float, but the individual values will be floats as desired.

type(npArr[0,1])
Out[28]: float

Finally if you want just an array of floats, that's also easy enough... just spit out all but the first column as an array and it will have dtype: float instead of object.

pdDF.loc[:,pdDF.columns>='b'].values
Out[28]: 
array([[  1.,   2.,   3.,   4.,   5.],
       [  6.,   7.,   8.,   9.,  10.],
       [ 11.,  12.,  13.,  14.,  15.],
       [ 16.,  17.,  18.,  19.,  20.]])

pdDF.loc[:,pdDF.columns>='b'].values.dtype
Out[29]: dtype('float64')

Comments

0

np.genfromtxt can easily load your data into a structured array. It will be a 1d array, with a field for each column:

Simulate the file with a list of lines:

   In [265]: txt=b"""A,1,2,3,4,5
       .....: B,6,7,8,9,10
       .....: C,11,12,13,14,15
       .....: A,16,17,18,19,20"""
    In [266]: txt=txt.splitlines()
    In [267]: A=np.genfromtxt(txt,delimiter=',',names=None,dtype=None)
    In [268]: A
    Out[268]: 
    array([(b'A', 1, 2, 3, 4, 5), (b'B', 6, 7, 8, 9, 10),
           (b'C', 11, 12, 13, 14, 15), (b'A', 16, 17, 18, 19, 20)], 
          dtype=[('f0', 'S1'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<i4')])

It deduced the dtype from the column values - strings and ints. Fields are accessed by name

In [269]: A['f0']
Out[269]: 
array([b'A', b'B', b'C', b'A'], 
      dtype='|S1')
In [270]: A['f1']
Out[270]: array([ 1,  6, 11, 16])

I could also define a dtype that would put the strings in one field, and all the other values in another field.

In [271]: A=np.genfromtxt(txt,delimiter=',',names=None,dtype='S2,(5)int')
In [272]: A
Out[272]: 
array([(b'A', [1, 2, 3, 4, 5]), (b'B', [6, 7, 8, 9, 10]),
       (b'C', [11, 12, 13, 14, 15]), (b'A', [16, 17, 18, 19, 20])], 
      dtype=[('f0', 'S2'), ('f1', '<i4', (5,))])
In [273]: A['f1']
Out[273]: 
array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20]])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.