Simple NumPy array reference

Question

I am having issues understanding how X and y are referenced for training.

I have a simple csv file with 5 numeric columns that I am loading into a NumPy array as follows:

url = "http://www.xyz/shortDataFinal.data"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
print(dataset.shape)
# separate the data from the target attributes

X = dataset[:,0:3] #Does this mean columns 1-4?
y = dataset[:,4] #Is this the 5th column?

I think I am referencing my X values incorrectly.

Here is what I need:

X values reference columns 1-4 and my y value is the last column, which is the 5th. If I understand correctly, I should be referencing array indices 0:3 for the X values and number 4 for the y as I have done above. However, those values aren't correct. In other words, the values returned by the array don't match the values in the data - they are off by one column (index).

You want 0:4 (to get 4 columns).

hpaulj
– hpaulj

2016-05-16 06:25:23 +00:00
Commented May 16, 2016 at 6:25 — hpaulj
– hpaulj, Commented May 16, 2016 at 6:25

Anthony E · Accepted Answer · 2016-05-16 03:44:35Z

1

Yes, your interpretation is correct. dataset is a matrix in this case, so the numpy indexing operators ([]) use the conventional row, column format.

X = dataset[:,0:3] is interpreted as "All rows for columns 0 through 3" and y = dataset[:,4] is interpreted as "all rows for column 4".

answered May 16, 2016 at 3:44

Anthony E

11.3k2 gold badges28 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

hpaulj · Accepted Answer · 2016-05-16 06:28:03Z

Using a multiline string as a standin for a csv file:

In [332]: txt=b"""0, 1, 2, 4, 5
   .....: 6, 7, 8, 9, 10
   .....: """

In [333]: data=np.loadtxt(txt.splitlines(), delimiter=',')

In [334]: data
Out[334]: 
array([[  0.,   1.,   2.,   4.,   5.],
       [  6.,   7.,   8.,   9.,  10.]])

In [335]: data.shape
Out[335]: (2, 5)

In [336]: data[:,0:4]
Out[336]: 
array([[ 0.,  1.,  2.,  4.],
       [ 6.,  7.,  8.,  9.]])

In [337]: data[:,4]
Out[337]: array([  5.,  10.])

numpy indexing starts at 0; [0:4] is the same (more or less) as the list of numbers starting at 0, up to, but not including 4.

In [339]: np.arange(0,4)
Out[339]: array([0, 1, 2, 3])

Another way to get all but the last column is to use -1 indexing

In [352]: data[:,:-1]
Out[352]: 
array([[ 0.,  1.,  2.,  4.],
       [ 6.,  7.,  8.,  9.]])

Often a CSV file is a mix of numeric and string values. The loadtxt dtype parameter has a short explanation of how you can load and access that as a structured array. genfromtxt is easier to use for that (though no less confusing).

Collectives™ on Stack Overflow

Simple NumPy array reference

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related