1

I am having issues understanding how X and y are referenced for training.

I have a simple csv file with 5 numeric columns that I am loading into a NumPy array as follows:

url = "http://www.xyz/shortDataFinal.data"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
print(dataset.shape)
# separate the data from the target attributes

X = dataset[:,0:3] #Does this mean columns 1-4?
y = dataset[:,4] #Is this the 5th column?

I think I am referencing my X values incorrectly.

Here is what I need:

X values reference columns 1-4 and my y value is the last column, which is the 5th. If I understand correctly, I should be referencing array indices 0:3 for the X values and number 4 for the y as I have done above. However, those values aren't correct. In other words, the values returned by the array don't match the values in the data - they are off by one column (index).

1
  • 1
    You want 0:4 (to get 4 columns). Commented May 16, 2016 at 6:25

2 Answers 2

1

Yes, your interpretation is correct. dataset is a matrix in this case, so the numpy indexing operators ([]) use the conventional row, column format.

X = dataset[:,0:3] is interpreted as "All rows for columns 0 through 3" and y = dataset[:,4] is interpreted as "all rows for column 4".

Sign up to request clarification or add additional context in comments.

Comments

1

Using a multiline string as a standin for a csv file:

In [332]: txt=b"""0, 1, 2, 4, 5
   .....: 6, 7, 8, 9, 10
   .....: """

In [333]: data=np.loadtxt(txt.splitlines(), delimiter=',')

In [334]: data
Out[334]: 
array([[  0.,   1.,   2.,   4.,   5.],
       [  6.,   7.,   8.,   9.,  10.]])

In [335]: data.shape
Out[335]: (2, 5)

In [336]: data[:,0:4]
Out[336]: 
array([[ 0.,  1.,  2.,  4.],
       [ 6.,  7.,  8.,  9.]])

In [337]: data[:,4]
Out[337]: array([  5.,  10.])

numpy indexing starts at 0; [0:4] is the same (more or less) as the list of numbers starting at 0, up to, but not including 4.

In [339]: np.arange(0,4)
Out[339]: array([0, 1, 2, 3])

Another way to get all but the last column is to use -1 indexing

In [352]: data[:,:-1]
Out[352]: 
array([[ 0.,  1.,  2.,  4.],
       [ 6.,  7.,  8.,  9.]])

Often a CSV file is a mix of numeric and string values. The loadtxt dtype parameter has a short explanation of how you can load and access that as a structured array. genfromtxt is easier to use for that (though no less confusing).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.