Trying to create a labeled numpy array

Question

I want to have a numpy array with values and corresponding labels for each value. I am using this array for linear regression and it will be my X data vector in the equation y = Xb + error.

My X vector consists of about 20 variables, each of which I would like to be able to reference by name like so X['variable1']. I was initially using a dictionary to do this but realized that the scikit library for linear regression requires a numpy matrix, so I am trying to build a numpy array that is labeled.

I keep getting an error stating:

TypeError: a bytes-like object is required, not 'int'.

This is what I'm doing:

X = np.array([3],dtype=[('label1','int')])

I eventually want to have 20 labeled values, something like this:

X = np.array([3,40,7,2,...],
             dtype=[('label1',int'),('label2','int'),('label3','int')...])

Would really appreciate any help on the syntax here. Thanks!

hpaulj · Accepted Answer · 2015-11-04 22:14:23Z

5

The correct way to create a structured array, with values, is with a list of tuples:

In [55]: X
Out[55]: 
array([(3,)], 
      dtype=[('label1', '<i4')])

In [56]: X=np.array([(3,4)],dtype=[('label1',int),('label2',int)])

In [57]: X
Out[57]: 
array([(3, 4)], 
      dtype=[('label1', '<i4'), ('label2', '<i4')])

But I should caution you that such array is not 2d (or matrix), it is 1d with fields:

In [58]: X.shape
Out[58]: (1,)

In [59]: X.dtype
Out[59]: dtype([('label1', '<i4'), ('label2', '<i4')])

And you can't do math across fields; X*2 and X.sum() will produce errors. Using X in an equation like y = X*b + error will be hopeless.

You are probably better off working with real 2d numeric arrays, and do the mapping between labels and column numbers in your head, or with a dictionary.

Or use Pandas.

edited Nov 4, 2015 at 22:14

answered Nov 4, 2015 at 22:08

hpaulj

233k14 gold badges260 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

covfefe Over a year ago

Thanks, I went with your first approach and did something like: keyValues = [('A',0), ('R',0), ('N',0)]

Dietrich · Accepted Answer · 2015-11-04 22:01:31Z

0

Since with 20 variables, memory is not an issue, you could keep on using dictionaries:

from collections import OrderedDict  # Dictionary that remembers insertion order
import numpy as np

dd = OrderedDict()
dd["Var1"] = 10
dd["Var2"] = 20
dd["Var3"] = 30

# make numpy array from dict:
xx = np.array([v for v in dd.values()])  

# make dict() from array:
xx2 = 2*xx
dd2 = OrderedDict((k, v) for (k,v) in zip(dd.keys(), xx2))

answered Nov 4, 2015 at 22:01

Dietrich

5,6014 gold badges28 silver badges42 bronze badges

Collectives™ on Stack Overflow

Trying to create a labeled numpy array

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related