TypeError: unsupported operand type(s) for -: 'numpy.ndarray' and 'numpy.ndarray' whilst trying to do PCA

Question

I'm trying to do PCA on a sparse matrix, but I am encountering an error:

TypeError: unsupported operand type(s) for -: 'numpy.ndarray' and 'numpy.ndarray'

Here is my code:

import sys
import csv
from sklearn.decomposition import PCA

data_sentiment = []
y = []
data2 = []
csv.field_size_limit(sys.maxint)
with open('/Users/jasondou/Google Drive/data/competition_1/speech_vectors.csv') as infile:
    reader = csv.reader(infile, delimiter=',', quotechar='|')
    n = 0
    for row in reader:
        # sample = row.split(',')
        n += 1
        if n%1000 == 0:
            print n
        data_sentiment.append(row[:25000])

pca = PCA(n_components=3)
pca.fit(data_sentiment)
PCA(copy=True, n_components=3, whiten=False)
print(pca.explained_variance_ratio_) 
y = pca.transform(data_sentiment)

The input data is speech_vector.csv, which a 2740 * 50000 matrix found available here

Here is the full error traceback:

Traceback (most recent call last):
  File "test.py", line 45, in <module>
    y = pca.transform(data_sentiment)
  File "/Users/jasondou/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 397, in transform
    X = X - self.mean_
TypeError: unsupported operand type(s) for -: 'numpy.ndarray' and 'numpy.ndarray'

I do not quite understand what self.mean_ refers to here.

It would be useful to know which line the error occurs, also your code is in it's current form just nonsense as you're passing an empty list to pca.fit — EdChum
– EdChum, Commented Apr 15, 2015 at 20:34
I'm thinking this happens elsewhere (e.g. in pca.fit() or pca.transform()); I don't see any subtraction operations that might have raised this error directly in this top-level code. — Kevin
– Kevin, Commented Apr 15, 2015 at 20:36
I don't know what you're referring to when you say "did not quite understand what self.mean_ here" — ali_m
– ali_m, Commented Apr 15, 2015 at 20:47
Please update the question to include a minimal, complete example that demonstrates the problem (stackoverflow.com/help/mcve). You haven't shown in the code or stated in the question how PCA is imported. — Warren Weckesser
– Warren Weckesser, Commented Apr 15, 2015 at 20:51
This is still not a complete example - we don't have access to your CSV file, and we therefore can't know what data_sentiment looks like. Could you please add a few rows from data_sentiment to your question. Also, please edit your question to contain the full traceback for the error message you are seeing. — ali_m
– ali_m, Commented Apr 15, 2015 at 22:39

ali_m · Accepted Answer · 2015-04-15 23:50:53Z

1

You are not parsing the CSV file correctly. Each row that your reader returns will be a list of strings, like this:

row = ['0.0', '1.0', '2.0', '3.0', '4.0']

Your data_sentiment will therefore be a list-of-lists-of-strings, for example:

data_sentiment = [row, row, row]

When you pass this directly to pca.fit(), it is internally converted to a numpy array, also containing strings:

X = np.array(data_sentiment)
print(repr(X))
# array([['0.0', '1.0', '2.0', '3.0', '4.0'],
#        ['0.0', '1.0', '2.0', '3.0', '4.0'],
#        ['0.0', '1.0', '2.0', '3.0', '4.0']], 
#       dtype='|S3')

numpy has no rule for subtracting an array of strings from another array of strings:

X - X
# TypeError: unsupported operand type(s) for -: 'numpy.ndarray' and 'numpy.ndarray'

This mistake would have been very easy to spot if you had bothered to show us some of the contents of data_sentiment in your question, as I asked you to.

What you need to do is convert your strings to floats, for example:

data_sentiment.append([float(s) for s in row[:25000]])

A much easier way would be to use np.loadtxt to parse the CSV file:

data_sentiment = np.loadtxt('/path/to/file.csv', delimiter=',')

If you have pandas installed, then pandas.read_csv will probably be faster than np.loadtxt for a large array such as this one.

edited Apr 15, 2015 at 23:50

answered Apr 15, 2015 at 23:35

ali_m

74.6k28 gold badges230 silver badges314 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ali_m Over a year ago

If my answer solves your problem then you should accept it (click the tick next to my answer)

ali_m Over a year ago

No problem, and welcome to StackOverflow! As a new user of the site, learning how to ask good questions is the most important skill for you to pick up. Please remember to include as much relevant information as you can in your question. If other users have to ask for important details in the comments then they are likely to get impatient with you, and may downvote or close your question instead of trying to answer it.

Collectives™ on Stack Overflow

TypeError: unsupported operand type(s) for -: 'numpy.ndarray' and 'numpy.ndarray' whilst trying to do PCA

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related