0

I'm trying to figure out which variables affect the toAnalyse variable. For this I use the LogisticRegression method. When I run the code below, I get the following error:

Code:

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from matplotlib import rcParams
from sklearn.linear_model import LogisticRegression

rcParams['figure.figsize'] = 14, 7
rcParams['axes.spines.top'] = False
rcParams['axes.spines.right'] = False

data = pd.read_csv('file.txt', sep=",")

df = pd.concat([
    pd.DataFrame(data, columns=data.columns),
    pd.DataFrame(data, columns=['toAnalyse'])
], axis=1)

X = df.drop(['notimportant', 'test', 'toAnalyse'], axis=1)
y = df['toAnalyse']
#y.drop(y.columns[0], axis=1, inplace=True)   <----------------- From 2 to 0 variables when running this?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

The error:

ValueError: y should be a 1d array, got an array of shape (258631, 2) instead.

That seems to be correct, because when I print y.info() I get back:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344842 entries, 0 to 344841
Data columns (total 2 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   toAnalyse          343480 non-null  float64
 1   toAnalyse          343480 non-null  float64

The toAnalyse variable thus appears to be in y twice. Okay, then I want to remove the first (based on the index) so that I am left with a 1d row. However, when I use y.drop(y.columns[0], axis=1, inplace=True) , I get the error that there are no more variables in it at all:

ValueError: y should be a 1d array, got an array of shape (258631, 0) instead.

What's going on, and how can I run this with a 1d array?

1 Answer 1

1

It looks like after

df = pd.concat([
    pd.DataFrame(data, columns=data.columns),
    pd.DataFrame(data, columns=['toAnalyse'])
], axis=1)

you have the column 'toAnalyse' in your dataframe twice. This is the reason for the wrong shape of y in the first place. As drop looks for the column name, you end up with no columns after your drop statement.

To fix that I would simply remove the statement with df. data seems to contain all you need, so

X = data.drop(['notimportant', 'test', 'toAnalyse'], axis=1)
y = data['toAnalyse']

should work.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! I just did the same as this example (first method here: towardsdatascience.com/…), but indeed, what you're saying is more logical. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.