Python Pandas Regression

Question

[enter image description here][1]I am struggling to figure out if regression is the route I need to go in order to solve my current challenge with Python. Here is my scenario:

I have a Pandas Dataframe that is 195 rows x 25 columns
All data (except for index and headers) are integers
I have one specific column (Column B) that I would like compared to all other columns
Attempting to determine if there is a range of numbers in any of the columns that influences or impacts column B
An example of the results I would like to calculate in Python is something similar to: Column B is above 3.5 when data in Column D is between 10.20 - 16.4

The examples I've been reading online with Regression in Python appear to produce charts and statistics that I don't need (or maybe I am interpreting incorrectly). I believe the proper wording to describe what I am asking, is to identify specific values or a range of values that are linear between two columns in a Pandas dataframe.

Can anyone help point me in the right direction?

Thank you all in advance!

So what you want to achieve is to determine if Column B is above 3.5 when data in Column D is between 10.20 - 16.4? can you provide some sample data frame? — 2342G456DI8
– 2342G456DI8, Commented Jan 8, 2016 at 1:45
I want to understand which numbers or ranges influence the outcome of column B. I'll post a sample data frame shortly. — Giltzer
– Giltzer, Commented Jan 8, 2016 at 1:50
Sorry, I'm not able to add attachments. copy/paste of the data frame is not displaying properly. — Giltzer
– Giltzer, Commented Jan 8, 2016 at 4:56

Stefan · Accepted Answer · 2016-01-08 16:13:07Z

2

Your goals sound very much like exploratory data analysis at this point. You should probably first calculate the correlation between your target column B and any other column using pandas.Series.corr (which really is the same as bivariate regression), which you could list:

other_cols = [col for col in df1.columns if col !='B']
corr_B = [{other: df.loc[:, 'B'].corr(df.loc[:, other])} for other in other_col]

To get a handle on specific ranges, I would recommend looking at:

the cut and qcut functionality to bin your data as you like and either plot or correlate subsets accordingly: see docs here and here.

To visualize bivariate and simple multivariate relationships, I would recommend

the seaborn package because it includes various types of plots designed to help you get a quick grasp of covariation among variables. See for instance the examples for univariate and bivariate distributions here, linear relationship plots here, and categorical data plots here.

The above should help you understand bivariate relationships. Once you want to progress to multivariate relationships, you could return to the scikit-learn or statsmodels packages best suited for this in python IMHO. Hope this helps to get you started.

answered Jan 8, 2016 at 16:13

Stefan

43.1k13 gold badges80 silver badges84 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Giltzer Over a year ago

Than you so much for your post and recommendations.

Stefan Over a year ago

You're welcome. Just let me know if you need clarification on the above, or if this answers your question for now.

Collectives™ on Stack Overflow

Python Pandas Regression

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related