0

[enter image description here][1]I am struggling to figure out if regression is the route I need to go in order to solve my current challenge with Python. Here is my scenario:

  • I have a Pandas Dataframe that is 195 rows x 25 columns
  • All data (except for index and headers) are integers
  • I have one specific column (Column B) that I would like compared to all other columns
  • Attempting to determine if there is a range of numbers in any of the columns that influences or impacts column B
  • An example of the results I would like to calculate in Python is something similar to: Column B is above 3.5 when data in Column D is between 10.20 - 16.4

The examples I've been reading online with Regression in Python appear to produce charts and statistics that I don't need (or maybe I am interpreting incorrectly). I believe the proper wording to describe what I am asking, is to identify specific values or a range of values that are linear between two columns in a Pandas dataframe.

Can anyone help point me in the right direction?

Thank you all in advance!

3
  • So what you want to achieve is to determine if Column B is above 3.5 when data in Column D is between 10.20 - 16.4? can you provide some sample data frame? Commented Jan 8, 2016 at 1:45
  • I want to understand which numbers or ranges influence the outcome of column B. I'll post a sample data frame shortly. Commented Jan 8, 2016 at 1:50
  • Sorry, I'm not able to add attachments. copy/paste of the data frame is not displaying properly. Commented Jan 8, 2016 at 4:56

1 Answer 1

2

Your goals sound very much like exploratory data analysis at this point. You should probably first calculate the correlation between your target column B and any other column using pandas.Series.corr (which really is the same as bivariate regression), which you could list:

other_cols = [col for col in df1.columns if col !='B']
corr_B = [{other: df.loc[:, 'B'].corr(df.loc[:, other])} for other in other_col]

To get a handle on specific ranges, I would recommend looking at:

  • the cut and qcut functionality to bin your data as you like and either plot or correlate subsets accordingly: see docs here and here.

To visualize bivariate and simple multivariate relationships, I would recommend

  • the seaborn package because it includes various types of plots designed to help you get a quick grasp of covariation among variables. See for instance the examples for univariate and bivariate distributions here, linear relationship plots here, and categorical data plots here.

The above should help you understand bivariate relationships. Once you want to progress to multivariate relationships, you could return to the scikit-learn or statsmodels packages best suited for this in python IMHO. Hope this helps to get you started.

Sign up to request clarification or add additional context in comments.

2 Comments

Than you so much for your post and recommendations.
You're welcome. Just let me know if you need clarification on the above, or if this answers your question for now.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.