3

I have a dataframe with a mix of data types (object and numeric). I want to plot a scatter plot for all numeric columns in the dataset against specific columns: col_32, col_69,col_74 and col_80 thereby generating 4 figures for each of the numeric columns.

Example:

  • col_1 against col_32,col_69,col_74 and col_80 ( 4 plots)

  • col_2 against col_32,col_69,col_74 and col_80 (4 plots)

  • col_3 against col_32,col_69,col_74 and col_80 (4 plots)

  • ...

  • col_85 against col_32,col_69,col_74 and col_80 (4 plots)

import pandas as pd 
from random import uniform
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import gmean


#Generate dataframe 

df = pd.DataFrame(
    data=np.random.uniform(low=5.5, high=30.75, size=(160, 84)),
    columns=[f'col_{i}' for i in range(1,85)],)

df.insert(
    loc=0, column='Location',
    value=np.repeat(['A','B','C','D'], 40, axis=0),)

# Insert NaN in the dataset just like the original dataset 
# Define the probability of introducing a NaN (e.g., 15%)
nan_probability = 0.15

np.random.seed(123)

df = df.mask(np.random.random(df.shape) < nan_probability)

# final dataset
df

I need help here, see my attempt below:

# select numeric columns 
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
print(f"Numeric columns: {numeric_cols}")

# create a list of specific columns col_32,col_69,col_74 and col_80
specific_x_cols = ['col_32','col_69','col_74','col_80']

for x_col in specific_x_cols:
    # Create a new figure for each  numeric column against the 4 specific_x_columns
    plt.subplots(nrows=2, ncols=2, figsize=(10, 8))
        
    
    for y_col in numeric_cols:
        if y_col != x_col: # Avoid plotting a column against itself
            
            sns.scatterplot(x =x_col, y = y_col,data=df)
            
    plt.title(f"Scatterplot of {y_col} against {x_col}")
   
    plt.xlabel(x_col)
    plt.ylabel("numeric columns")
    plt.grid(True)
    plt.legend()
    plt.savefig(f'{y_col}_scatterplot.png') # Save as a PNG file with a descriptive name
    plt.show()
    

print("scatterplot generated and saved successfully!")
    

Please share your code if you can

2
  • It's hard to know exactly what you want, so I can't provide it in code, but I'll leave you a tip. Using melt function to change the structure of your data will easily achieve what you want. Commented Oct 30 at 7:21
  • @PandaKim - I want a scatterplot of all the numeric columns in the dataframe against 4 target columns which are col_32,col_69,col_74 and col_80 (each numeric column vs 4 target columns, making a total 81 X 4 scatterplots Commented Oct 30 at 7:27

2 Answers 2

2

You need to define the Axes on which you want to represent your data (not the fourth, the last one created). And IIUC, it seems to me that the logic needs to be reversed for nested loops:

specific_x_cols = ['col_32','col_69','col_74','col_80']
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()

for y_col in numeric_cols:
    fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(10, 8))
    for x_col, ax in zip(specific_x_cols, axs.flatten()):
        sns.scatterplot(x=x_col, y=y_col, data=df, ax=ax)
        ax.set_title(f"Scatterplot of {y_col} against {x_col}")
        ax.grid()
        # customize the current Axes here
    # customize the current Figure here
    fig.set_tight_layout(True)

    # rest of your code here

Output:

scatterplot

Sign up to request clarification or add additional context in comments.

1 Comment

I think this is an excellent answer, and after reviewing the OP's if, I have just one suggestion. numeric_cols = df.select_dtypes(include=['number']).columns.difference(specific_x_cols)
0
import pandas as pd
import matplotlib.pyplot as plt
# df = pd.read_csv("your_data.csv")
target_col = "target_column"
# Select numeric columns except the target
numeric_cols = df.select_dtypes(include=['number']).columns.drop(target_col)
# Loop through and plot
for col in numeric_cols:
    plt.figure(figsize=(6, 4))
    plt.scatter(df[col], df[target_col], alpha=0.6)
    plt.title(f"{col} vs {target_col}")
    plt.xlabel(col)
    plt.ylabel(target_col)
    plt.grid(True)
    plt.show()

2 Comments

with your code, I got a ValueError: x and y must be the same size . I replaced ` target_col = 'target_column" ` with target_col = ['col_32','col_69','col_74','col_80']
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.