I generated a random dataset that has 1 categorical column and 30 numeric columns. The categorical variable has 3 classes X, Y and Z. See data generation code below
I need help with Steps 2,3 and 4:
Count the number of non-NaN values by group across all 30 numeric columns
Calculate the observed effect size across all 30 numeric columns
Calculate the observed Power across all 30 numeric columns
Export the final dataframe as csv file
run a post hoc power analysis and effect size for all numeric columns in a dataset. The dataset has a categorical variable by the name Group which has 3 classes X,Y and Z.
Desired output should look something like this:
Note: The observed effect size and power numbers in the table above are randomly generated (mockup number)
To generate the dataset, copy and paste the code below
# Generate dataset with 160 records with 31 columns with random missing values
import pandas as pd
from random import uniform
import numpy as np
np.random.seed(123)
data = np.random.uniform(13.5,38.8, size=(160, 30))
df1 = pd.DataFrame(data, columns=['column_1','column_2','column_3','column_4','column_5',
'column_6','column_7','column_8','column_9','column_10',
'column_11','column_12','column_13','column_14','column_15',
'column_16','column_17','column_18','column_19','column_20',
'column_21','column_22','column_23','column_24','column_25',
'column_26','column_27','column_28','column_29','column_30',])
# Randomly insert missing values: define the probability of introducing a NaN (e.g., 20%)
nan_probability = 0.12
np.random.seed(123)
df1 = df1.mask(np.random.random(df1.shape) < nan_probability)
# create and insert a categorical variable
df1["Group"] = pd.DataFrame({'Group':['X']*55 +['Y']*48 +['Z']*57})
# make group the first column
col = df1.pop('Group')
df1.insert(0, 'Group', col)
df1
My attempt:
#Step 1: count number of non-NaN values in each column grouped by 'Group' column
df1.groupby('Group').count()
#Step 2 and 3: Effect size and Power
# Goal: calculate both effect size and power for each column and save as dataframe
import statsmodels.api as sm
from statsmodels.stats.power import FTestAnovaPower
from statsmodels.formula.api import ols
import pingouin as pg
# I need help with all columns, using column_1 as an example here
model = ols('column_1 ~ C(Group)', data=df1).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
effect_size_f = 0.5 # Example Cohen's f or eta squared
n_groups = 3 # Number of groups
n_observations_per_group = 53 # I used the average sample size per group,ideally, the count_X,Y and Z should be used here.
# Initialize the power analysis object
power_analysis = FTestAnovaPower()
# Calculate post-hoc power
power = power_analysis.solve_power(
effect_size=effect_size_f,
nobs=n_observations_per_group * n_groups, # Total observations
alpha=0.05,
k_groups=3
)
print(f"Observed Power: {power:.4f}")
# store each column's effect size and power as a dataframe
Step 4: final_df.to_csv('/path/posthocanalysis.csv',index = False)
